Hi everyone, 

I want to share with you a demo of the code I've been developing to improve some of the results published here. 
It's a simple k-opt algorithm that works fast with k = 2, 3, 4... and maybe 5, up from that you have to be patient cause loops are huge.

It磗 my first competition in kaggle and I feel like this was more like a "hardware war". As someone said in the discussion, in next editions would be nice to add more complexity to the problem and reduce the data amount. There's nothing I can do with my 8gb laptop... Anyway, here's the code.

Any suggestions will be appreciated!

Thanks for reading.

**IMPORTS**

In [None]:
import time
import pandas as pd
import numpy as np
import sympy as sp
import matplotlib.pyplot as plt
import matplotlib.collections as mcoll
from itertools import chain
from itertools import combinations
from itertools import permutations
from itertools import product
from sklearn.neighbors import NearestNeighbors

**LOADING CITIES & SUBMISSION**

While loading the data,  I create a third dimension 'Z' to apply the penalty as fast as posible. As input I'm going to use a LKH solver solution provided by [Kostya Atarik](https://www.kaggle.com/kostyaatarik/traveling-santa-lkh-solution). (Thanks man)

In [None]:
def initial():
    df = pd.read_csv('../input/traveling-santa-2018-prime-paths/cities.csv')
    df['Z'] = 1 + .1 * ~df['CityId'].apply(sp.isprime)
    data = df[['X', 'Y', 'Z']].values
    tour = np.loadtxt('../input/traveling-santa-lkh-solution/pure1502650.csv', 
                      skiprows=1, dtype=int)
    return tour, data

tour, data = initial()

In [None]:
data[:5]

In [None]:
tour[:5]

**COST FUNCTION**

Can be used to compute the whole tour or just tour chunks. I keep the result as an array because I will be changing the values of it during the process like the tour array. 

In [None]:
def distance(tour, data, pen=9):
    xy, z = np.hsplit(data[tour], [2])
    dist = np.hypot(*(xy[:-1] - xy[1:]).T)
    dist[pen::10] *= z[:-1][pen::10].flat
    return dist

dist = distance(tour, data)
dist

When computing a tour chunk distance, the variable 'pen' goes from 0 to 9 depending on the relative position of the starting city in the tour, it's something like: 

    pen = 9 - index_of_starting_city_in_tour % 10

Supose we want to compute the distances between cities in positions [3, 8] on tour, we can do this without computing the whole tour distance:

*Note that between 6 cities there are 5 distances.

In [None]:
distance(tour[3:9], data, pen = 9 - 3 % 10) == dist[3:8]

To compute the initial score just sum it:

In [None]:
f'Initial Score: {np.sum(dist):.2f}'

**GENERATING CANDIDATES**

I use NearestNeighbors from [scikit-learn](https://scikit-learn.org/stable/modules/neighbors.html) to get a set of cities to play with:

In [None]:
def candidates(data, opt, ext):
    nns = NearestNeighbors(n_neighbors=opt + ext).fit(data[:, :2])
    kne = nns.kneighbors(data[:, :2], return_distance=False)
    np.random.shuffle(kne)
    cand = set()
    for i in kne:
        for j in combinations(i[1:], opt - 1):
            cand.add(tuple(sorted((i[0],) + j)))
    return cand

In [None]:
list(candidates(data, opt=3, ext=0))[:5]

The 'ext' varible can be used to consider extra unique combinations:

In [None]:
len(candidates(data, opt=2, ext=0))

In [None]:
len(candidates(data, opt=2, ext=1))

**BUILDING ALTERNATIVE PATHS**

Trying to speedup things, I've provided a filter to just considerate the alternatives that change every edge position (reverse or move). Note that to calculate the distances I need to include the previous and the last cities of the chunk -> (a, b).

In [None]:
def alternatives(tour, cuts, fil):
    edges = [tuple(x) for x in np.split(tour, cuts)[1:-1]]
    a, b = tour[cuts[0] - 1], tour[cuts[-1]]
    alter = set()
    for i in set(product(*zip(edges, [x[::-1] for x in edges]))):
        for j in permutations(i):
            if not fil or all(x != y for x, y in zip(edges, j)):
                alter.add(tuple(chain((a,), *j, (b,))))
    alter.discard(tuple(chain((a,), *edges, (b,))))
    return alter

For example, considering cities in indexes (2, 5, 7):

In [None]:
# edges
tour[2:5], tour[5:7]

In [None]:
# a, b
tour[1] , tour[7]

In [None]:
alternatives(tour, cuts = [2,5,7], fil=False)

In [None]:
alternatives(tour, cuts = [2,5,7], fil=True)

**SUBMISSION**

Easy.

In [None]:
def submit(tour):
    np.savetxt('submission.csv', tour, fmt='%d', header='Path', comments='')

**K-OPT**

Using the previous functions here's the final one, it's really simple as you can see.  I check if there's any alternative that reduces the distance and if that's the case I take the minimun one and modify the tour and dist arrays.

In [None]:
def kopt(opt, ext, fil):
    tour, data = initial()
    sequ = 1 + np.argsort(tour[1:])
    dist = distance(tour, data)
    print(f'opt:{opt} & ext:{ext} & fil:{fil} ...')
    cand = candidates(data, opt, ext)
    print(f' Initial Score:\t{np.sum(dist):0.2f}')
    for c in cand:
        cuts = sorted(sequ[j] for j in c)
        alter = alternatives(tour, cuts, fil)
        if not alter:
            continue
        atour, pen = np.array(list(alter)), 9 - (cuts[0] - 1) % 10
        adist = np.array([distance(x, data, pen) for x in atour])
        if np.any(np.sum(adist, 1) < np.sum(dist[cuts[0] - 1:cuts[-1]])):
            arg = np.argmin(np.sum(adist, 1))
            dist[cuts[0] - 1:cuts[-1]] = adist[arg]
            tour[cuts[0]:cuts[-1]] = atour[arg][1:-1]
            sequ[atour[arg][1:-1]] = range(cuts[0], cuts[-1])
    print(f' Final Score:\t{np.sum(dist):0.2f}')
    submit(tour)

Some runs:

In [None]:
t0 = time.time()
kopt(opt=2, ext=0, fil=False)
print(f'Time:\t{time.time()-t0:.2f}s')

In [None]:
t0 = time.time()
kopt(opt=2, ext=1, fil=False)
print(f'Time:\t{time.time()-t0:.2f}s')

In [None]:
t0 = time.time()
kopt(opt=3, ext=0, fil=False)
print(f'Time:\t{time.time()-t0:.2f}s')

In [None]:
t0 = time.time()
kopt(opt=4, ext=0, fil=True)
print(f'Time:\t{time.time()-t0:.2f}s')

**Bonus track: TOUR VISUALIZATION**

A cool visualization to check your submission. You can use any of the colormaps from [matplotlib](https://matplotlib.org/examples/color/colormaps_reference.html), just change 'Spectral' to whatever you want.

In [None]:
def graph():
    tour, data = initial()
    xy = data[tour][:, :2]
    segm = np.hstack((xy[:-1], xy[1:])).reshape(-1, 2, 2)
    lc = mcoll.LineCollection(segments=segm,
                              array=np.linspace(0, 1, len(segm)),
                              cmap=plt.get_cmap('Spectral'),
                              lw=.9)
    fig, ax = plt.subplots(figsize=(10,8))
    fig.subplots_adjust(left=0, bottom=0, right=1, top=1)
    ax.add_collection(lc)
    ax.plot(*xy.T, lw=.3, c='black')
    plt.show()
    
graph()