In this notebook all the steps described in the report are run with the Genetic Algorithm. Execution of the full notebook can take a few hours, so that we have saved the outputs of the algorithms as text files that can be loaded instead of running the algorithms again.

The lines of code used to run the algorithms and saving their outputs are commented out.

Note: throughout the notebook, the functions used to compute the distance of a route will take care of adding 0 at the start or at the and of the route during computation. For this reason, some "full" routes in this notebook might not start with 0, or not end with 0, but it will not make a difference. The function for computing the total distance counting penalities for 10th non-prime cities was tested on routes from the Kaggle challenge leaderboard in order to make sure of its correctness.

# Imports

In [1]:
from datetime import datetime
import json
import numpy as np

In [2]:
cities = np.genfromtxt('cities.csv', delimiter=',', skip_header = 1)

# Clustering

In [3]:
from sklearn.cluster import KMeans

We partition the cities in 1000 clusters using k-means.

In [4]:
n = 1000

We have run the following code and saved the clusters as a txt file, so we can just import it now

In [5]:
# kmeans = KMeans(n_clusters=n, random_state=0)
# kmeans.fit(cities[:, 1:3])
# np.savetxt('files/1000subsets.txt', kmeans.labels_) 

In [6]:
clusters = np.genfromtxt('files/1000subsets.txt', skip_header = 0).astype(int)

In [7]:
clusters

array([894, 654, 614, ..., 263, 305,  80])

We add the information regarding the cluster to the cities array.

In [8]:
kcities = np.concatenate((cities, clusters[:, np.newaxis]), 1)

We create a list of clusters ("subsets")

In [9]:
subs = [0]*n
for i in range(n):
    subs[i] = kcities[kcities[:, 3] == i][:, :3]

In [10]:
subs = np.array(subs)

The following are the sizes of some of the clusters:

In [11]:
lens = np.array([len(subs[i]) for i in range(n)])
lens[:50]

array([230, 118, 121, 168,  89, 163, 310, 288, 277, 156, 289, 228, 172,
       273, 151, 156, 261, 236, 190, 234, 183,  73, 242, 222, 228, 161,
       244, 282, 265, 198, 138, 288, 268, 181, 274, 158, 160, 203, 127,
       152, 216, 313, 222, 192, 218, 145, 241, 296, 151, 137])

The smallest cluster has only 12 cities:

In [12]:
np.min(lens)

12

The largest cluster consists of 407 cities:

In [13]:
np.max(lens)

407

In [None]:
np.median(lens)

The size of the union of the clusters is 197769, as expected.

In [14]:
sum(lens)

197769

# Running GA in each cluster

In [15]:
from ga import GA, route_fitness, shift_mutation, roulette_selection, two_point_crossover
from santas_path import total_length_straight, total_length_loop, edp, not_prime

For later comparison, the following is the total length of a random route through all the cities:

In [16]:
np.random.seed(4)
random_perm = np.random.permutation(len(cities))
total_length_loop(random_perm, cities)

442479971.65202415

We initialize the permutations in each cluster. We want to find the best permutation of cities inside each cluster, where a permutation inside a cluster is a permutation of the range of the size of the cluster, i.e., it is an ordering of cities.

In [17]:
subs_perms = [0]*n
for i in range(n):
    subs_perms[i] = np.arange(len(subs[i]))

We create a list of lists, each consisting of the city ids in each cluster:

In [18]:
c_ids = [el[:,0] for el in subs]

We can find the route inside each cluster (as a sequence of city ids) as follows:

In [19]:
start_routes = []
for i in range(n):
    start_routes.append(c_ids[i][subs_perms[i]])

In order to make a comparison with the total distance of the route that we will soon find by using GA to find an optimal permutation inside each cluster, we concatenate all the routes inside each cluster as they are at the moment, that is, just ordered by their city id.

In [20]:
start_route = np.concatenate(start_routes).astype(int)

The total length of the route obtained by clustering the cities, but without ordering the cities within each cluster, or without ordering the clusters, is the following, which is already a significant improvement over a random path.

In [21]:
total_length_loop(start_route, cities)

13139371.401815962

We now run GA inside each cluster in order to find the best permutation of cities in each "province". As described in the report, at this stage we treat this problem as a standard TSP problem.

In [23]:
# np.random.seed(4)
# start = datetime.now()
# startl = datetime.now()
# p = 0 
# initial_dist = total_length_loop(start_route, cities)
# last_dist = initial_dist
# print('Initial distance: {}'.format(last_dist))
# for i in range(n):
#     if i % 100 == 0:
#         print('\nStart Loop {} at {}'.format(i, startl))
#     subs_perms[i] = GA(subs[i], np.inf, 30, 10, route_fitness, [shift_mutation], 0.1,
#                                       sel_fun=roulette_selection, cross_fun=two_point_crossover,
#                                       max_no_change = 500, length_fun = total_length_straight)
#     if (i % (99 + p) == 0) and i > 0:
#         p += 100
#         endl = datetime.now()
#         print('End loop {} at {}: {} seconds'.format(i, endl, (endl-startl).total_seconds()))
#         startl = datetime.now()
#         temp_routes = []
#         for j in range(n):
#             temp_routes.append(c_ids[j][subs_perms[j]])
#         temp_full_route = np.concatenate(temp_routes).astype(int)
#         temp_tot = total_length_loop(temp_full_route, cities)
#         print('Total distance so far: {}'.format(temp_tot))
#         if all(np.isin(temp_full_route, cities[:,0])) and all(np.isin(cities[:,0], temp_full_route)):
#             print("Checked: the path goes through all cities")
#         print('Improvement: {}'.format(temp_tot - last_dist))
#         print('Total Improvement: {}'.format(temp_tot - initial_dist))
#         last_dist = temp_tot    
#         
# end = datetime.now()
# print('\nTotal time: {}'.format((end-start).total_seconds()))

Initial distance: 13139371.401815962

Start Loop 0 at 2019-07-06 21:25:04.088554
End loop 99 at 2019-07-06 21:36:13.870470: 669.781916 seconds
Total distance so far: 12592426.630352162
Checked: the path goes through all cities
Improvement: -546944.7714638002
Total Improvement: -546944.7714638002

Start Loop 100 at 2019-07-06 21:36:13.870470
End loop 199 at 2019-07-06 21:46:43.478171: 629.607701 seconds
Total distance so far: 12049227.581763212
Checked: the path goes through all cities
Improvement: -543199.0485889502
Total Improvement: -1090143.8200527504

Start Loop 200 at 2019-07-06 21:46:43.478171
End loop 299 at 2019-07-06 21:57:30.111704: 646.633533 seconds
Total distance so far: 11510104.362655126
Checked: the path goes through all cities
Improvement: -539123.2191080861
Total Improvement: -1629267.0391608365

Start Loop 300 at 2019-07-06 21:57:30.111704
End loop 399 at 2019-07-06 22:08:13.824186: 643.712482 seconds
Total distance so far: 10955587.12708347
Checked: the path goes th

Running the algorithm we have updated the list 'subs_perms' of permutations inside each cluster. Each element subs_perms[i] in this list is the best permutation found by running GA on subs[i]. We now order each list c_ids[i] by subs_perms[i] in order to obtain the routes as specified by the city ids.

In [24]:
# routes = []
# for i in range(n):
#     routes.append(c_ids[i][subs_perms[i]])

We can now concatenate these routes in order to find a route on the full set of cities.

In [25]:
# full_route = np.concatenate(routes)

We order the route so that it starts and ends at 0.

In [26]:
# zi = np.where(full_route == 0)[0][0]

In [27]:
# full_route = np.concatenate((full_route[zi:], full_route[:zi]))

In [28]:
# full_route = full_route.astype(int)

We save this route so that we do not have to run the algorithm again:

In [29]:
# np.savetxt('files/full_route_GA_in_clusts.txt', full_route)

We load the saved file containing this route:

In [30]:
full_route = np.genfromtxt('files/full_route_GA_in_clusts.txt').astype(int)

As a check, we can see that the length of the full route is correct.

In [31]:
len(full_route)

197769

Also, every city in the path is the id of a city:

In [32]:
all(np.isin(full_route, cities[:,0])) 

True

... and every city id is in the path:

In [33]:
all(np.isin(cities[:,0], full_route)) 

True

The total length through this route:

In [34]:
total_length_loop(full_route, cities)

7805290.363298486

Finding the best permutation of cities inside each cluster has improved the total length of the route from ~13 139 371 to ~7 805 290

Let us now see what would be the length of this route if we penalized every 10th step not starting from a prime city.

We create a boolean mask  for non-prime number: every city_id such that "not_primes_bool[city_id] = True" will be penalized.

In [35]:
np_not_prime = np.vectorize(not_prime)
nums = np.arange(0, len(cities))
not_primes_bool = np_not_prime(nums)

We now use the mask to find the total length of the route, where we penalize each 10th step originating from a city that is not prime.

In [36]:
edp(full_route, cities, not_primes_bool)

7878150.9519051965

EDP stands for "Euclidean Distance with Penalties", and it is just the total length of the route through the cities, where each step is computed as the Euclidean Distance between two cities coordinates, but every 10th step not originating from a prime cities is penalized by 10%.

#### Save subroutes as json

We also save the subroutes as a json file:

In [37]:
# routes_dict = dict()
# for i in range(len(routes)):
#     routes_dict[i] = list(routes[i])

In [38]:
# with open('files/subroutes_GA_in_clusts.json', 'w') as fp:
#     json.dump(routes_dict, fp)

# Running GA to find order of subroutes

We now have 1000 clusters and an optimal route inside each cluster. Now we want to apply GA to find the best way to order these 1000 clusters.

We reload the routes saved in the previous step:

In [39]:
with open('files/subroutes_GA_in_clusts.json', 'r') as fp:
    loaded_json = json.load(fp)

In [40]:
routes = [loaded_json[str(i)] for i in range(1000)]

In [41]:
for i in range(len(routes)):
    routes[i] = np.array(routes[i]).astype(int)

In [42]:
routes = np.array(routes)

We now apply GA for ordering the clusters.

In [43]:
from ga import subset_fitness

In [44]:
# np.random.seed(4)
# clusts_perm = GA(cities, np.inf, 30, 10, subset_fitness, [shift_mutation], 0.1,
#                         roulette_selection, cross_fun=two_point_crossover, max_no_change = 500,
#                         length_fun = total_length_loop, on_subsets= True, subs = routes, verbose = True)

Iter 0, ItNoChange 0, Best 7740827.460129939
Iter 1000, ItNoChange 15, Best 7165490.384662437
Iter 2000, ItNoChange 2, Best 6996542.826207113
Iter 3000, ItNoChange 4, Best 6883480.473114648
Iter 4000, ItNoChange 3, Best 6838172.580928457
Iter 5000, ItNoChange 29, Best 6795781.080666533
Iter 6000, ItNoChange 155, Best 6762940.747430191
Iter 7000, ItNoChange 195, Best 6752850.351150167
Iter 8000, ItNoChange 136, Best 6734042.59375663


We have obtained clusts_perm, which is the order of clusters

In [45]:
# new_full_route = np.concatenate(routes[clusts_perm])

We order it so that it starts and ends at 0:

In [46]:
# zin = np.where(new_full_route == 0)[0][0]
# new_full_route = np.concatenate((new_full_route[zin:], new_full_route[:zin]))

We save the route so we do not need to run the algorithm again:

In [47]:
# np.savetxt('files/route_after_clust_ordering.txt', new_full_route)

We load the route:

In [48]:
new_full_route = np.genfromtxt('files/route_after_clust_ordering.txt').astype(int)

In [49]:
new_full_route

array([     0, 161041,  97580, ...,  25100, 195889,  69414])

The total length through this route, starting and ending at 0, is the following:

In [50]:
total_length_loop(new_full_route, cities)

6726045.709191153

We permform the usual checks.
Every city in the path is the id of a city...

In [51]:
all(np.isin(new_full_route, cities[:,0]))

True

... and every city is in the path

In [52]:
all(np.isin(cities[:,0], new_full_route))

True

The total distance considering penalties is not very different:

In [54]:
edp(new_full_route[1:], cities, not_primes_bool)

6788202.383675191

# Improving the route by moving prime cities

We start from the route found in the previous step with length 6 788 202.383675191

Reload route:

In [55]:
start_route = np.genfromtxt('files/route_after_clust_ordering.txt').astype(int)

In [56]:
edp(start_route, cities, not_primes_bool)

6788202.383675191

We sort the cities dataset by the route found in the previous step. 

In [57]:
sorted_cities = cities[start_route[1:]]

We have excluded the first city in the route (i.e., the city with ID 0), as we want the SA to only work on inner segments of the full route.

We now want to run Simulated Annealing, starting from this route, with a special mutation function that reverses the order of two consecutive cities $c_1$ and $c_2$ with very high probability if $c_1$ is not prime and found at a tenth step, and $c_2$ is prime. We divide the route and the cities dataset in 1000 subsets so that SA can focus on a segment of the route at a time.

We create 1000 new subsets, each of size 200 (except from the last). 

In [58]:
clusters = np.repeat(np.arange(1000), 200)[:len(cities)-1]
len(clusters)

197768

We add the information about membership in each subset to the cities dataset sorted by the route:

In [59]:
kscities = np.concatenate((sorted_cities, clusters[:, np.newaxis]), 1)

There are 989 unique clusters

In [60]:
n = len(np.unique(clusters))
n

989

We create the list of subsets:

In [61]:
subs = [0]*n

In [62]:
for i in range(n):
    subs[i] = kscities[kscities[:, 3] == i][:, :3]

In [63]:
lens = [len(subs[i]) for i in range(n)]
np.array(lens[-20:])

array([200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
       200, 200, 200, 200, 200, 200, 168])

We initialize the permutation on each subset:

In [64]:
subs_perm = [0]*n

In [65]:
for i in range(n):
    subs_perm[i] = np.arange(len(subs[i]))

In [68]:
c_ids = [el[:,0] for el in subs]

We run SA on each subset, with the described mutation function.

In [69]:
from sa import SA, reverse_primes_mutation
from santas_path import edp_unordered_straight

In [70]:
np.random.seed(4)
start = datetime.now()
startl = datetime.now()
p = 0
last_dist = edp(start_route, cities, not_primes_bool)
initial_dist = last_dist
print('Initial distance: {}'.format(last_dist))
for i in range(n):
    if (i % 100 == 0) and (i > 0):
        print('\nStart Loop {} at {}'.format(i, startl))
    
    perm_init = np.arange(len(subs[i]))
    subs_perm[i] = SA(subs[i], edp_unordered_straight, reverse_primes_mutation,
                          black_list = not_primes_bool, scale = 100000, n_to_mute= 10,
                          maxIter = np.inf,perm_init = perm_init, maxIterNoChange=1000)
    if ((i % (99 + p) == 0) or (i == n - 1)) and i > 0:
        p += 100
        endl = datetime.now()
        print('End loop {} at {}: {} seconds'.format(i, endl, (endl-startl).total_seconds()))
        startl = datetime.now()
    if i % 20 == 0:
        temp_routes = []
        for j in range(n):
            temp_routes.append(c_ids[j][subs_perm[j]])
        temp_full_route = np.concatenate(temp_routes).astype(int)
        temp_full_route = np.concatenate(([0], temp_full_route))
        temp_tot = edp(temp_full_route[1:], cities, not_primes_bool)
        print('\nLoop {}, Total distance so far: {}'.format(i, temp_tot))   
        if all(np.isin(temp_full_route, cities[:,0])) and all(np.isin(cities[:,0], temp_full_route)):
            print("Checked: the path goes through all cities")
        print('Improvement: {}'.format(temp_tot - last_dist))
        print('Total Improvement: {}'.format(temp_tot - initial_dist))
        last_dist = temp_tot        
end = datetime.now()
print('\nTotal time: {}'.format((end-start).total_seconds()))

Initial distance: 6788202.383675191

Loop 0, Total distance so far: 6788148.157136387
Checked: the path goes through all cities
Improvement: -54.22653880342841
Total Improvement: -54.22653880342841

Loop 20, Total distance so far: 6786631.467968951
Checked: the path goes through all cities
Improvement: -1516.6891674362123
Total Improvement: -1570.9157062396407

Loop 40, Total distance so far: 6785151.752531693
Checked: the path goes through all cities
Improvement: -1479.7154372576624
Total Improvement: -3050.631143497303

Loop 60, Total distance so far: 6782841.286835057
Checked: the path goes through all cities
Improvement: -2310.4656966365874
Total Improvement: -5361.0968401338905

Loop 80, Total distance so far: 6780548.740222682
Checked: the path goes through all cities
Improvement: -2292.5466123744845
Total Improvement: -7653.643452508375
End loop 99 at 2019-07-07 00:47:52.229280: 199.657126 seconds

Start Loop 100 at 2019-07-07 00:47:52.229280

Loop 100, Total distance so far: 67

End loop 899 at 2019-07-07 01:13:29.446761: 192.63786 seconds

Start Loop 900 at 2019-07-07 01:13:29.446761

Loop 900, Total distance so far: 6689399.306424883
Checked: the path goes through all cities
Improvement: -2854.733207870275
Total Improvement: -98803.07725030743

Loop 920, Total distance so far: 6687112.421960575
Checked: the path goes through all cities
Improvement: -2286.8844643086195
Total Improvement: -101089.96171461605

Loop 940, Total distance so far: 6685060.7603020705
Checked: the path goes through all cities
Improvement: -2051.6616585040465
Total Improvement: -103141.62337312009

Loop 960, Total distance so far: 6682661.37152527
Checked: the path goes through all cities
Improvement: -2399.388776800595
Total Improvement: -105541.01214992069

Loop 980, Total distance so far: 6680208.149763066
Checked: the path goes through all cities
Improvement: -2453.2217622036114
Total Improvement: -107994.2339121243
End loop 988 at 2019-07-07 01:16:21.863753: 172.416992 seconds

To

Similarly as before, we find the route inside each cluster as specified by the cities ids, and we concatenate it (and add the first city with id 0 at the start) to find the final route.

In [71]:
# c_ids = [el[:,0] for el in subs]
# routes = []
# for i in range(n):
#     routes.append(c_ids[i][subs_perm[i]])
# 
# final_full_route = np.concatenate(routes).astype(int)
# final_full_route = np.concatenate(([0], final_full_route))

In [72]:
# np.savetxt('files/final_after_sa.txt', final_full_route)

We have thus obtained the following route:

In [73]:
final_full_route = np.genfromtxt('files/final_after_sa.txt').astype(int)

In [74]:
final_full_route

array([     0, 161041,  97580, ..., 144501, 195889,  69414])

In [75]:
edp(final_full_route, cities, not_primes_bool)

6679052.190203142

The length of the final route is 6 679 052.190203142