## Definitions

Given two different strings of equal length, the spacing between them is the number of other strings you would need to connect them on a word ladder. Alternately, this is 1 less than the number of letters that differ between the two strings. Examples:

    spacing("shift", "shirt") => 0
    spacing("shift", "whist") => 1
    spacing("shift", "wrist") => 2
    spacing("shift", "taffy") => 3
    spacing("shift", "hints") => 4

The total spacing of a word list is the sum of the spacing between each consecutive pair of words on the word list, i.e. the number of (not necessarily distinct) strings you'd need to insert to make it into a word ladder. For example, the list:

    daily
    doily
    golly
    guilt

has a total spacing of 0 + 1 + 2 = 3

## Challenge

Given an input list of unique words and a maximum total spacing, output a list of distinct words taken from the input list. The output list's total spacing must not exceed the given maximum. The output list should be as long as possible.

You are allowed to use existing libraries and research in forming your solution. (I'm guessing there's some graph theory algorithm that solves this instantly, but I don't know it.)

#### Example input

    abuzz
    carts
    curbs
    degas
    fruit
    ghost
    jupes
    sooth
    weirs
    zebra

Maximum total spacing: 10

#### Example output

The longest possible output given this input has length of 6:

    zebra
    weirs
    degas
    jupes
    curbs
    carts

#### Challenge input

This list of 1000 4-letter words randomly chosen from enable1.

Maximum total spacing of 100.

My best solution has a length of 602. How much higher can you get?

## Solution Pseudo

- Calculate spacing distances.
- Remove isolated nodes (those which have no 0 cost edges).
- Add hyper-node in order to translate Hamiltonian path problem to TSP.
- Run TSP-solver on leftover nodes.
- Remove hypernode, this transforms cycle into path.
- Shorthen path until cost requirement is fulfilled.

In [1]:
import numpy as np
import networkx as nx
import random
import collections
import os
import subprocess

In [2]:
with open('366-hard-words.txt') as f:
    file = f.read().splitlines()
    
print(f'Words:\n  {len(file)}')

Words:
  1000


### Calculate spacing distances

In [3]:
def spacing(a,b):
    return max(sum(1 for x,y in zip(a,b) if x != y)-1,0)

def distance_words(words):
    return [[spacing(x,y) for x in words] for y in words]

In [4]:
L = distance_words(file)

D = {x:0 for x in range(4)}
for x in L:
    for y in x:
        D[y] += 1
        
print(f'Edges grouped by weight:\n {D}')

Edges grouped by weight:
 {0: 3804, 1: 35618, 2: 243880, 3: 716698}


### Remove isolated nodes

In [5]:
# redundant, was first investigating using cliques to build a 0 cost skeleton bones
# now just use it to identify the isolated nodes
# Could just look at distance matrix to identify them more cheaply

def clique(L):
    edges = [(ix,iy,y) for ix,x in enumerate(L) 
                 for iy,y in enumerate(x) 
                 if not y]   
    G=nx.Graph()
    for x,y,w in edges:
        G.add_edge(x,y)        
    cliques = list(nx.algorithms.clique.find_cliques(G))
    return cliques

def member(cliques):
    members = collections.defaultdict(list)
    for ix,x in enumerate(cliques):
        for y in x:
            members[y].append([z for z in x if z != y])
    return members

In [6]:
cliques = clique(L)
members = member(cliques)

measure = {x:0 for x in range(1,9)}
for x in cliques:
    measure[len(x)] += 1  

print(f'Cliques:\n  {len(cliques)}')
print(f'Clique sizes:\n  {measure}')
print(f'Total clique nodes:\n  {sum((y*x) for x,y in measure.items())}')
print(f'Unique nodes:\n  {len(members)}')
print(f'Expensive nodes:\n  {measure[1]}')

Cliques:
  855
Clique sizes:
  {1: 149, 2: 518, 3: 131, 4: 37, 5: 10, 6: 8, 7: 1, 8: 1}
Total clique nodes:
  1839
Unique nodes:
  1000
Expensive nodes:
  149


In [7]:
expensive = {x[0] for x in cliques if len(x) == 1}

Lsmall = [[y for iy,y in enumerate(x) if iy not in expensive] 
          for ix,x in enumerate(L) if ix not in expensive]
filesmall = [x for ix,x in enumerate(file) if not ix in expensive]

### Add hyper-node

In [8]:
arr = np.array(Lsmall)
hamilithon_path = np.pad(arr, [(0,1),(0,1)], "constant")

### Run TSP-solver

In [9]:
# uses win lkh.exe available in same root
# http://www.akira.ruc.dk/~keld/research/LKH/lkh.exe

template = """NAME: {name}
TYPE: TSP
COMMENT: {name}
DIMENSION: {n_cities}
EDGE_WEIGHT_TYPE: EXPLICIT
EDGE_WEIGHT_FORMAT: LOWER_DIAG_ROW
EDGE_WEIGHT_SECTION
{matrix_s}EOF"""

def dumps_matrix(arr, name="Problem"):
    n_cities = arr.shape[0]
    width = len(str(arr.max())) + 1
    matrix_s = ""
    for i, row in enumerate(arr.tolist()):
        matrix_s += " ".join(["{0:>{1}}".format((int(elem)), width)
                              for elem in row[:i+1]])
        matrix_s += "\n"
    return template.format(**{'name': name,
                              'n_cities': n_cities,
                              'matrix_s': matrix_s})

def _create_lkh_par(tsp_path, runs=4):
    par_path = tsp_path + ".par"
    out_path = tsp_path + ".out"
    par = 'PROBLEM_FILE = {}\nRUNS = {}\nTOUR_FILE = {}'.format(tsp_path, runs, out_path)
    with open(par_path, 'w') as dest:
        dest.write(par)
    return par_path, out_path

def run(tsp_file='problem.tsp',runs = 1):
    with open(tsp_file, 'w') as problem:
        problem.write(dumps_matrix(hamilithon_path, name=tsp_file))
    par_path, out_path = _create_lkh_par(tsp_file, runs)
    
    subprocess.call(['lkh', par_path])

    with open(out_path) as solution:
        lkhout = solution.readlines()
    cost = int(lkhout[1].strip().split(' ')[-1])

    return [int(x)-1 for x in lkhout[6:-2:1]], cost

In [10]:
c,cost = run()

print(f'A TSP solution with spacing cost {cost} has been found for the {len(filesmall)} non-expensive nodes')

A TSP solution with spacing cost 113 has been found for the 851 non-expensive nodes


### Remove hypernode

In [11]:
ix = [ix for ix,x in enumerate(c) if x == len(c)-1][0]
p = c[ix+1:] + c[:ix]

### Shorten path

In [12]:
cost = 113

In [13]:
def find_running_cost(p,file):
    check = []
    mx = len(p)
    for ix,x,y in zip(range(mx),p,p[1:]):
        distance = spacing(file[x],file[y])
        if distance:
            check.append((ix,x,y,distance))
    
    result = []
    for (ix,ax,bx,cx),(iy,ay,by,cy) in zip(check[:-2],check[1:]):
        result.append((ix,iy-ix, bx,ay))
            
    return result

result = find_running_cost(p,filesmall)

a = 0
b = len(c)

while cost > 100+2:
    if result[0][1] > result[-1][1]:
        a = result[0][0]
        del result[0]
    else:
        b = result[-1][0]
        del result[-1]
    cost -= 1
    
r = p[a:b]

print(f'A {len(r)} words ladder with spacing cost 100 has been found')

' '.join([filesmall[x] for x in r])

A 770 words ladder with spacing cost 100 has been found


'haem idem idea iced ired prez prep peep weep weer deer dyer eyer ewer twee tree gree bree free flee fere fare fard lard lari lars lacs lack jack jock mock yock rock rocs mocs macs maes maps mads muds mugs migs mils oils olds elds elks ilks ills ells eels mels mems mess moss foss fuss buss cuss coss coys coos cots tots bots buts butt bute byte eyne sync syne syce sice fice fico piso miso mist wist wisp wiry wary warm farm barm bark sark dark dank yank oink kink kino kine fine bine bike bize bile bale dale dals daks oaks oafs kafs kifs kids kirs kiss hiss hose poke pole pele hebe heme heck keck kept wept weft left loft coft soft sift gift girt airt airn tirl tiro tyro typo tyre tare tarp harp hasp hash hush tush push pugh pugs purs purl burl furl farl harl hail heil deil debt deet leet leer jeer peer peel keel reel reek geek geed deed dees zees rees rues dues dubs duos dups sups suds sums sumo such ouch orcs orca orra okra aura jura jury fury fumy mumm mumu muni mini minx miff biff tiff

## Verify

In [14]:
sum([spacing(filesmall[x],filesmall[y]) for x,y in zip(r,r[1:])])

100