<i> This notebook presents a greedy heuristic algorithm approach to the Google HashCode Optimization Problem, more details for which can be found at https://www.kaggle.com/c/hashcode-photo-slideshow/data. Within this approach, a nxn matrix is created to calculate "interest metric" values between each pair of images. The highest-value pairs are continuously appended to the string and their corresponding values nullified in the matrix until all elements have been added. After individual horizontal and vertical images are added, a meta-level optimization is performed to club vertical images together. While this may possibly miss global minima as double-vertical slides are not considered in the initial ordering, the split  reduces the computational order of magnitude. </i>

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import gc
import psutil
import pickle
import time
from joblib import Parallel, delayed

In [2]:
with open('d_pet_pictures.txt', 'rb') as f:
    data = f.readlines()[1:]

In [3]:
def generate_tagdict(data):
    hdict, vdict = {}, {}
    for i in range(len(data)):
        row = np.vectorize(lambda s: s.decode('utf-8'))(data[i].split())
        if row[0]=="H": hdict[str(i)] = set(row[1:])
        else: vdict[str(i)] = set(row[1:])
    return hdict, vdict
hdict, vdict = generate_tagdict(data)

In [4]:
def calculate_interest(name1, name2):
    #Invariant: names are of the form (H/V):K where H/V indicates the dictionary in question and K is the key
    #The hdict and vdict dictionaries contain sets
    namedisamb = (lambda s: hdict[s.split(":")[1]] if s.split(":")[0]=="H" else 
                  vdict[s.split(":")[1]])
    tag1, tag2 = namedisamb(name1), namedisamb(name2)
    return min(len(tag1&tag2), len(tag1-tag2), len(tag2-tag1))

In [5]:
#Generate interest master table
start = time.time()
hkeys = np.vectorize(lambda s: "H:"+str(s))(np.array(list(hdict.keys())))
vkeys = np.vectorize(lambda s: "V:"+str(s))(np.array(list(vdict.keys())))
totalkeys = np.append(hkeys, vkeys) 
intmatrix = np.zeros((totalkeys.shape[0], totalkeys.shape[0]), dtype=np.int32)
for i in range(totalkeys.shape[0]):
    for j in range(i+1, totalkeys.shape[0]):
        intmatrix[i][j] = calculate_interest(totalkeys[i], totalkeys[j])
    if ((i+1)%10000==0) and (i!=0): print("Calculation Checkpoint: "+str(i+1))
print(time.time()-start)

Calculation Checkpoint: 10000


KeyboardInterrupt: 

In [6]:
print(time.time()-start)

3849.2248919010162


In [7]:
i

11191

In [10]:
sum(np.arange(90000, 90000-11191,-1))/sum(np.arange(90000,1,-1))

0.2332261382856798