CMPE 266-01

Spring 2023

Team: Phillip Nguyen, Xuewei Zheng, Roger Kuo

Movie Recommendation System (using multi-dimensional indexing)

# Initialization

In [1]:
# If using Google Colab
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [21]:
import time
import pandas as pd

MovieLens 1M Dataset found here: https://grouplens.org/datasets/movielens/1m/

In [3]:
# Change directory
directory = "gdrive/My Drive/Colab Notebooks/CMPE 266/"
# directory = "gdrive/My Drive/CMPE 266/Group Project/"

movies = pd.read_csv(directory+"movies.dat", sep='::', names=['mID','title','genres'], engine='python', encoding="ISO-8859-1")
ratings = pd.read_csv(directory+"ratings.dat", sep='::', names=['uID','mID','rating','timestamp'], engine='python')
users = pd.read_csv(directory+"users.dat", sep='::', names=['uID','gender','age','occupation','zip'], engine='python')

# Data Preparation

## View datasets

Information on the tables and their columns: https://files.grouplens.org/datasets/movielens/ml-1m-README.txt

In [4]:
print(movies.shape)
movies.head()

(3883, 3)


Unnamed: 0,mID,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
print(ratings.shape)
ratings.head()

(1000209, 4)


Unnamed: 0,uID,mID,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [6]:
print(users.shape)
users.head()

(6040, 5)


Unnamed: 0,uID,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


## Preprocessing

In [7]:
from sklearn.preprocessing import LabelEncoder

### Movies table

Omit the title column. We will use it later.

Each movie can have multiple genres, sorted in alphabetical order. 
To simplify our process, we just use the first assigned genre.
Otherwise we will have to incorporate more feature vectors.
While this would work fine with machine learning, it is ineffiient for our purposes with multi-dimensional indexing.

In [8]:
# Omit title column
m_df = movies.copy(deep=True)
m_df = m_df[['mID','genres']]
# Simplify genres column
m_df['genres'] = m_df['genres'].map(lambda genres: genres.split('|')[0])

# Encode the genres as integers
mle = LabelEncoder()
m_df['genres'] = mle.fit_transform(m_df['genres'])

m_df.head()

Unnamed: 0,mID,genres
0,1,2
1,2,1
2,3,4
3,4,4
4,5,4


### Ratings table

Omit the timestamp column. This feature is uselesss.

Omit the rating column. We may use it later.

In [9]:
r_df = ratings.copy(deep=True)
r_df = r_df[['uID','mID']]
r_df.head()

Unnamed: 0,uID,mID
0,1,1193
1,1,661
2,1,914
3,1,3408
4,1,2355


### Users table

Because many ratings come from the same user, we may not want to use every user feature.
Otherwise, we may get recommendations from the same user.

In [10]:
# Omit occupation and zip columns
u_df = users.copy(deep=True)

# Omit occupation and zip columns
u_df = u_df[['uID','gender','age']]

# Omit zip column
# u_df = u_df[['uID','gender','age','occupation']]

# Keep all columns
# u_df = u_df[['uID','gender','age','occupation','zip']]

# Encode the genders as integers
ule = LabelEncoder()
u_df['gender'] = ule.fit_transform(u_df['gender'])

u_df.head()

Unnamed: 0,uID,gender,age
0,1,0,1
1,2,1,56
2,3,1,25
3,4,1,45
4,5,1,25


### Combined table

Combine all three tables into one

In [11]:
result = pd.merge(r_df, m_df, left_on='mID', right_on='mID')
result = pd.merge(result, u_df, left_on='uID', right_on='uID')

# Omit uID and mID
result = result[['genres','gender','age']]
feature_count = result.shape[1]
print(result.shape)
result.head()

(1000209, 3)


Unnamed: 0,genres,gender,age
0,7,0,1
1,2,0,1
2,11,0,1
3,7,0,1
4,2,0,1


# Multi-dimensional Indexing

## KD Tree

Insert our combined table into a KD-Tree.

In [12]:
from sklearn.metrics.pairwise import distance_metrics
from sklearn.neighbors import KDTree

In [13]:
KDTree.valid_metrics

['euclidean',
 'l2',
 'minkowski',
 'p',
 'manhattan',
 'cityblock',
 'l1',
 'chebyshev',
 'infinity']

In [22]:
# Default similarity metric is minkowski.

# Create four trees for comparision
start = time.time()
kdt2 = KDTree(result, leaf_size=2, metric='euclidean')
print("Time to train kdt2:", time.time() - start)

start = time.time()
kdt4 = KDTree(result, leaf_size=4, metric='euclidean')
print("Time to train kdt4:", time.time() - start)

start = time.time()
kdt8 = KDTree(result, leaf_size=8, metric='euclidean')
print("Time to train kdt8:", time.time() - start)

start = time.time()
kdt16 = KDTree(result, leaf_size=16, metric='euclidean')
print("Time to train kdt16:", time.time() - start)

Time to train kdt2: 1.033564567565918
Time to train kdt4: 0.9127256870269775
Time to train kdt8: 0.8331103324890137
Time to train kdt16: 0.9205224514007568


## LSH

In [17]:
# If using Colab
import sys
sys.path.insert(0,'/content/gdrive/My Drive/Colab Notebooks/CMPE 266')
# sys.path.insert(0,'/content/gdrive/My Drive/CMPE 266/Group Project')

In [18]:
# Reuse lsh library provided from previous class activity
# NOTE: This LSH library states that it is not the most optimal
from lsh import clsh

In [23]:
# Create six lsh for comparison

# 2 functions
start = time.time()
lsh12 = clsh(result.to_numpy(), ntables=1, nfunctions=2)
print("Time to train lsh12:", time.time() - start)

# 4 functions
start = time.time()
lsh14 = clsh(result.to_numpy(), ntables=1, nfunctions=4)
print("Time to train lsh14:", time.time() - start)

# 8 functions
start = time.time()
lsh18 = clsh(result.to_numpy(), ntables=1, nfunctions=8)
print("Time to train lsh18:", time.time() - start)

# 4 tables 2 functions
start = time.time()
lsh42 = clsh(result.to_numpy(), ntables=4, nfunctions=2)
print("Time to train lsh42:", time.time() - start)

# 4 tables 4 functions
start = time.time()
lsh44 = clsh(result.to_numpy(), ntables=4, nfunctions=4)
print("Time to train lsh44:", time.time() - start)

# 4 tables 8 functions
start = time.time()
lsh48 = clsh(result.to_numpy(), ntables=4, nfunctions=8)
print("Time to train lsh48:", time.time() - start)

Time to train lsh12: 5.172131299972534
Time to train lsh14: 13.894601106643677
Time to train lsh18: 22.716301918029785
Time to train lsh42: 21.07439351081848
Time to train lsh44: 57.07830023765564
Time to train lsh48: 94.97461485862732


# NN Querying

### Setup

In [24]:
# Get the corresponding movies from a list of indices
def get_recommendations(ind):
  print("\nRecommended movies:")
  for i in ind[0]:
    mID = ratings.iloc[i]['mID']
    movie = movies.loc[movies['mID'] == mID].iloc[0]['title']
    print(movie)

## Define our user(s)

In [25]:
# Show query input parameters
result.head(1)

Unnamed: 0,genres,gender,age
0,7,0,1


In [26]:
#Show genre mapping
print(dict(zip(mle.classes_, mle.transform(mle.classes_))))

{'Action': 0, 'Adventure': 1, 'Animation': 2, "Children's": 3, 'Comedy': 4, 'Crime': 5, 'Documentary': 6, 'Drama': 7, 'Fantasy': 8, 'Film-Noir': 9, 'Horror': 10, 'Musical': 11, 'Mystery': 12, 'Romance': 13, 'Sci-Fi': 14, 'Thriller': 15, 'War': 16, 'Western': 17}


In [27]:
# Show gender mapping
print(dict(zip(ule.classes_, ule.transform(ule.classes_))))

{'F': 0, 'M': 1}


Age and occupation mappings can be found here: https://files.grouplens.org/datasets/movielens/ml-1m-README.txt

under the USERS FILE DESCRIPTION section

In [28]:
# Define our user(s)
user1 = [13, 0, 18] # Female, age 18-24, looking for Romance films
user2 = [6, 1, 45] # Male, age 45-49, looking for Documentaries
user3 = [1, 1, 25] # Male, age 25-43, looking for Adventure films

# Define how many movies we want
K = 5

## KD Tree

In [29]:
# Cound time of KDTree query, and return indices of similar user/preference
def run_KD(tree, user, k):
  start = time.time()
  dist, ind = tree.query([user], k=k)
  end = time.time()
  print("Time: ", end - start)
  return ind, end - start

In [30]:
# Run query on user1, using KD-Tree with leaf-size=2
print("Run query on user1, using KD-Tree with leaf-size=2")
ind, k12 = run_KD(kdt2, user1, K)
get_recommendations(ind)

# Run query on user1, using KD-Tree with leaf-size=4
print("\nRun query on user1, using KD-Tree with leaf-size=4")
ind, k14 = run_KD(kdt4, user1, K)
get_recommendations(ind)

# Run query on user1, using KD-Tree with leaf-size=8
print("\nRun query on user1, using KD-Tree with leaf-size=8")
ind, k18 = run_KD(kdt8, user1, K)
get_recommendations(ind)

# Run query on user1, using KD-Tree with leaf-size=16
print("\nRun query on user1, using KD-Tree with leaf-size=16")
ind, k116 = run_KD(kdt16, user1, K)
get_recommendations(ind)

Run query on user1, using KD-Tree with leaf-size=2
Time:  0.0016438961029052734

Recommended movies:
Dark Crystal, The (1982)
All About My Mother (Todo Sobre Mi Madre) (1999)
Time Bandits (1981)
Game, The (1997)
Donnie Brasco (1997)

Run query on user1, using KD-Tree with leaf-size=4
Time:  0.00041174888610839844

Recommended movies:
Dark Crystal, The (1982)
Time Bandits (1981)
Braveheart (1995)
Game, The (1997)
Donnie Brasco (1997)

Run query on user1, using KD-Tree with leaf-size=8
Time:  0.0006928443908691406

Recommended movies:
Game, The (1997)
Time Bandits (1981)
My Dog Skip (1999)
Donnie Brasco (1997)
Dark Crystal, The (1982)

Run query on user1, using KD-Tree with leaf-size=16
Time:  0.0013790130615234375

Recommended movies:
Game, The (1997)
Time Bandits (1981)
What's Eating Gilbert Grape (1993)
Donnie Brasco (1997)
Dark Crystal, The (1982)


In [31]:
# Run query on user2, using KD-Tree with leaf-size=2
print("Run query on user2, using KD-Tree with leaf-size=2")
ind, k22 = run_KD(kdt2, user2, K)
get_recommendations(ind)

# Run query on user2, using KD-Tree with leaf-size=4
print("\nRun query on user2, using KD-Tree with leaf-size=4")
ind, k24 = run_KD(kdt4, user2, K)
get_recommendations(ind)

# Run query on user2, using KD-Tree with leaf-size=8
print("\nRun query on user2, using KD-Tree with leaf-size=8")
ind, k28 = run_KD(kdt8, user2, K)
get_recommendations(ind)

# Run query on user2, using KD-Tree with leaf-size=16
print("\nRun query on user2, using KD-Tree with leaf-size=16")
ind, k216 = run_KD(kdt16, user2, K)
get_recommendations(ind)

Run query on user2, using KD-Tree with leaf-size=2
Time:  0.0016791820526123047

Recommended movies:
Mystery Science Theater 3000: The Movie (1996)
Little Big Man (1970)
Haunting, The (1999)
Creepshow (1982)
Lost in Space (1998)

Run query on user2, using KD-Tree with leaf-size=4
Time:  0.00035762786865234375

Recommended movies:
Creepshow (1982)
Little Big Man (1970)
Lost in Space (1998)
Haunting, The (1999)
Mystery Science Theater 3000: The Movie (1996)

Run query on user2, using KD-Tree with leaf-size=8
Time:  0.0006763935089111328

Recommended movies:
Creepshow (1982)
Lost in Space (1998)
Mystery Science Theater 3000: The Movie (1996)
Haunting, The (1999)
Conan the Barbarian (1982)

Run query on user2, using KD-Tree with leaf-size=16
Time:  0.00039267539978027344

Recommended movies:
Haunting, The (1999)
Lost in Space (1998)
Mystery Science Theater 3000: The Movie (1996)
Creepshow (1982)
Conan the Barbarian (1982)


In [32]:
# Run query on user3, using KD-Tree with leaf-size=2
print("Run query on user3, using KD-Tree with leaf-size=2")
ind, k32 = run_KD(kdt2, user3, K)
get_recommendations(ind)

# Run query on user3, using KD-Tree with leaf-size=4
print("\nRun query on user3, using KD-Tree with leaf-size=4")
ind, k34 = run_KD(kdt4, user3, K)
get_recommendations(ind)

# Run query on user3, using KD-Tree with leaf-size=8
print("\nRun query on user3, using KD-Tree with leaf-size=8")
ind, k38 = run_KD(kdt8, user3, K)
get_recommendations(ind)

# Run query on user3, using KD-Tree with leaf-size=16
print("\nRun query on user3, using KD-Tree with leaf-size=16")
ind, k316 = run_KD(kdt16, user3, K)
get_recommendations(ind)

Run query on user3, using KD-Tree with leaf-size=2
Time:  0.003661632537841797

Recommended movies:
Natural Born Killers (1994)
Tender Mercies (1983)
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Abyss, The (1989)

Run query on user3, using KD-Tree with leaf-size=4
Time:  0.0016841888427734375

Recommended movies:
Star Wars: Episode IV - A New Hope (1977)
Fast, Cheap & Out of Control (1997)
Abyss, The (1989)
Mission: Impossible (1996)
Natural Born Killers (1994)

Run query on user3, using KD-Tree with leaf-size=8
Time:  0.0048296451568603516

Recommended movies:
Natural Born Killers (1994)
Tender Mercies (1983)
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Fast, Cheap & Out of Control (1997)

Run query on user3, using KD-Tree with leaf-size=16
Time:  0.0030777454376220703

Recommended movies:
Star Wars: Episode IV - A New Hope (1977)
Tender Mercies (1983)
Abyss, The (1989)
Mission: Impossible (1996)
Natural Born Killers (1994)


## LSH

In [33]:
import numpy as np

In [34]:
# Count time of LSH query
def run_lsh(lsh, user, k):
  start = time.time()
  ind = lsh.findNeighbors(user, k=k)
  end = time.time()
  print("Time: ", end - start)
  return ind, end - start

In [35]:
# Run query on user1, using LSH with 2 functions
print("Run query on user1, using LSH with 2 functions")
ind, l112 = run_lsh(lsh12, np.array([user1]), K)
get_recommendations(ind)

# Run query on user1, using LSH with 4 functions
print("\nRun query on user1, using LSH with 4 functions")
ind, l114 = run_lsh(lsh14, np.array([user1]), K)
get_recommendations(ind)

# Run query on user1, using LSH with 8 functions
print("\nRun query on user1, using LSH with 8 functions")
ind, l118 = run_lsh(lsh18, np.array([user1]), K)
get_recommendations(ind)

# Run query on user1, using LSH with 4 tables 2 functions
print("Run query on user1, using LSH with 4 tables 2 2 functions")
ind, l142 = run_lsh(lsh42, np.array([user1]), K)
get_recommendations(ind)

# Run query on user1, using LSH with 4 tables 4 functions
print("\nRun query on user1, using LSH with 4 tables 2 4 functions")
ind, l144 = run_lsh(lsh44, np.array([user1]), K)
get_recommendations(ind)

# Run query on user1, using LSH with 4 tables 8 functions
print("\nRun query on user1, using LSH with 4 tables 2 8 functions")
ind, l148 = run_lsh(lsh48, np.array([user1]), K)
get_recommendations(ind)

Run query on user1, using LSH with 2 functions
Time:  19.070104360580444

Recommended movies:
Game, The (1997)
Dark Crystal, The (1982)
Donnie Brasco (1997)
All About My Mother (Todo Sobre Mi Madre) (1999)
Time Bandits (1981)

Run query on user1, using LSH with 4 functions
Time:  2.0396530628204346

Recommended movies:
Run Lola Run (Lola rennt) (1998)
Wizard of Oz, The (1939)
Beautician and the Beast, The (1997)
Forces of Nature (1999)
Blood Simple (1984)

Run query on user1, using LSH with 8 functions
Time:  0.24433183670043945

Recommended movies:
Batman (1989)
Raging Bull (1980)
Sting, The (1973)
Love and Death (1975)
American Beauty (1999)
Run query on user1, using LSH with 4 tables 2 2 functions
Time:  19.769280195236206

Recommended movies:
Game, The (1997)
Dark Crystal, The (1982)
Donnie Brasco (1997)
All About My Mother (Todo Sobre Mi Madre) (1999)
Time Bandits (1981)

Run query on user1, using LSH with 4 tables 2 4 functions
Time:  23.012451171875

Recommended movies:
Game, Th

In [36]:
# Run query on user2, using LSH with 2 functions
print("Run query on user2, using LSH with 2 functions")
ind, l212 = run_lsh(lsh12, np.array([user2]), K)
get_recommendations(ind)

# Run query on user2, using LSH with 4 functions
print("\nRun query on user2, using LSH with 4 functions")
ind, l214 = run_lsh(lsh14, np.array([user2]), K)
get_recommendations(ind)

# Run query on user2, using LSH with 8 functions
print("\nRun query on user2, using LSH with 8 functions")
ind, l218 = run_lsh(lsh18, np.array([user2]), K)
get_recommendations(ind)

# Run query on user2, using LSH with 4 tables 2 functions
print("Run query on user2, using LSH with 4 tables 2 2 functions")
ind, l242 = run_lsh(lsh42, np.array([user2]), K)
get_recommendations(ind)

# Run query on user2, using LSH with 4 tables 4 functions
print("\nRun query on user2, using LSH with 4 tables 2 4 functions")
ind, l244 = run_lsh(lsh44, np.array([user2]), K)
get_recommendations(ind)

# Run query on user2, using LSH with 4 tables 8 functions
print("\nRun query on user2, using LSH with 4 tables 2 8 functions")
ind, l248 = run_lsh(lsh48, np.array([user2]), K)
get_recommendations(ind)

Run query on user2, using LSH with 2 functions
Time:  21.16823959350586

Recommended movies:
Haunting, The (1999)
Creepshow (1982)
Mystery Science Theater 3000: The Movie (1996)
Lost in Space (1998)
Little Big Man (1970)

Run query on user2, using LSH with 4 functions
Time:  15.52151107788086

Recommended movies:
Haunting, The (1999)
Creepshow (1982)
Mystery Science Theater 3000: The Movie (1996)
Lost in Space (1998)
Little Big Man (1970)

Run query on user2, using LSH with 8 functions
Time:  15.719181060791016

Recommended movies:
Haunting, The (1999)
Creepshow (1982)
Mystery Science Theater 3000: The Movie (1996)
Lost in Space (1998)
Little Big Man (1970)
Run query on user2, using LSH with 4 tables 2 2 functions
Time:  22.524784326553345

Recommended movies:
Haunting, The (1999)
Creepshow (1982)
Mystery Science Theater 3000: The Movie (1996)
Lost in Space (1998)
Little Big Man (1970)

Run query on user2, using LSH with 4 tables 2 4 functions
Time:  19.497164011001587

Recommended mov

In [37]:
# Run query on user3, using LSH with 2 functions
print("Run query on user3, using LSH with 2 functions")
ind, l312 = run_lsh(lsh12, np.array([user3]), K)
get_recommendations(ind)

# Run query on user3, using LSH with 4 functions
print("\nRun query on user3, using LSH with 4 functions")
ind, l314 = run_lsh(lsh14, np.array([user3]), K)
get_recommendations(ind)

# Run query on user3, using LSH with 8 functions
print("\nRun query on user3, using LSH with 8 functions")
ind, l318 = run_lsh(lsh18, np.array([user3]), K)
get_recommendations(ind)

# Run query on user3, using LSH with 4 tables 2 functions
print("Run query on user3, using LSH with 4 tables 2 2 functions")
ind, l342 = run_lsh(lsh42, np.array([user3]), K)
get_recommendations(ind)

# Run query on user3, using LSH with 4 tables 4 functions
print("\nRun query on user3, using LSH with 4 tables 2 4 functions")
ind, l344 = run_lsh(lsh44, np.array([user3]), K)
get_recommendations(ind)

# Run query on user3, using LSH with 4 tables 8 functions
print("\nRun query on user3, using LSH with 4 tables 2 8 functions")
ind, l348 = run_lsh(lsh48, np.array([user3]), K)
get_recommendations(ind)

Run query on user3, using LSH with 2 functions
Time:  17.434277772903442

Recommended movies:
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Natural Born Killers (1994)
Abyss, The (1989)
Tender Mercies (1983)

Run query on user3, using LSH with 4 functions
Time:  16.600163459777832

Recommended movies:
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Natural Born Killers (1994)
Abyss, The (1989)
Tender Mercies (1983)

Run query on user3, using LSH with 8 functions
Time:  18.346718072891235

Recommended movies:
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Natural Born Killers (1994)
Abyss, The (1989)
Tender Mercies (1983)
Run query on user3, using LSH with 4 tables 2 2 functions
Time:  17.523998975753784

Recommended movies:
Mission: Impossible (1996)
Star Wars: Episode IV - A New Hope (1977)
Natural Born Killers (1994)
Abyss, The (1989)
Tender Mercies (1983)

Run query on user3, using LSH with 4 tables 2 4 functions
Time

## No multi-dimensional index

In [38]:
from sklearn.neighbors import NearestNeighbors

In [39]:
# Count time of regular NN query
# Note that changing the algorithm to 'kd_tree' or 'ball_tree' gives similar results
# to LSH and KD trees, but brute force is different. Is brute force more accurate
# due to pairwise comparison with all points?
def run_Brute(result, user, k):
  nrst_neigh = NearestNeighbors(n_neighbors = k, algorithm = 'brute')
  nrst_neigh.fit(result)
  start = time.time()
  dist, ind = nrst_neigh.kneighbors(user)
  end = time.time()
  print("Time: ", end - start)
  return ind, end - start

In [40]:
print("Run query on user1, using brute force")
ind, b1 = run_Brute(result.values, np.array([user1]), K)
get_recommendations(ind)

Run query on user1, using brute force
Time:  0.05452609062194824

Recommended movies:
Creepshow (1982)
Animal House (1978)
Trading Places (1983)
Backdraft (1991)
Wild Wild West (1999)


In [41]:
print("Run query on user2, using brute force")
ind, b2 = run_Brute(result.values, np.array([user2]), K)
get_recommendations(ind)

Run query on user2, using brute force
Time:  0.03659987449645996

Recommended movies:
In the Line of Fire (1993)
Rain Man (1988)
They Shoot Horses, Don't They? (1969)
Courage Under Fire (1996)
Supercop (1992)


In [42]:
print("Run query on user3, using brute force")
ind, b3 = run_Brute(result.values, np.array([user3]), K)
get_recommendations(ind)

Run query on user3, using brute force
Time:  0.04035449028015137

Recommended movies:
Hustler, The (1961)
Mask, The (1994)
Dirty Dancing (1987)
Superman II (1980)
Manchurian Candidate, The (1962)


# Results / Analysis

Due to the limitations of our indexes and the number of total parameters we used, our recommendation system is neither optimal nor very accurate. 
Although, we should take note that the query results from the KDTrees and the LSHs are similar.

Regardless, our main goal is to measure the performances of the two different multidimensional indexes.

## KD Tree

In [43]:
print("Using KD-Tree leaf-size 2, our NN query averaged:")
print((k12 + k22 + k32)/3,"seconds of runtime.")

Using KD-Tree leaf-size 2, our NN query averaged:
0.0023282368977864585 seconds of runtime.


In [44]:
print("Using KD-Tree leaf-size 4, our NN query averaged:")
print((k14 + k24 + k34)/3,"seconds of runtime.")

Using KD-Tree leaf-size 4, our NN query averaged:
0.0008178551991780599 seconds of runtime.


In [45]:
print("Using KD-Tree leaf-size 8, our NN query averaged:")
print((k18 + k28 + k38)/3,"seconds of runtime.")

Using KD-Tree leaf-size 8, our NN query averaged:
0.0020662943522135415 seconds of runtime.


In [46]:
print("Using KD-Tree leaf-size 16, our NN query averaged:")
print((k116 + k216 + k316)/3,"seconds of runtime.")

Using KD-Tree leaf-size 16, our NN query averaged:
0.0016164779663085938 seconds of runtime.


## LSH

In [47]:
print("Using LSH with 2 functions, our NN query averaged:")
print((l112 + l212 + l312)/3,"seconds of runtime.")

Using LSH with 2 functions, our NN query averaged:
19.224207242329914 seconds of runtime.


In [48]:
print("Using LSH with 4 functions, our NN query averaged:")
print((l114 + l214 + l314)/3,"seconds of runtime.")

Using LSH with 4 functions, our NN query averaged:
11.387109200159708 seconds of runtime.


In [49]:
print("Using LSH with 8 functions, our NN query averaged:")
print((l118 + l218 + l318)/3,"seconds of runtime.")

Using LSH with 8 functions, our NN query averaged:
11.43674365679423 seconds of runtime.


In [50]:
print("Using LSH with 4 tables 2 functions, our NN query averaged:")
print((l142 + l242 + l342)/3,"seconds of runtime.")

Using LSH with 4 tables 2 functions, our NN query averaged:
19.93935449918111 seconds of runtime.


In [51]:
print("Using LSH with 4 tables 4 functions, our NN query averaged:")
print((l144 + l244 + l344)/3,"seconds of runtime.")

Using LSH with 4 tables 4 functions, our NN query averaged:
20.41827630996704 seconds of runtime.


In [52]:
print("Using LSH with 4 tables 8 functions, our NN query averaged:")
print((l148 + l248 + l348)/3,"seconds of runtime.")

Using LSH with 4 tables 8 functions, our NN query averaged:
21.612171014149983 seconds of runtime.


## Brute Force

In [53]:
print("Using brute force, our NN query averaged:")
print((b1 + b2 + b3)/3,"seconds of runtime.")

Using brute force, our NN query averaged:
0.04382681846618652 seconds of runtime.


## Analysis

### Training Time

For training time, the four KD-trees took arroximately ~1 second each to fit.
- Time to train kdt2: 1.033564567565918
- Time to train kdt4: 0.9127256870269775
- Time to train kdt8: 0.8331103324890137
- Time to train kdt16: 0.9205224514007568

However for LSH, the training times took longer when we increased either the number of tables or the number of functions.

- Time to train lsh12: 5.172131299972534
- Time to train lsh14: 13.894601106643677
- Time to train lsh18: 22.716301918029785
- Time to train lsh42: 21.07439351081848
- Time to train lsh44: 57.07830023765564
- Time to train lsh48: 94.97461485862732

### Recommendation Time

From our KD-Tree, there is no discernable pattern suggesting that increasing the leaf size would improve performance.

For LSH, there was only improvement in performance when using a low number of tables and a higher number of functions. We see that using 1 table and 4/8 functions produced the fastest times, averaging ~11 seconds. The rest of the 4 LSHs averaged ~20 seconds.

Our brute force method was slightly slower than KD-Tree, but it still out-performed LSH.

### Recommendations

Movie recommendations mostly the same amongst the KD-Tree and LSH.

For KD-Tree, slight variance in recommendations occured with bigger leaf sizes.

For LSH, the results were consistent. 

The brute force method had a few similar recommendations compared to the first two methods.

### Overview

Comparing KD-Tree to LSH, our KD-tree performed significantly faster, in terms of both training and recommendation.

Using brute force was also faster than LSH, but slower than the KD-Tree

It is important to note that the LSH library used in our implementation is not optimal.

Despite this, the results we have are expected. LSH addresses curse of dimensionality in KD-Trees. However, we are using a very low number of dimensions. With a low number of dimensions, a KD-Tree can outperform LSH. 