<h1 align='center'>Euclidean Distance: <br>
A Recommendation Engine Using Nearest Neighbors</h1>

<h2>Prerequisites</h2>
<ul>
    <li>You have seen the standard/Euclidean <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm" target="_blank">norm</a> of a vector</li>
    <li>You have some awareness of the ubiquity of making recommendations online.  For reference, briefly skim <a href="https://en.wikipedia.org/wiki/Recommender_system" target="_blank">this wiki</a>, <a href="https://www.thrillist.com/entertainment/nation/the-netflix-prize#" target="_blank"> this article</a> about Netflix from more than a decade ago, and <a href="https://www.nytimes.com/2017/04/17/arts/youtube-broadcasters-algorithm-ads.html" target="_blank"> this article</a> about YouTube indicating how this is still a difficult task. </li>
    <li>It would be helpful if you have read the article, Higher Dimensions and Linear Regression, as we will be using a 20-dim space below
</ul>

<h2>Introduction</h2>
briefly introduce euclidean distance

describe the nature of our task, take a new user John, how can we assess if John will like the most recent Pixar film, Incredibles 2. tempting to find people who think the same as John for all of the other movies, i.e., those who have similar taste

nearest neighbors uses euclidean distance to assess who is closest to John in terms of how they viewed all of the other movies

<h2>Our Data</h2>
look at our data briefly

show the simple linear regression for each movie against Incredibles 2

and if we look at each of these individually, it is fairly easy to see who is close together.  Our goal is push this same idea into 20 dimensions.

<h2>Appendix: Tools Used</h2>

Despite my best efforts, I wasn't able to find a dataset that provide rankings for all the Pixar movies grouped by user. I also don't have 500 friends.  As such, I am taking some liberties and generating our user observations.  To do this, we will start by taking the mean rating from IMDB for each movie and generating observations via the inverse of the cumulative distribution function.  

This method has the disadvantage that we are unlikely to generate users who really love or really hate Pixar movies, and most will have a balanced assessment of Pixar as a whole. This is perhaps not realistic but will suffice for the linear algebra discussion that is our goal.  Furthermore, because IMDB does not list the standard deviation, I have chosen to use a standard deviation of 1 for every movie, which is unlikely to be the case. But, I am not trying to measure the controversy for each movie, so this will be fine for our purposes. 

For reference, the Pixar movies and their ratings are:
<ol>
    <li> Toy Story, 1995: 8.3 </li>
    <li> A Bug's Life, 1998: 7.2 </li>
    <li> Toy Story 2, 1999: 7.9</li>
    <li> Monsters, Inc., 2001: 8.1</li>
    <li> Finding Nemo, 2003: 8.2</li>
    <li> The Incredibles, 2004: 8.0</li>
    <li> Cars, 2006:  7.2</li>
    <li> Ratatouille, 2007: 8.0</li>
    <li> WALL-E, 2008: 8.4</li>
    <li> Up, 2009: 8.3</li>
    <li> Toy Story 3, 2010: 8.3</li>
    <li> Cars 2, 2011: 6.3</li>
    <li> Brave, 2012: 7.2</li>
    <li> Monsters University, 2013: 7.3</li>
    <li> Inside Out, 2015: 8.4</li>
    <li> The Good Dinosaur, 2015: 8.3</li>
    <li> Finding Dory, 2016:  7.8</li>
    <li> Cars 3, 2017:  6.8</li>
    <li> Coco, 2017:  8.4</li>
    <li> Incredibles 2, 2018:  8.1</li>
</ol>

First let's create a function to generate 500 observations given some mean.

In [17]:
import numpy as np
import math
from random import *
from scipy.stats import norm

#need to revisit this at some point. I am using continuous distribution methods and rounding for what should ideally be a discrete variable.
#the consequences are negligible for this application, but there must be a more direct way

def obs_generator(mean):
    #first we will generate some random observations around the mean using the inverse of the CDF, using the built-in ppf
    initial = []

    for x in range(0,500):
        obs = norm.ppf(random(), loc=mean, scale=1)
        if obs > 10:
            initial.append(10)
        else:
            initial.append(obs)
        
    #while our set is large enough to get pretty close to our desired mean, we can adjust the generated values to get it closer
    temp_mean = np.mean(initial)
    temp_sd = np.std(initial)

    result = []

    for y in initial:
        adj = mean + ((y - temp_mean)/temp_sd)
        if adj > 10:
            result.append(10)
        else:
            result.append(round(adj,1))
        
    #because we will put this into a dataframe, we will return this as a numpy array    
    return np.array(result)

#let's do a quick sanity check
test = obs_generator(7.3)
print(np.mean(test))

7.2993999999999994


Now, let's create our all of our observations:

In [18]:
#I will create variables so that I can control the order in which items are passed later
toy_obs =  obs_generator(8.3)
bug_obs = obs_generator(7.2) 
toy2_obs = obs_generator(7.9)
monst_obs = obs_generator(8.1)
nemo_obs = obs_generator(8.2)
incred_obs = obs_generator(8.0)
cars_obs = obs_generator(7.2)
rat_obs = obs_generator(8.0)
walle_obs = obs_generator(8.4)
up_obs = obs_generator(8.3)
toy3_obs = obs_generator(8.3)
cars2_obs = obs_generator(6.3)
brave_obs = obs_generator(7.2)
monst_univ_obs = obs_generator(7.3)
inside_obs = obs_generator(8.4)
dino_obs = obs_generator(8.3)
dory_obs = obs_generator(7.8)
cars3_obs = obs_generator(6.8)
coco_obs = obs_generator(8.4)
incred2_obs = obs_generator(8.1)

Now we will create our dataframe with columns as each movie and rows as each user:

In [19]:
import pandas as pd

user_ratings = pd.DataFrame({
    "Toy Story": toy_obs,
    "A Bugs Life": bug_obs,
    "Toy Story 2": toy2_obs,
    "Monsters, Inc.": monst_obs,
    "Finding Nemo": nemo_obs,
    "The Incredibles": incred_obs,
    "Cars": cars_obs,
    "Ratatouille": rat_obs,
    "WALL-E": walle_obs,
    "Up": up_obs,
    "Toy Story 3": toy3_obs,
    "Cars 2": cars2_obs,
    "Brave": brave_obs,
    "Monsters University": monst_univ_obs,
    "Inside Out": inside_obs,
    "The Good Dinosaur": dino_obs,
    "Finding Dory": dory_obs,
    "Cars 3": cars3_obs,
    "Coco": coco_obs,
    "Incredibles 2": incred2_obs
})

In [20]:
user_ratings.head()

Unnamed: 0,A Bugs Life,Brave,Cars,Cars 2,Cars 3,Coco,Finding Dory,Finding Nemo,Incredibles 2,Inside Out,Monsters University,"Monsters, Inc.",Ratatouille,The Good Dinosaur,The Incredibles,Toy Story,Toy Story 2,Toy Story 3,Up,WALL-E
0,6.7,6.3,6.2,6.0,7.3,8.6,9.9,7.3,7.9,9.0,9.0,8.7,7.9,8.4,7.0,10.0,9.0,8.6,9.3,8.3
1,7.9,6.2,9.2,7.6,7.8,4.7,6.3,7.4,8.7,8.0,8.1,8.7,7.5,8.9,8.5,9.7,7.0,6.5,7.5,7.9
2,6.3,7.0,5.7,7.0,6.2,7.5,7.5,8.0,6.9,8.4,7.6,7.5,8.2,10.0,8.9,8.3,7.4,7.0,8.4,8.2
3,7.2,8.7,6.8,6.1,7.1,6.9,7.6,8.9,7.3,8.3,7.0,6.6,8.2,8.0,7.2,6.7,8.9,8.8,9.5,8.6
4,8.1,6.7,7.5,6.5,6.1,8.0,7.3,5.8,9.3,8.4,8.4,9.1,7.5,9.7,7.2,10.0,7.0,8.5,8.5,8.6


In [21]:
user_ratings.describe()

Unnamed: 0,A Bugs Life,Brave,Cars,Cars 2,Cars 3,Coco,Finding Dory,Finding Nemo,Incredibles 2,Inside Out,Monsters University,"Monsters, Inc.",Ratatouille,The Good Dinosaur,The Incredibles,Toy Story,Toy Story 2,Toy Story 3,Up,WALL-E
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,7.1984,7.2006,7.2002,6.3002,6.8004,8.3928,7.7986,8.1948,8.0966,8.3964,7.3026,8.1,8.0002,8.2986,7.9998,8.2892,7.8994,8.2998,8.3006,8.3962
std,1.000319,1.001672,0.999589,1.00401,1.001422,0.984625,1.000129,0.991776,0.997938,0.993318,1.002548,0.996829,0.999228,0.996737,1.001512,0.982514,0.995913,1.000451,1.003231,0.991469
min,4.2,4.7,4.1,3.6,4.1,4.7,4.3,5.2,5.2,5.0,4.5,5.4,5.2,5.3,5.1,5.5,4.5,5.3,5.6,5.5
25%,6.5,6.5,6.575,5.6,6.1,7.7,7.1,7.5,7.4,7.7,6.6,7.4,7.3,7.6,7.3,7.6,7.2,7.675,7.6,7.8
50%,7.2,7.2,7.2,6.3,6.8,8.4,7.8,8.2,8.1,8.4,7.3,8.1,8.0,8.3,8.0,8.3,7.9,8.3,8.4,8.4
75%,7.9,7.9,7.8,6.9,7.4,9.2,8.5,8.9,8.8,9.1,8.0,8.8,8.7,9.025,8.7,9.0,8.6,9.1,9.1,9.2
max,10.0,10.0,9.9,9.2,9.5,10.0,9.9,10.0,9.9,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,9.9,10.0,10.0


And, we now have a dataframe with 500 user ratings such that the mean rating for each movie corresponds to that provided by IMDB.

Again, this does have a slight problem with it.  Notice under Cars 2 that the max rating is 9.2.  I'm sure there is someone out there who loves Cars 2 and gave it a 10.  But, they are so far away from the mean, even without knowing the standard deviation, that they aren't likely to have many associates. Also, I have allowed our user ratings to be floats, rounded to one decimal place; I think IMDB only allows for integer voting values.  As I observered earlier, I would like to return to the structure of the obs_generator function to resolve this continuous vs. discrete variable issue at a later date.  Also, in a survey this large, we would surely have some Null values for users who haven't seen a particular movie.  I have generated ratings for every user for every movie, but, in practice, we would need to confront these Null values and decide how to parse them.  But, for the linear algebra discussion above, I am content with these slight abuses of reality.