### Codio Activity 19.5: Distance Based Recommendations

As another example of recommendation approaches, this assignment applies a distance-based approach to recommendations using the idea of **cosine distance**. Using information about users and items, you will create an item distance matrix. Using these distances, you will make recommendations for users based on similar items to those they have rated highly.


In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('codio_19_5_solution/data/movie_ratings_19_5.csv', index_col=0)
df.head()

Unnamed: 0,movieId,title,userId,rating
5319,185,Clueless (1995),201,3.0
5320,185,Clueless (1995),219,2.0
5321,185,Clueless (1995),227,3.0
5322,185,Clueless (1995),229,4.0
5323,185,Clueless (1995),232,3.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 78009 entries, 5319 to 71480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  78009 non-null  int64  
 1   title    78009 non-null  object 
 2   userId   78009 non-null  int64  
 3   rating   78009 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 3.0+ MB


### Problem 1

#### Pivot Matrix

Below, use the DataFrame and the `pivot_table` function in pandas to create a table where the rows are the movie titles, columns are user ID's and the values are the associated ratings of the movies.  

In [7]:
piv_df = pd.pivot_table(df, index = 'title', columns = 'userId', values = 'rating').fillna(0)
piv_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Clueless (1995),0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,0.0,...,0.0,3.0,0.0,4.0,0.0,0.0,0.0,3.0,3.0,0.0
Coach Carter (2005),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0,0.0,0.0
Coal Miner's Daughter (1980),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Problem 2

#### Cosine Distance

To determine the similarity of movies, you can consider the distance between two movie arrays.  Below, use the scikitlearn implementation of `pairwise_distances` using the `cosine` as the metric.  Assign the results as `distance_array`.

**Note**: Be sure to fill the missing values of the `piv_df` with 0 before finding distances.  

In [8]:
from sklearn.metrics.pairwise import pairwise_distances

In [10]:
distance_array = pairwise_distances(piv_df, metric = 'cosine')
distance_array.shape

(5075, 5075)

### Problem 3

#### Create a Distance DataFrame

Using your distance array, create a DataFrame with both index and column names as the movie names.  

In [11]:
dist_df = pd.DataFrame(distance_array, index = piv_df.index, columns = piv_df.index)
dist_df.head()

title,'Round Midnight (1986),'Til There Was You (1997),Clueless (1995),Coach Carter (2005),Coal Miner's Daughter (1980),Cobb (1994),Cobra (1986),Cocaine Cowboys (2006),Cocktail (1988),Coco (2017),...,Zombieland (2009),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zootopia (2016),Zulu (1964),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Round Midnight (1986),0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.967259,1.0,1.0,1.0,1.0,1.0,1.0
'Til There Was You (1997),1.0,2.220446e-16,0.8843602,1.0,1.0,1.0,1.0,1.0,0.862171,1.0,...,1.0,1.0,1.0,0.893782,1.0,0.907731,1.0,1.0,0.92147,1.0
Clueless (1995),1.0,0.8843602,2.220446e-16,0.9449909,0.937616,0.888982,0.970826,1.0,0.911382,0.945281,...,0.913059,1.0,0.974194,0.835638,1.0,0.947219,1.0,0.906243,0.803393,1.0
Coach Carter (2005),1.0,1.0,0.9449909,1.110223e-16,1.0,1.0,1.0,0.816326,0.885037,0.880089,...,1.0,1.0,1.0,0.748878,0.781806,0.923039,1.0,0.781258,0.938593,1.0
Coal Miner's Daughter (1980),1.0,1.0,0.9376155,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.947376,1.0,1.0,1.0,1.0,0.863828,1.0


### Problem 4

#### Using the Distances to make recommendations

Use the `distance_df` to decide what movie you would recommend to a user who rated `'xXx (2002)'` highly -- aka what is the most similar movie?

In [12]:
# recommendation = 'Time Bandits (1981)'
recommendation = dist_df['xXx (2002)'].nsmallest().index[1]
recommendation

'Time Bandits (1981)'