# Before we get started: conda 

Download it here:
https://www.anaconda.com/download/

Open source package and environment management system

## Package Management

In [None]:
conda install scipy
python -c "import scipy; print scipy.__version__" # print scipy module version
conda list scipy
conda remove scipy
conda list

## Environment Management

In [None]:
# Create new environment installing default packages
conda create --name recommender_env python=3.6 anaconda

# Activate environment and check module version
source activate recommender_env
python -c "import scipy; print scipy.__version__"

# Deactivate environment
source deactivate

# Display environments and locations of files
conda info --envs

# Display modules in environment
conda list --name recommender_env

# copy environments
conda create --name recommender_env_copy --clone recommender_env

## Start Jupyter Notebook

In [None]:
# Activate your environment, navigate in the folder you want to start your supyter notebook and type
jupyter notebook

In [None]:
# You can also define where to start your notebook
jupyter notebook --notebook-dir=/Users/yourname/folder1/folder2/

# Get Data

## Download Data, Structure of Folders, read in data with pandas, 

Download Data here:
https://grouplens.org/datasets/movielens/

The ml_latest_small should work fine for our purpose

In [1]:
import numpy as np
import pandas as pd

PATH='Data/'

In [2]:
movies_df = pd.read_csv(f'{PATH}movies.csv', low_memory=False)
ratings_df = pd.read_csv(f'{PATH}ratings.csv', low_memory=False)
links_df = pd.read_csv(f'{PATH}links.csv', low_memory=False)
tags_df = pd.read_csv(f'{PATH}tags.csv', low_memory=False)

In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


# Data Exploration

In [7]:
ratings_df.shape

(100836, 4)

In [8]:
ratings_df.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [9]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


## Lambda Function

Lambda Function: count distinct values for each column

In [10]:
ratings_df.apply(lambda x: len(x.unique()))

userId         610
movieId       9724
rating          10
timestamp    85043
dtype: int64

In [11]:
type(ratings_df.rating.unique())

numpy.ndarray

In [12]:
ratings_df.rating.unique()

array([ 4. ,  5. ,  3. ,  2. ,  1. ,  4.5,  3.5,  2.5,  0.5,  1.5])

## Aggregate Function

Let's say we want to know for each User what is their average rating and how many movies did they rate?

In [13]:
df = ratings_df.groupby('userId').agg({'rating':[np.mean, np.size]}).reset_index()

df.head()

Unnamed: 0_level_0,userId,rating,rating
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,size
0,1,4.366379,232.0
1,2,3.948276,29.0
2,3,2.435897,39.0
3,4,3.555556,216.0
4,5,3.636364,44.0


In [14]:
df.columns

MultiIndex(levels=[['rating', 'userId'], ['mean', 'size', '']],
           labels=[[1, 0, 0], [2, 0, 1]])

In [15]:
df.columns = df.columns.droplevel()

In [16]:
df.head()

Unnamed: 0,Unnamed: 1,mean,size
0,1,4.366379,232.0
1,2,3.948276,29.0
2,3,2.435897,39.0
3,4,3.555556,216.0
4,5,3.636364,44.0


In [17]:
df.rename(columns={'': 'userId', 'mean': 'mean_ratings', 'size':'cnt_ratings'}, inplace=True)
df.head()

Unnamed: 0,userId,mean_ratings,cnt_ratings
0,1,4.366379,232.0
1,2,3.948276,29.0
2,3,2.435897,39.0
3,4,3.555556,216.0
4,5,3.636364,44.0


Let's check mean/median/std of this dataframe for mean_ratings and cnt_ratings. How would you do this? Will the mean for the ratings be the same as above? What do the results mean?

In [18]:
df.describe()

Unnamed: 0,userId,mean_ratings,cnt_ratings
count,610.0,610.0,610.0
mean,305.5,3.657222,165.304918
std,176.236111,0.480635,269.480584
min,1.0,1.275,20.0
25%,153.25,3.36,35.0
50%,305.5,3.694385,70.5
75%,457.75,3.9975,168.0
max,610.0,5.0,2698.0


Split Dataframe into less than/equal to median and higher than median with respect to cnt_ratings. Compare mean_ratings.

In [19]:
s = 70.5

df_less_equal_median = df[df['cnt_ratings'] <= s]
df_higher_median = df[df['cnt_ratings'] > s]

In [20]:
df_less_equal_median.describe()

Unnamed: 0,userId,mean_ratings,cnt_ratings
count,305.0,305.0,305.0
mean,304.786885,3.70666,38.383607
std,171.359839,0.503885,14.313928
min,2.0,1.275,20.0
25%,155.0,3.380952,26.0
50%,302.0,3.772727,35.0
75%,456.0,4.055556,50.0
max,609.0,5.0,70.0


In [21]:
df_higher_median.describe()

Unnamed: 0,userId,mean_ratings,cnt_ratings
count,305.0,305.0,305.0
mean,306.213115,3.607785,292.22623
std,181.260273,0.451637,336.079669
min,1.0,2.14433,71.0
25%,144.0,3.339783,111.0
50%,307.0,3.647727,168.0
75%,462.0,3.912214,340.0
max,610.0,4.693333,2698.0


## Exercises:

- Write a lambda function where you divide each rating by 5 to generate values between 0 and 1
- Take the tags dataframe and build a new dataframe where you aggregate the tags over the movieIds and count them and sort them by their count (hint: sort_values on pandas dataframe with correct columns)

## Solutions

In [14]:
#ratings_df.apply(lambda x: x/5).head()

In [28]:
#df = tags_df.groupby('tag').agg({'tag':[np.size]}).reset_index()
#df.columns = df.columns.droplevel()
#df.sort_values(by=['size'], ascending=False)

## pandas profiling

Here's how to install it in conda: https://anaconda.org/conda-forge/pandas-profiling

In [22]:
import pandas_profiling
pandas_profiling.ProfileReport(ratings_df)

  (prop.get_family(), self.defaultFamily[fontext]))
  (prop.get_family(), self.defaultFamily[fontext]))
  (prop.get_family(), self.defaultFamily[fontext]))
  (prop.get_family(), self.defaultFamily[fontext]))
  (prop.get_family(), self.defaultFamily[fontext]))


0,1
Number of variables,4
Number of observations,100836
Total Missing (%),0.0%
Total size in memory,3.1 MiB
Average record size in memory,32.0 B

0,1
Numeric,4
Categorical,0
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,9724
Unique (%),9.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,19435
Minimum,1
Maximum,193609
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,170
Q1,1199
Median,2991
Q3,8122
95-th percentile,103250
Maximum,193609
Range,193608
Interquartile range,6923

0,1
Standard deviation,35531
Coef of variation,1.8282
Kurtosis,4.3762
Mean,19435
MAD,25955
Skewness,2.2108
Sum,1959777479
Variance,1262500000
Memory size,787.9 KiB

Value,Count,Frequency (%),Unnamed: 3
356,329,0.3%,
318,317,0.3%,
296,307,0.3%,
593,279,0.3%,
2571,278,0.3%,
260,251,0.2%,
480,238,0.2%,
110,237,0.2%,
589,224,0.2%,
527,220,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1,215,0.2%,
2,110,0.1%,
3,52,0.1%,
4,7,0.0%,
5,49,0.0%,

Value,Count,Frequency (%),Unnamed: 3
193581,1,0.0%,
193583,1,0.0%,
193585,1,0.0%,
193587,1,0.0%,
193609,1,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5016
Minimum,0.5
Maximum,5
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,1.5
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.5
Interquartile range,1.0

0,1
Standard deviation,1.0425
Coef of variation,0.29773
Kurtosis,0.12331
Mean,3.5016
MAD,0.8271
Skewness,-0.6372
Sum,353080
Variance,1.0869
Memory size,787.9 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,26818,26.6%,
3.0,20047,19.9%,
5.0,13211,13.1%,
3.5,13136,13.0%,
4.5,8551,8.5%,
2.0,7551,7.5%,
2.5,5550,5.5%,
1.0,2811,2.8%,
1.5,1791,1.8%,
0.5,1370,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0.5,1370,1.4%,
1.0,2811,2.8%,
1.5,1791,1.8%,
2.0,7551,7.5%,
2.5,5550,5.5%,

Value,Count,Frequency (%),Unnamed: 3
3.0,20047,19.9%,
3.5,13136,13.0%,
4.0,26818,26.6%,
4.5,8551,8.5%,
5.0,13211,13.1%,

0,1
Distinct count,85043
Unique (%),84.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1205900000
Minimum,828124615
Maximum,1537799250
Zeros (%),0.0%

0,1
Minimum,828124615
5-th percentile,847660000
Q1,1019100000
Median,1186100000
Q3,1436000000
95-th percentile,1519100000
Maximum,1537799250
Range,709674635
Interquartile range,416870000

0,1
Standard deviation,216260000
Coef of variation,0.17933
Kurtosis,-1.2679
Mean,1205900000
MAD,187540000
Skewness,-0.0087769
Sum,121602779665887
Variance,4.6769e+16
Memory size,787.9 KiB

Value,Count,Frequency (%),Unnamed: 3
1459787998,128,0.1%,
1459787997,124,0.1%,
1459787996,85,0.1%,
1459787995,37,0.0%,
828124616,37,0.0%,
829760898,34,0.0%,
829760897,30,0.0%,
829828005,23,0.0%,
829759809,21,0.0%,
829828006,21,0.0%,

Value,Count,Frequency (%),Unnamed: 3
828124615,20,0.0%,
828124616,37,0.0%,
828124762,1,0.0%,
829322340,9,0.0%,
829759809,21,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1537674927,1,0.0%,
1537674946,1,0.0%,
1537757040,1,0.0%,
1537757059,1,0.0%,
1537799250,1,0.0%,

0,1
Distinct count,610
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,326.13
Minimum,1
Maximum,610
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,33
Q1,177
Median,325
Q3,477
95-th percentile,600
Maximum,610
Range,609
Interquartile range,300

0,1
Standard deviation,182.62
Coef of variation,0.55996
Kurtosis,-1.1814
Mean,326.13
MAD,157.53
Skewness,-0.079036
Sum,32885399
Variance,33350
Memory size,787.9 KiB

Value,Count,Frequency (%),Unnamed: 3
414,2698,2.7%,
599,2478,2.5%,
474,2108,2.1%,
448,1864,1.8%,
274,1346,1.3%,
610,1302,1.3%,
68,1260,1.2%,
380,1218,1.2%,
606,1115,1.1%,
288,1055,1.0%,

Value,Count,Frequency (%),Unnamed: 3
1,232,0.2%,
2,29,0.0%,
3,39,0.0%,
4,216,0.2%,
5,44,0.0%,

Value,Count,Frequency (%),Unnamed: 3
606,1115,1.1%,
607,187,0.2%,
608,831,0.8%,
609,37,0.0%,
610,1302,1.3%,

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# Concept of Matrix Factorization

# Data Preparation

## Pivot data

In [29]:
R_df = ratings_df.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
R_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array

In [30]:
R = R_df.values
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

In [31]:
R

array([[ 4. ,  0. ,  4. , ...,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       ..., 
       [ 2.5,  2. ,  2. , ...,  0. ,  0. ,  0. ],
       [ 3. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 5. ,  0. ,  0. , ...,  0. ,  0. ,  0. ]])

In [32]:
R.shape

(610, 9724)

# Singular Value Decomposition in Python

Docs for the SVD:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html

## Exercise:

Build a SVD and calculate U, sigma and Vt with 50 Embeddings

## Solution

In [33]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

In [34]:
U

array([[-0.01716626,  0.00032694,  0.01194615, ...,  0.00335838,
        -0.06213084, -0.0596384 ],
       [-0.00501697, -0.0010387 ,  0.01505205, ..., -0.0012344 ,
         0.01767432, -0.00626322],
       [ 0.00065179,  0.00538272, -0.00649223, ...,  0.00067537,
        -0.00203417, -0.00064958],
       ..., 
       [-0.16117225,  0.13376441, -0.05557292, ..., -0.01464468,
        -0.01227985, -0.11854893],
       [ 0.0089301 , -0.00652333, -0.00537419, ..., -0.04097903,
        -0.01400112, -0.00856716],
       [ 0.01039005, -0.00880112,  0.06333814, ...,  0.06183579,
         0.20316391, -0.12143586]])

In [35]:
sigma

array([  67.86628347,   68.1967072 ,   69.02678246,   69.4170401 ,
         69.91863747,   70.02091789,   70.19408599,   71.67445157,
         72.43371861,   73.21879553,   73.43760593,   74.02644882,
         74.28978377,   74.9207733 ,   75.17528213,   75.59325141,
         76.70227225,   77.35717925,   78.39405157,   79.04344482,
         79.21217131,   80.56747647,   81.5467832 ,   82.1973482 ,
         83.04447645,   85.11688914,   85.74871886,   86.51711471,
         87.91550637,   90.33575237,   90.9340682 ,   92.26271695,
         93.39976829,   97.10067118,   99.28906754,   99.82361796,
        101.84794614,  105.97367358,  107.04782929,  109.20838712,
        112.80840902,  120.61532345,  122.64724436,  134.58721632,
        139.637245  ,  153.93097112,  163.73084057,  184.86187801,
        231.22453421,  474.20606204])

In [36]:
Vt

array([[  5.06053498e-02,  -1.46261894e-03,  -2.28232417e-03, ...,
          1.42764417e-03,   1.42764417e-03,  -2.96452853e-03],
       [ -2.95078801e-02,   2.17971445e-02,  -2.25072247e-02, ...,
         -2.92507189e-03,  -2.92507189e-03,   9.95934144e-05],
       [ -6.65561487e-02,  -1.43370497e-02,   2.64013814e-02, ...,
         -4.79377861e-04,  -4.79377861e-04,  -1.49239941e-03],
       ..., 
       [ -6.77263279e-02,  -6.97142996e-02,  -2.91611099e-02, ...,
         -2.24798857e-03,  -2.24798857e-03,  -2.06691001e-03],
       [ -2.84008740e-02,  -2.36032577e-03,  -2.47048049e-02, ...,
          7.01753154e-04,   7.01753154e-04,   1.36888991e-03],
       [ -7.60983302e-02,  -3.84874039e-02,  -1.24439904e-02, ...,
          5.10178162e-03,   5.10178162e-03,   4.81883687e-03]])

In [37]:
sigma = np.diag(sigma)

# Making Predictions

In [33]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [34]:
all_user_predicted_ratings

array([[  2.16732840e+00,   4.02750508e-01,   8.40183552e-01, ...,
         -2.34533753e-02,  -2.34533753e-02,  -5.87318552e-02],
       [  2.11459069e-01,   6.65755884e-03,   3.34547997e-02, ...,
          1.94980595e-02,   1.94980595e-02,   3.22813825e-02],
       [  3.58844848e-03,   3.05175179e-02,   4.63929239e-02, ...,
          5.90929301e-03,   5.90929301e-03,   8.00411072e-03],
       ..., 
       [  2.16136388e+00,   2.67091989e+00,   2.12845971e+00, ...,
         -4.40029476e-02,  -4.40029476e-02,   7.18717825e-02],
       [  7.80205947e-01,   5.33648654e-01,   9.64537701e-02, ...,
          4.35514249e-03,   4.35514249e-03,  -1.34622131e-03],
       [  5.36398127e+00,  -3.40945139e-01,  -1.75163291e-01, ...,
         -2.63577616e-02,  -2.63577616e-02,   5.15415792e-02]])

# Making Movie Recommendations

In [35]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,2.167328,0.402751,0.840184,-0.076281,-0.551337,2.504091,-0.890114,-0.026443,0.196974,1.593259,...,-0.023453,-0.019967,-0.026939,-0.026939,-0.023453,-0.026939,-0.023453,-0.023453,-0.023453,-0.058732
1,0.211459,0.006658,0.033455,0.017419,0.18343,-0.062473,0.083037,0.024158,0.04933,-0.15253,...,0.019498,0.016777,0.022219,0.022219,0.019498,0.022219,0.019498,0.019498,0.019498,0.032281
2,0.003588,0.030518,0.046393,0.008176,-0.006247,0.107328,-0.012416,0.003779,0.007297,-0.059362,...,0.005909,0.006209,0.00561,0.00561,0.005909,0.00561,0.005909,0.005909,0.005909,0.008004
3,2.051549,-0.387104,-0.252199,0.087562,0.130465,0.27021,0.477835,0.040313,0.025858,-0.017365,...,0.004836,0.004172,0.0055,0.0055,0.004836,0.0055,0.004836,0.004836,0.004836,-0.023311
4,1.344738,0.778511,0.065749,0.111744,0.273144,0.584426,0.25493,0.128788,-0.085541,1.023455,...,-0.008042,-0.007419,-0.008664,-0.008664,-0.008042,-0.008664,-0.008042,-0.008042,-0.008042,-0.010127


## Build function to get movie predictions per User

In [15]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [16]:
already_rated, predictions = recommend_movies(preds_df, 400, movies_df, ratings_df, 10)

In [17]:
already_rated.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,400,6,5.0,1498870480,Heat (1995),Action|Crime|Thriller
10,400,608,5.0,1498870431,Fargo (1996),Comedy|Crime|Drama|Thriller
1,400,47,5.0,1498870391,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
26,400,3949,5.0,1498870465,Requiem for a Dream (2000),Drama
18,400,1210,5.0,1498870163,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi


In [21]:
predictions.head(10)

Unnamed: 0,movieId,title,genres
453,527,Schindler's List (1993),Drama|War
4108,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
2121,2858,American Beauty (1999),Drama|Romance
1711,2329,American History X (1998),Crime|Drama
1480,2028,Saving Private Ryan (1998),Action|Drama|War
814,1089,Reservoir Dogs (1992),Crime|Mystery|Thriller
94,110,Braveheart (1995),Action|Drama|War
967,1291,Indiana Jones and the Last Crusade (1989),Action|Adventure
3114,4226,Memento (2000),Mystery|Thriller
7729,91529,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX


# Outlook: What's next?

## How to test recommendations

Offline: RMSE - randomly select ratings and build second dataset from it. Then check your predictions, substract your predictions from the real ratings, square it and sum it up. The lower the RMSE the better.

Online: AB-Test, test recommender system vs baseline/existing recommender.

## User Matrix, Movie Matrix, Clustering for Categories 

The User Matrix gives us a numerical description of each user; the Movie Matrix gives a a numerical description of each movie. We can use this to cluster users or movies and find out categories of users/movies just by their ratings.

## Implicit Recommender Systems

Most of the times, there is no explicit rating. You can then use algorithms like Alternating Least Squares (ALS) to nevertheless get great results. Example: song recommendations; product recommendations

# Recap

- Conda as an environment management and how to use it
- read in data, explore and understand the data
- get to know matrix factorization and the concept of embeddigs
- build a small recommender with SVD