# model based collaberative filtering
- uses sparse matrices to value items (filled with nulls)
- compressess matrices using singular value decomposition (SVD)
    - SVD formula 
    - A = original matrix (utility matrix)
    - u = Left orthogonal matrix (holds important, non-redundant information about users
    - v = Right orthogonal matrix (holds important, non-redundant information on items
    - S = Diagonal matrix (contains all of the information about the decomposition process performed during the compression)
    


**A = u * S * v**


## SVD matrix factorization

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import TruncatedSVD

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


MovieLens dataset [SOURCE](https://grouplens.org/datasets/movielens/100k/)

In [4]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
frame = pd.read_csv('data/ml-100k/u.data', sep="\t", names=columns)
frame.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


- user_id = user
- item_id = movie
- rating = user rating
- timestamp = total length of movies viewed

In [6]:
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDB URL', 
           'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime',
          'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musicial', 'Mystery',
          'Romance', "Sci-Fi", "Thriller", "War", "Western"]

movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=columns, encoding='latin-1')

In [8]:
movies.head()

Unnamed: 0,item_id,movie title,release date,video release date,IMDB URL,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musicial,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [9]:
# now lets merge our datasets
combined_movies_data = pd.merge(frame, movies, on='item_id')
combined_movies_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie title,release date,video release date,IMDB URL,unknown,Action,...,Fantasy,Film-Noir,Horror,Musicial,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
1,63,242,3,875747190,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
2,226,242,5,883888671,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
3,154,242,3,879138235,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
4,306,242,5,876503793,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# what movie has the most views with the highest ratings?

combined_movies_data.groupby('item_id')['rating'].count().sort_values(ascending=False).head()

item_id
50     583
258    509
100    508
181    507
294    485
Name: rating, dtype: int64

In [14]:
mask = combined_movies_data['item_id'] == 50
combined_movies_data[mask]['movie title'].unique()

array(['Star Wars (1977)'], dtype=object)

The highest rated and most viewed film in our dataset is Star Wars

## Building a utility matrix

In [17]:
# contains a value for each user and each movie
# We filled all nulls with 0
ratings_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
ratings_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


## Transposing the Matrix

In [18]:
ratings_crosstab.shape

(943, 1664)

**Anatonmy of Truncated SVD**
- (943, 1664) Users --> n_components=12 (12 unique Movies) U = (943,12) Users {Latent Variables about Movies}

- (1664, 943) Movies --> n_components=12 (12 unique Users) U = (1664,12) Movies {Generalized User Tastes}

This is how we're compressing our dataframe

In [19]:
x = ratings_crosstab.values.T
x.shape

(1664, 943)

### Decomposing the Matrix

In [20]:
SVD = TruncatedSVD(n_components=12, random_state=17)

resultant_matrix = SVD.fit_transform(x)
resultant_matrix.shape

(1664, 12)

### Generating a Correlation Matrix
**Making the Recommendation**
- generalized user tastes (1664,12) --> Pearson's r --> (1664, 1664) movies Movies (1664,1664) --> movies of interest (1664,1) 

- Correlation Matrix
    - Recom mend the item that correlates the most with your movie of interest, based on the generalized user tastes

In [21]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

### Isolating Star Wars from the correlation matrix

In [22]:
# finding index of star wars
movies_names = ratings_crosstab.columns
movies_list = list(movies_names)

star_wars = movies_list.index('Star Wars (1977)')
print(star_wars)

1398


In [23]:
# Isolating the array that represents star wars
corr_star_wars = corr_mat[star_wars]
# each row contains a pearsons r coefficent 
corr_star_wars.shape

(1664,)

### Recommending a Highly Correlated Movie

In [24]:
list(movies_names[(corr_star_wars < 1.0) & (corr_star_wars > 0.9)])

['Die Hard (1988)',
 'Empire Strikes Back, The (1980)',
 'Fugitive, The (1993)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Star Wars (1977)',
 'Terminator 2: Judgment Day (1991)',
 'Terminator, The (1984)',
 'Toy Story (1995)']

In [25]:
list(movies_names[(corr_star_wars < 1.0) & (corr_star_wars > 0.95)])

['Return of the Jedi (1983)', 'Star Wars (1977)']