<DIV ALIGN=CENTER>

# Introduction to Recommender Systems
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we introduce the concept of a Recommender System.

Market Basket
- Get movielens data
- Explore with Pandas (Reda Blog)
- Build Recommender System 


-----


-----

## Movie Lens Data

To begin exploring 

http://grouplens.org/datasets/movielens/latest/

----

```bash

wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

unzip ml-latest-small.zip
```

The folloing is a summary of the MovieLens Small Dataset, you can read the entire file either at the command line or via `%load ml-latest-small/README.txt` in a code cell.


Summary
=======

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


-----

In [1]:
# NAme of the directory holding the Small MovieLens data
data_dir = '/home/data_scientist/rppdm/data/ml-latest-small'

In [2]:
!ls -la $data_dir

total 3432
drwxr-xr-x 1 data_scientist staff     238 Jan 11 10:55 .
drwxr-xr-x 1 data_scientist staff     272 Feb 15 23:37 ..
-rw-r--r-- 1 data_scientist staff  207997 Jan 11 10:55 links.csv
-rw-r--r-- 1 data_scientist staff  515700 Jan 11 10:55 movies.csv
-rw-r--r-- 1 data_scientist staff 2580392 Jan 11 10:55 ratings.csv
-rw-r--r-- 1 data_scientist staff    8056 Jan 11 10:55 README.txt
-rw-r--r-- 1 data_scientist staff  199073 Jan 11 10:54 tags.csv


In [3]:
!wc -l $data_dir/ratings.csv

105340 /home/data_scientist/rppdm/data/ml-latest-small/ratings.csv


-----

## [Recommender Systems][rs]

When confronted with a large, multi-dimensional data set, one approach

This is demonstrated in the next code cell.

-----
[rs]: https://en.wikipedia.org/wiki/Recommender_system

In [4]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import numpy.ma as ma


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")



In [5]:
import os

ratings_file = os.path.join(data_dir, 'ratings.csv')
movies_file = os.path.join(data_dir, 'movies.csv')

ratings = pd.read_csv(ratings_file)
movies = pd.read_csv(movies_file)

In [6]:
movies.tail()

Unnamed: 0,movieId,title,genres
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama
10328,149532,Marco Polo: One Hundred Eyes (2015),(no genres listed)


In [7]:
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
105334,668,142488,4.0,1451535844
105335,668,142507,3.5,1451535889
105336,668,143385,4.0,1446388585
105337,668,144976,2.5,1448656898
105338,668,148626,4.5,1451148148


In [8]:
len(pd.unique(ratings['movieId'].ravel()))

10325

In [9]:
mv_lens = pd.merge(movies, ratings)

In [10]:
mv_lens.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2,5,859046895
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4,1303501039
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,5,858610933
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4,850815810
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,14,4,851766286


In [11]:
mv_lens.title.value_counts().head()


Pulp Fiction (1994)                 325
Forrest Gump (1994)                 311
Shawshank Redemption, The (1994)    308
Jurassic Park (1993)                294
Silence of the Lambs, The (1991)    290
Name: title, dtype: int64

In [12]:
mv_stats = mv_lens.groupby('title').agg({'rating': [np.size, np.mean]})

# Number of ratings to consider top movie
rating_count = 20

top_movies = mv_stats['rating']['size'] >= rating_count
mv_stats[top_movies].sort_values(by=('rating', 'mean'), ascending=False).head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Nausicaä of the Valley of the Wind (Kaze no tani no Naushika) (1984),22,4.477273
Touch of Evil (1958),21,4.47619
Cinema Paradiso (Nuovo cinema Paradiso) (1989),37,4.459459
"Shawshank Redemption, The (1994)",308,4.454545
Dr. Horrible's Sing-Along Blog (2008),23,4.434783


----
### User Based Collaborative Filtering

-----

In [13]:
tmp_df = ratings.pivot(index='userId', columns='movieId', values='rating')


This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 

In [14]:
the_data = tmp_df.applymap(lambda x: 1 if x > 3 else 0).as_matrix()
print(the_data.shape)

(668, 10325)


That is a very big array. Lets cut out only movies that have been reviewed by a number of individuals.  First group by movie id, and count how many reviews per movie. Then make new matrix.

In [15]:
mvrs = ratings.groupby('movieId').size().sort_values(ascending=False)
tmp_ratings = ratings.ix[mvrs[mvrs > rating_count].index].dropna()

In [114]:
tmp_df = tmp_ratings.pivot(index='userId', columns='movieId', values='rating')

In [17]:
the_data = tmp_df.applymap(lambda x: 1 if x > 3 else 0).as_matrix()
print(the_data.shape)

(152, 873)


No test/train. We simply do cosine similarity to get likely matches.

In [18]:
# Deinfe the Cosine Similarity function

def cosine_similarity(u, v):
    return(np.dot(u, v)/np.sqrt((np.dot(u, u) * np.dot(v, v))))

In [19]:
a = np.array([1, 1, 1, 0, 0])
b = np.array([0, 0, 0, 1, 1])
c = np.array([0, 1, 0, 1, 1])

print('cosine similarity(a, b) = {0:4.3f}'.format(cosine_similarity(a, b)))
print('cosine similarity(a, c) = {0:4.3f}'.format(cosine_similarity(a, c)))
print('cosine similarity(b, c) = {0:4.3f}'.format(cosine_similarity(b, c)))

print('cosine similarity(a, a) = {0:4.3f}'.format(cosine_similarity(a, a)))

cosine similarity(a, b) = 0.000
cosine similarity(a, c) = 0.333
cosine similarity(b, c) = 0.816
cosine similarity(a, a) = 1.000


----

## Test one user

Do the analysis but for one user.

----

In [20]:
x = the_data

# Make a fake user (with movie ratings that will gaurantee a match)
y = np.zeros(the_data.shape[1], dtype=np.int32)
y[6] = 1 ; y[10] = 1; y[15] = 1; y[64] = 1; y[136] = 1
y[180] = 1; y[230] = 1; y[339] = 1; y[622] = 1; y[703] = 1

# Add a special index column to map the row in the x matrix to the userIds
tmp_df.tmp_idx = np.array(range(x.shape[0]))

In [21]:
# Compute similarity, find maximum value
sims = np.apply_along_axis(cosine_similarity, 1, x, y)
mx = np.nanmax(sims)

usr_idx = np.where(sims==mx)[0][0]

print(y[:30])
print(x[usr_idx, :30])

print('\nCosine Similarity(y, x[{0:d}]) = {1:4.3f}' \
      .format(usr_idx, cosine_similarity(y, x[usr_idx])), end='\n\n')

# Now we subtract the vectors
# (any negative value is a movie to recommend)
mov_vec = y - x[usr_idx]

# We want a mask aray, so we zero out any recommended movie.
mov_vec[mov_vec >= 0] = 1
mov_vec[mov_vec < 0] = 0

print(mov_vec[:30])

[0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

Cosine Similarity(y, x[7]) = 0.283

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1]


In [22]:
print('\n{0} Movie Recommendations for User = {1}' \
      .format(mov_vec[mov_vec == 0].shape[0], 
              tmp_df[tmp_df.tmp_idx == usr_idx].index[0]))


3 Movie Recommendations for User = 8.0


In [23]:
mov_ids = tmp_df[tmp_df.tmp_idx == usr_idx].columns

In [24]:
# Now make a masked array to find movies to recommend
# values are the movie ids, mask is the movies the most
# similar user liked.

ma_mov_idx = ma.array(mov_ids, mask = mov_vec)
mov_idx = ma_mov_idx[~ma_mov_idx.mask]        

In [25]:
# Now make a DataFrame of the moves of interest and display

mv_df = movies.ix[movies.movieId.isin(mov_idx)].dropna()

In [26]:
print(60*'-')

for movie in mv_df.title.values:
    print(movie)

print(60*'-', end='\n\n')

------------------------------------------------------------
Mr. Holland's Opus (1995)
I Shot Andy Warhol (1996)
William Shakespeare's Romeo + Juliet (1996)
------------------------------------------------------------



-----

## Student Activity

In the preceding cells, we used PCA to reduce the dimensionality of the
digit data. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Change the number of components in the analysis of the fake
handwritten data. How does this change the reconstruction?

2. Try multiplying the covariance matrix by several original images.
What does the resulting _image_ look like?

3. Try changing the standard PCA we used in this notebook to incremental
PCA. What benefits does this new approach provide? How do the results
differ with standard PCA?



-----

In [27]:
from sklearn.cross_validation import train_test_split

x, y = the_data, range(the_data.shape[0])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state=42)

tmp_df.tmp_idx = np.array(y)

In [28]:
for idx, user in enumerate(x_test):
    
    # Compute similarity, find maximum value
    sims = np.apply_along_axis(cosine_similarity, 1, x_train, user)
    mx = np.nanmax(sims)
    
    # If maximum value is a real value    
    if mx > 0:
        
        # Find the index in the similarity matrix with maximum value
        train_idx = np.where(sims==mx)[0][0]
        
        # Now we subtract the vectors 
        # (any negative value is a movie to recommend)
        mov_vec = user - x_train[train_idx]
        
        # We make a mask aray, so we zero out any recommended movie.
        mov_vec[mov_vec >= 0] = 1
        mov_vec[mov_vec < 0] = 0
        
        # We use the fact that y_train has the indices into the original
        # temporary data frame

        user_idx = tmp_df[tmp_df.tmp_idx == y_train[train_idx]]

        # State how many movies are being recommend for this user id
        print('{0} Movie Recommendations for User = {1}' \
              .format(mov_vec[mov_vec == 0].shape[0], \
                      tmp_df[tmp_df.tmp_idx == y_test[idx]].index[0]))
        
        print(60*'-')
        # Now make a masked array to find movies to recommend
        # values are the movie ids, mask is the movies the most
        # similar user liked.
        ma_mov_idx = ma.array(user_idx.columns, mask = mov_vec)
        mov_idx = ma_mov_idx[~ma_mov_idx.mask]
        
        # Now make a DataFrame of the moves of interest and display
        mv_df = movies.ix[movies.movieId.isin(mov_idx)].dropna()
        for movie in mv_df.title.values:
            print(movie)
            
        print(60*'-', end='\n\n')

45 Movie Recommendations for User = 380.0
------------------------------------------------------------
Flirting With Disaster (1996)
Star Wars: Episode IV - A New Hope (1977)
Stargate (1994)
Four Weddings and a Funeral (1994)
Independence Day (a.k.a. ID4) (1996)
Singin' in the Rain (1952)
American in Paris, An (1951)
Breakfast at Tiffany's (1961)
It Happened One Night (1934)
North by Northwest (1959)
Sabrina (1954)
Mr. Smith Goes to Washington (1939)
Willy Wonka & the Chocolate Factory (1971)
Sleeper (1973)
Candidate, The (1972)
Great Race, The (1965)
Bonnie and Clyde (1967)
Platoon (1986)
Return of the Pink Panther, The (1975)
Cook the Thief His Wife & Her Lover, The (1989)
English Patient, The (1996)
Blues Brothers, The (1980)
Godfather: Part II, The (1974)
Shining, The (1980)
Somewhere in Time (1980)
Field of Dreams (1989)
Jaws (1975)
Hunt for Red October, The (1990)
L.A. Confidential (1997)
Wag the Dog (1997)
Marty (1955)
Tom Jones (1963)
Man for All Seasons, A (1966)
French Connec

-----

## Student Activity

In the preceding cells, we used PCA to reduce the dimensionality of the
digit data. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Change the number of components in the analysis of the fake
handwritten data. How does this change the reconstruction?

2. Try multiplying the covariance matrix by several original images.
What does the resulting _image_ look like?

3. Try changing the standard PCA we used in this notebook to incremental
PCA. What benefits does this new approach provide? How do the results
differ with standard PCA?

Right now we only find most similar user, modify to find `n` most similar users.



-----

----
### Item Based Collaborative Filtering

-----

In [79]:
movie_data = the_data.T

In [88]:
mis = np.array([[cosine_similarity(md_i, md_j) \
                 for md_i in movie_data] for md_j in movie_data])

In [91]:
print(mis.shape)

(873, 873)


In [109]:
mid = movies[movies.title.str.contains('Shawshank Redemption')].movieId.values[0]


In [None]:
mv_id = 
Given movie ID
Get index into array.

then grab row from similarity matrix

go through row, pulling out any index and similairty score greater than zero.
pair these all up as a list?
retunr sorted list.


NameError: name 'users_interests' is not defined

In [110]:
print(mid)

318


In [116]:
tmp_df[mid].shape

(152,)