# Recommender Systems
### David Samuel

Acknowledgement and use of the dataset in publications, is cited below:

MovieLens: http://grouplens.org/datasets/movielens/
-Thanks to Rich Davies for generating the data set.

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

Graphlab: https://turi.com/learn/userguide/recommender/choosing-a-model.html

The dataset is available here: http://grouplens.org/datasets/movielens/latest/

<a id="top"></a>
## Table of Contents
________________________________________________________________________________________________________

### I.  Business Understanding
* <a href="#business_understanding">Business Understanding</a>

### II. Data Understanding 
* <a href="#data_understanding">Data Understanding</a>

### III. Modeling and Evaluation
* <a href="#modeling_and_evaluation">Modeling and Evaluation</a>
* <a href="#Collaborative_Filtering">Final Dataset Description</a>

### IV.  Deployment
* <a href="#deployment">Deployment</a>

<a id="business_understanding"></a>
<a href="#top">Back to Top</a>
# Business Understanding 
• [10 points] Describe the purpose of the data set you selected (i.e., why was this data
collected in the first place?). How will you measure the effectiveness of a goodalgorithm? Why does your chosen validation method make sense for this specific
dataset and the stakeholders needs?

The dataset selected was the Movielens latest dataset.  It consists of a ratings file of over 24 million ratings for 40,000 movie titles. a tags file with over 668,000 tag applications created by just under 26,000 users. The data was collected build and maintain a recommender engine for a movie-user-ratings database Movielens.  The project enables users to tune their own matching algorithm to their needs by rating movies.  

Good algorithms will be measured by precision and recall trade-off.  For a movie recommender like Netflix, the stakeholder might be the user, as they determine how good the recommended movie is, and will likely use the service more if they recieve more precise recommendations. So precision is the focus for most movie recommendations as the user usually only checks the top 5 or so.  Precision governs the completeness of the recommendations, so perhaps a seasoned user may be scrolling towards the next 10 or 20 recommendations looking for a movie.  

<a id='data_understanding'></a>
<a href="#top">Back to Top</a>
# Data Understanding (20 points total)
• [10 points] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file. Verify data quality: Are there missing values? Duplicate data?
Outliers? Are those mistakes? How do you deal with these problems?

The data is of very high quality and is not missing any values as will be seen in the SFrame tables and visualizations below.  Luckily, this data set is not sparse, and is huge.  Which will enable customized recommendation engines to be implemented.

In [4]:
import graphlab as gl
ratings = gl.SFrame.read_csv("ml-latest/ratings.csv", header=True)
ratings.head()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871
1,1609,3.0,945544824
1,1961,3.0,945544871
1,1972,1.0,945544871
2,441,2.0,1008942733
2,494,2.0,1008942733
2,1193,4.0,1008942667


In [17]:
# import the rest of the data

genome_scores = gl.SFrame.read_csv("ml-latest/genome-scores.csv", header=True, verbose=False)
genome_tags = gl.SFrame.read_csv("ml-latest/genome-tags.csv", header=True, verbose=False)
links = gl.SFrame.read_csv("ml-latest/links.csv", header=True, verbose=False)
movies = gl.SFrame.read_csv("ml-latest/movies.csv", header=True)
tags = gl.SFrame.read_csv("ml-latest/tags.csv", header=True, verbose=False)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


• [10 points] Visualize the any important attributes appropriately. Important: Provide an
interpretation for any charts or graphs. 

In [111]:
gl.canvas.set_target('browser') # set viz target browser or ipynb

In [15]:
ratings = gl.load_sframe('ratings')
gl.canvas.set_target('ipynb')
ratings.show()

# As we can see there are no missing values in the num_undefined columns for any of the ratings dataset. Also, the distribution of ratings appears normal with some outliers on the bottom.

## Nearly 40,000 movies are rated from 0.5 to 5 by approximately 260,000 unique users

![title](scatter_movieId_relevance.png)

# The plot above shows the genome tag set with relevance by movie_id highly concentrated in the lower number ids

# The Genomes tag set shows 1,127 unique tag ids corresponding to the genome tagging system

In [18]:
gl.canvas.set_target('ipynb')
genome_scores.show()

# The Genomes score set shows 10,668 movies tagged relevant 3,997 times with 1,121 unique tag ids corresponding to the genome tagging system

# reference database for IMDB, and tmdb, 
as noted in the author's description there was some descrepancy amongst ids for movies across movie databases.  This will not affect my analysis of the ratings.

In [19]:
movies.show()

# Drama is the dominant genre with over 15% of the share.  Over 5% have no genre label at all

# 1,514 unique genres have been identified, with a small descrepancy on movie title and movie id, which is due to the estimated repeated titles of sequels.

Drama was

In [20]:
tags.show()

# 17,110 movie reviewers tagged 23,849 different movies 49,627 times using 16 unique tags

<a id='modeling_and_evaluation'></a>
<a href="#top">Back to Top</a>
# Modeling and Evaluation (50 points total)
Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
the data you have chosen and use visualizations of the results to explain the performance and
expected outcomes whenever possible. Guide the reader through your analysis with plenty of
discussion of the results.

<a id="Collaborative_Filtering"></a>
<a href="#top">Back to Top</a>
# Collaborative Filtering
• Create user-item matrices or item-item matrices using collaborative filtering


In [14]:
import graphlab as gl
ratings = gl.SFrame.read_csv("data/ratings.csv", header=True)
ratings.head()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871
1,1609,3.0,945544824
1,1961,3.0,945544871
1,1972,1.0,945544871
2,441,2.0,1008942733
2,494,2.0,1008942733
2,1193,4.0,1008942667


# 24 Million ratings

In [21]:
training_data, validation_data = gl.recommender.util.random_split_by_user(ratings, 'userId', 'movieId')

In [32]:
# ratings.save('ratings')
# genome_scores.save('genome_scores')
# genome_tags.save('genome_tags')
# ratings.save('links')
# ratings.save('movies')
# ratings.save('tags')
# genome_scores = graphlab.SFrame.read_csv("/media/dave/HD Storage/Data Mining/lab3ml-latest/genome-scores.csv", header=True)
#  = graphlab.SFrame.read_csv("/media/dave/HD Storage/Data Mining/lab3ml-latest/genome-tags.csv", header=True)
# = graphlab.SFrame.read_csv("/media/dave/HD Storage/Data Mining/lab3ml-latest/links.csv", header=True)
#  = graphlab.SFrame.read_csv("/media/dave/HD Storage/Data Mining/lab3ml-latest/movies.csv", header=True)
# tags = graphlab.SFrame.read_csv("/media/dave/HD Storage/Data Mining/lab3ml-latest/tags.csv", header=True)

In [41]:
training_data.save('training'); validation_data.save('validation')

In [42]:
print 'training size:', training_data.shape[0], 'validation size:', validation_data.shape[0]

training size: 24386436 validation size: 17660


## Determine performance of the recommendations using different performance measures
## and explain what each measure does.

Model performance will be measured by the MSE of predicted rating == objective value, step-size indicates the SDG mini batch sizes.

Precision is the percent relevency of the selected recommendations, and recall is the percent of selected relevent reccomendations
Precision is the focus for most movie recommendations as the user usually only checks the top 5 or so.  Precision governs the completeness of the recommendations, so perhaps a seasoned user may be scrolling towards the next 10 or 20 recommendations looking for a movie.

Because this dataset is based on user rankings to inform the movie recommendation engine, the ranking factorization 
will be utilized, but will also be compared to the default Turi factorization recommender.  


In [122]:
m1 = gl.recommender.create(training_data, user_id="userId", item_id="movieId", target="rating")

In [123]:
m1.save('model1')

# Model 1 has no parameter optimization changes

In [124]:
r1 = m1.recommend(users=[1], k=10, verbose=False)

In [64]:
m2 = gl.ranking_factorization_recommender.create(training_data,
                                          user_id='userId',
                                                      item_id='movieId',
                                                      target='rating', verbose=True)

In [88]:
m2.save('model2')

In [125]:
r = m2.recommend(users=[1], k=10)

In [62]:
r.save('results2')

# Now try with half the number of factors to reduce overfitting, and lower ranking regularization to increase diversity, and lower sdg_step_size to increase accuracy.

In [46]:
# graphlab.recommender.factorization_recommender.create(observation_data, 
#                                                       user_id='user_id', item_id='item_id', 
#                                                       target=None, user_data=None, item_data=None, 
#                                                       num_factors=8, regularization=1e-08, linear_regularization=1e-10, 
#                                                       side_data_factorization=True, nmf=False, binary_target=False, max_iterations=50, 
#                                                       sgd_step_size=0, random_seed=0, solver='auto', verbose=True, **kwargs)


m3 = gl.recommender.ranking_factorization_recommender.create(training_data, user_id="userId", item_id="movieId", target="rating", 
                           num_factors=16, ranking_regularization=0.1, sgd_step_size=0.025)

# The adjustments lowered the final objective value and the overall RMSE

In [87]:
m3.save('model3')

# Now with this recommender, i will add tags as new data to the model with diversity=1 to spice up the list, and k=15 to explore the possible recall

In [61]:
# FactorizationRecommender.recommend(users=None, k=10, 
#                                    exclude=None, items=None, 
#                                    new_observation_data=None, 
#                                    new_user_data=None, new_item_data=None, 
#                                    exclude_known=True, diversity=0, random_seed=None, verbose=True)
r3 = m3.recommend(users=[1], new_user_data=tags, diversity=1, k=15).print_rows(num_rows=15)

+--------+---------+---------------+------+
| userId | movieId |     score     | rank |
+--------+---------+---------------+------+
|   1    |   318   |  4.5074127209 |  1   |
|   1    |   858   | 4.37136393829 |  2   |
|   1    |    50   |  4.3690995392 |  3   |
|   1    |   1198  | 4.23969798392 |  4   |
|   1    |   260   | 4.22989017932 |  5   |
|   1    |   1196  | 4.22630113034 |  6   |
|   1    |   296   | 4.19750653996 |  7   |
|   1    |   4226  |  4.1922656302 |  8   |
|   1    |   1203  | 4.18718409581 |  9   |
|   1    |   1193  | 4.18049112378 |  10  |
|   1    |   1197  | 4.17566738321 |  11  |
|   1    |   1136  | 4.16787082931 |  12  |
|   1    |   593   | 4.16511990918 |  13  |
|   1    |   1213  | 4.16254810473 |  14  |
|   1    |   904   | 4.11662172569 |  15  |
+--------+---------+---------------+------+
[15 rows x 4 columns]



In [58]:
r3.save('rec3')

AttributeError: 'NoneType' object has no attribute 'save'

# Changed item_data to include genome scores of the movies, and lower step further to 0.15, increase to max iterations

In [142]:
m4 = gl.recommender.factorization_recommender.create(training_data, 
                                                      user_id='userId', item_id='movieId', 
                                                      target='rating', user_data=None, item_data=genome_scores, 
                                                      num_factors=32, regularization=1e-08, linear_regularization=1e-10, 
                                                      side_data_factorization=False, nmf=False, binary_target=False, max_iterations=50, 
                                                      sgd_step_size=0.15, random_seed=0, solver='auto', verbose=True)

In [143]:
m4.save('model4')

# Try diversity 2 now

In [174]:
# FactorizationRecommender.recommend(users=None, k=10, 
#                                    exclude=None, items=None, 
#                                    new_observation_data=None, 
#                                    new_user_data=None, new_item_data=None, 
#                                    exclude_known=True, diversity=0, random_seed=None, verbose=True)
r4 = m4.recommend(users=[20000], diversity=2, k=10)
r4

userId,movieId,score,rank
20000,26587,5.88738347308,1
20000,26472,5.75278928393,2
20000,7577,5.53962028439,3
20000,746,5.50737013979,4
20000,85,5.37479965344,5
20000,8368,5.36729537361,6
20000,1545,5.35802908653,7
20000,7301,5.26851659033,8
20000,2427,5.2001714807,9
20000,1354,5.19271676056,10


In [185]:
for mo in r4['movieId']:
    print movies[mo]

{'genres': 'Adventure|Comedy', 'movieId': 122515, 'title': 'Spymate (2006)'}
{'genres': 'Comedy|Romance', 'movieId': 122197, 'title': 'Paradise for Three (1938)'}
{'genres': 'Comedy|Romance', 'movieId': 7941, 'title': 'Smiles of a Summer Night (Sommarnattens leende) (1955)'}
{'genres': 'Documentary', 'movieId': 759, 'title': 'Maya Lin: A Strong Clear Vision (1994)'}
{'genres': 'Action|Adventure|Drama', 'movieId': 86, 'title': 'White Squall (1996)'}
{'genres': 'Drama|Musical', 'movieId': 25770, 'title': 'Applause (1929)'}
{'genres': 'Drama|Mystery|Romance|Thriller', 'movieId': 1597, 'title': 'Conspiracy Theory (1997)'}
{'genres': 'Comedy|Horror|Mystery', 'movieId': 7412, 'title': 'Cat and the Canary, The (1978)'}
{'genres': 'Crime|Film-Noir', 'movieId': 2511, 'title': 'Long Goodbye, The (1973)'}
{'genres': 'Action|Drama|Thriller', 'movieId': 1385, 'title': 'Under Siege (1992)'}


In [182]:
r2 = m2.recommend(users=[20000], diversity=2, k=10)
r2

userId,movieId,score,rank
20000,5952,4.67076550095,1
20000,318,4.44328416555,2
20000,4226,4.42112840383,3
20000,858,4.36185450523,4
20000,1197,4.31613331704,5
20000,3000,4.28852431147,6
20000,1203,4.238152903,7
20000,2858,4.22354891567,8
20000,2329,4.2220748183,9
20000,4011,4.20055375664,10


In [178]:
movies[r2[movieId]]=5928
# for mo in r2['movieId']:
#     movies[mo]

{'genres': 'Comedy|Romance', 'movieId': 7264, 'title': 'An Amazing Couple (2002)'}
{'genres': 'Drama', 'movieId': 321, 'title': 'Strawberry and Chocolate (Fresa y chocolate) (1993)'}
{'genres': 'Comedy|Drama', 'movieId': 3045, 'title': "Peter's Friends (1992)"}
{'genres': 'Romance', 'movieId': 1159, 'title': 'Love in Bloom (1935)'}
{'genres': 'Action|Drama|Thriller', 'movieId': 51, 'title': 'Guardian Angel (1994)'}
{'genres': 'Crime|Drama', 'movieId': 874, 'title': 'Killer: A Journal of Murder (1995)'}
{'genres': 'Comedy|Crime|Drama|Thriller', 'movieId': 296, 'title': 'Pulp Fiction (1994)'}
{'genres': 'Comedy|Crime|Mystery|Thriller', 'movieId': 2413, 'title': 'Clue (1985)'}
{'genres': 'Animation|Children|Comedy|Musical', 'movieId': 2102, 'title': 'Steamboat Willie (1928)'}
{'genres': 'Comedy', 'movieId': 4120, 'title': 'Hunk (1987)'}


In [180]:
r1 = m1.recommend(users=[20000], diversity=2, k=10)

In [181]:
for mo in r1['movieId']:
    print movies[mo]

{'genres': 'Crime|Drama', 'movieId': 874, 'title': 'Killer: A Journal of Murder (1995)'}
{'genres': 'Drama', 'movieId': 545, 'title': 'Harem (1985)'}
{'genres': 'Action|Drama|Thriller', 'movieId': 51, 'title': 'Guardian Angel (1994)'}
{'genres': 'Adventure', 'movieId': 941, 'title': 'Mark of Zorro, The (1940)'}
{'genres': 'Crime|Horror', 'movieId': 1219, 'title': 'Psycho (1960)'}
{'genres': 'Drama|War', 'movieId': 1242, 'title': 'Glory (1989)'}
{'genres': 'Action|Drama', 'movieId': 1773, 'title': 'Tokyo Fist (Tokyo ken) (1995)'}
{'genres': 'Action|Drama|Romance|War', 'movieId': 1224, 'title': 'Henry V (1989)'}
{'genres': 'Adventure|Drama|Romance', 'movieId': 2847, 'title': 'Only Angels Have Wings (1939)'}
{'genres': 'Children|Comedy|Fantasy|Musical', 'movieId': 3086, 'title': 'Babes in Toyland (1934)'}


In [150]:
m3.recommend(users=[20000], diversity=2, k=15).print_rows(num_rows=15)

+--------+---------+---------------+------+
| userId | movieId |     score     | rank |
+--------+---------+---------------+------+
| 20000  |   318   | 4.43728267072 |  1   |
| 20000  |    50   | 4.30588011114 |  2   |
| 20000  |   858   | 4.28301421224 |  3   |
| 20000  |   260   | 4.23174325792 |  4   |
| 20000  |   527   | 4.20819648398 |  5   |
| 20000  |   2324  | 4.14816158681 |  6   |
| 20000  |   2959  | 4.13683382837 |  7   |
| 20000  |   2329  | 4.10610322325 |  8   |
| 20000  |   5618  | 4.09568525521 |  9   |
| 20000  |    47   | 4.05159538655 |  10  |
| 20000  |   1203  |  4.0415584114 |  11  |
| 20000  |   1291  | 4.03862130313 |  12  |
| 20000  |   1210  | 4.03841572671 |  13  |
| 20000  |   1193  | 4.03781396939 |  14  |
| 20000  |  58559  | 4.00964690416 |  15  |
+--------+---------+---------------+------+
[15 rows x 4 columns]



• Use tables/visualization to discuss the found results. Explain each visualization in detail.


In [7]:
import graphlab as gl

m1 = gl.load_model('model1')
m2 = gl.load_model('model2')
m3 = gl.load_model('model3')
m4 = gl.load_model('model4')
m5 = gl.load_model('model5')
m6 = gl.load_model('model6')
m7 = gl.load_model('model7')


NameError: name 'validation_data' is not defined

In [12]:
validation_data = gl.load_sframe('validation')

gl.canvas.set_target('ipynb')
model_comp = gl.compare(validation_data, [m1, m2, m3, m4, m5, m6, m7], verbose=False)
gl.show_comparison(model_comp, [m1, m2, m3, m4, m5, m6, m7])

Model compare metric: precision_recall


# Lowering ranking regularization, and halving the number of factors greatly reduced precision and recall of m3, m1 however has a higher cutoff for recall than precision as m1 compares to m2.  It appears that adding the ranking factorization in models 4 through 7 destroyed the models ability to recover precision and recall.  Models 1, 2 and 3 performed the best with 

In [98]:
r2 = m2.recommend(users=[1], k=10, verbose=True)
r2

userId,movieId,score,rank
1,318,4.91926253258,1
1,858,4.8972439733,2
1,50,4.85057966082,3
1,296,4.78105562596,4
1,527,4.72608348398,5
1,1193,4.71877410679,6
1,2858,4.68160211711,7
1,2324,4.66714703707,8
1,1213,4.63173211782,9
1,2959,4.62295391826,10


In [126]:
r3 = m3.recommend(users=[1], k=10, verbose=True)
r3

userId,movieId,score,rank
1,318,4.5074127209,1
1,858,4.37136393829,2
1,50,4.3690995392,3
1,527,4.33162067501,4
1,2324,4.24243187202,5
1,1198,4.23969798392,6
1,260,4.22989017932,7
1,1196,4.22630113034,8
1,2571,4.21091879761,9
1,296,4.19750653996,10


# Both methods pull identical top 3 movie recommendations, as well as many other repeats, but perhaps this is just how user 1 chose his movies ranks

In [127]:
r1 = m1.recommend(users=[1], k=10, verbose=True)
r1

userId,movieId,score,rank
1,318,5.0195966479,1
1,858,4.87502838342,2
1,527,4.82388411133,3
1,2324,4.71689416139,4
1,296,4.68793092339,5
1,356,4.66345484225,6
1,2858,4.65721906392,7
1,2329,4.64699668853,8
1,50,4.64651560454,9
1,1213,4.63802320926,10


In [187]:
m5 = gl.recommender.factorization_recommender.create(training_data, 
                                                      user_id='userId', item_id='movieId', 
                                                      target='rating', user_data=None, item_data=movies, 
                                                      num_factors=32, regularization=1e-08, linear_regularization=1e-10, 
                                                      side_data_factorization=False, nmf=False, binary_target=False, max_iterations=25, 
                                                      random_seed=0, solver='auto', verbose=True)

In [190]:
m5.save('model5')

In [191]:
m6 = gl.recommender.factorization_recommender.create(training_data, 
                                                      user_id='userId', item_id='movieId', 
                                                      target='rating', user_data=None, item_data=movies, 
                                                      num_factors=32, regularization=1e-08, linear_regularization=1e-10, 
                                                      side_data_factorization=False, nmf=False, binary_target=False, max_iterations=25, 
                                                      random_seed=0, solver='auto', verbose=True)

In [192]:
m6.save('model6')

In [None]:
m6 = gl.recommender.factorization_recommender.create(training_data, 
                                                      user_id='userId', item_id='movieId', 
                                                      target='rating', user_data=None, item_data=movies, 
                                                      num_factors=32, regularization=1e-08, linear_regularization=1e-10, 
                                                      side_data_factorization=False, nmf=False, binary_target=False, max_iterations=25, 
                                                      random_seed=0, solver='sgd', verbose=True)

In [194]:
m6.save('model6')

In [195]:
m7 = gl.recommender.factorization_recommender.create(training_data, 
                                                      user_id='userId', item_id='movieId', 
                                                      target='rating', user_data=None, item_data=None, 
                                                      num_factors=32, regularization=1e-08, linear_regularization=1e-10, 
                                                      side_data_factorization=False, nmf=False, binary_target=False, max_iterations=25, 
                                                      random_seed=0, solver='sgd', verbose=True)

In [196]:
m7.save('model7')

# Describe your results. What findings are the most compelling and why?

It is very interesting that when I changed item_data to include genome scores of the movies, and lower step and increased max iterations, the precision and recal shot way down so it had either very low precision and high recall, or very low recall and high precision.  Movie recommendations are interesting though becuase you don't walways want to be so precise and overfit the recommendations by only having the same "relevant" movies while decent obscure movies lay around without many ratings or tags. 

* <a href="#deployment">Deployment</a>
<a href="#top">Back to Top</a>
# Deployment 


# Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

The methodologies for this lab have not been streamlined for commercial use, however to work with a dataset of over 24 million ratings and get a recommender system that is fairly good especially when it comes to precision/recall, it is remarkable.

I achieved my goals of tuning the recommender to achiever softer recommendations, as was mentioned on the movielens website.  This seems to be an inherent problem with recommender systems, a lack of tuning, and a lack of data.  

# How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?

The utility of implmenting custom tailored recommender engines is very far reaching, and movie databases and ratings are ever expanding with Netflix, Amazon, and other streaming services collecting and applying data to machine learning models.


 # How would your deploy your model for interested parties?
 
 I would use Amazon EC2 to run models on various settings for each user to build a recommendation profile for old users, and then use that metadata to tune the recommendation engines of new users.


# What other data should be collected?

More movie metadata like actors, budget, facebook likes, movie release data. Perhaps some additional survey information like inputing your top 10 favorite movies. 

# How often would the model need to be updated, etc.?

It would be wise to maintain a recommendation for a someones top 10 then update model everytime user logs in and clicks on a movie, and ofter new chances to change your all time top 10.