# Analysis of MovieLens Data

**Created by Phillip Efthimion, Scott Payne, Gino Varghese and John Blevins**

*MSDS 7331 Data Mining - Section 403 - Lab 3*

# Business Understanding (10 points total) - John
• [10 points] Describe the purpose of the data set you selected (i.e., why was this data
collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific
dataset and the stakeholders needs?

# Data Understanding (20 points total) - Phillip
• [10 points] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file. Verify data quality: Are there missing values? Duplicate data?
Outliers? Are those mistakes? How do you deal with these problems?
• [10 points] Visualize the any important attributes appropriately. Important: Provide an
interpretation for any charts or graphs.

# Modeling and Evaluation (50 points total) - Gino, Scott
Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
the data you have chosen and use visualizations of the results to explain the performance and
expected outcomes whenever possible. Guide the reader through your analysis with plenty of
discussion of the results. Each option is broken down by:
• [10 Points] Train and adjust parameters
• [10 Points] Evaluate and Compare
• [10 Points] Visualize Results
• [20 Points] Summarize the Ramifications - John/Phillip

## Collaborative Filtering 
• Train: Create user-item matrices or item-item matrices using collaborative filtering
(adjust parameters).
• Eval: Determine performance of the recommendations using different performance
measures (explain the ramifications of each measure).
• Visualize: Use tables/visualization to discuss the found results. Explain each
visualization in detail.
• Summarize: Describe your results. What findings are the most compelling and why?

### Train and adjust parameters (10 points) - Gino, Scott
### Evaluate and Compare (10 points) - Gino, Scott
### Visualize Results  (10 points) - Gino, Scott
### Summarize the Ramifications (20 points) - John/Phillip

### 1 Actual : Train and adjust parameters (10 points) - Gino, Scott


In [146]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image
gl.canvas.set_target('ipynb')


# Data for Model 1
data_ratings = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/ratings.csv", column_type_hints={"rating":int})

#create a new variable
data_ratings.remove_column('timestamp')

model_ratings = gl.recommender.create(data_ratings, user_id="userId", item_id="movieId", target="rating")
# leaving k as 5
#results = model_ratings.recommend(users=None, k=5)
results = model_ratings.recommend(users=None, k=3)
model_ratings.save("my_model")


#Please ignore code block, the code was used for analysis
                                 
#data_rating.add_columns('')

#data_movies.add_column()

#model_ratings.user_id=data_movies.movieId 
                                                                  
#model_movies = gl.recommender.item_content_recommender.create(data_movies, user_id="movieId", item_id="title", target="genre")



In [147]:
results

userId,movieId,score,rank
1,593,4.79978343311,1
1,50,4.51099511448,2
1,858,4.45528289143,3
2,1035,4.96931787792,1
2,531,4.54748412434,2
2,147,4.48539825741,3
3,1259,4.99066420857,1
3,1089,4.86869617764,2
3,1625,4.8472318298,3
4,1304,6.32684537235,1


In [148]:
data.head()

userId,movieId,rating
1,31,2
1,1029,3
1,1061,3
1,1129,2
1,1172,4
1,1263,2
1,1287,2
1,1293,2
1,1339,3
1,1343,2


In [149]:
from IPython.display import display
from IPython.display import Image
import graphlab.aggregate as agg


#gl.canvas.set_target('browser')
gl.canvas.set_target('ipynb')

count_rating = data.groupby(key_columns='userId', operations={'rating': agg.COUNT()})

count_movie = data.groupby(key_columns='userId', operations={'movieId': agg.COUNT()})



#creating a count of ratings for each movie, to see how each movie has variablity in there ratings
#ex: movie 1 has lot more ratings
#count_movie_rating = data.groupby(key_columns='movieId', operations={'rating': agg.COUNT()})

data.show(view="Summary")

#data.show(view="Heat Map", x='movieId', y='count_movie_rating')

#ratings vs movieid, to identify the popular rating, how many users rated movies, into 5 rating categories.
data.show(view="Bar Chart", x='rating', y='movieId')


#data.show(view="Heat Map", x="userId", y="count_movie")

#Identify the number of movies that was watched by each user, 
data.show(view="Bar Chart", x="userId", y="count_movie")

# To see how many users have watched each movie, in logarthemic scale..
data.show(view="Heat Map", x="userId", y="movieId")
#sa.count_rating = gl.SArray(data=count_rating, int)

#data.show(view="Heat Map", x='movieId', y='count_rating')

#data.show(view="Summary")


#data.show(view="Heat Map", x="movieId", y="rating")


#data.show(view="Heat Map", x="userId", y="rating")



#data_array = gl.SArray(data)
                            
#data.show(view="Bar Chart", x="userId", y="count_rating")

#count_rating.show(view="Numeric")
#show(view="numeric", data.)

#data.show(view="Bar Chart", x="userId", y="count_movie")

#data_array.show()


# talking of data gaps, after comparing the graphs "userid vc. count of rating" and "userid vs. count of movies", 
# we can say that, there is no data gaps present in the data set, for example, userid 547 watched 2391 movies 
# and also rated all the movies that was watched.

### Train and Compare

In [150]:
#80% train and 20% for testing 
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="userId", item_id="movieId",
                                                            max_num_users=150, item_test_proportion=0.2)

In [151]:

m1 = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="movieId", 
                                  target="rating",
                                  #only_top_k=5,
                                  only_top_k=3,
                                  #similarity_type="sine")                    
                                  similarity_type="cosine")

#rmse_results = item_item.evaluate(test)



In [99]:
m2 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="movieId", 
                                  target="rating")

#rmse_results = rec1.evaluate(test)




### Comparison

In [152]:

model_comp = gl.recommender.util.compare_models(test,[m1,m2])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.406666666667 | 0.0260060796283 |
|   2    | 0.383333333333 | 0.0523927909334 |
|   3    | 0.331111111111 | 0.0627383536062 |
|   4    | 0.311666666667 |  0.076143529002 |
|   5    |     0.292      | 0.0903470627721 |
|   6    | 0.282222222222 |  0.102396882225 |
|   7    | 0.273333333333 |  0.111336718672 |
|   8    |      0.26      |  0.120089437014 |
|   9    | 0.25037037037  |  0.132885096745 |
|   10   |     0.244      |  0.143880638096 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.571932133261484)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  310   |   2   | 1.54033180726 |
+--------+-------+---------------+
[1 rows x 3 colum

In [153]:
params = {'user_id': 'userId', 
          'item_id': 'movieId', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=10,
        environment=None)

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-06-2017-11-29-0300000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-06-2017-11-29-0300000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-06-2017-11-29-0300000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-06-2017-11-29-0300000-9b258'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-06-2017-11-29-0300000-9b258' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-06-2017-11-29-0300000-9b258' scheduled.


In [102]:
job.get_status()

{'Canceled': 0, 'Completed': 0, 'Failed': 0, 'Pending': 10, 'Running': 0}

In [103]:
job_result = job.get_results()

job_result.head()

model_id,item_id,linear_regularization,max_iterations,num_factors,num_sampled_negative_exam ples ...,ranking_regularization
9,movieId,0.001,50,32,4,0.1
8,movieId,0.001,25,8,4,0.25
1,movieId,0.001,50,32,4,0.5
0,movieId,0.001,25,16,8,0.25
3,movieId,0.001,50,12,4,0.5
2,movieId,0.001,50,16,8,0.1
5,movieId,0.001,50,12,8,0.25
4,movieId,0.001,25,24,4,0.25
7,movieId,0.001,25,12,8,0.5
6,movieId,0.001,50,16,8,0.5

regularization,target,user_id,training_precision@5,training_recall@5,training_rmse,validation_precision@5
0.001,rating,userId,0.366020864382,0.0233486863467,0.944114974915,0.149333333333
0.001,rating,userId,0.362444113264,0.0233259708834,1.00780036008,0.150666666667
0.001,rating,userId,0.374068554396,0.0238303483388,1.11370546438,0.154666666667
0.001,rating,userId,0.362444113264,0.0233259708834,1.01951912308,0.148
0.001,rating,userId,0.362444113264,0.0233259708834,1.11371087114,0.149333333333
0.001,rating,userId,0.366318926975,0.0233532666141,0.949925309765,0.152
0.001,rating,userId,0.362444113264,0.0233259708834,1.01962036348,0.149333333333
0.001,rating,userId,0.362444113264,0.0233259708834,1.00772223023,0.150666666667
0.001,rating,userId,0.387183308495,0.0248522624061,1.13520800339,0.164
0.001,rating,userId,0.387183308495,0.0248522624061,1.13527041802,0.164

validation_recall@5,validation_rmse
0.0407873136979,0.981967589655
0.0405225993817,1.02567597193
0.0445053764856,1.11574977847
0.0418273179021,1.03482603392
0.0418863288665,1.11573288606
0.0430282100564,0.986245291087
0.0419212146157,1.03483525442
0.0405225993817,1.02558280262
0.0447553714093,1.13429870344
0.0453851900034,1.13446615084


In [104]:
bst_prms = job.get_best_params()
bst_prms

{'item_id': 'movieId',
 'linear_regularization': 0.001,
 'max_iterations': 50,
 'num_factors': 32,
 'num_sampled_negative_examples': 4,
 'ranking_regularization': 0.1,
 'regularization': 0.001,
 'target': 'rating',
 'user_id': 'userId'}

### Add Ranking Factorization Info


In [105]:
models = job.get_models()
models

[Class                            : RankingFactorizationRecommender
 
 Schema
 ------
 User ID                          : userId
 Item ID                          : movieId
 Target                           : rating
 Additional observation features  : 0
 User side features               : []
 Item side features               : []
 
 Statistics
 ----------
 Number of observations           : 95889
 Number of users                  : 671
 Number of items                  : 8928
 
 Training summary
 ----------------
 Training time                    : 5.438
 
 Model Parameters
 ----------------
 Model class                      : RankingFactorizationRecommender
 num_factors                      : 16
 binary_target                    : 0
 side_data_factorization          : 1
 solver                           : auto
 nmf                              : 0
 max_iterations                   : 25
 
 Regularization Settings
 -----------------------
 regularization                   : 0.001
 regul

In [106]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.2       | 0.0119710178764 |
|   2    |      0.19      | 0.0194341580494 |
|   3    | 0.184444444444 | 0.0321864612165 |
|   4    |     0.165      | 0.0378045295974 |
|   5    |     0.148      | 0.0418273179021 |
|   6    | 0.151111111111 | 0.0498603387353 |
|   7    | 0.151428571429 | 0.0556285371433 |
|   8    |     0.145      | 0.0602584157479 |
|   9    | 0.142962962963 |  0.065741651979 |
|   10   | 0.136666666667 |  0.069891776557 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.186666666667 | 0.0112748

In [107]:
## need to talk about how the current rating in the model is harder to predict because rating goes from 1 to 5, 
## instead of 0 or 1. Ranking factoriztion is penalizing itself when building the model, instead of hit or a miss decision
## 

In [174]:
# Data for Model 1
data_ratings = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/ratings.csv", column_type_hints={"rating":int})

#create a new variable
data_ratings.remove_column('timestamp')

# Data for Model 2
data_movies = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/movies.csv", 
                          column_type_hints={"movieId":int})

#data_movies.remove_column('timestamp')
                                 


    
    #Extra credit , adding side features.
# adding options to include correction 
model_ratings = gl.recommender.ranking_factorization_recommender.create(data_ratings, user_id="userId", item_id="movieId", 
                                      item_data=data_movies, target="rating", ranking_regularization=0.1, unobserved_rating_value=1)
# leaving k as 5
results = model_ratings.recommend(users=None, k=5)
model_ratings.save("my_model")
                                 

    

In [175]:
print(model_ratings.get)

<bound method RankingFactorizationRecommender.get of Class                            : RankingFactorizationRecommender

Schema
------
User ID                          : userId
Item ID                          : movieId
Target                           : rating
Additional observation features  : 0
User side features               : []
Item side features               : ['movieId', 'title', 'genres']

Statistics
----------
Number of observations           : 100004
Number of users                  : 671
Number of items                  : 9125

Training summary
----------------
Training time                    : 14.039

Model Parameters
----------------
Model class                      : RankingFactorizationRecommender
num_factors                      : 32
binary_target                    : 0
side_data_factorization          : 1
solver                           : auto
nmf                              : 0
max_iterations                   : 25

Regularization Settings
----------------------

In [176]:
#ignore code

#80% train and 20% for testing 
#train, test = gl.recommender.util.random_split_by_user(model_ratings,
#                                                    user_id="userId", item_id="movieId",
#                                                            max_num_users=150, item_test_proportion=0.2)

In [177]:
model_comp_rating = gl.recommender.util.compare_models(test,[m1,m2,model_ratings])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.406666666667 | 0.0260060796283 |
|   2    | 0.383333333333 | 0.0523927909334 |
|   3    | 0.331111111111 | 0.0627383536062 |
|   4    | 0.311666666667 |  0.076143529002 |
|   5    |     0.292      | 0.0903470627721 |
|   6    | 0.282222222222 |  0.102396882225 |
|   7    | 0.273333333333 |  0.111336718672 |
|   8    |      0.26      |  0.120089437014 |
|   9    | 0.25037037037  |  0.132885096745 |
|   10   |     0.244      |  0.143880638096 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.571932271133501)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  310   |   2   | 1.54033180726 |
+--------+-------+---------------+
[1 rows x 3 colum

In [178]:
#Model:1 = item vs item similarity, if the user interacted with the item

#Model:2 = RankingFactorizationRecommender


# Deployment (10 points total) - John
• Be critical of your performance and tell the reader how you current model might be usable by
other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
• How useful is your model for interested parties (i.e., the companies or organizations
that might want to use it)?
• How would your deploy your model for interested parties?
• What other data should be collected?
How often would the model need to be updated, etc.?

# Exceptional Work (10 points total) - Scott
• You have free reign to provide additional analyses or combine analyses.