# Analysis of MovieLens Data

**Created by Phillip Efthimion, Scott Payne, Gino Varghese and John Blevins**

*MSDS 7331 Data Mining - Section 403 - Lab 3*

# Business Understanding (10 points total) - John
• [10 points] Describe the purpose of the data set you selected (i.e., why was this data
collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific
dataset and the stakeholders needs?

# Data Understanding (20 points total) - Phillip
• [10 points] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file. Verify data quality: Are there missing values? Duplicate data?
Outliers? Are those mistakes? How do you deal with these problems?
• [10 points] Visualize the any important attributes appropriately. Important: Provide an
interpretation for any charts or graphs.

# Modeling and Evaluation (50 points total) - Gino, Scott
Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
the data you have chosen and use visualizations of the results to explain the performance and
expected outcomes whenever possible. Guide the reader through your analysis with plenty of
discussion of the results. Each option is broken down by:
• [10 Points] Train and adjust parameters
• [10 Points] Evaluate and Compare
• [10 Points] Visualize Results
• [20 Points] Summarize the Ramifications - John/Phillip

## Collaborative Filtering 
• Train: Create user-item matrices or item-item matrices using collaborative filtering
(adjust parameters).
• Eval: Determine performance of the recommendations using different performance
measures (explain the ramifications of each measure).
• Visualize: Use tables/visualization to discuss the found results. Explain each
visualization in detail.
• Summarize: Describe your results. What findings are the most compelling and why?

### Train and adjust parameters (10 points) - Gino, Scott
### Evaluate and Compare (10 points) - Gino, Scott
### Visualize Results  (10 points) - Gino, Scott
### Summarize the Ramifications (20 points) - John/Phillip

### 1 Actual : Train and adjust parameters (10 points) - Gino, Scott


In [9]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image

# sets the output of built in visualizations to the notebook instead of the browser based canvas utility
gl.canvas.set_target('ipynb') 



#Please ignore code block, the code was used for analysis
                                 
#data_rating.add_columns('')

#data_movies.add_column()

#model_ratings.user_id=data_movies.movieId 
                                                                  
#model_movies = gl.recommender.item_content_recommender.create(data_movies, user_id="movieId", item_id="title", target="genre")



In [10]:
# Reads the movie ratings data directly into an SFrame
data_ratings = gl.SFrame.read_csv("data/ml-latest-small/ratings.csv", column_type_hints={"rating":int})

# Removes timestamp column
data_ratings.remove_column('timestamp')


userId,movieId,rating
1,31,2
1,1029,3
1,1061,3
1,1129,2
1,1172,4
1,1263,2
1,1287,2
1,1293,2
1,1339,3
1,1343,2


## Simple Recomender Model
GraphLab is able to create a recommender model from an SFrame and chooses the type of model that best fits the data. The only requirements are that the SFrame contain a column with Item ids and a column with User ids. An optional target value can be specified such as a rating, if no target is specified, then the model will be based on item-item similarity. 

In [13]:
# Because no model is specified, GraphLab will select the most approriate model
auto_selected_model = gl.recommender.create(data, user_id="userId", item_id="movieId", target="rating")

# results = model_ratings.recommend(users=None, k=5)
# results = model_ratings.recommend(users=None, k=3)
# model_ratings.save("my_model")

### Simple code, powerful results 
With a simple line of code, GraphLab is able to examine the data and build a recommender model with optimized parameters. Because we are interested in how different recommender models perform, we will look at a few of the recommendation models in GraphLab and see how well we can optimize the parameters.


## Train and Compare 
Before we build our recommender models and optimize them, we need to create a cross-validation split of testing and training data so that we can determine how well our models are performing.

In [21]:
#80% train and 20% for testing 
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="userId", item_id="movieId",
                                                            max_num_users=150, item_test_proportion=0.2)

## Item-Item Similarity

In [27]:
# Create a recommender that uses item-item similarities based on users in common.
m1 = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="movieId", 
                                  target="rating",
                                  #only_top_k=5,
                                  only_top_k=3,
                                  #similarity_type="sine")                    
                                  similarity_type="cosine")


#nearest_items = m1.get_similar_items()

#m1_nearest_items = gl.item_similarity_recommender.create(train, 
#                                  user_id="userId", 
#                                  item_id="movieId", 
#                                  target="rating",
#                                  #only_top_k=5,
#                                  only_top_k=3,
#                                  #similarity_type="sine")                    
#                                  similarity_type="cosine",
#                                  nearest_items= nearest_items)

#rmse_results = m1.evaluate(test)

## Ranking Factorization
The Factorization Recommender trains a model capable of predicting a score for each possible combination of users and items. The internal coefficients of the model are learned from known scores of users and items. Recommendations are then based on these scores.

In [32]:
# Model that learns latent factors for each user and item and uses them to make rating predictions
m2 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="movieId", 
                                  target="rating")

#rmse_results = rec1.evaluate(test)



## Comparison

In [34]:
model_comp = gl.recommender.util.compare_models(test, [m1,m2], model_names = ['Item-Item','Ranking Factorization'])

PROGRESS: Evaluate model Item-Item

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.406666666667 | 0.0260060796283 |
|   2    | 0.383333333333 | 0.0523927909334 |
|   3    | 0.331111111111 | 0.0627383536062 |
|   4    | 0.311666666667 |  0.076143529002 |
|   5    |     0.292      | 0.0903470627721 |
|   6    | 0.282222222222 |  0.102396882225 |
|   7    | 0.273333333333 |  0.111336718672 |
|   8    |      0.26      |  0.120089437014 |
|   9    | 0.25037037037  |  0.132885096745 |
|   10   |     0.244      |  0.143880638096 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.571932304218963)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  310   |   2   | 1.54033180726 |
+--------+-------+---------------+
[1 rows x 

In [None]:
params = {'user_id': 'userId', 
          'item_id': 'movieId', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=10,
        environment=None)

In [None]:
job.get_status()

In [None]:
job_result = job.get_results()

job_result.head()

In [None]:
bst_prms = job.get_best_params()
bst_prms

## Adding Side Data to the Model


In [None]:
models = job.get_models()
models

In [None]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

In [None]:
## need to talk about how the current rating in the model is harder to predict because rating goes from 1 to 5, 
## instead of 0 or 1. Ranking factoriztion is penalizing itself when building the model, instead of hit or a miss decision
## 

In [None]:
# Data for Model 1
data_ratings = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/ratings.csv", column_type_hints={"rating":int})

#create a new variable
data_ratings.remove_column('timestamp')

# Data for Model 2
data_movies = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/movies.csv", 
                          column_type_hints={"movieId":int})

#data_movies.remove_column('timestamp')
                                 


    
    #Extra credit , adding side features.
# adding options to include correction 
model_ratings = gl.recommender.ranking_factorization_recommender.create(data_ratings, user_id="userId", item_id="movieId", 
                                      item_data=data_movies, target="rating", ranking_regularization=0.1, unobserved_rating_value=1)
# leaving k as 5
results = model_ratings.recommend(users=None, k=5)
model_ratings.save("my_model")
                                 

    

In [None]:
print(model_ratings.get)

In [None]:
#ignore code

#80% train and 20% for testing 
#train, test = gl.recommender.util.random_split_by_user(model_ratings,
#                                                    user_id="userId", item_id="movieId",
#                                                            max_num_users=150, item_test_proportion=0.2)

In [None]:
model_comp_rating = gl.recommender.util.compare_models(test,[m1,m2,model_ratings])

In [None]:
#Model:1 = item vs item similarity, if the user interacted with the item

#Model:2 = RankingFactorizationRecommender


# Deployment (10 points total) - John
• Be critical of your performance and tell the reader how you current model might be usable by
other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
• How useful is your model for interested parties (i.e., the companies or organizations
that might want to use it)?
• How would your deploy your model for interested parties?
• What other data should be collected?
How often would the model need to be updated, etc.?

# Exceptional Work (10 points total) - Scott
• You have free reign to provide additional analyses or combine analyses.

In [None]:
from IPython.display import display
from IPython.display import Image
import graphlab.aggregate as agg


#gl.canvas.set_target('browser')
gl.canvas.set_target('ipynb')

count_rating = data.groupby(key_columns='userId', operations={'rating': agg.COUNT()})

count_movie = data.groupby(key_columns='userId', operations={'movieId': agg.COUNT()})



#creating a count of ratings for each movie, to see how each movie has variablity in there ratings
#ex: movie 1 has lot more ratings
#count_movie_rating = data.groupby(key_columns='movieId', operations={'rating': agg.COUNT()})

data.show(view="Summary")

#data.show(view="Heat Map", x='movieId', y='count_movie_rating')

#ratings vs movieid, to identify the popular rating, how many users rated movies, into 5 rating categories.
data.show(view="Bar Chart", x='rating', y='movieId')


#data.show(view="Heat Map", x="userId", y="count_movie")

#Identify the number of movies that was watched by each user, 
data.show(view="Bar Chart", x="userId", y="count_movie")

# To see how many users have watched each movie, in logarthemic scale..
data.show(view="Heat Map", x="userId", y="count_rating")
#sa.count_rating = gl.SArray(data=count_rating, int)

#data.show(view="Heat Map", x='movieId', y='count_rating')

#data.show(view="Summary")


#data.show(view="Heat Map", x="movieId", y="rating")


#data.show(view="Heat Map", x="userId", y="rating")



#data_array = gl.SArray(data)
                            
#data.show(view="Bar Chart", x="userId", y="count_rating")

#count_rating.show(view="Numeric")
#show(view="numeric", data.)

#data.show(view="Bar Chart", x="userId", y="count_movie")

#data_array.show()


# talking of data gaps, after comparing the graphs "userid vc. count of rating" and "userid vs. count of movies", 
# we can say that, there is no data gaps present in the data set, for example, userid 547 watched 2391 movies 
# and also rated all the movies that was watched