# MovieLens 100K Dataset

http://files.grouplens.org/datasets/movielens/ml-100k.zip

http://grouplens.org/datasets/movielens/100k/

Install GraphLab Create with Command Line
https://turi.com/download/install-graphlab-create-command-line.html

We will be using the MovieLens dataset for this purpose. 
It consists of:
100,000 ratings (1-5) from 943 users on 1682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip)
Genre information of movies

Lets load this data into Python. There are many files in the ml-100k.zip file which we can use. Lets load the three most importance files to get a sense of the data. I also recommend you to read the readme document which gives a lot of information about the difference files.


In [2]:
import numpy as np
import pandas as pd

# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('C:\\Users\\Madhur\\Desktop\\DS Deck\\Machine Learning\\Machine Learning Sections\\Recommender-Systems\\u.user', sep='|', names=u_cols, encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('C:\\Users\\Madhur\\Desktop\\DS Deck\\Machine Learning\\Machine Learning Sections\\Recommender-Systems\\u.data', sep='\t', names=r_cols, encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy','Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('C:\\Users\\Madhur\\Desktop\\DS Deck\\Machine Learning\\Machine Learning Sections\\Recommender-Systems\\u.item', sep='|', names=i_cols, encoding='latin-1')

In [1]:
### Users dataset

In [3]:
print(users.shape)
users.head()

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## What is GraphLab?

GraphLab is a new parallel framework for machine learning written in C++. It is an open source project and has been designed considering the scale, variety and complexity of real world data. It incorporates various high level algorithms such as Stochastic Gradient Descent (SGD), Gradient Descent & Locking to deliver high performance experience. It helps data scientists and developers easily create and install applications at large scale.

But, what makes it amazing?  It’s the presence of neat libraries for data transformation, manipulation and model visualization. In addition, it comprises of scalable machine learning toolkits which has everything (almost) required to improve machine learning models. The toolkit includes implementation for deep learning, factor machines, topic modeling, clustering, nearest neighbors and more.

Instal Graphlabs
----------------
conda install -c derickl graphlab-create 

pip install GraphLab-Create

In [4]:
import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)

ModuleNotFoundError: No module named 'graphlab'

## A Simple Popularity Model

All the users have same recommendation based on the most popular choices. We’ll use the  graphlab recommender functions popularity_recommender for this.

In [None]:
popularity_model = graphlab.popularity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating')

Arguments:

train_data: the SFrame which contains the required data

user_id: the column name which represents each user ID

item_id: the column name which represents each item to be recommended

target: the column name representing scores/ratings given by the user

In [None]:
#Get recommendations for first 5 users and print them
#users = range(1,6) specifies user ID of first 5 users
#k=5 specifies top 5 recommendations to be given
popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)
popularity_recomm.print_rows(num_rows=25)

Did you notice something? The recommendations for all users are same – 1500,1201,1189,1122,814 in the same order. This can be verified by checking the movies with highest mean recommendations in our ratings_base data set:

In [None]:
ratings_base.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).head(20)

## A Collaborative Filtering Model

Lets create a model based on item similarity as follow:

In [None]:
#Train Model
item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='pearson')

#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)
item_sim_recomm.print_rows(num_rows=25)

Here we can see that the recommendations are different for each user. So, personalization exists. 

#### But how good is this model? 
We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## Evaluating Recommendation Engines

#### Recall:
What ratio of items that a user likes were actually recommended.
If a user likes say 5 items and the recommendation decided to show 3 of them, then the recall is 0.6
#### Precision
Out of all the recommended items, how many the user actually liked?
If 5 items were recommended to the user out of which he liked say 4 of them, then precision is 0.8

In [None]:
model_performance = graphlab.compare(test_data, [popularity_model, item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])

Here we can make 2 very quick observations:

The item similarity model is definitely better than the popularity model (by atleast 10x)
On an absolute level, even the item similarity model appears to have a poor performance. It is far from being a useful recommendation system.