<h1 align="center"><font size="5">OVERVIEW</font></h1>

## 1. Introduction
Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version of one using Python and the Pandas library. <br>
### Story
In this kernel Data Scientist want to solve the problem with give suggestion for promotion people who become customer in XXX, and give some promotion for some movie with best rating for customer who not yet watch a movie.
### Definiton of Algorithm
**Collaborative Filtering** is the process of information filtering by collecting human judgments (ratings) “word of mouth”. Collaborative filtering (CF) is a technique commonly used to build personalized recommendations on the Web. 

**Limitation of content based recommendation:**
    - It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing interest and providing recommendations across genres.

    - It doesn't capture the personal intrest and biases of a user. Anyone querying on model for recommendations based on a movie will receive the same recommendations for that movie, regardless of who he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

We will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

#### Install Some Library

In [1]:
!pip install surprise



You are using pip version 9.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
!pip install scikit-surprise



You are using pip version 9.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## 2. Loading Data

#### Load Library

In [3]:
#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from surprise import SVD, accuracy
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

pd.options.display.max_columns = None

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = None

from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD


import warnings; warnings.simplefilter('ignore')

In [4]:
reader = Reader()

#### Importing the Dataset

In [5]:
ratings = pd.read_csv('../Dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


#### Sneak Peak Dataset

In [6]:
ratings.shape

(100004, 4)

In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


## 3. Execute Algorithm
The dataset module defines the Dataset class and other subclasses which are used for managing datasets.

In [8]:
#Load a built-in dataset.
data = Dataset.load_builtin('ml-100k')

Obviously, we could also simply fit our algorithm to the whole dataset, rather than running cross-validation. This can be done by using the **build_full_trainset()** method which will build a trainset object:

In [9]:
trainset = data.build_full_trainset()

To load a dataset from a pandas dataframe, you will need the **load_from_df()** method. You will also need a **Reader** object, but only the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.<br>

In [10]:
# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8972  0.8962  0.8960  0.9008  0.8959  0.8972  0.0019  
MAE (testset)     0.6912  0.6914  0.6901  0.6931  0.6871  0.6906  0.0020  
Fit time          8.29    7.10    8.21    7.28    7.15    7.61    0.53    
Test time         0.27    0.19    0.29    0.19    0.20    0.23    0.04    


{'fit_time': (8.294590473175049,
  7.095261812210083,
  8.209322452545166,
  7.277791738510132,
  7.152303695678711),
 'test_mae': array([0.69120988, 0.69137263, 0.69012456, 0.69307116, 0.68708794]),
 'test_rmse': array([0.89724847, 0.89622619, 0.89598644, 0.90083055, 0.89591054]),
 'test_time': (0.26535987854003906,
  0.18798375129699707,
  0.29363107681274414,
  0.19198346138000488,
  0.19641661643981934)}

We get a mean **Root Mean Sqaure Error** of 0.8965 which is more than good enough right now. Let us now train on dataset and arrive at predictions.

Once **fit()** has been called, the best_estimator attribute gives us an algorithm instance with the optimal set of parameters, which can be used how we please:

In [11]:
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16e4a4b1908>

If you don’t want to run a full cross-validation procedure, you can use the **train_test_split()** to sample a trainset and a testset with given sizes.

In [12]:
trainset, testset = train_test_split(data, test_size=0.25)

algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16e4a4b4b38>

Give suggestion movie for userId 1

In [13]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


We can now predict ratings by directly calling the **predict()** method. Let’s say you’re interested in user 1 and item 302 (make sure they’re in the trainset!), and you know that the true rating rui=4:

In [14]:
uid = str(1)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 1          item: 302        r_ui = 4.00   est = 3.54   {'was_impossible': False}


<h1 align="center"><font size="5">CONCLUSION</font></h1>
For movie with ID 302, we get an estimated prediction of **3.54**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie for the ratings.