# [Movie Recommendation System]

## 1. Objective
The goal of this project is to build a movie recommendation system using item based collaborative filtering.

## 2. Description


In this project, I will attempt at implementing a recommendation algorithm using collaborative filtering technique. For novices like me this will pretty much serve as a foundation in recommendation systems and will provide you with something to start with. 

To implement an item based collaborative filtering, KNN is a perfect go-to model and also a very good baseline for recommender system development. But what is the KNN? KNN is a non-parametric, lazy learning method. 

For the fisrt modeling, I have used the dataset from the MovieLens 100k. And the dataset is a set of 100,000 data points related to ratings given by a set of users to a set of movies. It also contains movie metadata and user profiles. While it is a small dataset, you can quickly download it and run Spark code on it. This makes it ideal for illustrative purposes.

You can download the dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip.

For the second modeling, I have inserted the data by myself.

I have initially coded this mini project model based on object oriented programming (I am trying to transfrom this to juypter notebook to visualise effectively). Hence, I am not going to take the standard process for data analysis but I am going to show you how I have created the model and how the model works with several test.

## 3. Modeling (1)
The first model is bulit with the dataset from MovieLens 100k.

### 3.1 Import Libraries

In [1]:
import surprise
import pandas as pd

### 3.2 Load Dataset

In [2]:
data = surprise.Dataset.load_builtin('ml-100k')

In [3]:
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x000000000889D550>


In [4]:
print(data.raw_ratings)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [5]:
df = pd.DataFrame(data.raw_ratings, columns=['user', 'item', 'rate', 'id'])
df.head()

Unnamed: 0,user,item,rate,id
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


### 3.3 Similarity Measure

In [6]:
option1 = {'name' : 'msd'}  # Mean Squared Difference Similarity
option2 = {'name' : 'cosine'}  # Cosine Simliarity
option3 = {'name' : 'pearson'} # Pearson Correlation Coefficient

### 3.4 Get a recommendation list
Create a learning object to get the recommendation list using KNNBasic algorithm


In [7]:
algo = surprise.KNNBasic(sim_options=option3)

### 3.5 Prepare a trainset

In [8]:
trainset = data.build_full_trainset()

### 3.6 Train the algorithm 
Train the algorithm on the trainset

In [9]:
algo.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0xa63c780>

### 3.7 Recomend five movies for the 196th user

In [10]:
result = algo.get_neighbors(196, k=5)

for r1 in result :
    print(r1)

89
112
125
172
241


## 4. Modeling (2)
The second model is built with the data inserted by myself

### 4.1 Create a dataset
Create a dataset with the users' name and rates for the movies


In [11]:
ratings_expand = {
        '마동석': {
            '택시운전사': 3.5,
            '남한산성': 1.5,
            '킹스맨:골든서클': 3.0,
            '범죄도시': 3.5,
            '아이 캔 스피크': 2.5,
            '꾼': 3.0,
        },
        '이정재': {
            '택시운전사': 5.0,
            '남한산성': 4.5,
            '킹스맨:골든서클': 0.5,
            '범죄도시': 1.5,
            '아이 캔 스피크': 4.5,
            '꾼': 5.0,
        },
        '윤계상': {
            '택시운전사': 3.0,
            '남한산성': 2.5,
            '킹스맨:골든서클': 1.5,
            '범죄도시': 3.0,
            '꾼': 3.0,
            '아이 캔 스피크': 3.5,
        },
        '설경구': {
            '택시운전사': 2.5,
            '남한산성': 3.0,
            '범죄도시': 4.5,
            '꾼': 4.0,
        },
        '최홍만': {
            '남한산성': 4.5,
            '킹스맨:골든서클': 3.0,
            '꾼': 4.5,
            '범죄도시': 3.0,
            '아이 캔 스피크': 2.5,
        },
        '홍수환': {
            '택시운전사': 3.0,
            '남한산성': 4.0,
            '킹스맨:골든서클': 1.0,
            '범죄도시': 3.0,
            '꾼': 3.5,
            '아이 캔 스피크': 2.0,
        },
        '나원탁': {
            '택시운전사': 3.0,
            '남한산성': 4.0,
            '꾼': 3.0,
            '범죄도시': 5.0,
            '아이 캔 스피크': 3.5,
        },
        '소이현': {
            '남한산성': 4.5,
            '아이 캔 스피크': 1.0,
            '범죄도시': 4.0
        }
}

### 4.2 Create a list and a set
The list to append the users' name and the set to append the movie name

In [12]:
name_list = []

In [13]:
movie_set = set()

### 4.3 Append the users' name and movies

In [14]:
# iterate as many times as the number of users
for user_key in ratings_expand :
    # print(user_key)
    name_list.append(user_key)
    # Append the movies that the current user has watched
    for movie_key in ratings_expand[user_key] :
        # print(user_key, ":", movie_key)
        movie_set.add(movie_key)
        
movie_list = list(movie_set)
print(name_list)
print(movie_list)

['마동석', '이정재', '윤계상', '설경구', '최홍만', '홍수환', '나원탁', '소이현']
['범죄도시', '아이 캔 스피크', '남한산성', '택시운전사', '꾼', '킹스맨:골든서클']


### 4.4 Create a dataset to train

In [15]:
rating_dic = {
    'user_id' : [],
    'item_id' : [],
    'rating' :[]
}

### 4.5 Append the ratings

In [16]:
# iterate as many times as the number of users
for name_key in ratings_expand :
    # iterate as many times as the number of the movies user has watched
    for movie_key in ratings_expand[name_key] :
        # Extract the index no. of the user
        a1 = name_list.index(name_key)
        # Extract the index no. of the movie
        a2 = movie_list.index(movie_key)
        # Extract the ratings
        a3 = ratings_expand[name_key][movie_key]
        # Append it
        rating_dic['user_id'].append(a1)
        rating_dic['item_id'].append(a2)
        rating_dic['rating'].append(a3)

print(rating_dic['user_id'])
print(rating_dic['item_id'])
print(rating_dic['rating'])

[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7]
[3, 2, 5, 0, 1, 4, 3, 2, 5, 0, 1, 4, 3, 2, 5, 0, 4, 1, 3, 2, 0, 4, 2, 5, 4, 0, 1, 3, 2, 5, 0, 4, 1, 3, 2, 4, 0, 1, 2, 1, 0]
[3.5, 1.5, 3.0, 3.5, 2.5, 3.0, 5.0, 4.5, 0.5, 1.5, 4.5, 5.0, 3.0, 2.5, 1.5, 3.0, 3.0, 3.5, 2.5, 3.0, 4.5, 4.0, 4.5, 3.0, 4.5, 3.0, 2.5, 3.0, 4.0, 1.0, 3.0, 3.5, 2.0, 3.0, 4.0, 3.0, 5.0, 3.5, 4.5, 1.0, 4.0]


In [17]:
df = pd.DataFrame(rating_dic)
df.head()

Unnamed: 0,user_id,item_id,rating
0,0,3,3.5
1,0,2,1.5
2,0,5,3.0
3,0,0,3.5
4,0,1,2.5


In [18]:
# Create an object for collecting the data
# rating_scale : Range of the rates
reader = surprise.Reader(rating_scale=(0.0, 5.0))

In [19]:
# the column names when we use at the "surprise"
# Create column names that the data is installed
# The first -> "user", The second -> "item", The third -> "rate"
col_list = ['user_id', 'item_id', 'rating']
data = surprise.Dataset.load_from_df(df[col_list], reader)

### 4.6 Train the algorithm

In [20]:
# Train
trainset = data.build_full_trainset()
option = {'name' : 'pearson'}
algo = surprise.KNNBasic(sim_options=option)
algo.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0xb9c8828>

## 5. Evaluation

In [21]:
# Recommend three movies for the user "소이현"
index = name_list.index('소이현')
result = algo.get_neighbors(index, k=3)

for r1 in result :
    print(movie_list[r1 - 1])

꾼
택시운전사
킹스맨:골든서클
