### Source Data 

#### MovieLens 25M Dataset
MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019

[Movie Dataset link](https://grouplens.org/datasets/movielens/25m/)

In [14]:
import pandas as pd
import numpy as np
import re
import joblib

In [2]:
# import zipfile
# z = zipfile.ZipFile('ml-25m.zip')
# z.extractall()
# z.close()

### Reading Movie data

In [15]:
movie = pd.read_csv('ml-25m\movies.csv')

In [4]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movie.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [16]:
def clean_title(title):
    return re.sub('[^a-zA-Z ]','',title) 

In [17]:
movie['clean_title'] = movie.title.apply(clean_title) # removing unnecessory character for better search optimization

In [18]:
movie.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II


In [9]:
movie[['movieId','clean_title','genres']].to_csv('clean_movie.csv',index = False) # index = False -> so that no extra column added

In [19]:
movie.shape

(62423, 4)

### Date enconding

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vector = TfidfVectorizer(ngram_range=(1, 2))
vec_metric = vector.fit_transform(movie['clean_title'])# Encoding movie title for thre use of search engine

In [67]:
joblib.dump(vector,'vectorizer.pkl')
joblib.dump(vec_metric,'vec_metric.pkl')

['vec_metric.pkl']

In [21]:
type(vec_metric)

scipy.sparse._csr.csr_matrix

In [22]:
vec_metric

<62423x117744 sparse matrix of type '<class 'numpy.float64'>'
	with 320037 stored elements in Compressed Sparse Row format>

### Search Engine

In [33]:
from sklearn.metrics.pairwise import cosine_similarity
def search(title):
    title = clean_title(title)
    query_vec= vector.transform([title])
    similarity = cosine_similarity(query_vec,vec_metric).flatten() # flatten is used get 1D array
    indices = np.argpartition(similarity,-5)[-5:]
    result  = movie.iloc[indices][::-1] # most similar listed first
    return result

### Display search

In [34]:
import ipywidgets as widgets
from IPython.display import display


print('Type atleat three letter')
movie_input = widgets.Text(
value = '',
description = 'Movie Title : ',
disabled = False
)

movie_list = widgets.Output()

def on_typ(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title)>2:
            display(search(title))


movie_input.observe(on_typ,names ='value')
display(movie_input,movie_list)


Type atleat three letter


Text(value='', description='Movie Title : ')

Output()

### Reading ratings file of movies rated by all users

In [3]:
ratings = pd.read_csv('ml-25m/ratings.csv')

In [29]:
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434
25000094,162541,63876,5.0,1240952515


In [6]:
ratings.shape

(25000095, 4)

In [13]:
ratings[ratings['rating']>=4][['userId','movieId','rating']].to_csv('ratings.csv',index = False) #saving for deployment

In [30]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [31]:
ratings.shape

(25000095, 4)

In [32]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [33]:
movieId = 1
# it is a list of user who have rated 4 or more on given movieId
similar_user = ratings[(ratings['movieId']==movieId) &(ratings['rating']>=4) ]['userId'].unique() # unique method return a sorted array
similar_user                                                                                      #with unique values

array([     3,      5,      8, ..., 162530, 162533, 162534], dtype=int64)

In [34]:
len(similar_user)

37709

In [35]:
# finding list of movieId rated 4 or more by 'similar_user'
similar_user_rec = ratings[(ratings['rating'] >=4 ) & (ratings['userId'].isin(similar_user))]['movieId']   #['userId'].value_counts() 
similar_user_rec   # similar_user_rec -> similar_user_recommendation

254              1
255             29
256             32
257             50
258            111
             ...  
24999332    166643
24999342    171763
24999348    177593
24999351    177765
24999378    198609
Name: movieId, Length: 5101989, dtype: int64

In [36]:
similar_user_rec = similar_user_rec.value_counts()/len(similar_user) # %age of similar users like a particular movieId
# finding those movie which are liked by 10% or above similar users
similar_user_rec = similar_user_rec[similar_user_rec >0.1]
similar_user_rec

movieId
1       1.000000
318     0.549604
260     0.531518
356     0.517224
296     0.495744
          ...   
235     0.101249
1242    0.100931
1907    0.100772
3527    0.100613
2761    0.100135
Name: count, Length: 273, dtype: float64

Here we get 273 movie recommended by similar people but it is not exact match to search movie because they also rated other 
movie with same rating on their own interest but that movie may not be similar to searched movie.

In [37]:
# finding how much all user like these 273 movie

all_user = ratings[(ratings['movieId'].isin(similar_user_rec.index)) &(ratings['rating']>=4)] #list of users like the above movie
all_user['movieId'].value_counts() # counting how many user rated a single movie

all_user_rec = all_user['movieId'].value_counts()/len(all_user['userId'].unique()) # %age of user like a single movie
all_user_rec # all_user_rec -> all_users_recommendation

movieId
318     0.440215
296     0.389659
356     0.367553
593     0.361897
2571    0.347994
          ...   
3175    0.049325
2081    0.047128
1282    0.044712
2761    0.039855
1907    0.039805
Name: count, Length: 273, dtype: float64

Here this stats shows how much %age single movie by all users exist in data and rated it
e.g movieId = 318 are rated >4 and among all users 44% like it

In [38]:
rec_percent = pd.concat([similar_user_rec,all_user_rec],axis = 1)
rec_percent.columns = ['similar','all']
rec_percent  # rec_percent -> recommendation_percent

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.000000,0.235415
318,0.549604,0.440215
260,0.531518,0.325251
356,0.517224,0.367553
296,0.495744,0.389659
...,...,...
235,0.101249,0.055281
1242,0.100931,0.050805
1907,0.100772,0.039805
3527,0.100613,0.056879


In [39]:
#calculating score of a movie = ratio of similar %age to all user %age
rec_percent['score'] = rec_percent['similar']/rec_percent['all']
rec_percent.sort_values(ascending=False,by ='score',inplace =True)
rec_percent

Unnamed: 0_level_0,similar,all,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.000000,0.235415,4.247819
3114,0.328914,0.102241,3.217054
78499,0.161924,0.057710,2.805840
2355,0.191095,0.068978,2.770367
2081,0.120714,0.047128,2.561408
...,...,...,...
99114,0.112732,0.091209,1.235967
2959,0.351826,0.292519,1.202745
6016,0.118380,0.099007,1.195678
109487,0.117426,0.102603,1.144469


In [40]:
rec_percent.head(10).merge(movie,left_index=True,right_on='movieId')
# merging rec_percent(recommendation percent) into movie dataset
#acting rec_percent as left and movie as right while using index of rec_percent and movieId of movie to join them.

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.235415,4.247819,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story
3021,0.328914,0.102241,3.217054,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story
14813,0.161924,0.05771,2.80584,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story
2264,0.191095,0.068978,2.770367,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A
1992,0.120714,0.047128,2.561408,2081,"Little Mermaid, The (1989)",Animation|Children|Comedy|Musical|Romance,Little Mermaid The
1818,0.100772,0.039805,2.531636,1907,Mulan (1998),Adventure|Animation|Children|Comedy|Drama|Musi...,Mulan
2669,0.100135,0.039855,2.512494,2761,"Iron Giant, The (1999)",Adventure|Animation|Children|Drama|Sci-Fi,Iron Giant The
1005,0.12806,0.054719,2.340299,1028,Mary Poppins (1964),Children|Comedy|Fantasy|Musical,Mary Poppins
1047,0.231801,0.099113,2.338762,1073,Willy Wonka & the Chocolate Factory (1971),Children|Comedy|Fantasy|Musical,Willy Wonka the Chocolate Factory
1249,0.103636,0.044712,2.317855,1282,Fantasia (1940),Animation|Children|Fantasy|Musical,Fantasia
