# Movie Recommendation Engine
##### Description: This project is a demonstration of the use of recommendation models, applied to film selection. We will mainly use the collaborative recommendation, the one where we look for similar profiles to recommend a movie.

##### Data: Dataset was extracted from Movie lens. A database of user-given movies and ratings of approximately 100,000 ratings

###### Autor: Lucas Fernandes

### 0 - installing packages that will be used

In [1]:
## In this case, no new packeges are needed

### 1 - importing the libraries

In [2]:
#The Libraries that we will work with
import pandas as pd
import numpy as np

import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt

from datetime import date

import warnings
warnings.simplefilter("ignore")

### 2 - Loading data from movie lens

##### Loading Movies IDs

In [3]:
#Local of the file:

#smaller Data Set
filelocation = r'C:\Users\55119\Documents\Lucas - Minhas Pastas\Alura\Algoritimo de Recomentação\ml-latest-small\ml-latest-small\movies.csv'

#file Location for a Bigger data
#filelocation = r'C:\Users\55119\Documents\Lucas - Minhas Pastas\Alura\Algoritimo de Recomentação\ml-latest\ml-latest\movies.csv'

In [4]:
#loading and analysing the data set inicial parameters:
data = pd.read_csv(filelocation)
print('______________Data set head is:___________')
print(data.head(2))
print('______________The number of col, and lines in the data set is:',data.shape)
print('______________the numeber of null values in the data is:',data.isna().sum().sum())
print('______________Data Type in the data set__________')
print(data.dtypes)


#renaming the data
movies_ids = data

______________Data set head is:___________
   movieId             title                                       genres
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1        2    Jumanji (1995)                   Adventure|Children|Fantasy
______________The number of col, and lines in the data set is: (9742, 3)
______________the numeber of null values in the data is: 0
______________Data Type in the data set__________
movieId     int64
title      object
genres     object
dtype: object


In [5]:
movies_ids.set_index('movieId',inplace=True)

##### Loading User Ratings

In [6]:
#Local of the file:

#smaller Data Set
filelocation = r'C:\Users\55119\Documents\Lucas - Minhas Pastas\Alura\Algoritimo de Recomentação\ml-latest-small\ml-latest-small\ratings.csv'

#biger data set
#filelocation = r'C:\Users\55119\Documents\Lucas - Minhas Pastas\Alura\Algoritimo de Recomentação\ml-latest\ml-latest\ratings.csv'

In [7]:
#loading and analysing the data set inicial parameters:
data = pd.read_csv(filelocation)
print('______________Data set head is:___________')
print(data.head(2))
print('______________The number of col, and lines in the data set is:',data.shape)
print('______________the numeber of null values in the data is:',data.isna().sum().sum())
print('______________Data Type in the data set__________')
print(data.dtypes)


#renaming the data
ratings = data

______________Data set head is:___________
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
______________The number of col, and lines in the data set is: (100836, 4)
______________the numeber of null values in the data is: 0
______________Data Type in the data set__________
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object


In [8]:
#ratings.userId.unique()

### 3 - Using the k-nearest neighbors algorithm to define distance between users

#### 3.1 - extracting notes per user in numpy format

In [9]:
#defining a routine to seach the rate by each user
def user_rating ( user ):
    
    # Seaching using the query, all the rates the user had given in the dataset
    rate_by_user = ratings.query('userId==%d' % user)
    
    #dropping the collums we dont need
    rate_by_user.drop(columns = ['userId', 'timestamp'], inplace= True)
    
    #setting the new Index
    rate_by_user.set_index('movieId', inplace= True)
    
    return rate_by_user

#### 3.2 - Defining the distance between vectors

In [10]:
def distance_vectors (vector1, vector2):
    #using the linalg.norm to calculate de distance between the 2 vectors (using sqrt(x^2+y^2))
    distance = np.linalg.norm(vector1 - vector2)
    return distance

#### 3.3- Defining the distance between users

In [11]:
def distance_users(userid1,userid2,min_length = 5):
    
    #using the first routine to find the rates by users
    user_rates1=user_rating(userid1)
    user_rates2=user_rating(userid2)
    
    #joing the rates in the same DF, in this case we use Join because it already join using the right index
    rates_vectors = user_rates1.join(user_rates2,lsuffix="_left",rsuffix = '_right').dropna()
    
    if (len(rates_vectors) < min_length):
        return [userid1, userid2, None]
    
    #using the distance_vector to calculate the distance between users
    
    distance_between_users = distance_vectors(rates_vectors.rating_left, rates_vectors.rating_right)
    return [userid1, userid2, distance_between_users]
    
    

#### 3.4 - Definig distance between all users in the dataset


In [12]:
#Creating a fuction to calculate the distance between a User an all our dataset:

#defining the function
def distance_between_all_users(compared_user,numberofuserscompared = None):
    #creating the array NOTAS, it will be used to store the distances
    distances=[]
    
    #creating the for to compare all the users with the one that we are looking for
    for users in ratings.userId.unique()[:numberofuserscompared]:
        result = distance_users(compared_user,users)
        distances.append(result)
        
    distances = pd.DataFrame(distances, columns = ['UserId_original','UserId_compared','Distance'])
    return distances


In [13]:
#distances = distance_between_all_users(1,20).dropna()
#distances.head()


#### 3.5 Creating a Function to find the nearest users froam a certain User

In [14]:
def Nearests_from_user(user,numberofuserscompared = None, top_nearests = 10):
    #Using the function to calculate the distance from the user to all the other users, it returns the nearest users from it.:
    distances = distance_between_all_users(user,numberofuserscompared).dropna()
    
    #Sorting by the nearests:
    distances.sort_values('Distance',inplace = True)
    
    #dropping the user row, we dont need to compare the user with itself
    distances = distances.set_index('UserId_compared').drop(user)
    
    return distances.head(top_nearests)

In [15]:
#example:
#a = Nearests_from_user(610)
#print(a)

#### 3.6 suggesting movies based on closest users (k-nearest)

In [16]:
##defining a Fuction to search the nearests users from a certain one, and suggest the movies that the nearest users like the most.
#def suggested_movies(user_to_suggest,numberofuserscompared = None,top_nearests = 10):

#    user_test = user_to_suggest
#
#    #getting the user ratting:
#    user_ratting = user_rating(user_test)
#
#    Nearest = Nearests_from_user(user_test,numberofuserscompared,top_nearests)
#    #trazendo os valores mias proximos
#    print(f'The users_id most near from the user {user_test} is {Nearest.iloc[1].name}, Distance = {Nearest.iloc[1].Distance}')
#
#    #pegando as notas do usuário mais similar:
#    rating_similar = user_rating(Nearest.iloc[0].name)
#
#    #taking off the movies that the original user had already seen:
#    rating_similar = rating_similar.drop(user_ratting.index, errors = 'ignore')
#
#    #sorting by the hights rates to the smallest
#    rating_similar = rating_similar.sort_values('rating' ,ascending= False )
#
#    #Addind the movie names:
#    rating_similar = rating_similar.join(movies_ids)
#    rating_similar.dropna(inplace = True)
#    return(rating_similar)

In [17]:
#a = suggested_movies(100)

In [18]:
#a.head(15)

In [19]:
##defining a Fuction to search the nearests users from a certain one, and suggest the movies that the nearest users like the most.
def suggested_moviesv02(user_to_suggest,numberofuserscompared = None ,top_nearests = 10):
    user_test = user_to_suggest

    #getting the user_original ratting:
    user_ratting = user_rating(user_test)

    #calculating the nearests users:
    Nearest = Nearests_from_user(user_test,numberofuserscompared,top_nearests)

    #Printing the answer:
    print(f'The users_id most near from the user {user_test} is {Nearest.index}, Distance = {Nearest.Distance}')

    Near_usersid_from_user_tested = Nearest.index
    Near_usersid_from_user_tested

    #Getting the movies seen by near users from the data set (movies)
    ratings_by_nearest_users = ratings.set_index('userId').loc[Near_usersid_from_user_tested]

    #Grouping the moviest by their ID, and the rating score mean:
    mean_ratings_by_nearest_users = ratings_by_nearest_users.groupby('movieId').mean()[['rating']]

    #We would like to filter if only a certain number of people has seen the movies, so we will count the number of movie watched:
    count_ratings_by_nearest_users = ratings_by_nearest_users.groupby('movieId').count()[['rating']]
    
    #adding couting in the mean
    
    mean_ratings_by_nearest_users = mean_ratings_by_nearest_users.join(count_ratings_by_nearest_users, lsuffix = '_means', rsuffix = '_counts')
    
    #filtering only movies with more than certain amount of votes
    minimum_filter = top_nearests/2
    
    mean_ratings_by_nearest_users = mean_ratings_by_nearest_users.query('rating_counts >= %.2f' % minimum_filter)
    
    #sorting by Rates
    mean_ratings_by_nearest_users.sort_values('rating_means', ascending = False,inplace = True)

    #best movie to be suggested:
    suggested_movies = mean_ratings_by_nearest_users.join(movies_ids).head(5)

    #printing the result:
    return (suggested_movies)


In [20]:
suggested_moviesv02(152)

The users_id most near from the user 152 is Int64Index([286, 285, 557, 242, 587, 142, 5, 227, 446, 565], dtype='int64', name='UserId_compared'), Distance = UserId_compared
286    1.118034
285    1.224745
557    1.224745
242    1.322876
587    1.322876
142    1.414214
5      1.414214
227    1.414214
446    1.414214
565    1.414214
Name: Distance, dtype: float64


Unnamed: 0_level_0,rating_means,rating_counts,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
296,4.722222,9,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
593,4.7,5,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
32,4.7,5,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
50,4.6,5,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
527,4.571429,7,Schindler's List (1993),Drama|War


### 4 - Testing new Users

In [21]:
#Defining new User:
def new_user(data):
    
    #first: getting the number of the last UserId, and adding 1
    new_user_Id = ratings['userId'].max()+1
    #printing the response:
    print(f"The new user ID is: {new_user_Id} ")
    
    #creating the df from the new user ratting:
    new_user_ratting = pd.DataFrame(data,columns = ['movieId','rating'])
    
    #creating the column UserId in this new Data Frame:
    new_user_ratting['userId'] =   new_user_Id
    
    return pd.concat([ratings, new_user_ratting])
    

In [22]:
#deffining a new user to test:
lucasoliveiras = ([
[55247,5], #into the wild - Rate: 5
[356,5], #forrest gump - rate: 5
[4886,5], #monstros SA - rate:5
[4896,5], #Harry potter e a pedra filosofal rate:5
[7153,5], #senhor do aneis o retorno do rei rate:5
[1,5], #Toy Story rate:5
])

In [23]:
#calling a function to add my new user
ratings = new_user(lucasoliveiras)

The new user ID is: 611 


In [24]:
#
user_rating(611).join(movies_ids)

Unnamed: 0_level_0,rating,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
55247,5.0,Into the Wild (2007),Action|Adventure|Drama
356,5.0,Forrest Gump (1994),Comedy|Drama|Romance|War
4886,5.0,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy
4896,5.0,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
7153,5.0,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
1,5.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [25]:
suggested_moviesv02(611)

The users_id most near from the user 611 is Int64Index([382, 169, 220, 330, 239, 483, 7, 534, 263, 288], dtype='int64', name='UserId_compared'), Distance = UserId_compared
382    0.866025
169    0.866025
220    1.118034
330    1.500000
239    1.500000
483    1.581139
7      1.802776
534    1.870829
263    1.870829
288    1.936492
Name: Distance, dtype: float64


Unnamed: 0_level_0,rating_means,rating_counts,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
527,5.0,5,Schindler's List (1993),Drama|War
318,4.9375,8,"Shawshank Redemption, The (1994)",Crime|Drama
32,4.9,5,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
260,4.75,8,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
356,4.75,10,Forrest Gump (1994),Comedy|Drama|Romance|War


In [26]:
print(user_rating(288).sort_values('rating',ascending = False).head(30))

         rating
movieId        
1198        5.0
1204        5.0
924         5.0
2529        5.0
1193        5.0
1136        5.0
235         5.0
1097        5.0
2804        5.0
527         5.0
1080        5.0
260         5.0
1079        5.0
3037        5.0
296         5.0
3100        5.0
318         5.0
3108        5.0
3198        5.0
1023        5.0
3363        5.0
356         5.0
3421        5.0
5060        5.0
590         5.0
593         5.0
1291        5.0
1732        5.0
32          5.0
57669       5.0


In [27]:
Thamiris_filmes=([
[318,5], #um sonho de liberdade
[4993,5], #Um amor para recordar
[5066,5], #Senhor dos aneis
[134853,5], #Divertidamente
[4886,4],#Texas Chainsaw Massacre:
[53125,3], #piratas do caribe
])

In [28]:
ratings = new_user(Thamiris_filmes)

The new user ID is: 612 


In [29]:
print(f'Recomendation for the user {ratings.userId.max()}')
suggested_moviesv02(ratings.userId.max())

Recomendation for the user 612
The users_id most near from the user 612 is Int64Index([249, 63, 414, 18, 339, 105, 380, 103, 610, 177], dtype='int64', name='UserId_compared'), Distance = UserId_compared
249    0.866025
63     1.224745
414    1.500000
18     1.658312
339    1.870829
105    2.121320
380    2.236068
103    2.397916
610    2.549510
177    3.535534
Name: Distance, dtype: float64


Unnamed: 0_level_0,rating_means,rating_counts,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1136,4.833333,6,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy
1201,4.75,8,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western
1080,4.75,6,Monty Python's Life of Brian (1979),Comedy
293,4.714286,7,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
74458,4.714286,7,Shutter Island (2010),Drama|Mystery|Thriller
