# RECOMMENDATION SYSTEM BASED ON ALL THE PROCESS USING NON-ALGORITHM


In this project I have decided to work on a movie recommendation system without using known algorithms so that all the work that goes into an algorithm can be appreciated.

As you can see, the formulas are greatly redefined to move towards a final formula that helps us recommend to our users based on the closest K, in this case the closest 10.

For educational purposes, this project is very interesting since I explain the entire process, but for professional purposes I highly recommend using an algorithm, since it will greatly simplify the code.

In [79]:
import pandas as pd
import numpy as np
from math import sqrt

movies = pd.read_csv("/content/movies (1).csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [2]:
#Creating the DataFrame
movies.columns = ['movieId', 'Title', 'Category']
movies = movies.set_index('movieId')
movies.head()

Unnamed: 0_level_0,Title,Category
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [3]:
# Descriptive analysis
movies.describe()

Unnamed: 0,Title,Category
count,9742,9742
unique,9737,951
top,Emma (1996),Drama
freq,2,1053


In [4]:
# Rating data
rate = pd.read_csv("/content/ratings (1).csv")
rate.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
# Creating the DataFrame for rating
rate.columns = ['userId', 'movieId', 'rate','momentum']
rate.head()

Unnamed: 0,userId,movieId,rate,momentum
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# Descriptive analysis
rate.describe()

Unnamed: 0,userId,movieId,rate,momentum
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [7]:
#Frequency of movies watched
rate.value_counts("movieId")

movieId
356       329
318       317
296       307
593       279
2571      278
         ... 
4093        1
4089        1
58351       1
4083        1
193609      1
Length: 9724, dtype: int64

In [8]:
# Lets find out the total rates
total_rates = rate.value_counts("movieId")
movies['total_rates'] = total_rates
movies.sort_values('total_rates', ascending = False).head(5)

Unnamed: 0_level_0,Title,Category,total_rates
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0


In [9]:
# Average score per movieId
rate.groupby('movieId').mean()

Unnamed: 0_level_0,userId,rate,momentum
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,306.530233,3.920930,1.129835e+09
2,329.554545,3.431818,1.135805e+09
3,283.596154,3.259615,1.005110e+09
4,219.857143,2.357143,8.985789e+08
5,299.571429,3.071429,9.926643e+08
...,...,...,...
193581,184.000000,4.000000,1.537109e+09
193583,184.000000,3.500000,1.537110e+09
193585,184.000000,3.500000,1.537110e+09
193587,184.000000,3.500000,1.537110e+09


In [10]:
# Average score per rate
avg_rate = rate.groupby('movieId').mean()['rate']
avg_rate

movieId
1         3.920930
2         3.431818
3         3.259615
4         2.357143
5         3.071429
            ...   
193581    4.000000
193583    3.500000
193585    3.500000
193587    3.500000
193609    4.000000
Name: rate, Length: 9724, dtype: float64

In [11]:
# Let´s add this Average Rate per movie to the DataFrame
movies['avg_rate'] = avg_rate
movies.sort_values('total_rates', ascending = False).head(5)

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0,4.16129
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446


In [12]:
movies.sort_values('avg_rate', ascending = False).head(10)

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1.0,5.0
100556,"Act of Killing, The (2012)",Documentary,1.0,5.0
143031,Jump In! (2007),Comedy|Drama|Romance,1.0,5.0
143511,Human (2015),Documentary,1.0,5.0
143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,1.0,5.0
6201,Lady Jane (1986),Drama|Romance,1.0,5.0
102217,Bill Hicks: Revelations (1993),Comedy,1.0,5.0
102084,Justice League: Doom (2012),Action|Animation|Fantasy,1.0,5.0
6192,Open Hearts (Elsker dig for evigt) (2002),Romance,1.0,5.0
145994,Formula of Love (1984),Comedy,1.0,5.0


In [13]:
# To have a realistic Total rate it should be 50 rates minumun
movies_with_more_50_rates = movies.query('total_rates >=50').sort_values('avg_rate', ascending = False).head(5)
movies_with_more_50_rates

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
1276,Cool Hand Luke (1967),Drama,57.0,4.27193
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,97.0,4.268041


# **Code test 1**

In [14]:
# MovieId [1,19,21,10,2, 27, 33]
seen_movies = [1,19,21,10,2,27]
movies.loc[seen_movies]

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
19,Ace Ventura: When Nature Calls (1995),Comedy,88.0,2.727273
21,Get Shorty (1995),Comedy|Crime|Thriller,89.0,3.494382
10,GoldenEye (1995),Action|Adventure|Thriller,132.0,3.496212
2,Jumanji (1995),Adventure|Children|Fantasy,110.0,3.431818
27,Now and Then (1995),Children|Drama,9.0,3.333333


In [15]:
# Movies of the Category: Crime|Drama
movies.query("Category=='Crime|Drama'")

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
16,Casino (1995),Crime|Drama,82.0,3.926829
30,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,Crime|Drama,3.0,3.000000
36,Dead Man Walking (1995),Crime|Drama,67.0,3.835821
97,"Hate (Haine, La) (1995)",Crime|Drama,10.0,3.900000
117,"Young Poisoner's Handbook, The (1995)",Crime|Drama,1.0,3.000000
...,...,...,...,...
161582,Hell or High Water (2016),Crime|Drama,8.0,3.562500
168266,T2: Trainspotting (2017),Crime|Drama,4.0,3.750000
174727,Good Time (2017),Crime|Drama,1.0,3.000000
177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,8.0,4.750000


In [16]:
Crime_Drama = movies_with_more_50_rates.query("Category=='Crime|Drama'")
Crime_Drama.sort_values('avg_rate', ascending = False).head()

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062


In [17]:
# Shorting by Average Rate
Crime_Drama.sort_values('avg_rate', ascending = False).head()

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062


In [18]:
# Deleting the movies watched by the user
Crime_Drama.drop(seen_movies, errors = 'ignore').sort_values('avg_rate', ascending = False).head()

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062


# **Code Test 2**

**Recommendation based on the Category of the last seen movie**

In [19]:
# Let's see what movies Alex has seen
Alex_seen_movies = [2,10, 296,2130, 1221, 4,25, 239,4234, 193585,318, 858]
movies.loc[Alex_seen_movies]

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,Jumanji (1995),Adventure|Children|Fantasy,110.0,3.431818
10,GoldenEye (1995),Action|Adventure|Thriller,132.0,3.496212
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068
2130,Atlantic City (1980),Crime|Drama|Romance,8.0,3.6875
1221,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0,2.357143
25,Leaving Las Vegas (1995),Drama|Romance,76.0,3.625
239,"Goofy Movie, A (1995)",Animation|Children|Comedy|Romance,17.0,3.0
4234,"Tailor of Panama, The (2001)",Drama|Thriller,5.0,3.2
193585,Flint (2017),Drama,1.0,3.5


In [20]:
# Get the movies of the latest Category: Crime|Drama
movies.query("Category=='Crime|Drama'")

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
16,Casino (1995),Crime|Drama,82.0,3.926829
30,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,Crime|Drama,3.0,3.000000
36,Dead Man Walking (1995),Crime|Drama,67.0,3.835821
97,"Hate (Haine, La) (1995)",Crime|Drama,10.0,3.900000
117,"Young Poisoner's Handbook, The (1995)",Crime|Drama,1.0,3.000000
...,...,...,...,...
161582,Hell or High Water (2016),Crime|Drama,8.0,3.562500
168266,T2: Trainspotting (2017),Crime|Drama,4.0,3.750000
174727,Good Time (2017),Crime|Drama,1.0,3.000000
177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,8.0,4.750000


In [21]:
# As always, lets keep into account only the ones rated more than 50
Crime_Drama=movies_with_more_50_rates.query("Category=='Crime|Drama'")
Crime_Drama.sort_values('avg_rate', ascending = False).head()

Unnamed: 0_level_0,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062


In [22]:
results =Crime_Drama.sort_values('avg_rate', ascending = False).head()

In [23]:
print(f"Considering the last category seen by Alex, we can recommend the following films:\n\n{results[['Title']]}\n")

Considering the last category seen by Alex, we can recommend the following films:

                                    Title
movieId                                  
318      Shawshank Redemption, The (1994)
858                 Godfather, The (1972)



# **Generalizing the Calculation of Distances**

#Euclidean distance calculation by rating

* John - 5
* Christian - 4
* Sergio - 3.5

Juan - Christian = 5 - 4 = 1
Christian - Sergio = 4 - 3.5 = 0.5

Distances for 1 movie rate

example:
John = [5,5]
Sergio = [4, 4.5]

In [24]:
John = np.array([5,5])
Sergio = np.array([4,4.5])

John - Sergio

array([1. , 0.5])

In [25]:
# Creating a pythagora funciton
from math import sqrt
def pythagoras(a,b):
   (delta_x,delta_y) = a - b
   return sqrt(delta_x*delta_x + delta_y*delta_y)
pythagoras(John,Sergio)

1.118033988749895

In [26]:
#np.linalg.norm(a-b)
np.linalg.norm(John - Sergio)

1.118033988749895

In [27]:
# Redefining the Pythagorean function to distance
def distance(a,b):
  return np.linalg.norm(a-b)
distance(John,Sergio)

1.118033988749895

In [28]:
rate.head()

Unnamed: 0,userId,movieId,rate,momentum
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [29]:
# Movies watched by the userId == 1
rate.query("userId==1")

Unnamed: 0,userId,movieId,rate,momentum
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
227,1,3744,4.0,964980694
228,1,3793,5.0,964981855
229,1,3809,4.0,964981220
230,1,4006,4.0,964982903


In [30]:
# Movies watched by the userId == 4
rate.query("userId == 4")

Unnamed: 0,userId,movieId,rate,momentum
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [31]:
user1= rate.query("userId==1")[['movieId','rate']].set_index('movieId')
user4 = rate.query("userId==4")[['movieId','rate']].set_index('movieId')

In [32]:
user1

Unnamed: 0_level_0,rate
movieId,Unnamed: 1_level_1
1,4.0
3,4.0
6,4.0
47,5.0
50,5.0
...,...
3744,4.0
3793,5.0
3809,4.0
4006,4.0


In [33]:
user4

Unnamed: 0_level_0,rate
movieId,Unnamed: 1_level_1
21,3.0
32,2.0
45,3.0
47,2.0
52,3.0
...,...
4765,5.0
4881,3.0
4896,4.0
4902,4.0


In [34]:
# Joining user1 with user4
differences = user1.join(user4, lsuffix='_left(1)', rsuffix='_right(4)').dropna()
differences

Unnamed: 0_level_0,rate_left(1),rate_right(4)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
47,5.0,2.0
235,4.0,2.0
260,5.0,5.0
296,3.0,1.0
441,4.0,1.0
457,5.0,5.0
553,5.0,2.0
593,4.0,5.0
608,5.0,5.0
648,3.0,3.0


In [35]:
differences['rate_left(1)']

movieId
47      5.0
235     4.0
260     5.0
296     3.0
441     4.0
457     5.0
553     5.0
593     4.0
608     5.0
648     3.0
919     5.0
1025    5.0
1060    4.0
1073    5.0
1080    5.0
1136    5.0
1196    5.0
1197    5.0
1198    5.0
1213    5.0
1219    2.0
1265    4.0
1282    5.0
1291    5.0
1500    4.0
1517    5.0
1580    3.0
1617    5.0
1732    5.0
1967    4.0
2078    5.0
2174    4.0
2395    5.0
2406    4.0
2571    5.0
2628    4.0
2692    5.0
2858    5.0
2959    5.0
2997    4.0
3033    5.0
3176    1.0
3386    5.0
3489    4.0
3809    4.0
Name: rate_left(1), dtype: float64

In [36]:
differences['rate_right(4)']

movieId
47      2.0
235     2.0
260     5.0
296     1.0
441     1.0
457     5.0
553     2.0
593     5.0
608     5.0
648     3.0
919     5.0
1025    4.0
1060    2.0
1073    4.0
1080    5.0
1136    5.0
1196    5.0
1197    5.0
1198    3.0
1213    4.0
1219    4.0
1265    4.0
1282    5.0
1291    4.0
1500    4.0
1517    4.0
1580    3.0
1617    2.0
1732    4.0
1967    5.0
2078    5.0
2174    5.0
2395    3.0
2406    4.0
2571    1.0
2628    1.0
2692    5.0
2858    5.0
2959    2.0
2997    4.0
3033    4.0
3176    4.0
3386    4.0
3489    1.0
3809    3.0
Name: rate_right(4), dtype: float64

In [37]:
# Distance between users
distance(differences['rate_left(1)'],differences['rate_right(4)'])

11.135528725660043

In [38]:
# Let´s create a function that returns the rate of a user
def user_rates(userId):
  user_rates = rate.query("userId==%d" % userId)[['movieId','rate']]
  user_rates = user_rates.set_index('movieId')
  return user_rates

In [39]:
#Let's redefine the distance function
def difference_of_vectors(a,b):
  return np.linalg.norm(a-b)

In [40]:
def distance_between_users(user_id1, user_id2):
    rate1 = user_rates(user_id1)
    rate2 = user_rates(user_id2)
    differences = rate1.join(rate2, lsuffix='_left', rsuffix='_right').dropna()
    return difference_of_vectors(differences['rate_left'], differences['rate_right'])

In [41]:
distance_between_users(1, 4)

11.135528725660043

In [42]:
# Since we don´t really need to show the data as shown before,
# I´m going to modify the previous function "distance_between_users"
def distance_between_users(user_id1,user_id2):
  rate1 = user_rates(user_id1)
  rate2 = user_rates(user_id2)
  differences = rate.join(rate, lsuffix='_left', rsuffix='_right').dropna()
  distance = difference_of_vectors(differences['rate_left'],differences['rate_right'])
  return [user_id1, user_id2, distance]

In [43]:
# Distance between a ref user and the rest
users = rate['userId'].unique()
users_ref = 1

for user in users:
    information = distance_between_users(users_ref,user)
    print(information)

[1, 1, 0.0]
[1, 2, 0.0]
[1, 3, 0.0]
[1, 4, 0.0]
[1, 5, 0.0]
[1, 6, 0.0]
[1, 7, 0.0]
[1, 8, 0.0]
[1, 9, 0.0]
[1, 10, 0.0]
[1, 11, 0.0]
[1, 12, 0.0]
[1, 13, 0.0]
[1, 14, 0.0]
[1, 15, 0.0]
[1, 16, 0.0]
[1, 17, 0.0]
[1, 18, 0.0]
[1, 19, 0.0]
[1, 20, 0.0]
[1, 21, 0.0]
[1, 22, 0.0]
[1, 23, 0.0]
[1, 24, 0.0]
[1, 25, 0.0]
[1, 26, 0.0]
[1, 27, 0.0]
[1, 28, 0.0]
[1, 29, 0.0]
[1, 30, 0.0]
[1, 31, 0.0]
[1, 32, 0.0]
[1, 33, 0.0]
[1, 34, 0.0]
[1, 35, 0.0]
[1, 36, 0.0]
[1, 37, 0.0]
[1, 38, 0.0]
[1, 39, 0.0]
[1, 40, 0.0]
[1, 41, 0.0]
[1, 42, 0.0]
[1, 43, 0.0]
[1, 44, 0.0]
[1, 45, 0.0]
[1, 46, 0.0]
[1, 47, 0.0]
[1, 48, 0.0]
[1, 49, 0.0]
[1, 50, 0.0]
[1, 51, 0.0]
[1, 52, 0.0]
[1, 53, 0.0]
[1, 54, 0.0]
[1, 55, 0.0]
[1, 56, 0.0]
[1, 57, 0.0]
[1, 58, 0.0]
[1, 59, 0.0]
[1, 60, 0.0]
[1, 61, 0.0]
[1, 62, 0.0]
[1, 63, 0.0]
[1, 64, 0.0]
[1, 65, 0.0]
[1, 66, 0.0]
[1, 67, 0.0]
[1, 68, 0.0]
[1, 69, 0.0]
[1, 70, 0.0]
[1, 71, 0.0]
[1, 72, 0.0]
[1, 73, 0.0]
[1, 74, 0.0]
[1, 75, 0.0]
[1, 76, 0.0]
[1, 77, 0.0]
[1, 78, 

In [44]:
# Distance between a ref user and all the other users
def distance_of_all(users_ref):
  users = rate['userId'].unique()
  distance = []

  for user in users:
    information = distance_between_users(users_ref,user)
    distance.append(information)

  distance = pd.DataFrame(distance, columns = ['users_ref','user','distance'])
  return distance

In [45]:
# We can now redefine the function "distance_between_users"
def distance_between_users(user_id1, user_id2, minimo=5):
    rate1 = user_rates(user_id1)
    rate2 = user_rates(user_id2)
    differences = rate1.join(rate2, lsuffix='_left', rsuffix='_right').dropna()
    if len(differences) < minimo:
        return [user_id1, user_id2, 10000]
    distance = difference_of_vectors(differences['rate_left'], differences['rate_right'])
    return [user_id1, user_id2, distance]

In [46]:
distance_of_all(1).sort_values('distance').head(10)

Unnamed: 0,users_ref,user,distance
0,1,1,0.0
76,1,77,0.0
510,1,511,0.5
365,1,366,0.707107
522,1,523,1.0
48,1,49,1.0
8,1,9,1.0
257,1,258,1.0
318,1,319,1.118034
397,1,398,1.224745


**Identifying the closest user**



In [47]:
# Calculation of nearest users
def closer_to_user_ref(users_ref):
  closest = distance_of_all(users_ref)
  closest=closest.sort_values("distance")
  closest = closest.set_index("user")
  closest = closest.drop(users_ref)
  return closest

In [48]:
closer_to_user_ref(1)

Unnamed: 0_level_0,users_ref,distance
user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.000000
511,1,0.500000
366,1,0.707107
523,1,1.000000
49,1,1.000000
...,...,...
190,1,10000.000000
60,1,10000.000000
576,1,10000.000000
545,1,10000.000000


In [49]:
# We redefine the distance of everyone considering a # of users to analyze
def distance_of_all(users_ref, user_number_analyze=None):
  users = rate['userId'].unique()
  distance = []
  if user_number_analyze:
    users = users[:user_number_analyze]

  for user in users:
    information = distance_between_users(users_ref,user)
    distance.append(information)

  distance = pd.DataFrame(distance, columns = ['users_ref','user','distance'])
  return distance

In [50]:
def closer_to_user_ref(users_ref, user_number_analyze=None):
    closest = distance_of_all(users_ref, user_number_analyze)
    closest = closest.sort_values("distance")
    closest = closest.set_index("user")
    closest = closest.drop(users_ref)
    return closest

# Example for nearest 100
closer_to_user_ref(1, 100)

Unnamed: 0_level_0,users_ref,distance
user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.000000
49,1,1.000000
9,1,1.000000
65,1,1.322876
90,1,1.414214
...,...,...
87,1,10000.000000
53,1,10000.000000
12,1,10000.000000
92,1,10000.000000


In [51]:
def distance_between_users(user_id1, user_id2, minimo=5):
    rate1 = user_rates(user_id1)
    rate2 = user_rates(user_id2)
    differences = rate1.join(rate2, lsuffix='_left', rsuffix='_right').dropna()
    if len(differences) < minimo:
        return None
    distance = difference_of_vectors(differences['rate_left'], differences['rate_right'])
    return [user_id1, user_id2, distance]

In [52]:
def distance_of_all(user_ref, user_number_analyze=None):
    users = rate['userId'].unique()
    distance = []
    if user_number_analyze:
        users = users[:user_number_analyze]

    for user in users:
        information = distance_between_users(user_ref, user)
        if information:  # Check if the result is not None
            distance.append(information)
    distance = pd.DataFrame(distance, columns=['users_ref', 'user', 'distance'])
    return distance

In [53]:
# Analyzing the closest users
closer_to_user_ref(1, 100).head()

Unnamed: 0_level_0,users_ref,distance
user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.0
9,1,1.0
49,1,1.0
65,1,1.322876
90,1,1.414214


In [71]:
# We just keep the similar users
similar = closer_to_user_ref(1,100)
similar = similar.iloc[0].name
similar

77

# Recommending movies to the Ref user taking into accoun the closest user

In [55]:
user_rates(similar)

Unnamed: 0_level_0,rate
movieId,Unnamed: 1_level_1
260,5.0
1196,5.0
1198,5.0
1210,5.0
2571,5.0
3578,5.0
3948,3.0
3996,5.0
4226,2.5
4878,1.0


In [72]:
users_ref = 1
# User reference rates
user_rates_ref = user_rates(users_ref)

# Getting closer users
similares = closer_to_user_ref(users_ref, 100)
similar = similares.iloc[0].name

# Obtaining rates of the similar user
similar_rate = user_rates(similar)  # Call the function to get the DataFrame
similar_rate = similar_rate.drop(user_rates_ref.index, errors="ignore")  # Use drop on the DataFrame
similar_rate.sort_values("rate", ascending=False)

Unnamed: 0_level_0,rate
movieId,Unnamed: 1_level_1
8636,5.0
58559,5.0
33794,5.0
4993,5.0
5349,5.0
5378,5.0
8961,5.0
5816,5.0
5952,5.0
3996,5.0


In [73]:
# Generalizing the function

def suggestions(users_ref, user_number_analyze=None):
  user_rates_ref= user_rates(users_ref)

  similares = closer_to_user_ref(users_ref,user_number_analyze)
  similar=similares.iloc[0].name

  similar_rate=user_rates(similar)
  similar_rate = similar_rate.drop(user_rates_ref.index, errors = "ignore")
  recommendations = similar_rate.sort_values("rate", ascending = False)
  return recommendations

In [58]:
# Top 5 suggestion
suggestions(1,100).head(5)

Unnamed: 0_level_0,rate
movieId,Unnamed: 1_level_1
8636,5.0
58559,5.0
33794,5.0
4993,5.0
5349,5.0


In [74]:
# Let´s modify the suggestion function to show more data
def suggestions(users_ref, user_number_analyze=None):
    user_rates_ref = user_rates(users_ref)

    similares = closer_to_user_ref(users_ref, user_number_analyze)
    similar = similares.iloc[0].name

    similar_rate = user_rates(similar)
    similar_rate = similar_rate.drop(user_rates_ref.index, errors="ignore")
    recommendations = similar_rate.sort_values("rate", ascending=False)
    return recommendations.join(movies).head(5)

In [60]:
# Top 5 suggestion
suggestions(1, 100).head(5)

Unnamed: 0_level_0,rate,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8636,5.0,Spider-Man 2 (2004),Action|Adventure|Sci-Fi|IMAX,79.0,3.803797
58559,5.0,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,149.0,4.238255
33794,5.0,Batman Begins (2005),Action|Crime|IMAX,116.0,3.862069
4993,5.0,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061
5349,5.0,Spider-Man (2002),Action|Adventure|Sci-Fi|Thriller,122.0,3.540984


In [61]:
# Testing
suggestions(2,100).head(5)

Unnamed: 0_level_0,rate,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1213,5.0,Goodfellas (1990),Crime|Drama,126.0,4.25
858,5.0,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
8957,5.0,Saw (2004),Horror|Mystery|Thriller,33.0,3.181818
1136,5.0,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy,136.0,4.161765
34405,5.0,Serenity (2005),Action|Adventure|Sci-Fi,50.0,3.94


**So far we have made a recommendation system that only uses recommending based on a single user similar to the one we want to recommend, that is, it is based on the tastes of the user most similar to the reference user.**

In [76]:
# Now i have to redefine the formulas in order to suggest taking
# into account more than 1 user, let´s say 10
def closer_to_user_ref(users_ref,K_most_close=10, user_number_analyze=None):
  closest = distance_of_all(users_ref, user_number_analyze)
  closest=closest.sort_values("distance")
  closest = closest.set_index("user")
  closest = closest.drop(users_ref)
  return closest.head(K_most_close)

In [63]:
closer_to_user_ref(1,10,100)

Unnamed: 0_level_0,users_ref,distance
user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.0
9,1,1.0
49,1,1.0
65,1,1.322876
90,1,1.414214
25,1,1.414214
13,1,1.414214
30,1,1.802776
35,1,2.236068
26,1,2.236068


In [77]:
user_ref=1
user_number_analyze=100
K_most_close=10

# Getting user rates reference
user_rates_ref= user_rates(user_ref)

# Getting rates from similar users
similars = closer_to_user_ref(user_ref,K_most_close, user_number_analyze)
similar_users = similars.index

# Rates from similar users
similar_rates=rate.set_index("userId").loc[similar_users]
similar_rates = similar_rates.groupby("movieId").mean()[["rate"]]

# Let´s make recommendations
recommendations = similar_rates.drop(user_rates_ref.index, errors = "ignore")
recommendations = recommendations.sort_values("rate", ascending = False)
recommendations.join(movies).head()

Unnamed: 0_level_0,rate,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
187593,5.0,Deadpool 2 (2018),Action|Comedy|Sci-Fi,12.0,3.875
261,5.0,Little Women (1994),Drama,42.0,3.880952
95510,5.0,"Amazing Spider-Man, The (2012)",Action|Adventure|Sci-Fi|IMAX,30.0,3.25
93510,5.0,21 Jump Street (2012),Action|Comedy|Crime,26.0,3.865385
339,5.0,While You Were Sleeping (1995),Comedy|Romance,98.0,3.469388


# SUGGESTION FUNCTION

In [78]:
# Renaming (knn)
def knn(users_ref,K_most_close=10, user_number_analyze=None):
  closest = distance_of_all(users_ref, user_number_analyze)
  closest=closest.sort_values("distance")
  closest = closest.set_index("user")
  closest = closest.drop(users_ref)
  return closest.head(K_most_close)

In [66]:
# Redefining function Suggestion taking into account several nearby users
def suggestions(users_ref,K_most_close=10, user_number_analyze=None):
  user_rates_ref= user_rates(users_ref)

  # Getting user notes reference
  user_rates_ref= user_rates(users_ref)

  # Getting rates from similar users
  similars = knn(users_ref,K_most_close, user_number_analyze)
  similar_users = similars.index

  # Rates from similar users
  similar_rates=rate.set_index("userId").loc[similar_users]
  similar_rates = similar_rates.groupby("movieId").mean()[["rate"]]

  # Making recommendations
  recommendations = similar_rates.drop(user_rates_ref.index, errors = "ignore")
  recommendations = recommendations.sort_values("rate", ascending = False)
  return recommendations.join(movies)

In [67]:
suggestions(1,10,100)

Unnamed: 0_level_0,rate,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
187593,5.0,Deadpool 2 (2018),Action|Comedy|Sci-Fi,12.0,3.875000
261,5.0,Little Women (1994),Drama,42.0,3.880952
95510,5.0,"Amazing Spider-Man, The (2012)",Action|Adventure|Sci-Fi|IMAX,30.0,3.250000
93510,5.0,21 Jump Street (2012),Action|Comedy|Crime,26.0,3.865385
339,5.0,While You Were Sleeping (1995),Comedy|Romance,98.0,3.469388
...,...,...,...,...,...
5507,1.0,xXx (2002),Action|Crime|Thriller,24.0,2.770833
305,1.0,Ready to Wear (Pret-A-Porter) (1994),Comedy,9.0,2.833333
4878,1.0,Donnie Darko (2001),Drama|Mystery|Sci-Fi|Thriller,109.0,3.981651
4558,1.0,Twins (1988),Comedy,14.0,2.464286


# Recommending to a NEW USER

In [68]:
# Development of a function to create a new user
def new_user(rate):
    user_id = max(rate, key=lambda x: x[0])[0] + 1
    new_user_rates = pd.DataFrame(rate, columns=["movieId", "rate"])
    new_user_rates["userId"] = user_id
    return new_user_rates

rates = new_user([[122904,2],[1246,5],[2529,2],[2329,5],[2324,5],[1,2],[7,0.5],[2,2],[1196,1],[260,111]])


In [69]:
num_users = rate['userId'].nunique()
print("Número total de usuarios en el conjunto de datos:", num_users)

Número total de usuarios en el conjunto de datos: 610


In [70]:
# Generation of the recommendation for the new user
suggestions(610).head(5)

Unnamed: 0_level_0,rate,Title,Category,total_rates,avg_rate
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1225,5.0,Amadeus (1984),Drama,76.0,4.184211
7025,5.0,"Midnight Clear, A (1992)",Drama|War,2.0,3.75
805,5.0,"Time to Kill, A (1996)",Drama|Thriller,35.0,3.657143
837,5.0,Matilda (1996),Children|Comedy|Fantasy,33.0,3.272727
838,5.0,Emma (1996),Comedy|Drama|Romance,30.0,3.916667


**As we see, we already have a precise recommendation system for our existing and new users**