`1. Recommendation system`
`Implementing a recommendation system is critical for businesses and digital platforms that want to thrive in today's competitive environment. These systems use data-driven personalization to tailor content, products, and services to individual user preferences. The latter improves user engagement, satisfaction, retention, and revenue through increased sales and cross-selling opportunities. In this section, you will attempt to implement a recommendation system by identifying similar users' preferences and recommending movies they watch to the study user.`

`To be more specific, you will implement your version of the LSH algorithm, which will take as input the user's preferred genre of movies, find the most similar users to this user, and recommend the most watched movies by those who are more similar to the user.`

`Data: The data you will be working with can be found here.`

`Looking at the data, you can see that there is data available for each user for the movies the user clicked on. Gather the title and genre of the maximum top 10 movies that each user clicked on regarding the number of clicks.`

`1.2 Minhash Signatures`
`Using the movie genre and user_ids, try to implement your min-hash signatures so that users with similar interests in a genre appear in the same bucket.`

`Important note: You must write your minhash function from scratch. You are not permitted to use any already implemented hash functions. Read the class materials and, if necessary, conduct an internet search. The description of hash functions in the book may be helpful as a reference.`

`1.3 Locality-Sensitive Hashing (LSH)`
`Now that your buckets are ready, it's time to ask a few queries. We will provide you with some user_ids and ask you to recommend at most five movies to the user to watch based on the movies clicked by similar users.`

`To recommend at most five movies given a user_id, use the following procedure:`

`Identify the two most similar users to this user.`
`If these two users have any movies in common, recommend those movies based on the total number of clicks by these users.`
`If there are no more common movies, try to propose the most clicked movies by the most similar user first, followed by the other user.`
`Note: At the end of the process, we expect to see at most five movies recommended to the user.`

`Example: assume you've identified user A and B as the most similar users to a single user, and we have the following records on these users:`

`User A with 80% similarity`

`User B with 50% similarity`

`user	movie title	#clicks`

`A	Wild Child	20`

`A	Innocence	10`

`A	Coin Heist	2`

`B	Innocence	30`

`B	Coin Heist	15`

`B	Before I Fall	30`

`B	Beyond Skyline	8`

`B	The Amazing Spider-Man	5`

`Recommended movies in order:`

`Innocence`

`Coin Heist`

`Wild Child`

`Before I Fall`

`Beyond Skyline`


Gather the title and genre of the maximum top 10 movies that each user clicked on regarding the number of clicks:

In [1]:
import pandas as pd

df = pd.read_csv('netflix.csv')

def bestmovies(user):
    #The dictionary for all users
    users_dict = df.groupby('user_id').apply(lambda x: dict(x['movie_id'].value_counts())).to_dict()
    #The dictionary for the specified user
    topmovies = users_dict.get(user, {})
    # Sorting the dictionary by the value in descending order
    sorted_topmovies = dict(sorted(topmovies.items(), key=lambda item: item[1], reverse=True))
    # Return only the first 10 key-value pairs
    return dict(list(sorted_topmovies.items())[:10])

def moviegenres(user):
    # Creating a dictionary with movie_id as key and genre as value
    movie_genre_dict = pd.Series(df.genres.values,index=df.movie_id).to_dict()

    top_10_movies = bestmovies(user)

    # Create a dictionary with the top 10 movies as keys and their genres as values
    moviegenre = {movie: movie_genre_dict[movie] for movie in top_10_movies.keys()}
    return moviegenre

`1.2 Minhash Signatures`
`Using the movie genre and user_ids, try to implement your min-hash signatures so that users with similar interests in a genre appear in the same bucket.`

Making list of genres:

In [2]:
df_copy = df.copy()

# Splitting the genres and expanding them into separate rows
s = df_copy['genres'].str.split(',').apply(pd.Series, 1).stack()

# Removing whitespaces, dropping duplicates and sorting
s.index = s.index.droplevel(-1)
s.name = 'genres'
del df_copy['genres']
df_copy = df_copy.join(s)
genres_list = df_copy['genres'].str.strip().drop_duplicates().sort_values().tolist()

print(genres_list)


['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'NOT AVAILABLE', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western']


Making whichgenres column:

In [3]:
# Creating a new column with a list of 27 zeros
df['whichgenres'] = [[0]*27 for _ in range(len(df))]

# Updating the new column based on the presence of genres
for i, genre in enumerate(genres_list):
    df.loc[df['genres'].str.contains(genre), 'whichgenres'] = df.loc[df['genres'].str.contains(genre), 'whichgenres'].apply(lambda x: x[:i] + [1] + x[i+1:])

# Saving the updated dataframe to a new csv file
df.to_csv('Netflix_updated.csv', index=False)

Hash functions:

In [4]:
def h1(k):
    return (5*k + 3)%97
def h2(k):
    return(10*k + 3)%97
def h3(k):
    return(15*k + 3)%97
def h4(k):
    return(20*k + 3)%97
def h5(k):
    return(25*k + 3)%97
def h6(k):
    return(30*k + 3)%97
def h7(k):
    return(35*k + 3)%97
def h8(k):
    return(40*k + 3)%97
def h9(k):
    return(45*k + 3)%97
def h10(k):
    return(50*k + 3)%97
def h11(k):
    return(55*k + 3)%97
def h12(k):
    return(60*k + 3)%97

Making additional column with results of hash functions:

In [5]:
df = pd.read_csv('Netflix_updated.csv')
# Apply the hash functions to each row number
df['hash_results'] = df.index.to_series().apply(lambda x: [h1(x), h2(x), h3(x), h4(x), h5(x), h6(x), h7(x), h8(x), h9(x), h10(x), h11(x), h12(x)])

# Overwrite the original CSV file with the updated DataFrame
df.to_csv('Netflix_updated.csv', index=False)

In [6]:
# Define your minh function
def minh(elements):
    return min(h6(x) for x in elements)

# Apply the minh function to each subset of 4 elements in the hash_results column
df['minh_results'] = df['hash_results'].apply(lambda x: [minh(x[i:i+4]) for i in range(0, len(x), 4)])

# Overwrite the original CSV file with the updated DataFrame
df.to_csv('Netflix_updated.csv', index=False)
print(df.head())

   Unnamed: 0             datetime  duration  \
0       58773  2017-01-01 01:15:09       0.0   
1       58774  2017-01-01 13:56:02       0.0   
2       58775  2017-01-01 15:17:47   10530.0   
3       58776  2017-01-01 16:04:13      49.0   
4       58777  2017-01-01 19:16:37       0.0   

                                title  \
0  Angus, Thongs and Perfect Snogging   
1        The Curse of Sleeping Beauty   
2                   London Has Fallen   
3                            Vendetta   
4     The SpongeBob SquarePants Movie   

                                              genres release_date    movie_id  \
0                             Comedy, Drama, Romance   2008-07-25  26bd5987e8   
1                 Fantasy, Horror, Mystery, Thriller   2016-06-02  f26ed2675e   
2                                   Action, Thriller   2016-03-04  f77e500e7a   
3                                      Action, Drama   2015-06-12  c74aec7673   
4  Animation, Action, Adventure, Comedy, Family, ...   2004

A function that takes a user_id as input and returns user ids of the 2 users that are most similar to the user from the input:

In [18]:


#the function that extracts minh_results of the user
def users_minh(user_id):
    user_data = df[df['user_id'] == user_id]
    get_minh_results = user_data['minh_results'].tolist()
    return get_minh_results

def points(list1, list2):
    # Initialize a counter
    count = 0
    # Iterate over each 'minh' in the first list
    for minh1 in list1:
        for minh2 in list2:
            for i in range(3):
                if minh1[i] == minh2[i]:
                    # If the i-th elements are the same, increment the counter
                    count += 1

    return count

def similar2(user_id):
    # Load the data from the CSV file
    df = pd.read_csv('Netflix_updated.csv')
    
    # Get the 'minh_results' for the given user_id
    user_minh_results = users_minh(user_id)
    
    # Initialize an empty list to store the concerned movie_ids
    concerned_movie_ids = []
    
    score=dict()
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        score[row['user_id']]=points(user_minh_results, users_minh(row['user_id']))
    sorted_dict = {k: v for k, v in sorted(score.items(), key=lambda item: item[1], reverse=True)}
    
    return (sorted_dict.keys[0], sorted_dict.keys[1])


In [19]:
user_id = "b15926c011"
print(similar2(user_id))

KeyboardInterrupt: 

In [9]:
df

Unnamed: 0.1,Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id,whichgenres,hash_results,minh_results
0,58773,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe,"[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]","[93, 93, 93]"
1,58774,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ...","[8, 13, 18, 23, 28, 33, 38, 43, 48, 53, 58, 63]","[5, 23, 41]"
2,58775,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[13, 23, 33, 43, 53, 63, 73, 83, 93, 6, 16, 26]","[5, 41, 7]"
3,58776,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[18, 33, 48, 63, 78, 93, 11, 26, 41, 56, 71, 86]","[23, 7, 34]"
4,58777,2017-01-01 19:16:37,0.0,The SpongeBob SquarePants Movie,"Animation, Action, Adventure, Comedy, Family, ...",2004-11-19,a80d6fc2aa,a57c992287,"[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...","[23, 43, 63, 83, 6, 26, 46, 66, 86, 9, 29, 49]","[14, 7, 0]"
...,...,...,...,...,...,...,...,...,...,...,...
671731,730504,2019-06-30 21:37:08,851.0,Oprah Presents When They See Us Now,Talk-Show,2019-06-12,43cd23f30f,57501964fd,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[33, 63, 93, 26, 56, 86, 19, 49, 79, 12, 42, 72]","[7, 18, 2]"
671732,730505,2019-06-30 21:49:34,91157.0,HALO Legends,"Animation, Action, Adventure, Family, Sci-Fi",2010-02-16,febf42d55f,d4fcb079ba,"[1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...","[38, 73, 11, 46, 81, 19, 54, 89, 27, 62, 0, 35]","[25, 8, 3]"
671733,730506,2019-06-30 22:00:44,0.0,Pacific Rim,"Action, Adventure, Sci-Fi",2013-07-12,7b15e5ada1,4a14a2cd5a,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[43, 83, 26, 66, 9, 49, 89, 32, 72, 15, 55, 95]","[7, 18, 4]"
671734,730507,2019-06-30 22:04:23,0.0,ReMastered: The Two Killings of Sam Cooke,"Documentary, Music",2019-02-08,52d49c515a,0b8163ea4b,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...","[48, 93, 41, 86, 34, 79, 27, 72, 20, 65, 13, 58]","[61, 29, 5]"
