# Vectorizing Users & Movies

The purpose of this notebook, is to vectorise users based on their tags, and choices of movies.
We are going to work on categorical features such as tags, titles and movie genres to construct numeric vectors. We are also going to transform dates into ages, and then ages into youth_rates to help understand which era fits best every user. These vectors and youth_rates will allow us to compare users and movies thus circle shared topics and styles to recommend better to our most valuable customers.

##### Importing libraries

In [2]:
# Import essential libraries:
import numpy as np
import pandas as pd
import re


##### Loading Tag DataFrame

In [3]:
# Load the dataset from 'tag.csv'
df_tag = pd.read_csv('input_data\\tag.csv', delimiter=',')

# Display the first 2 rows
df_tag.head(2)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18


##### Quick checking number of user and number of tagging
7801 user, it's a very low number, these user can be considered as opinion leaders since they put more effort into judging movies.

In [4]:
# Calculate the number of unique users in the 'df_tag' DataFrame by counting distinct 'userId' values.
df_tag['userId'].nunique()


7801

In [5]:
# Retrieve the dimensions of the 'df_tag' DataFrame
df_tag.shape


(465564, 4)

##### Droping a few NaN

In [6]:
# Find columns with NaN values
# Count NaN values for each column
nan_counts = df_tag.isna().sum()

# Filter and print only the columns with NaN values and their counts
nan_columns_counts = nan_counts[nan_counts > 0]
nan_columns_counts

tag    16
dtype: int64

In [7]:
# Remove rows with any missing values from the 'df_tag' DataFrame.
df_tag = df_tag.dropna()


In [8]:
# Import the Natural Language Toolkit (nltk) and download the 'names' dataset.
import nltk
nltk.download('names')


[nltk_data] Downloading package names to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

##### Removing names to replace them with actor or actress in tags
The purpose is to vectorize the tags, and the model doesn't know names and can't vectorise them, so its better to drop this information and replace it with something useful. It allows us to feed to model with information such as "is the main charactere male or female?"


In [9]:
from nltk.corpus import names

# Load male and female names from the NLTK names dataset.
male_names = set(names.words('male.txt'))
female_names = set(names.words('female.txt'))

# Function to replace names with 'actor' or 'actress' based on gender.
# If a name matches a male name, replace with "actor"; if it matches a female name, replace with "actress".
# If no name is found, the original tag is kept.
def replace_name(tag):
    words = tag.split()
    for word in words:
        if word in male_names:
            return "actor"
        elif word in female_names:
            return "actress"
    return tag

# Apply the 'replace_name' function to the 'tag' column in the 'df_tag' DataFrame.
# This replaces names with 'actor' or 'actress' as determined by the function.
df_tag['tag'] = df_tag['tag'].apply(replace_name)

# Display the first 5 rows of the updated DataFrame.
df_tag.head(2)


Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,actor,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18


##### Using Timestamp to create a tag age column, much more usefull later on

In [10]:
# Convert the 'timestamp' column to datetime format if it is not already in that format.
df_tag['timestamp'] = pd.to_datetime(df_tag['timestamp'])

# Calculate the age in years based on the difference between the current date and the 'timestamp'.
# Convert the age from days to years and round to the nearest whole number.
current_date = pd.Timestamp.now()
df_tag['age'] = round((current_date - df_tag['timestamp']).dt.days / 365.25, 0)

# Drop the 'timestamp' column from the DataFrame as it is no longer needed.
df_tag = df_tag.drop(columns=['timestamp'])

# Ensure the 'age' column values are rounded to the nearest whole number.
df_tag['age'] = round(df_tag['age'], 0)

# Display the first 2 rows of the updated DataFrame to check the changes.
df_tag.head(2)


Unnamed: 0,userId,movieId,tag,age
0,18,4141,actor,15.0
1,65,208,dark hero,11.0


##### Loading Movie DataFrame

In [11]:
# Load the dataset from 'movie.csv'
df_movie = pd.read_csv('input_data\\movie.csv')

# Display the first 2 rows
df_movie.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


##### Splitting genres and Extracting title and year data for a easier use later on

In [12]:
# Split the 'genres' column values by '|' and join them with a space to form a single string of genres.
df_movie['genres'] = df_movie['genres'].str.split('|').str.join(' ')

# Extract the movie title and year from the 'title' column.
# The regex pattern captures the title and the year (in parentheses) into separate columns.
df_movie[['title', 'year']] = df_movie['title'].str.extract(r'^(.*)\s\((\d{4})\)$')

# Display the first 2 rows of the updated DataFrame to check the changes.
df_movie.head(2)


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,1995
1,2,Jumanji,Adventure Children Fantasy,1995


##### Dropping a few NaN in titles, and transforming date in age to later create youth_rate for movies

In [13]:
# Remove rows with missing values in the 'title' column (55 rows in total).
df_movie = df_movie.dropna(subset=['title'])

# Convert the 'year' column to integer data type.
df_movie['year'] = df_movie['year'].astype(int)

# Create a new column 'age_movie' by calculating the age of the movie based on the current year (2016).
df_movie['age_movie'] = 2016 - df_movie['year']

# Drop the 'year' column from the DataFrame as it is no longer needed.
df_movie = df_movie.drop(columns=['year'])

# Display the first 2 rows
df_movie.head(2)


Unnamed: 0,movieId,title,genres,age_movie
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,21
1,2,Jumanji,Adventure Children Fantasy,21


##### Merging tag related DataFrame with movie related DataFrame

In [14]:
# Merge the 'df_tag' DataFrame with 'df_movie' on the 'movieId' column using a left join.
# This combines movie details such as title and genres with the tag information.
df_tag_title_genres = df_tag.merge(df_movie, on='movieId', how='left')

# Display the first 2 rows
df_tag_title_genres.head(2)

Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie
0,18,4141,actor,15.0,Head Over Heels,Comedy Romance,15.0
1,65,208,dark hero,11.0,Waterworld,Action Adventure Sci-Fi,21.0


In [15]:
# Retrieve the dimensions of the 'df_tag_title_genres' DataFrame
df_tag_title_genres.shape

(465548, 7)

In [16]:
# Convert the 'title' and 'genres' columns to string data type.
df_tag_title_genres['title'] = df_tag_title_genres['title'].astype(str)
df_tag_title_genres['genres'] = df_tag_title_genres['genres'].astype(str)

##### Loading the pre-trained model word2vec to vectorize tags, titles, and genres

In [17]:
# Importing necessary libraries
import nltk
from nltk.data import find
import gensim

# Downloading required NLTK resources
nltk.download('punkt')  # Downloading tokenizers for NLTK
nltk.download('stopwords')
nltk.download('word2vec_sample')  # Downloading the word2vec sample model

# Finding the path of the pre-trained word2vec model
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

# Loading the pre-trained word2vec model using Gensim
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package word2vec_sample to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


In [18]:
# Model test
model.similarity('actor','actress')

0.79300094

##### Cleaning text, tokenizing, lower case, punctuation, symboles, stopwords...

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure the 'stopwords' and 'punkt' datasets are downloaded for stopword removal and tokenization.
nltk.download('stopwords')
nltk.download('punkt')

# Load English stopwords into a set.
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from a text:
# - Tokenize the text into words.
# - Filter out stopwords and non-alphabetic tokens.
# - Join the filtered words back into a single string.
def remove_stopwords(text):
    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

# Apply the 'remove_stopwords' function to the 'title' and 'tag' columns.
df_tag_title_genres['title'] = df_tag_title_genres['title'].apply(remove_stopwords)
df_tag_title_genres['tag'] = df_tag_title_genres['tag'].apply(remove_stopwords)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##### Replacing genres words that the model doesn't understand with words the model knows and is able to embed

The library does not kwow words like Sci-Fi, filmnoir. We map them to a model understandable language

In [20]:
# Dictionary for mapping and replacing specific genre terms with more descriptive terms.
mapping_dict = {
    'scifi': 'future',
    'thriller': 'suspense',
    'filmnoir': 'cynical',
    'musical': 'singing',
    'western': 'cowboy'
}

# Replace the terms in the 'genres' column based on the mapping dictionary.
# Iterate through the dictionary and apply the replacements using regex.
for key, value in mapping_dict.items():
    df_tag_title_genres['genres'] = df_tag_title_genres['genres'].str.replace(key, value, regex=True)

### Vectorizing the tags, titles and genres into 300 array list vectors

In [21]:
# Function to vectorize words in a tag using a given word embedding model.
# - For each word in the tag, retrieve its vector from the model.
# - If a word is not recognized, return NaN.
# - Calculate the mean vector for the tag by averaging the vectors of all words.
def vectorize_tag(tag, model):
    vectors = []
    for word in tag.split():
        if word in model:
            vectors.append(model[word])
        else:
            return np.nan  # Return NaN if any word is not recognized
    return np.mean(vectors, axis=0)  # Average vectors for the tag

# Apply the 'vectorize_tag' function to 'tag', 'title', and 'genres' columns.
# This creates vector representations for each tag, title, and genre using the embedding model.
df_tag_title_genres['tag_vector'] = df_tag_title_genres['tag'].apply(lambda x: vectorize_tag(x, model))
df_tag_title_genres['title_vector'] = df_tag_title_genres['title'].apply(lambda x: vectorize_tag(x, model))
df_tag_title_genres['genres_vector'] = df_tag_title_genres['genres'].apply(lambda x: vectorize_tag(x, model))

# Display the first 2 rows
df_tag_title_genres.head(2)


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector
0,18,4141,actor,15.0,head heels,Comedy Romance,15.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...",
1,65,208,dark hero,11.0,waterworld,Action Adventure Sci-Fi,21.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...",,


#### Replacing NaN with numpy array 300 list of zeros to have all the same dtype in the cells


In [34]:
# Create a zero vector with 300 elements to use as a replacement for NaN values.
zero_vector = np.zeros(300)

# Replace NaN values with the zero vector in the 'tag_vector', 'title_vector', and 'genres_vector' columns.
# Use lambda functions to check if the value is NaN and substitute it with the zero vector.
df_tag_title_genres['tag_vector'] = df_tag_title_genres['tag_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)
df_tag_title_genres['title_vector'] = df_tag_title_genres['title_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)
df_tag_title_genres['genres_vector'] = df_tag_title_genres['genres_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)

# Display the first 3 rows
df_tag_title_genres.head(2)


Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector,user_movie_vector
0,18,4141,actor,15.0,head heels,Comedy Romance,15.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0039216503500938416, -0.02111555015047391, ..."
1,65,208,dark hero,11.0,waterworld,Action Adventure Sci-Fi,21.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.028805866837501526, 0.018409182627995808, 0..."


##### Calculating one average vector/user/movie based on the 3 vectors (tag, title and genres)

In [37]:
# Function to calculate the average vector from 'tag_vector', 'title_vector', and 'genres_vector'.
# It creates an array of these vectors and computes their mean.
def calculate_average_vector(row):
    vectors = np.array([row['tag_vector'], row['title_vector'], row['genres_vector']])
    return np.mean(vectors, axis=0)

# Apply the 'calculate_average_vector' function to each row of the DataFrame.
# This calculates the average vector and adds it as a new column 'user_movie_vector'.
df_tag_title_genres['user_movie_vector'] = df_tag_title_genres.apply(calculate_average_vector, axis=1)

# Display the first 2 rows
df_tag_title_genres.head(2)


Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector,user_movie_vector
0,18,4141,actor,15.0,head heels,Comedy Romance,15.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0039216503500938416, -0.02111555015047391, ..."
1,65,208,dark hero,11.0,waterworld,Action Adventure Sci-Fi,21.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.028805866837501526, 0.018409182627995808, 0..."


##### Calculating the average user_vector with groupby user_id to get only one vector/user

In [40]:
# Group the DataFrame by 'userId' and compute the mean for each numeric column.
df_grouped = df_tag_title_genres.groupby('userId').mean(numeric_only=True)


In [41]:

# Display the first 2 rows
df_grouped.head(2)

Unnamed: 0_level_0,movieId,age,age_movie
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18,4141.0,15.0,15.0
65,15211.058824,11.470588,24.088235


In [25]:
# Retrieve the dimensions of the grouped DataFrame to show the number of rows and columns.
df_grouped.shape


(7801, 3)

##### Grouping the 300 vectors into a list again

In [None]:
# Combine columns 'user_movie_vector_0' to 'user_movie_vector_299' into a single 'user_movie_vector' column,
# where each entry is a list of 300 vector elements.
df_grouped['user_movie_vector'] = df_grouped[[f'user_movie_vector_{i}' for i in range(300)]].apply(lambda row: row.values.tolist(), axis=1)

# Drop the individual vector columns 'user_movie_vector_0' to 'user_movie_vector_299' as they are now combined.
df_grouped.drop([f'user_movie_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'tag_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'title_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'genres_vector_{i}' for i in range(300)], axis=1, inplace=True)

# Drop the 'movieId' column as it is no longer needed.
df_grouped.drop(['movieId'], axis=1, inplace=True)

# Display the first 2 rows
df_grouped.head(2)

##### Renaming the columns

In [None]:
# Rename columns in 'df_grouped' for clarity:
df_grouped.rename(columns={
    'age': 'tag_mean_age',
    'age_movie': 'movie_mean_age',
    'user_movie_vector': 'user_vector'
}, inplace=True)

# Vectorizing movies

In [61]:
# Load the CSV file into a DataFrame
df = pd.read_csv('input_data\\tag.csv')

In [62]:
# Drop the 'timestamp' column
df.drop('timestamp', axis=1, inplace=True)
df.head(2)

Unnamed: 0,userId,movieId,tag
0,18,4141,Mark Waters
1,65,208,dark hero


In [65]:
 # Searching for NaN
df.isna().sum()

userId     0
movieId    0
tag        0
dtype: int64

In [64]:
# Drop rows with NaN
df.dropna(axis=0, inplace=True)

In [66]:
# Function to vectorize words in a tag using the provided model
def vectorize_tag(tag, model):
    vectors = []
    for word in tag.split():
        if word in model:
            vectors.append(model[word])
        else:
            return np.nan  # Return NaN if any word is not recognized by the model
    return np.sum(vectors, axis=0)  # Sum of the vectors for each tag

# Apply the vectorization function to the 'tag' column and create a new 'vector' column
df['vector'] = df['tag'].apply(lambda x: vectorize_tag(x, model))

# Drop rows where the 'vector' column contains NaN values (i.e., rows with unrecognized words)
df.dropna(subset=['vector'], inplace=True)


In [70]:
df_genome_tags = pd.read_csv('input_data\genome_tags.csv')
df_genome_tags.head(5)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [71]:
# Fonction pour vectoriser les mots d'une tag
def vectorize_tag(tag, model):
    vectors = []
    for word in tag.split():
        if word in model:
            vectors.append(model[word])
        else:
            return np.nan  # Retourne NaN si un mot n'est pas reconnu
    return np.sum(vectors, axis=0)  # Somme des vecteurs pour chaque tag

# Appliquer la fonction de vectorisation
df_genome_tags['vector'] = df_genome_tags['tag'].apply(lambda x: vectorize_tag(x, model))

# Supprimer les lignes où le vector est NaN (c'est-à-dire les lignes avec des mots non reconnus)
df_genome_tags.dropna(subset=['vector'], inplace=True)
df_genome_tags.head(5)

Unnamed: 0,tagId,tag,vector
16,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ..."
17,18,absurd,"[0.0468222, -0.050771, 0.0409929, 0.0389245, -..."
18,19,action,"[-0.0138909, 0.0633263, 0.0185893, 0.0105203, ..."
19,20,action packed,"[0.0420461, 0.0945427, -0.0084770005, 0.010103..."
20,21,adaptation,"[0.0837244, -0.0257734, -0.0665421, 0.0565452,..."


In [72]:
# Load the CSV file into a DataFrame
df_genome_scores = pd.read_csv('input_data\genome_scores.csv')
df_genome_scores.head(5)

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.025
1,1,2,0.025
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675


In [None]:
# Merge the 'df_genome_tags' and 'df_genome_scores' DataFrames on the 'tagId' column using an inner join
# This combines rows from both DataFrames where 'tagId' matches in both
df_tags_merged = pd.merge(left=df_genome_tags, right=df_genome_scores, how='inner', on='tagId')


In [74]:
len(df_tags_merged)

7640416

In [75]:
# Filter the merged DataFrame to include only rows where 'relevance' is greater than 0.5
# This retains only those tags with a relevance score above the specified threshold
df_tags_merged = df_tags_merged[df_tags_merged['relevance'] > 0.5]

# Output the number of rows remaining in the filtered DataFrame
len(df_tags_merged)


365316

In [76]:
# Function to multiply each vector by its relevance score
# This adjusts the vector representation based on the relevance score for each row
def apply_relevance(row):
    weighted_vector = np.array(row['vector']) * row['relevance']
    return weighted_vector

# Apply the 'apply_relevance' function to each row in the DataFrame
# This creates a new column 'weighted_vector' where each vector is scaled by its relevance score
df_tags_merged['weighted_vector'] = df_tags_merged.apply(apply_relevance, axis=1)

# Display the first 5 rows of the DataFrame to check the new 'weighted_vector' column
df_tags_merged.head(5)


Unnamed: 0,tagId,tag,vector,movieId,relevance,weighted_vector
1243,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ...",1392,0.97325,"[0.0075675636, 0.02316841, 0.06519743, 0.08661..."
2798,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ...",3148,0.85875,"[0.006677262, 0.020442715, 0.057527147, 0.0764..."
2849,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ...",3210,0.6765,"[0.0052601667, 0.016104218, 0.04531833, 0.0602..."
3430,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ...",3888,0.63675,"[0.0049510878, 0.01515796, 0.042655498, 0.0566..."
3703,17,abortion,"[0.00777556, 0.0238052, 0.0669894, 0.0890002, ...",4191,0.745,"[0.0057927924, 0.017734874, 0.049907103, 0.066..."


In [77]:
len(df_tags_merged)

365316

In [78]:
df_tags_merged.drop(['vector', 'relevance'], axis=1, inplace=True)

In [80]:
# Group by 'movieId' and calculate the average of 'weighted_vector' for each movie
# This aggregates the weighted vectors for each movie and computes the mean vector
df_avg_vector = df_tags_merged.groupby('movieId')['weighted_vector'].apply(lambda x: np.mean(np.stack(x), axis=0))

# Convert the result into a DataFrame, with 'movieId' as a column and the average vectors
df_avg_vector = df_avg_vector.reset_index()

# Display the first 5 rows of the resulting DataFrame
df_avg_vector.head(5)


Unnamed: 0,movieId,weighted_vector
0,1,"[0.034586858, 0.0041892263, -0.007967652, 0.04..."
1,2,"[0.02394018, 0.012072621, -0.012650338, 0.0410..."
2,3,"[0.025589697, -0.0041956874, -0.0049643833, 0...."
3,4,"[0.014996908, -0.0033440643, -0.023636833, 0.0..."
4,5,"[0.023046032, -0.010654649, -0.01876496, 0.041..."


#### Cosine Similarity on Vectors

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Choose a specific user for analysis
user_id = 10616
user_vector = df_user_vector.loc[user_id, 'user_vector']

# Compute cosine similarity between the user's vector and each movie vector
similarities = cosine_similarity([user_vector], df_avg_vector['weighted_vector'].tolist())[0]

# Add the similarity scores to the DataFrame containing movie vectors
df_avg_vector['similarity'] = similarities

# Sort the movies by similarity in descending order
recommended_movies = df_avg_vector.sort_values(by='similarity', ascending=False)

# Get the list of movies already seen by the user
user_seen_movies = df_tag[df_tag['userId'] == user_id]['movieId'].tolist()

# Filter out movies that the user has already seen
recommended_movies = recommended_movies[~recommended_movies['movieId'].isin(user_seen_movies)]

# Display the top 5 recommended movies
print(recommended_movies.head(5))


#### Recommended Movies for User 10616

In [None]:
# Merge the recommended movies with movie titles and other details from the movie DataFrame
recommended_movies_titles = recommended_movies.merge(df_movie, on='movieId', how='left')

# Drop the 'weighted_vector' column as it is no longer needed
recommended_movies_titles.drop(['weighted_vector'], axis=1, inplace=True)

# Display the top 10 recommended movies with titles and details
recommended_movies_titles.head(10)


Unnamed: 0,movieId,similarity,title,genres,age_movie
0,110566,0.90161,Son of Batman,Action Adventure Animation Crime Fantasy,10.0
1,113278,0.888484,Batman: Assault on Arkham,Action Animation Crime Thriller,10.0
2,106102,0.888108,Gambit,Comedy Crime,12.0
3,7697,0.886546,"Prince and the Showgirl, The",Comedy Romance,67.0
4,1723,0.882793,Twisted,Comedy Drama,28.0
5,2000,0.882739,Lethal Weapon,Action Comedy Crime Drama,37.0
6,104419,0.882469,Justice League: Crisis on Two Earths,Action Animation Sci-Fi,14.0
7,7720,0.880466,"Four Musketeers, The",Action Adventure Comedy Romance,50.0
8,31923,0.879845,"Three Musketeers, The",Action Adventure Comedy,51.0
9,27064,0.87742,Batman & Mr. Freeze: Subzero,Action Animation Children Crime,26.0
