# Vectorizing Users

The purpose of this notebook, is to vectorise users based on their tags, and choices of movies.
We are going to work on categorical features such as tags, titles and movie genres to construct numeric vectors. We are also going to transform dates into ages, and then ages into youth_rates to help understand which era fits best every user. These vectors and youth_rates will allow us to compare users and movies thus circle shared topics and styles to recommend better to our most valuable customers.

##### Importing nessecary libraries

In [6]:
import numpy as np
import pandas as pd
import re


##### Loading Tag DataFrame

In [7]:
df_tag = pd.read_csv('input_data\\tag.csv', delimiter=',')
df_tag.head(4)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43


##### Quick checking number of user and number of tagging
7801 user, it's a very low number, these user can be considered as opinion leaders since they put more effort into judging movies.

In [8]:
df_tag['userId'].nunique()

7801

In [9]:
df_tag.shape

(465564, 4)

##### Droping a few NaN

In [10]:
# Find columns with NaN values
# Count NaN values for each column
nan_counts = df_tag.isna().sum()

# Filter and print only the columns with NaN values and their counts
nan_columns_counts = nan_counts[nan_counts > 0]
nan_columns_counts

tag    16
dtype: int64

In [11]:
df_tag = df_tag.dropna()


In [12]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

##### Removing names to replace them with actor or actress in tags
The purpose is to vectorize the tags, and the model doesn't know names and can't vectorise them, so its better to drop this information and replace it with something useful. It allows us to feed to model with information such as "is the main charactere male or female?"


In [13]:
from nltk.corpus import names

# Charger les prénoms masculins et féminins
male_names = set(names.words('male.txt'))
female_names = set(names.words('female.txt'))

# Fonction pour remplacer toute la cellule par "male actor" ou "female actor"
def replace_name(tag):
    words = tag.split()
    for word in words:
        if word in male_names:
            return "actor"
        elif word in female_names:
            return "actress"
    return tag  # Si aucun prénom n'est trouvé, conserver le tag original

# Appliquer la fonction à la colonne 'tag' pour remplacer les prénoms
df_tag['tag'] = df_tag['tag'].apply(replace_name)

df_tag.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,actor,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18


In [14]:
df_tag.dtypes


userId        int64
movieId       int64
tag          object
timestamp    object
dtype: object

##### Using Timestamp to create a tag age column, much more usefull later on

In [15]:
# Convertir la colonne 'timestamp' en datetime si nécessaire
df_tag['timestamp'] = pd.to_datetime(df_tag['timestamp'])

# Calculer l'âge en années
current_date = pd.Timestamp.now()
df_tag['age'] = round((current_date - df_tag['timestamp']).dt.days / 365.25, 0)

# Droper la colonne 'timestamp'
df_tag = df_tag.drop(columns=['timestamp'])

df_tag['age'] = round(df_tag['age'],0)
df_tag

##### Loading Movie DataFrame

In [17]:
df_movie = pd.read_csv('input_data\movie.csv')

df_movie.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


##### Splitting genres and Extracting title and year data for a easier use later on

In [18]:
# Séparer les genres en une liste, puis joindre avec un espace
df_movie['genres'] = df_movie['genres'].str.split('|').str.join(' ')

# Utiliser str.extract pour séparer le titre et l'année
df_movie[['title', 'year']] = df_movie['title'].str.extract(r'^(.*)\s\((\d{4})\)$')

df_movie.head(3)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,1995
1,2,Jumanji,Adventure Children Fantasy,1995
2,3,Grumpier Old Men,Comedy Romance,1995


##### Dropping a few NaN in titles, and transforming date in age to later create youth_rate for movies

In [19]:
# Drop les lignes sans titles ni years (55 en tout)
df_movie = df_movie.dropna(subset=['title'])

# Convertir la colonne 'year' en entier
df_movie['year'] = df_movie['year'].astype(int)

# Creation de la colonne age_movie
df_movie['age_movie'] = 2024 - df_movie['year']

# Droper la colonne 'year'
df_movie = df_movie.drop(columns=['year'])

df_movie.head(5)

Unnamed: 0,movieId,title,genres,age_movie
0,1,Toy Story,Adventure Animation Children Comedy Fantasy,29
1,2,Jumanji,Adventure Children Fantasy,29
2,3,Grumpier Old Men,Comedy Romance,29
3,4,Waiting to Exhale,Comedy Drama Romance,29
4,5,Father of the Bride Part II,Comedy,29


##### Merging tag related DataFrame with movie related DataFrame

In [20]:
df_tag_title_genres = df_tag.merge(df_movie, on='movieId', how='left')

df_tag_title_genres.head(5)

Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie
0,18,4141,actor,15.0,Head Over Heels,Comedy Romance,23.0
1,65,208,dark hero,11.0,Waterworld,Action Adventure Sci-Fi,29.0
2,65,353,dark hero,11.0,"Crow, The",Action Crime Fantasy Thriller,30.0
3,65,521,noir thriller,11.0,Romeo Is Bleeding,Crime Thriller,31.0
4,65,592,dark hero,11.0,Batman,Action Crime Thriller,35.0


In [21]:
df_tag_title_genres.shape

(465548, 7)

In [22]:
df_tag_title_genres['title'] = df_tag_title_genres['title'].astype(str)
df_tag_title_genres['genres'] = df_tag_title_genres['genres'].astype(str)

##### Loading the pre-trained model word2vec to vectorize tags, titles, and genres

In [23]:
# Importing necessary libraries
import nltk
from nltk.data import find
import gensim

# Downloading required NLTK resources
nltk.download('punkt')  # Downloading tokenizers for NLTK
nltk.download('stopwords')
nltk.download('word2vec_sample')  # Downloading the word2vec sample model

# Finding the path of the pre-trained word2vec model
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

# Loading the pre-trained word2vec model using Gensim
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package word2vec_sample to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


In [24]:
model.similarity('actor','actress')

0.79300094

##### Cleaning text, tokenizing, lower case, punctuation, symboles, stopwords...

In [25]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Assure-toi d'avoir téléchargé les stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Charger les stopwords anglais
stop_words = set(stopwords.words('english'))

# Fonction pour supprimer les stopwords d'un texte
def remove_stopwords(text):
    # Tokenisation du texte
    words = word_tokenize(text.lower())
    # Filtrage des stopwords
    filtered_words = [word for word in words if word not in stop_words and word.isalpha()]
    # Rejoindre les mots filtrés en une seule chaîne
    return ' '.join(filtered_words)

# Appliquer la fonction à la colonne 'title'
df_tag_title_genres['title'] = df_tag_title_genres['title'].apply(remove_stopwords)
df_tag_title_genres['tag'] = df_tag_title_genres['tag'].apply(remove_stopwords)




[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jcrig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##### Cleaning text, lower case, punctuation (might be redundant with previous cell)

In [26]:
def clean_text(text):
    # Convertir en minuscules
    text = text.lower()
    # Supprimer la ponctuation et les symboles
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

# Appliquer la fonction aux colonnes 'tag', 'title', et 'genres'
df_tag_title_genres['tag'] = df_tag_title_genres['tag'].apply(clean_text)
df_tag_title_genres['title'] = df_tag_title_genres['title'].apply(clean_text)
df_tag_title_genres['genres'] = df_tag_title_genres['genres'].apply(clean_text)

df_tag_title_genres.head(10)

Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie
0,18,4141,actor,15.0,head heels,comedy romance,23.0
1,65,208,dark hero,11.0,waterworld,action adventure scifi,29.0
2,65,353,dark hero,11.0,crow,action crime fantasy thriller,30.0
3,65,521,noir thriller,11.0,romeo bleeding,crime thriller,31.0
4,65,592,dark hero,11.0,batman,action crime thriller,35.0
5,65,668,bollywood,11.0,song little road pather panchali,drama,69.0
6,65,898,screwball comedy,11.0,philadelphia story,comedy drama romance,84.0
7,65,1248,noir thriller,11.0,touch evil,crime filmnoir thriller,66.0
8,65,1391,mars,11.0,mars attacks,action comedy scifi,28.0
9,65,1617,,11.0,confidential,crime filmnoir mystery thriller,27.0


##### Replacing genres words that the model doesn't understand with words the model knows and is able to embed

In [27]:
# Dictionnaire de mapping pour remplacer les termes
mapping_dict = {
    'scifi': 'future',
    'thriller': 'suspense',
    'filmnoir': 'cynical',
    'musical': 'singing',
    'western': 'cowboy'
}

# Remplacement des termes dans la colonne 'genres'
for key, value in mapping_dict.items():
    df_tag_title_genres['genres'] = df_tag_title_genres['genres'].str.replace(key, value, regex=True)

In [28]:
df_tag_title_genres.head(10)

Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie
0,18,4141,actor,15.0,head heels,comedy romance,23.0
1,65,208,dark hero,11.0,waterworld,action adventure future,29.0
2,65,353,dark hero,11.0,crow,action crime fantasy suspense,30.0
3,65,521,noir thriller,11.0,romeo bleeding,crime suspense,31.0
4,65,592,dark hero,11.0,batman,action crime suspense,35.0
5,65,668,bollywood,11.0,song little road pather panchali,drama,69.0
6,65,898,screwball comedy,11.0,philadelphia story,comedy drama romance,84.0
7,65,1248,noir thriller,11.0,touch evil,crime cynical suspense,66.0
8,65,1391,mars,11.0,mars attacks,action comedy future,28.0
9,65,1617,,11.0,confidential,crime cynical mystery suspense,27.0


### Vectorizing the tags, titles and genres into 300 array list vectors

In [29]:
# Fonction pour vectoriser les mots d'une tag
def vectorize_tag(tag, model):
    vectors = []
    for word in tag.split():
        if word in model:
            vectors.append(model[word])
        else:
            return np.nan  # Retourne NaN si un mot n'est pas reconnu
    return np.mean(vectors, axis=0)  # Somme des vecteurs pour chaque tag

# Appliquer la fonction de vectorisation
df_tag_title_genres['tag_vector'] = df_tag_title_genres['tag'].apply(lambda x: vectorize_tag(x, model))
df_tag_title_genres['title_vector'] = df_tag_title_genres['title'].apply(lambda x: vectorize_tag(x, model))
df_tag_title_genres['genres_vector'] = df_tag_title_genres['genres'].apply(lambda x: vectorize_tag(x, model))


df_tag_title_genres.head(3)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector
0,18,4141,actor,15.0,head heels,comedy romance,23.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...","[0.032902144, -0.0385218, -0.03976148, 0.07610..."
1,65,208,dark hero,11.0,waterworld,action adventure future,29.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...",,"[0.0060361014, 0.017076494, 0.011771732, 0.026..."
2,65,353,dark hero,11.0,crow,action crime fantasy suspense,30.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[-0.00227776, 0.0247481, 0.0133911, 0.00635654...","[0.041928604, 0.0007887692, 0.025844725, -0.00..."


#### Replacing NaN with numpy array 300 list of zeros to have all the same dtype in the cells


In [30]:
# Créer un vecteur de 300 éléments égaux à zéro
zero_vector = np.zeros(300)

# Remplacer les NaN par le vecteur zéro dans chaque colonne
df_tag_title_genres['tag_vector'] = df_tag_title_genres['tag_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)
df_tag_title_genres['title_vector'] = df_tag_title_genres['title_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)
df_tag_title_genres['genres_vector'] = df_tag_title_genres['genres_vector'].apply(lambda x: zero_vector if isinstance(x, float) and np.isnan(x) else x)
df_tag_title_genres.head(3)

Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector
0,18,4141,actor,15.0,head heels,comedy romance,23.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...","[0.032902144, -0.0385218, -0.03976148, 0.07610..."
1,65,208,dark hero,11.0,waterworld,action adventure future,29.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0060361014, 0.017076494, 0.011771732, 0.026..."
2,65,353,dark hero,11.0,crow,action crime fantasy suspense,30.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[-0.00227776, 0.0247481, 0.0133911, 0.00635654...","[0.041928604, 0.0007887692, 0.025844725, -0.00..."


##### Calculating one average vector/user/movie based on the 3 vectors (tag, title and genres)

In [31]:
# Fonction pour calculer la moyenne des vecteurs
def calculate_average_vector(row):
    vectors = np.array([row['tag_vector'], row['title_vector'], row['genres_vector']])
    return np.mean(vectors, axis=0)

# Appliquer la fonction pour calculer le vecteur moyen et l'ajouter comme nouvelle colonne
df_tag_title_genres['user_movie_vector'] = df_tag_title_genres.apply(calculate_average_vector, axis=1)
df_tag_title_genres.head(3)


Unnamed: 0,userId,movieId,tag,age,title,genres,age_movie,tag_vector,title_vector,genres_vector,user_movie_vector
0,18,4141,actor,15.0,head heels,comedy romance,23.0,"[0.0536976, -0.0352089, -0.0556269, 0.0234726,...","[-0.04193265, -0.02813775, 0.05097, -0.0277740...","[0.032902144, -0.0385218, -0.03976148, 0.07610...","[0.014889032, -0.03395615, -0.014806126, 0.023..."
1,65,208,dark hero,11.0,waterworld,action adventure future,29.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0060361014, 0.017076494, 0.011771732, 0.026...","[0.03081790062909325, 0.02410134735206763, 0.0..."
2,65,353,dark hero,11.0,crow,action crime fantasy suspense,30.0,"[0.0864176, 0.055227548, 0.06579455, -0.010537...","[-0.00227776, 0.0247481, 0.0133911, 0.00635654...","[0.041928604, 0.0007887692, 0.025844725, -0.00...","[0.042022813, 0.026921473, 0.035010125, -0.001..."


In [32]:
df_tag_title_genres.shape

(465548, 11)

##### Calculating the average user_vector with groupby user_id to get only one vector/user

In [38]:
# Appliquer groupby sur userId et calculer la moyenne pour chaque colonne
df_grouped = df_tag_title_genres.groupby('userId').mean(numeric_only=True)

In [39]:
df_grouped.head(10)

Unnamed: 0_level_0,movieId,age,age_movie,tag_vector_0,tag_vector_1,tag_vector_2,tag_vector_3,tag_vector_4,tag_vector_5,tag_vector_6,...,user_movie_vector_290,user_movie_vector_291,user_movie_vector_292,user_movie_vector_293,user_movie_vector_294,user_movie_vector_295,user_movie_vector_296,user_movie_vector_297,user_movie_vector_298,user_movie_vector_299
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
18,4141.0,15.0,23.0,0.053698,-0.035209,-0.055627,0.023473,0.016077,0.110611,-0.009043,...,-0.006471,0.037527,-0.073639,-0.023889,0.023804,0.012073,0.019153,0.005831,-0.006938,0.05804
65,15211.058824,11.470588,32.088235,0.027663,0.014717,0.002878,0.010959,-0.001154,0.013754,0.017317,...,-0.022506,0.003585,-0.030945,-0.012529,-0.002457,-0.001832,-0.004014,-0.021792,0.000503,0.020233
96,106696.0,10.0,11.0,0.030466,-0.003008,-0.011646,0.029537,-0.007484,0.01105,0.017731,...,-0.071901,-0.004837,-0.058803,-0.012635,0.004143,0.001133,0.019925,0.011643,0.009393,-0.03077
121,31220.054054,13.0,20.959459,0.018217,-0.005181,-0.013041,0.0332,-0.008786,0.018358,-0.003804,...,-0.004458,0.019261,-0.05705,-0.027208,-0.007126,0.017245,-0.00274,-0.018061,0.007516,0.021165
129,35166.270492,13.737705,19.54918,0.028904,-0.00517,-0.011786,0.030328,-0.013535,0.023266,0.013479,...,-0.013489,0.017812,-0.052038,-0.021572,-0.00653,0.00958,0.002248,-0.02083,0.002735,0.013557
133,5847.0,10.0,22.2,0.035504,-0.000523,-0.022148,0.00721,0.0035,-0.00215,0.031261,...,-0.032205,0.007054,-0.043336,-0.017538,0.005027,0.002727,0.012269,-0.029577,0.001541,0.025244
190,1714.8,13.0,36.4,0.022145,0.050467,-0.018776,0.045312,-0.010063,0.011392,-0.012396,...,-0.032627,0.033438,-0.04464,-0.018476,0.013056,0.011426,-0.001755,0.005213,0.016037,-0.015467
205,1966.0,17.0,34.0,-0.00997,0.016945,-0.033396,0.036504,-0.017301,0.012105,0.026214,...,-0.019224,0.013812,-0.067321,-0.01004,0.000229,0.004797,-0.008424,-0.011488,0.021194,-0.02812
208,21246.833333,19.0,21.166667,0.015007,0.003165,-0.008897,0.017798,-0.007676,0.009924,0.006844,...,-0.025295,0.002331,-0.04606,-0.012975,0.009992,0.006089,0.009344,-0.030555,0.00093,0.01926
271,84772.0,13.0,13.0,0.053698,-0.035209,-0.055627,0.023473,0.016077,0.110611,-0.009043,...,-0.011303,0.018093,-0.044287,-0.034569,0.010871,0.018623,-0.018846,0.004163,0.001256,0.006668


In [40]:
df_grouped.shape

(7801, 1203)

##### Grouping the 300 vectors into a list again

In [41]:
# Combiner les colonnes user_movie_vector_0 à user_movie_vector_299 en une seule colonne user_movie_vector
df_grouped['user_movie_vector'] = df_grouped[[f'user_movie_vector_{i}' for i in range(300)]].apply(lambda row: row.values.tolist(), axis=1)

# Supprimer les colonnes user_movie_vector_0 à user_movie_vector_299 car elles sont maintenant combinées
df_grouped.drop([f'user_movie_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'tag_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'title_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop([f'genres_vector_{i}' for i in range(300)], axis=1, inplace=True)
df_grouped.drop(['movieId'], axis=1, inplace=True)


In [42]:
df_grouped.head(10)

Unnamed: 0_level_0,age,age_movie,user_movie_vector
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18,15.0,23.0,"[0.014889031648635864, -0.03395615145564079, -..."
65,11.470588,32.088235,"[0.027792496606707573, 0.009135226303638487, 0..."
96,10.0,11.0,"[0.010905513501105208, -0.0034274699816402663,..."
121,13.0,20.959459,"[0.012775622906587044, -0.004583349273128955, ..."
129,13.737705,19.54918,"[0.02194722527459956, -0.0033614813859696637, ..."
133,10.0,22.2,"[0.03299186776081721, -0.003448138727496068, -..."
190,13.0,36.4,"[0.010119041614234447, 0.012334399422009788, 0..."
205,17.0,34.0,"[-0.00790642915914456, -0.0074115414172410965,..."
208,19.0,21.166667,"[0.01984358832447065, -0.0010980015051447684, ..."
271,13.0,13.0,"[0.020448155235499144, -0.015068457461893559, ..."


##### Renaming the columns

In [44]:
df_grouped.rename(columns={
    'age': 'tag_mean_age',
    'age_movie': 'movie_mean_age',
    'user_movie_vector': 'user_vector'
}, inplace=True)

In [46]:
df_grouped.head()

Unnamed: 0_level_0,tag_mean_age,movie_mean_age,user_vector
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18,15.0,23.0,"[0.014889031648635864, -0.03395615145564079, -..."
65,11.470588,32.088235,"[0.027792496606707573, 0.009135226303638487, 0..."
96,10.0,11.0,"[0.010905513501105208, -0.0034274699816402663,..."
121,13.0,20.959459,"[0.012775622906587044, -0.004583349273128955, ..."
129,13.737705,19.54918,"[0.02194722527459956, -0.0033614813859696637, ..."


##### Changing the dataframe's name

In [47]:
df_user_vector = df_grouped

#### Here I study the minimum, maximum, and mean age of movies and tags

As we can see, movies started 104 years ago while tagging only started 19 years ago. This will allow us to define first which era our users are interested in, and secondly, how fresh is the tagging, and maybe feel and evolution in taste

In [50]:
print(df_user_vector['movie_mean_age'].min())
print(df_user_vector['movie_mean_age'].max())
print(df_user_vector['movie_mean_age'].mean())

print(df_user_vector['tag_mean_age'].min())
print(df_user_vector['tag_mean_age'].max())
print(df_user_vector['tag_mean_age'].mean())

9.0
104.0
26.72252585515137
9.0
19.0
14.4575634827061


##### Creation of both tag_youth_rate and movie_youth_rate

In [52]:
# Normaliser l'âge pour obtenir le taux de jeunesse (0 à 1)
df_user_vector['tag_youth_rate'] = 1 - (df_user_vector['tag_mean_age'] - df_user_vector['tag_mean_age'].min()) / (df_user_vector['movie_mean_age'].max() - df_user_vector['tag_mean_age'].min())
df_user_vector['movie_youth_rate'] = 1 - (df_user_vector['movie_mean_age'] - df_user_vector['movie_mean_age'].min()) / (df_user_vector['movie_mean_age'].max() - df_user_vector['movie_mean_age'].min())

df_user_vector.head(3)

Unnamed: 0_level_0,tag_mean_age,movie_mean_age,user_vector,tag_youth_rate,movie_youth_rate
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18,15.0,23.0,"[0.014889031648635864, -0.03395615145564079, -...",0.936842,0.852632
65,11.470588,32.088235,"[0.027792496606707573, 0.009135226303638487, 0...",0.973994,0.756966
96,10.0,11.0,"[0.010905513501105208, -0.0034274699816402663,...",0.989474,0.978947


##### Dropping age columns

In [53]:
df_user_vector.drop(['tag_mean_age','movie_mean_age'], axis=1, inplace=True)

##### Export dataframe to csv for ML use

In [None]:
# Export du dataframe au format csv
df_user_vector.to_csv('output_data/user_avg_vectors_4.csv',  index=False)