# Clustering Preprocessing
Users are clustered based on the genres of the movies they watched (rated). To get this information, the following is done:   
1. Extract genres from <i>genres</i> column in the movie metadata (json to one hot encoding)
2. merge user ratings and movie metadata
3. group user ratings so only one entry per user is retained
4. normalize genre values so the sum equals 1 for each user 
5. discard unneeded columns
6. for visualization purposes: transform data to two dimensions 

Step 1 is already performed in the prediction preprocessing.

In [2]:
# change used width of browser window
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [3]:
# import packages
import pandas as pd
import re
import json

from sklearn.manifold import TSNE as TSNE
from sklearn.decomposition import PCA

## 1. Load Data

In [7]:
# Load Preprocessed Movie Metadata
df_movies = pd.read_csv("clusterPreprocessing.csv")
display(df_movies.head(3))
# load user ratings
df_ratings = pd.read_csv("the-movies-dataset/ratings.csv")
df_ratings = df_ratings.drop(columns=["rating", "timestamp"])
display(df_ratings.head(3))

Unnamed: 0,War,Drama,Western,Thriller,Documentary,Science Fiction,Comedy,History,Music,Fantasy,...,Animation,Foreign,Adventure,Romance,Family,Horror,Crime,Action,imdbId,rating
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,114709,3.888157
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,113497,3.236953
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,113228,3.17555


Unnamed: 0,userId,movieId
0,1,110
1,1,147
2,1,858


## 2. Merge User Ratings and Movie Metadata 

In [None]:
# join ratings and metadata
df_joined = 
 
# discard unneeded columns (all except userId and genres)
df_joined = df_joined.drop(columns=["movieId"])

# group values per user and aggregate genres
df_joined = df_joined.groupby("userId").sum()

# normalize genre values
df_joined[list(genre_set)] = df_joined[list(genre_set)].div(df_joined[list(genre_set)].sum(axis=1), axis=0)
df_joined = df_joined.fillna(0)

display(df_joined.head(3))

In [None]:
# save clustering data as csv
df_joined.to_csv("userclusterdata.csv", index=True)

## 3. Transform Data 

In [None]:
# pca: principal component analysis
a_pca = PCA(n_components=2).fit_transform(df_joined[list(genre_set)])

# tsne: t-distributed stochastic neighbor embedding
a_tsne = TSNE(n_components=2).fit_transform(df_joined[list(genre_set)])

In [None]:
# save transformed data as csv file
np.savetxt("tsne_allgenres.csv", a_tsne, delimiter=",")
np.savetxt("pca_allgenres.csv", a_pca, delimiter=",")