## Second Practice Activity of a Recommendation System
In the first notebook, I briefly explained recommendation systems and how to divide them. Furthermore, I implemented two initial collaborative methods, an item-based movie recommendation, and a user-based movie recommendation; these are simply methods using similarity metrics to recommend new movies.

Here, I want to implement a Content recommendation system, using the same movie dataset, with 3 new datasets. 
Content Approachs uses informations about the item/user to recommend. We preprocessing the data to have a dataset in the desired format.

In [322]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [323]:
def correlation_pearson_sparse_row(array, sparse_matrix):
    
    yy = array - array.mean() 
    xm = sparse_matrix.mean(axis=1).A.ravel() # pegando a média
    ys = yy / np.sqrt(np.dot(yy, yy)) # calculando ys
    xs = np.sqrt(np.add.reduceat(sparse_matrix.data**2, sparse_matrix.indptr[:-1]) - sparse_matrix.shape[1]*xm*xm)

    correl2 = np.add.reduceat(sparse_matrix.data * ys[sparse_matrix.indices], sparse_matrix.indptr[:-1]) / xs
    return correl2

### Datasets

In [324]:
movies = pd.read_csv("../Datasets/Movies IMDB/info.csv")

# 27278 movies with 16 duplicated movies title, but with differents IDs
# later i remove the duplicated movie titles to avoid problem

movies.info()

movies.head()

# these movie dataset we used to obtain part of the content information about the movies, the genres

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [325]:
tags = pd.read_csv("../Datasets/Movies IMDB/tags.csv").drop(["timestamp", "userId"], axis = 1) 

# 26744 movies
# that dataset has tags about certain movies, these information is also used as content about the movies
# bring new informations as Actors, Directors, etc.

tags.info()

tags.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   movieId  465564 non-null  int64 
 1   tag      465548 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.1+ MB


Unnamed: 0,movieId,tag
0,4141,Mark Waters
1,208,dark hero
2,353,dark hero
3,521,noir thriller
4,592,dark hero


In [326]:
movie_ids_ = tags["movieId"]

In [327]:
ratings = pd.read_csv("../Datasets/Movies IMDB/ratings.csv").drop("timestamp", axis = 1) #use later
# 19545 movies
# Dataset with user-movie interactions with ratings in interval [1, 5]

ratings.info()

ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int64  
 1   movieId  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 457.8 MB


Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


### Preprocessing the content data

To obtain the desired content dataset we reformat the genres column in movie dataset and aggroup the important tags in the tags dataset.

In [328]:
# modifying tags

formatted_tags = tags.copy()

formatted_tags['tag'] = tags['tag'].astype(str).apply(lambda x: x.capitalize().strip().replace(" ", "")) # formatting tags
formatted_tags = formatted_tags.drop_duplicates()

tags_grouped_by_id = formatted_tags.groupby(['movieId'], as_index=False).agg({'tag': '; '.join}) # to aggroup tags by movieId

tags_grouped_by_id.head()

Unnamed: 0,movieId,tag
0,1,Watched; Computeranimation; Disneyanimatedfeat...
1,2,Timetravel; Adaptedfrom:book; Boardgame; Child...
2,3,Oldpeoplethatisactuallyfunny; Sequelfever; Gru...
3,4,Chickflick; Revenge; Characters; Clv
4,5,Dianekeaton; Family; Sequel; Stevemartin; Wedd...


In [329]:
# formatting movies
movies.loc[:, "genres"] = movies.loc[:, "genres"].str.replace("|", "; ", regex = False)
movies.loc[:, "genres"].replace("(no genres listed)", "" , inplace = True) # basically change no genres listed to NaN

movies.head()

movies.loc[:, "genres"] = ""

In [330]:
df_merged = pd.merge(movies, tags_grouped_by_id, on = "movieId", how = "outer") # outer join with the two dataframes

df_merged.loc[:, "genres"].replace(np.nan, "", inplace = True)
df_merged.loc[:, "tag"].replace(np.nan, "", inplace = True)
df_merged["tags"] = df_merged["genres"] + "; "+ df_merged["tag"] # uning genres and tags in one columns tags
df_merged.drop(["genres", "tag"], axis = 1, inplace = True)

df_merged.head()

Unnamed: 0,movieId,title,tags
0,1,Toy Story (1995),; Watched; Computeranimation; Disneyanimatedfe...
1,2,Jumanji (1995),; Timetravel; Adaptedfrom:book; Boardgame; Chi...
2,3,Grumpier Old Men (1995),; Oldpeoplethatisactuallyfunny; Sequelfever; G...
3,4,Waiting to Exhale (1995),; Chickflick; Revenge; Characters; Clv
4,5,Father of the Bride Part II (1995),; Dianekeaton; Family; Sequel; Stevemartin; We...


In [331]:
without_tags = df_merged[df_merged["tags"] == "; "]["movieId"] # save to remove later

without_tags.shape

(7733,)

In [332]:
df_merged["movieId"].unique().shape # checking if has two same ids

(27278,)

In [333]:
print("tags: ",tags_grouped_by_id.shape, " \nmovies:", movies.shape, " \ndf_merged:", df_merged.shape)

tags:  (19545, 2)  
movies: (27278, 3)  
df_merged: (27278, 3)


In [334]:
vectorizer_tags = CountVectorizer(dtype = np.int8)

tags_vectorized = vectorizer_tags.fit_transform(df_merged.loc[:, "tags"])
tags_names = vectorizer_tags.get_feature_names_out()

tags_vectorized.data = np.where(tags_vectorized.data != 0, 1, 0)
tags_vectorized

<27278x34957 sparse matrix of type '<class 'numpy.int32'>'
	with 213797 stored elements in Compressed Sparse Row format>

In [335]:
# we filter some tags to have just the more frequent tags, more important ones

tags_counts = np.array(tags_vectorized.sum(axis=0))[0]
filter_ = (tags_counts > 10)

names_filtered = tags_names[filter_] # 35k para 3k

tags_vectorized = pd.DataFrame((tags_vectorized[:, filter_].toarray()), columns = names_filtered)
tags_vectorized.drop(["tag", "watched"], axis = 1, inplace = True)

tags_vectorized.shape

(27278, 2935)

In [336]:
tags_vectorized.head()

Unnamed: 0,007,01,02,03,04,05,06,07,08,09,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [337]:
tags_final = pd.concat([df_merged["movieId"], tags_vectorized], axis = 1)

tags_final.drop(tags_final[tags_final["movieId"].isin(without_tags)].index, inplace = True)
# removing movies without any tag

In [338]:
new_ratings = pd.merge(movies.drop("genres", axis = 1), ratings, on = "movieId")

new_ratings.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5


In [339]:
count = new_ratings.groupby("title")["movieId"].nunique()

dup_movies = (count[count>1].index).tolist()

new_ratings = new_ratings.drop(new_ratings[new_ratings["title"].isin(dup_movies)].index)

new_ratings.shape
#Dropping duplicate movies

(19940607, 4)

In [340]:
# Here we maintain just the movies which appear in the two datasets(ratings and tags)

tags_movie_ids = tags_final["movieId"]
ratings_movie_ids = new_ratings["movieId"]

common_movie_ids = pd.Series(list(set(tags_movie_ids) & set(ratings_movie_ids)))

common_movie_ids.shape

new_ratings = new_ratings.drop(new_ratings[~(new_ratings['movieId'].isin(common_movie_ids))].index)

tags_final = tags_final.drop(tags_final[~(tags_final['movieId'].isin(common_movie_ids))].index)

new_ratings.shape

(19794945, 4)

In [341]:
# Code to create the pivot table in a efficient way in csr matrix

user_ids = new_ratings['userId'].unique()
title_ids = new_ratings['title'].unique()

user_to_row = {user_id: i for i, user_id in enumerate(new_ratings['userId'].unique())} # dict used to userNames
title_to_col = {title: j for j, title in enumerate(new_ratings['title'].unique())} # dict used to itemNames

rows = [user_to_row[user_id] for user_id in new_ratings['userId']]
cols = [title_to_col[title] for title in new_ratings['title']]
ratings_values = new_ratings['rating'].tolist()

sparse_matrix = csr_matrix((ratings_values, (rows, cols)), shape=(len(user_ids), len(title_ids)), dtype = np.int8)

print(sparse_matrix.shape)

(138493, 18991)


In [342]:
movie_names = pd.Series(list(title_to_col.keys()))
user_names = pd.Series(list(user_to_row.keys()))

pivot_csr = sparse_matrix

In [343]:
movie_ids = tags_final["movieId"]
movie_names = pd.merge(movies[["movieId", "title"]], tags_final["movieId"], on = "movieId")["title"]

print("names: ",movie_names.shape,"  ids: ", movie_ids.shape)

tags_final = tags_final.drop("movieId", axis = 1)

tags_array = np.array(tags_final, dtype = np.int8)
tags_csr = csr_matrix(tags_final, dtype = np.int8)

names:  (18991,)   ids:  (18991,)


In [344]:
del new_ratings, tags_grouped_by_id, tags_vectorized, tags_final, movies, ratings

To make the user recommendation we need another dataset, final_csr which can be interpreted as the the characteristics of the movies watched by a user

In [345]:
reduced_pivot = csr_matrix(pivot_csr, dtype = np.float16)

result_dot = reduced_pivot.dot(tags_csr)

In [346]:
tags_sum = np.array(tags_csr.sum(axis = 0), dtype = np.float16)[0]

In [347]:
final_coo = result_dot / tags_sum # extremely heavy line, use a lot of RAM

final_csr = csr_matrix(final_coo, dtype = np.float16)

In [348]:
del result_dot, tags_sum, final_coo, reduced_pivot

### Recommendations

With the all datasets preprocessed we create the recommendations recommendations to item and user.

In [349]:
def get_recommendation_item(movie_name, metric):
    if (movie_name is None):
        print("Movie not found.")
        return None
    
    # ou dar o nome exato ou não roda
    
    movie = tags_array[pd.Index(movie_names).get_loc(movie_name), :]

    if metric == "cosine":
        similarity = cosine_similarity(movie.reshape(1, -1), tags_csr)[0]
    elif metric == "pearson":
        similarity = correlation_pearson_sparse_row(movie, tags_csr)
    else: 
        print("Unknown Metric.")
        return None
    
    order = np.argsort(-similarity)

    top_10 = similarity[order[1:11]]
    top_10_names = movie_names[order[1:11]]
    
    return (list(top_10_names), top_10)

In [350]:
get_recommendation_item("Batman & Robin (1997)", "cosine")

(['Batman (1966)',
  'Batman Forever (1995)',
  'Superman III (1983)',
  'Batman: Mask of the Phantasm (1993)',
  'Batman Returns (1992)',
  'Batman Beyond: Return of the Joker (2000)',
  'Batman: Gotham Knight (2008)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman (1989)',
  'Superman II (1980)'],
 array([0.53422445, 0.4810512 , 0.47105572, 0.45425676, 0.40032038,
        0.39629696, 0.39629696, 0.3933979 , 0.34027233, 0.33968311]))

In [351]:
get_recommendation_item("Batman & Robin (1997)", "pearson")

(['Batman (1966)',
  'Batman Forever (1995)',
  'Superman III (1983)',
  'Batman: Mask of the Phantasm (1993)',
  'Batman Beyond: Return of the Joker (2000)',
  'Batman: Gotham Knight (2008)',
  'Batman Returns (1992)',
  'Superman IV: The Quest for Peace (1987)',
  'Superman II (1980)',
  'Batman (1989)'],
 array([0.52961592, 0.47397287, 0.4653604 , 0.44909869, 0.39343441,
        0.39343441, 0.39256963, 0.38655158, 0.33165842, 0.3290266 ]))

In [352]:
def get_recommendation_user(user_name):
    if (user_name is None):
        print("User not found.")
        return None
    user_id = pd.Index(user_names).get_loc(user_name)
    
    user_movies = pivot_csr[user_id, :].toarray()
    user_characteristics = final_csr[user_id, :]
    
    unwatched_movies = (user_movies == 0)[0]
    unwatched_movies_charac = tags_csr[unwatched_movies, :].T
    
    scores = user_characteristics.dot(unwatched_movies_charac).toarray()[0]
    
    names = movie_names[unwatched_movies]
    order = scores.argsort()[::-1]
    
    return names.iloc[order], scores[order]

In [353]:
get_recommendation_user(3)

(258                                    Pulp Fiction (1994)
 312                                    Forrest Gump (1994)
 4224     Lord of the Rings: The Fellowship of the Ring,...
 2462                                     Fight Club (1999)
 6090     Lord of the Rings: The Return of the King, The...
                                ...                        
 7834                      Strangers in Good Company (1990)
 18192                        Leave The World Behind (2014)
 2846     Carriers Are Waiting, The (Convoyeurs attenden...
 7824                       Mannequin 2: On the Move (1991)
 13331                 Stray Dogs (Sag-haye velgard) (2004)
 Name: title, Length: 18804, dtype: object,
 array([77.12048, 67.86047, 53.95102, ...,  0.     ,  0.     ,  0.     ],
       dtype=float32))

In [354]:
get_recommendation_user(2579)

(977      Star Wars: Episode VI - Return of the Jedi (1983)
 258                                    Pulp Fiction (1994)
 968                                          Aliens (1986)
 2114                                    Matrix, The (1999)
 416                                   Jurassic Park (1993)
                                ...                        
 15370                          Weight of the Nation (2012)
 5162     Labyrinth of Passion (Laberinto de Pasiones) (...
 12098                                          Ryan (2004)
 12097    Maradona, the Hand of God (Maradona, la mano d...
 14958                   Harvest Month, The (Elokuu) (1956)
 Name: title, Length: 18967, dtype: object,
 array([16.677773, 15.569359, 14.754047, ...,  0.      ,  0.      ,
         0.      ], dtype=float32))

#### Observações finais:
     -> Aparentemente tudo OK
     -> A recomendação de usuário tem algumas ressalvas quanto a ela, mas por enquanto vou deixar assim pois quero seguir em frente
     -> A otimização ficou mais tranquila, deu pra reduzir e corrigir alguns erros antigos
     -> No mais acredito que esteja tudo ok, talvez mais para frente quando tiver com mais vontade vou transformar numa classe que nem o outro exemplo.