### Second Practice Activity of a Recommendation System

In the first notebook, I briefly explained recommendation systems and how to divide them. Furthermore, I implemented two initial collaborative methods, an item-based movie recommendation, and a user-based movie recommendation; these are simply methods using similarity metrics to recommend new movies.

Here, I want to implement a Content recommendation system, using the same movie dataset. Content Approachs uses informations about the item/user to select similar ones and recommend. This dataset was prepared in the another Python Notebook using movie_informations.

In [12]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [13]:
def correlation_pearson_sparse_row(array, sparse_matrix):
    
    yy = array - array.mean() 
    xm = sparse_matrix.mean(axis=1).A.ravel() # pegando a média
    ys = yy / np.sqrt(np.dot(yy, yy)) # calculando ys
    xs = np.sqrt(np.add.reduceat(sparse_matrix.data**2, sparse_matrix.indptr[:-1]) - sparse_matrix.shape[1]*xm*xm)

    correl2 = np.add.reduceat(sparse_matrix.data * ys[sparse_matrix.indices], sparse_matrix.indptr[:-1]) / xs
    return correl2

In [14]:
movies = pd.read_csv("../movie_information/movie.csv")
# 27278 movies with 16 duplicated movies title, but with differents IDs
# i will maintain because they have differents movieIds

movies.info()

movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
tags = pd.read_csv("../movie_information/tag.csv").drop(["timestamp", "userId"], axis = 1) 
# 26744 movies

tags.info()

tags.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   movieId  465564 non-null  int64 
 1   tag      465548 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.1+ MB


Unnamed: 0,movieId,tag
0,4141,Mark Waters
1,208,dark hero
2,353,dark hero
3,521,noir thriller
4,592,dark hero


In [16]:
ratings = pd.read_csv("../movie_information/rating.csv").drop("timestamp", axis = 1) #use later
# 19545 movies

ratings.info()

ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int64  
 1   movieId  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 457.8 MB


Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [17]:
# modifying tags

formatted_tags = tags.copy()

formatted_tags['tag'] = tags['tag'].astype(str).apply(lambda x: x.capitalize().strip().replace(" ", "")) # formatting tags
formatted_tags = formatted_tags.drop_duplicates()

tags_grouped_by_id = formatted_tags.groupby(['movieId'], as_index=False).agg({'tag': '; '.join}) # to aggroup tags by movieId

tags_grouped_by_id.head()

Unnamed: 0,movieId,tag
0,1,Watched; Computeranimation; Disneyanimatedfeat...
1,2,Timetravel; Adaptedfrom:book; Boardgame; Child...
2,3,Oldpeoplethatisactuallyfunny; Sequelfever; Gru...
3,4,Chickflick; Revenge; Characters; Clv
4,5,Dianekeaton; Family; Sequel; Stevemartin; Wedd...


In [18]:
# formatting movies
movies.loc[:, "genres"] = movies.loc[:, "genres"].str.replace("|", "; ", regex = False)
movies.loc[:, "genres"].replace("(no genres listed)", "" , inplace = True) # basically change no genres listed to NaN

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure; Animation; Children; Comedy; Fantasy
1,2,Jumanji (1995),Adventure; Children; Fantasy
2,3,Grumpier Old Men (1995),Comedy; Romance
3,4,Waiting to Exhale (1995),Comedy; Drama; Romance
4,5,Father of the Bride Part II (1995),Comedy


In [19]:
df_merged = pd.merge(movies, tags_grouped_by_id, on = "movieId", how = "outer") # outer join with the two dataframes

df_merged.loc[:, "genres"].replace(np.nan, "", inplace = True)
df_merged.loc[:, "tag"].replace(np.nan, "", inplace = True)
df_merged["tags"] = df_merged["genres"] + "; "+ df_merged["tag"] # uning genres and tags in one columns tags
df_merged.drop(["genres", "tag"], axis = 1, inplace = True)

df_merged.head()

Unnamed: 0,movieId,title,tags
0,1,Toy Story (1995),Adventure; Animation; Children; Comedy; Fantas...
1,2,Jumanji (1995),Adventure; Children; Fantasy; Timetravel; Adap...
2,3,Grumpier Old Men (1995),Comedy; Romance; Oldpeoplethatisactuallyfunny;...
3,4,Waiting to Exhale (1995),Comedy; Drama; Romance; Chickflick; Revenge; C...
4,5,Father of the Bride Part II (1995),Comedy; Dianekeaton; Family; Sequel; Stevemart...


In [20]:
without_tags = df_merged[df_merged["tags"] == "; "]["movieId"] # save to remove later

without_tags.shape

(209,)

In [21]:
print("tags: ",tags_grouped_by_id.shape, " \nmovies:", movies.shape, " \ndf_merged:", df_merged.shape)

tags:  (19545, 2)  
movies: (27278, 3)  
df_merged: (27278, 3)


In [22]:
vectorizer_tags = CountVectorizer()

tags_vectorized = vectorizer_tags.fit_transform(df_merged.loc[:, "tags"])
tags_names = vectorizer_tags.get_feature_names_out()

tags_vectorized.data = np.where(tags_vectorized.data != 0, 1, 0)
tags_vectorized

#talvez mudar dps de int32 para int8

<27278x34957 sparse matrix of type '<class 'numpy.int32'>'
	with 265140 stored elements in Compressed Sparse Row format>

In [23]:
tags_counts = np.array(tags_vectorized.sum(axis=0))[0]
filter_ = (tags_counts > 10)

names_filtered = tags_names[filter_] # 35k para 3k columns

tags_vectorized = pd.DataFrame((tags_vectorized[:, filter_].toarray()), columns = names_filtered)
tags_vectorized.drop(["tag", "watched"], axis = 1, inplace = True)

tags_vectorized.shape

(27278, 2935)

In [24]:
tags_vectorized.head()

Unnamed: 0,007,01,02,03,04,05,06,07,08,09,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
tags_final = pd.concat([df_merged["movieId"], tags_vectorized], axis = 1)

tags_final.drop(tags_final[tags_final["movieId"].isin(without_tags)].index, inplace = True) # removing movies without any tag

tags_final.head() # aqui ta ok

Unnamed: 0,movieId,007,01,02,03,04,05,06,07,08,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Code to create the pivot table in a efficient way in csr matrix
"""
new_ratings = pd.merge(movies.drop("genres", axis = 1), ratings, on = "movieId")

new_ratings.head()

user_ids = new_ratings['userId'].unique()
title_ids = new_ratings['title'].unique()

user_to_row = {user_id: i for i, user_id in enumerate(new_ratings['userId'].unique())} # dict used to userNames
title_to_col = {title: j for j, title in enumerate(new_ratings['title'].unique())} # dict used to itemNames

rows = [user_to_row[user_id] for user_id in new_ratings['userId']]
cols = [title_to_col[title] for title in new_ratings['title']]
ratings_values = new_ratings['rating'].tolist()

sparse_matrix = csr_matrix((ratings_values, (rows, cols)), shape=(len(user_ids), len(title_ids)))
movie_names = pd.Series(list(title_to_col.keys()))
user_names = pd.Series(list(user_to_row.keys()))
pivot_csr = sparse_matrix
pivot_csc = sparse_matrix.tocsc()

print(sparse_matrix.shape)
"""

'\nnew_ratings = pd.merge(movies.drop("genres", axis = 1), ratings, on = "movieId")\n\nnew_ratings.head()\n\nuser_ids = new_ratings[\'userId\'].unique()\ntitle_ids = new_ratings[\'title\'].unique()\n\nuser_to_row = {user_id: i for i, user_id in enumerate(new_ratings[\'userId\'].unique())} # dict used to userNames\ntitle_to_col = {title: j for j, title in enumerate(new_ratings[\'title\'].unique())} # dict used to itemNames\n\nrows = [user_to_row[user_id] for user_id in new_ratings[\'userId\']]\ncols = [title_to_col[title] for title in new_ratings[\'title\']]\nratings_values = new_ratings[\'rating\'].tolist()\n\nsparse_matrix = csr_matrix((ratings_values, (rows, cols)), shape=(len(user_ids), len(title_ids)))\nmovie_names = pd.Series(list(title_to_col.keys()))\nuser_names = pd.Series(list(user_to_row.keys()))\npivot_csr = sparse_matrix\npivot_csc = sparse_matrix.tocsc()\n\nprint(sparse_matrix.shape)\n'

In [27]:
"""
movies_Ids = movies_final["movieId"]
movie_names = movies_final["title"]

movies_final_array = movies_final.drop(["movieId", "title"], axis = 1)

movies_final = np.array(movies_final_array)
movies_csr = csr_matrix(movies_final_array)
# Falta retirar o título, botar o movieId para o index e depois colocar para sparseMatrix
"""

'\nmovies_Ids = movies_final["movieId"]\nmovie_names = movies_final["title"]\n\nmovies_final_array = movies_final.drop(["movieId", "title"], axis = 1)\n\nmovies_final = np.array(movies_final_array)\nmovies_csr = csr_matrix(movies_final_array)\n# Falta retirar o título, botar o movieId para o index e depois colocar para sparseMatrix\n'

In [28]:
movie_ids = tags_final["movieId"]
movie_names = pd.merge(movies[["movieId", "title"]], tags_final["movieId"], on = "movieId")["title"]

print("names: ",movie_names.shape,"  ids: ", movie_ids.shape)

tags_array = np.array(tags_final.drop("movieId", axis = 1))
tags_csr = csr_matrix(tags_final.drop("movieId", axis = 1))

names:  (27069,)   ids:  (27069,)


In [29]:
def get_recommendation_item(movie_name, metric):
    if (movie_name is None) or (movie_names.str.match(movie_name).any()):
        print("Movie not found.")
        return None
    
    movie = tags_array[pd.Index(movie_names).get_loc(movie_name), :]
    print(pd.Index(movie_names).get_loc(movie_name))

    if metric == "cosine":
        similarity = cosine_similarity(movie.reshape(1, -1), tags_csr)[0]
    elif metric == "pearson":
        similarity = correlation_pearson_sparse_row(movie, tags_csr) 
    elif metric == "kernel":
        similarity = linear_kernel(movie.reshape(1, -1), tags_csr)[0]
    else: 
        print("Unknown Metric.")
        return None
    
    order = np.argsort(-similarity)

    top_10 = similarity[order[1:11]]
    top_10_names = movie_names[order[1:11]]
    
    return (list(top_10_names), top_10)

In [30]:
get_recommendation_item("Batman & Robin (1997)", "kernel")

1511


(['Batman Forever (1995)',
  'Batman (1989)',
  'Batman (1966)',
  'Superman III (1983)',
  'Batman Returns (1992)',
  'Batman Begins (2005)',
  'Dark Knight, The (2008)',
  'Superman IV: The Quest for Peace (1987)',
  'Superman (1978)',
  'Catwoman (2004)'],
 array([20., 18., 17., 17., 17., 17., 16., 15., 15., 14.]))

In [31]:
get_recommendation_item("Batman & Robin (1997)", "cosine")

1511


(['Batman (1966)',
  'Superman III (1983)',
  'Batman Forever (1995)',
  'Batman Returns (1992)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.52463139, 0.47892074, 0.47619048, 0.43719282, 0.41826814,
        0.40915854, 0.37219368, 0.37115374, 0.36004115, 0.34718254]))

In [32]:
get_recommendation_item("Batman & Robin (1997)", "pearson")

1511


(['Batman (1966)',
  'Superman III (1983)',
  'Batman Forever (1995)',
  'Batman Returns (1992)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.51952297, 0.47262464, 0.46858591, 0.42965372, 0.41224561,
        0.40173273, 0.36819993, 0.36639143, 0.3565201 , 0.33558011]))

#### Alguns problemas:
    -> Preciso saber como fazer de User para isso
    -> Pensar na forma de generalizar o código (criar classe)
    -> Pensar em como otimizar também (o tipo de matriz(csr ou csc) o pré-processamento e tal)