### Second Practice Activity of a Recommendation System

In the first notebook, I briefly explained recommendation systems and how to divide them. Furthermore, I implemented two initial collaborative methods, an item-based movie recommendation, and a user-based movie recommendation; these are simply methods using similarity metrics to recommend new movies.

Here, I want to implement a Content recommendation system, using the same movie dataset. Content Approachs uses informations about the item/user to select similar ones and recommend. This dataset was prepared in the another Python Notebook using movie_informations.

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [2]:
def correlation_pearson_sparse_row(array, sparse_matrix):
    
    yy = array - array.mean() 
    xm = sparse_matrix.mean(axis=1).A.ravel() # pegando a média
    ys = yy / np.sqrt(np.dot(yy, yy)) # calculando ys
    xs = np.sqrt(np.add.reduceat(sparse_matrix.data**2, sparse_matrix.indptr[:-1]) - sparse_matrix.shape[1]*xm*xm)

    correl2 = np.add.reduceat(sparse_matrix.data * ys[sparse_matrix.indices], sparse_matrix.indptr[:-1]) / xs
    return correl2

In [3]:
movies = pd.read_csv("../movie_information/movie.csv")
# 27278 movies with 16 duplicated movies title, but with differents IDs
# i will maintain because they have differents movieIds

movies.info()

movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
tags = pd.read_csv("../movie_information/tag.csv").drop(["timestamp", "userId"], axis = 1) 
# 26744 movies

tags.info()

tags.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   movieId  465564 non-null  int64 
 1   tag      465548 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.1+ MB


Unnamed: 0,movieId,tag
0,4141,Mark Waters
1,208,dark hero
2,353,dark hero
3,521,noir thriller
4,592,dark hero


In [5]:
ratings = pd.read_csv("../movie_information/rating.csv").drop("timestamp", axis = 1) #use later
# 19545 movies

ratings.info()

ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int64  
 1   movieId  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 457.8 MB


Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [6]:
# modifying tags

formatted_tags = tags.copy()

formatted_tags['tag'] = tags['tag'].astype(str).apply(lambda x: x.capitalize().strip().replace(" ", "")) # formatting tags
formatted_tags = formatted_tags.drop_duplicates()

tags_grouped_by_id = formatted_tags.groupby(['movieId'], as_index=False).agg({'tag': '; '.join}) # to aggroup tags by movieId

tags_grouped_by_id.head()

Unnamed: 0,movieId,tag
0,1,Watched; Computeranimation; Disneyanimatedfeat...
1,2,Timetravel; Adaptedfrom:book; Boardgame; Child...
2,3,Oldpeoplethatisactuallyfunny; Sequelfever; Gru...
3,4,Chickflick; Revenge; Characters; Clv
4,5,Dianekeaton; Family; Sequel; Stevemartin; Wedd...


In [7]:
# formatting movies
movies.loc[:, "genres"] = movies.loc[:, "genres"].str.replace("|", "; ", regex = False)
movies.loc[:, "genres"].replace("(no genres listed)", "" , inplace = True) # basically change no genres listed to NaN

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure; Animation; Children; Comedy; Fantasy
1,2,Jumanji (1995),Adventure; Children; Fantasy
2,3,Grumpier Old Men (1995),Comedy; Romance
3,4,Waiting to Exhale (1995),Comedy; Drama; Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
df_merged = pd.merge(movies, tags_grouped_by_id, on = "movieId", how = "outer") # outer join with the two dataframes

df_merged.loc[:, "genres"].replace(np.nan, "", inplace = True)
df_merged.loc[:, "tag"].replace(np.nan, "", inplace = True)
df_merged["tags"] = df_merged["genres"] + "; "+ df_merged["tag"] # uning genres and tags in one columns tags
df_merged.drop(["genres", "tag"], axis = 1, inplace = True)

df_merged.head()

Unnamed: 0,movieId,title,tags
0,1,Toy Story (1995),Adventure; Animation; Children; Comedy; Fantas...
1,2,Jumanji (1995),Adventure; Children; Fantasy; Timetravel; Adap...
2,3,Grumpier Old Men (1995),Comedy; Romance; Oldpeoplethatisactuallyfunny;...
3,4,Waiting to Exhale (1995),Comedy; Drama; Romance; Chickflick; Revenge; C...
4,5,Father of the Bride Part II (1995),Comedy; Dianekeaton; Family; Sequel; Stevemart...


In [9]:
without_tags = df_merged[df_merged["tags"] == "; "]["movieId"] # save to remove later

without_tags.shape

(209,)

In [10]:
print("tags: ",tags_grouped_by_id.shape, " \nmovies:", movies.shape, " \ndf_merged:", df_merged.shape)

tags:  (19545, 2)  
movies: (27278, 3)  
df_merged: (27278, 3)


In [11]:
vectorizer_tags = CountVectorizer()

tags_vectorized = vectorizer_tags.fit_transform(df_merged.loc[:, "tags"])
tags_names = vectorizer_tags.get_feature_names_out()

tags_vectorized.data = np.where(tags_vectorized.data != 0, 1, 0)
tags_vectorized

#talvez mudar dps de int32 para int8

<27278x34957 sparse matrix of type '<class 'numpy.int32'>'
	with 265140 stored elements in Compressed Sparse Row format>

In [12]:
tags_counts = np.array(tags_vectorized.sum(axis=0))[0]
filter_ = (tags_counts > 10)

names_filtered = tags_names[filter_] # 35k para 3k columns

tags_vectorized = pd.DataFrame((tags_vectorized[:, filter_].toarray()), columns = names_filtered)
tags_vectorized.drop(["tag", "watched"], axis = 1, inplace = True)

tags_vectorized.shape

(27278, 2935)

In [13]:
tags_vectorized.head()

Unnamed: 0,007,01,02,03,04,05,06,07,08,09,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
tags_final = pd.concat([df_merged["movieId"], tags_vectorized], axis = 1)

tags_final.drop(tags_final[tags_final["movieId"].isin(without_tags)].index, inplace = True) # removing movies without any tag

tags_final.head() # aqui ta ok

Unnamed: 0,movieId,007,01,02,03,04,05,06,07,08,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Code to create the pivot table in a efficient way in csr matrix
new_ratings = ratings.copy()

new_ratings.drop(new_ratings[~(new_ratings["movieId"].isin(tags_final["movieId"]))].index, inplace = True)

user_ids = new_ratings['userId'].unique()
title_ids = new_ratings['movieId'].unique()

user_to_row = {user_id: i for i, user_id in enumerate(new_ratings['userId'].unique())} # dict used to userNames
title_to_col = {title: j for j, title in enumerate(new_ratings['movieId'].unique())} # dict used to itemNames

rows = [user_to_row[user_id] for user_id in new_ratings['userId']]
cols = [title_to_col[title] for title in new_ratings['movieId']]
ratings_values = new_ratings['rating'].tolist()

sparse_matrix = csr_matrix((ratings_values, (rows, cols)), shape=(len(user_ids), len(title_ids)))
movie_names = pd.Series(list(title_to_col.keys()))
user_names = pd.Series(list(user_to_row.keys()))

movies_users_csr = sparse_matrix
movies_users_csc = sparse_matrix.tocsc()

print(sparse_matrix.shape)

(138493, 26535)


In [16]:
tags_final.drop(tags_final[~(tags_final["movieId"].isin(title_ids))].index, inplace = True)

movie_ids   = movies["movieId"]
movie_names = pd.merge(movies[["movieId", "title"]], tags_final["movieId"], on = "movieId")["title"]

print("Size: ",movie_names.shape[0])

tags_array = np.array(tags_final.drop("movieId", axis = 1))
tags_csr = csr_matrix(tags_final.drop("movieId", axis = 1))

Size:  26535


In [56]:
def get_recommendation_item(movie_name, metric):
    if (movie_name is None) or (movie_names.str.match(movie_name).any()):
        print("Movie not found.")
        return None
    
    movie = tags_array[pd.Index(movie_names).get_loc(movie_name), :]
    print(movie.reshape(1, -1).shape)
    print(pd.Index(movie_names).get_loc(movie_name))

    if metric == "cosine":
        similarity = cosine_similarity(movie.reshape(1, -1), tags_csr)[0]
    elif metric == "pearson":
        similarity = correlation_pearson_sparse_row(movie, tags_csr) 
    elif metric == "kernel":
        similarity = linear_kernel(movie.reshape(1, -1), tags_csr)[0]
    else: 
        print("Unknown Metric.")
        return None
    
    order = np.argsort(-similarity)

    top_10 = similarity[order[1:11]]
    top_10_names = movie_names[order[1:11]]
    
    return (list(top_10_names), top_10)

In [57]:
get_recommendation_item("Batman & Robin (1997)", "cosine")

(1, 2935)
1511


(['Batman (1966)',
  'Superman III (1983)',
  'Batman Forever (1995)',
  'Batman Returns (1992)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.52463139, 0.47892074, 0.47619048, 0.43719282, 0.41826814,
        0.40915854, 0.37219368, 0.37115374, 0.36004115, 0.34718254]))

In [19]:
get_recommendation_item("Batman & Robin (1997)", "cosine")

1511


(['Batman (1966)',
  'Superman III (1983)',
  'Batman Forever (1995)',
  'Batman Returns (1992)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.52463139, 0.47892074, 0.47619048, 0.43719282, 0.41826814,
        0.40915854, 0.37219368, 0.37115374, 0.36004115, 0.34718254]))

In [20]:
get_recommendation_item("Batman & Robin (1997)", "pearson")

1511


(['Batman (1966)',
  'Superman III (1983)',
  'Batman Forever (1995)',
  'Batman Returns (1992)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.51952297, 0.47262464, 0.46858591, 0.42965372, 0.41224561,
        0.40173273, 0.36819993, 0.36639143, 0.3565201 , 0.33558011]))

In [71]:
def get_recommendation_user(user_id, metric):
    first_user = (movies_users_csc[user_id, :]).toarray()[0] # tabela usuário linha, filme tabela
    watched_movies_index = np.where(first_user > 0.)[0]
    if metric == "cosine":
        similarity_func = cosine_similarity
    elif metric == "pearson":
        similarity_func = correlation_pearson_sparse_row
    else: 
        print("Unknown Metric.")
    # Calc similarity table
    
    # o que vamos usar para calcular
    
    # Creating the two Results for the ranking
    #total = np.zeros(num_columns)
    #similaritySums = np.zeros(num_columns)
    
    sim_normalized = np.zeros(tags_csr.shape[0])
    for review_index in watched_movies_index:
        item_reviewed = tags_array[movie_ids[movie_ids == title_ids[review_index]].index]
        
        review = first_user[review_index]
        similarities = cosine_similarity(item_reviewed.reshape(1, -1), tags_csr)[0]
        sim_normalized += (similarities * review)
        
    print(sim_normalized/5)
    
    # metade feito, falta selecionar legal os indices e ordenar os filmes
"""
    
    
    for review in watched_movies_reviews:
        item_reviewed = get_item_in_tags(review) # review seria as colunas, mas temos que ter informação quanto ao movieId
        for item in tags_csr:
            similarities = item_reviewed @ tags_csr
            sim_normalized += (similarities * review.nota) # testar assim
        






# ideia para o get_recommendation_user:
# preciso pegar um usuário e trazer filmes recomendados por ele com base em distância entre os itens.
#   -> pego os filmes assistidos pelos user, com nota
#   -> calculo toda a distância entre os filmes não assistidos e o filme assistido, multiplico pela nota do filme e salvo
#   -> repito o processo para todos os assistidos

"""

'\n    \n    \n    for review in watched_movies_reviews:\n        item_reviewed = get_item_in_tags(review) # review seria as colunas, mas temos que ter informação quanto ao movieId\n        for item in tags_csr:\n            similarities = item_reviewed @ tags_csr\n            sim_normalized += (similarities * review.nota) # testar assim\n        \n\n\n\n\n\n\n# ideia para o get_recommendation_user:\n# preciso pegar um usuário e trazer filmes recomendados por ele com base em distância entre os itens.\n#   -> pego os filmes assistidos pelos user, com nota\n#   -> calculo toda a distância entre os filmes não assistidos e o filme assistido, multiplico pela nota do filme e salvo\n#   -> repito o processo para todos os assistidos\n\n'

In [72]:
get_recommendation_user(1, "cosine")

[4.620441   2.61003401 3.26552764 ... 1.79362747 1.80724086 4.26359203]


In [None]:
def get_recommendation_user(user_id, metric):
    first_user = pivot_array[user_id, :]
    watched_movies_index = np.where(first_user > 0.)[0]

    if metric == "cosine":
        similarity_table = cosine_similarity(first_user.reshape(1, -1), pivot_csr)[0]
    elif metric == "pearson":
        similarity_table = correlation_pearson_sparse_row(first_user, pivot_csr)
    else: 
        print("Unknown Metric.")
        return None
    # Calc similarity table

    num_columns = pivot_csc.shape[1] # take num_movie
    
    # Creating the two Results for the ranking
    total = np.zeros(num_columns)
    similaritySums = np.zeros(num_columns)

    for col in range(num_columns):
        if col in watched_movies_index: continue
        column_data = pivot_csc[:, col].T # quero uma matriz linha
        # For each unwatched movie we calculate the (Similarity * Rating) Sum and the Similarity Sum 
        total[col] = column_data.dot(similarity_table)

        column_data.data = np.ones(column_data.data.shape)
        similaritySums[col] = column_data.dot(similarity_table)

    # the division between the total and similaritySums result in the "Predicted" Rate for that movie, calculated using the rate of the observed Rate
    similaritySums_smoothed = similaritySums + 1e-10 # to avoid division by 0
    index = (np.argsort(-(total/similaritySums_smoothed)))
    values = -(np.sort(-(total/similaritySums_smoothed)))
    return index, values

## Optimized

#### Alguns problemas:
    -> Preciso saber como fazer de User para isso
    -> Pensar na forma de generalizar o código (criar classe)
    -> Pensar em como otimizar também (o tipo de matriz(csr ou csc) o pré-processamento e tal)