### Second Practice Activity of a Recommendation System

In the first notebook, I briefly explained recommendation systems and how to divide them. Furthermore, I implemented two initial collaborative methods, an item-based movie recommendation, and a user-based movie recommendation; these are simply methods using similarity metrics to recommend new movies.

Here, I want to implement a Content recommendation system, using the same movie dataset. Content Approachs uses informations about the item/user to select similar ones and recommend. This dataset was prepared in the another Python Notebook using movie_informations.

In [239]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [240]:
def correlation_pearson_sparse_row(array, sparse_matrix):
    
    yy = array - array.mean() 
    xm = sparse_matrix.mean(axis=1).A.ravel() # pegando a média
    ys = yy / np.sqrt(np.dot(yy, yy)) # calculando ys
    xs = np.sqrt(np.add.reduceat(sparse_matrix.data**2, sparse_matrix.indptr[:-1]) - sparse_matrix.shape[1]*xm*xm)

    correl2 = np.add.reduceat(sparse_matrix.data * ys[sparse_matrix.indices], sparse_matrix.indptr[:-1]) / xs
    return correl2

In [241]:
movies = pd.read_csv("./movie_information/movie.csv")

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


In [242]:
tags = pd.read_csv("./movie_information/tag.csv")

tags = tags.drop(["timestamp", "userId"], axis = 1)

tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   movieId  465564 non-null  int64 
 1   tag      465548 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.1+ MB


In [243]:
tags['tag'] = tags['tag'].astype(str).apply(lambda x: x.capitalize().strip().replace(" ", ""))
tags = tags.drop_duplicates()

tags.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195750 entries, 0 to 465563
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   movieId  195750 non-null  int64 
 1   tag      195750 non-null  object
dtypes: int64(1), object(1)
memory usage: 4.5+ MB


In [244]:
tagsGrouped = tags.groupby(['movieId'], as_index=False).agg({'tag': '; '.join})

tagsGrouped

Unnamed: 0,movieId,tag
0,1,Watched; Computeranimation; Disneyanimatedfeat...
1,2,Timetravel; Adaptedfrom:book; Boardgame; Child...
2,3,Oldpeoplethatisactuallyfunny; Sequelfever; Gru...
3,4,Chickflick; Revenge; Characters; Clv
4,5,Dianekeaton; Family; Sequel; Stevemartin; Wedd...
...,...,...
19540,131054,Dinosaurs
19541,131082,Documentary; Yoshitomonara
19542,131164,Vietnamwar
19543,131170,Alternatereality


In [245]:
movies.loc[:, "genres"].replace("(no genres listed)", "-", inplace = True)

movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),-


In [246]:
vectorizer_tags = CountVectorizer()

tags_ = vectorizer_tags.fit_transform(tagsGrouped.loc[:, "tag"])
tags_names = vectorizer_tags.get_feature_names_out()

tags_.data = np.where(tags_.data != 0, 1, 0)
tags_

<19545x34957 sparse matrix of type '<class 'numpy.int32'>'
	with 213797 stored elements in Compressed Sparse Row format>

In [247]:
tags_counts = np.array(tags_.sum(axis=0))[0]
filter_ = (tags_counts > 10)

names_filtered = tags_names[filter_] # 35k para 3k

In [248]:
tags_ = pd.DataFrame((tags_[:, filter_].toarray()), columns = names_filtered)

tags_

Unnamed: 0,007,01,02,03,04,05,06,07,08,09,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19540,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19541,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19542,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19543,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [249]:
tags_ = tags_.drop(["tag", "watched"], axis = 1)

In [250]:
tags  = pd.concat([tagsGrouped, tags_], axis = 1)

tags

Unnamed: 0,movieId,tag,007,01,02,03,04,05,06,07,...,yakuza,yasujirôozu,youth,youtube,zachgalifianakis,zatoichi,zhangyimou,zombie,zombies,zooeydeschanel
0,1,Watched; Computeranimation; Disneyanimatedfeat...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Timetravel; Adaptedfrom:book; Boardgame; Child...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Oldpeoplethatisactuallyfunny; Sequelfever; Gru...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Chickflick; Revenge; Characters; Clv,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Dianekeaton; Family; Sequel; Stevemartin; Wedd...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19540,131054,Dinosaurs,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19541,131082,Documentary; Yoshitomonara,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19542,131164,Vietnamwar,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19543,131170,Alternatereality,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [251]:
tags = tags.drop("tag", axis = 1)

In [252]:
vectorizer_movies = CountVectorizer()

genres = vectorizer_movies.fit_transform(movies.loc[:, "genres"])
genres_names = vectorizer_movies.get_feature_names_out()

genres.data = np.where(genres.data != 0, 1, 0)
genres

<27278x21 sparse matrix of type '<class 'numpy.int32'>'
	with 56233 stored elements in Compressed Sparse Row format>

In [253]:
genres = pd.DataFrame((genres.toarray()), columns = genres_names)

In [254]:
movies.drop("genres", axis = 1, inplace = True)

movies_with_genres = pd.concat([movies, genres], axis = 1)

In [255]:
movies_final = movies_with_genres.merge(tags, on='movieId', suffixes=('', '_tags'))

In [256]:
movies_final["title"]

0                                     Toy Story (1995)
1                                       Jumanji (1995)
2                              Grumpier Old Men (1995)
3                             Waiting to Exhale (1995)
4                   Father of the Bride Part II (1995)
                             ...                      
19540    Dinotopia: Quest for the Ruby Sunstone (2005)
19541                                Playground (2009)
19542                             Vietnam in HD (2011)
19543                                 Parallels (2015)
19544                               The Pirates (2014)
Name: title, Length: 19545, dtype: object

In [257]:
movies_Ids = movies_final["movieId"]
movie_names = movies_final["title"]

movies_final_array = movies_final.drop(["movieId", "title"], axis = 1)

movies_final_array = np.array(movies_final_array)
movies_csr = csr_matrix(movies_final_array)
# Falta retirar o título, botar o movieId para o index e depois colocar para sparseMatrix

In [258]:
def get_recommendation_item(movie_name, metric):
    if (movie_name is None) or (movie_names.str.match(movie_name).any()):
        print("Movie not found.")
        return None
    
    movie = movies_final_array[pd.Index(movie_names).get_loc(movie_name), :]
    print(pd.Index(movie_names).get_loc(movie_name))

    if metric == "cosine":
        similarity = cosine_similarity(movie.reshape(1, -1), movies_csr)[0]
    elif metric == "pearson":
        similarity = correlation_pearson_sparse_row(movie, movies_csr) 
    elif metric == "kernel":
        similarity = linear_kernel(movie.reshape(1, -1), movies_csr)[0]
    else: 
        print("Unknown Metric.")
        return None
    
    order = np.argsort(-similarity)

    top_10 = similarity[order[1:11]]
    top_10_names = movie_names[order[1:11]]
    
    return (list(top_10_names), top_10)



In [259]:
get_recommendation_item("Batman & Robin (1997)", "kernel")

1267


(['Batman Forever (1995)',
  'Batman (1989)',
  'Batman (1966)',
  'Batman Begins (2005)',
  'Superman III (1983)',
  'Batman Returns (1992)',
  'Spider-Man 2 (2004)',
  'Dark Knight, The (2008)',
  'Superman IV: The Quest for Peace (1987)',
  'Spider-Man (2002)'],
 array([21., 19., 18., 18., 17., 16., 15., 15., 15., 14.]))

In [260]:
get_recommendation_item("Batman & Robin (1997)", "cosine")

1267


(['Batman (1966)',
  'Batman Forever (1995)',
  'Superman III (1983)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman Returns (1992)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.53833374, 0.48279051, 0.47331914, 0.41337595, 0.4043729 ,
        0.39581656, 0.36784039, 0.36681262, 0.35583   , 0.35398265]))

In [261]:
get_recommendation_item("Batman & Robin (1997)", "pearson")

1267


(['Batman (1966)',
  'Batman Forever (1995)',
  'Superman III (1983)',
  'Batman: Mask of the Phantasm (1993)',
  'Superman IV: The Quest for Peace (1987)',
  'Batman Returns (1992)',
  'Batman: Gotham Knight (2008)',
  'Batman Beyond: Return of the Joker (2000)',
  'Supergirl (1984)',
  'Batman (1989)'],
 array([0.53324821, 0.47506666, 0.4669355 , 0.40728632, 0.3968593 ,
        0.38745031, 0.36381138, 0.36200697, 0.35227936, 0.3421945 ]))

#### Terminar de ajeitar a recomendação por conteúdo, talvez separar a preparação dos datasets em outros arquivos, comentar mais o content e ver outras formas mais complexas de desenvolver isso