# Information retrieval for movies recommendation

Database which the project it's based on:   
[HBO Max](https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows)  

<div></div> 

In [1]:
import numpy as np 
import pandas as pd

from ast import literal_eval

In [2]:
# Load the "rating" and "link" with the id from multiple sources
ratings = pd.read_csv('../data/tmdb/ratings.zip')
links = pd.read_csv('../data/tmdb/links.zip')

# Inner join between both files
ratings = ratings.merge(links, how='inner', on='movieId')

# Select only the movies with at least 750 reviews, to classify as popular enough for recommendation
pop_movies = ratings['movieId'].value_counts().to_frame().query('count > 750').index
pop_movies = links.query('movieId in @pop_movies')['tmdbId'].dropna()

del ratings, links

<div></div> 

## Leitura dos Arquivos

As bases vieram em formato CSV, portanto, só foi utilizado o pandas para leitura e feito um concat

<div></div> 

In [3]:
# Load the 'credits' dataset from a zipped CSV file
dt_c = pd.read_csv('../data/tmdb/credits.zip')

# Load the 'movies_metadata' dataset from a zipped CSV file
dt_m = pd.read_csv('../data/tmdb/movies_metadata.zip')

# Convert the 'id' column to numeric data type, ignoring any errors
dt_m['id'] = pd.to_numeric(dt_m['id'], errors='coerce')

# Convert the 'popularity' column to numeric data type, ignoring any errors
dt_m['popularity'] = pd.to_numeric(dt_m['popularity'], errors='coerce')

# Merge the 'movies_metadata' DataFrame with the 'credits' DataFrame based on the 'id' column
dt_m = dt_m.merge(dt_c.set_index('id'), how='left', left_on=['id'], right_index=True)

# Drop rows with missing values in the 'id' column
dt_m.dropna(subset=['id', 'overview'], inplace=True)

# Select the movies with the minimun engagement
dt_m.query('id in @pop_movies', inplace=True)

# Reset index 
dt_m.reset_index(drop=True, inplace=True)

# Delete the 'credits' DataFrame to free up memory
del dt_c

  dt_m = pd.read_csv('../data/tmdb/movies_metadata.zip')


In [4]:
# Define the variables
v = 'vote_count'  # Vote count column
m = 'vote_count.quantile(0.85)'  # Quantile of vote count
R = 'vote_average'  # Vote average column
C = 'vote_average.mean()'  # Mean of vote average

# Evaluate the score using the defined variables and assign it to a new column 'score'
dt_m.eval(f'score = ({v}/({v}+{m}) * {R}) + ({m}/({m}+{v}) * {C})', inplace=True)

In [5]:
# Top ranking by normalized score
dt_m[['title', 'vote_average', 'score']].sort_values(by='score', ascending=False).head(10)

Unnamed: 0,title,vote_average,score
231,The Shawshank Redemption,8.5,8.195274
3624,The Dark Knight,8.3,8.103938
535,The Godfather,8.5,8.101349
1753,Fight Club,8.3,8.058688
216,Pulp Fiction,8.3,8.034883
3868,Inception,8.1,7.946423
262,Forrest Gump,8.2,7.936551
4169,Interstellar,8.1,7.911596
731,The Empire Strikes Back,8.2,7.861528
2976,The Lord of the Rings: The Return of the King,8.1,7.854521


In [6]:
# Top ranking by popularity
dt_m[['title', 'popularity', 'score']].sort_values(by='popularity', ascending=False).head(10)

Unnamed: 0,title,popularity,score
4283,Minions,547.488298,6.437074
4218,Big Hero 6,213.849907,7.553117
4249,Deadpool,187.860492,7.298742
4250,Guardians of the Galaxy Vol. 2,185.330992,7.346307
3812,Avatar,185.070892,7.126373
4213,John Wick,183.870374,6.901203
4198,Gone Girl,154.801009,7.623994
4223,The Hunger Games: Mockingjay - Part 1,147.098006,6.589483
4251,Captain America: Civil War,145.882135,7.005543
216,Pulp Fiction,140.950236,8.034883


<div></div> 

## Pré-Processamento de Texto

Para minimizar possíveis gargalos de processamento e identificação dos termos relevantes, é realizada a remoção de ruídos utilizando regex. Em seguida, é aplicada a tokenização, que consiste na transformação do texto em uma lista de palavras, a fim de possibilitar a aplicação das técnicas de TF-IDF em um modelo vetorial.

Além disso, foram feitas alguns processsos adicionais para possibilitar o processamento sem erros

<div></div> 


### Tratamento dos gêneros

### Remoção de palavras e transformação de minúsculos

In [7]:
dt_m['p_overview'] = dt_m['overview'].replace(r'[^\w\s]', '', regex=True)

# Aplicando as funções str.lower() e str.strip() simultaneamente
dt_m['p_overview'] = dt_m['p_overview'].apply(lambda x: x.lower().strip() if isinstance(x, str) else x)


In [8]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')

# Set the English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from text
def remove_stopwords(text):
    if text is not pd.NA:  # Check if the text is not NaN
        tokens = word_tokenize(text)  # Tokenize the text into words
        tokens_sem_stopwords = [token for token in tokens if token.lower() not in stop_words]  # Remove stopwords
        texto_sem_stopwords = ' '.join(tokens_sem_stopwords)  # Join the remaining tokens back into text
        return texto_sem_stopwords
    else:
        return text

# Apply the remove_stopwords function to the 'p_overview' column and update the values
dt_m['p_overview'] = dt_m['p_overview'].map(remove_stopwords)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<div></div> 

### Tokenização e Lemmatizer

**Tokenização:** A tokenização de texto é o processo de dividir um texto em unidades menores, chamadas de tokens. Esses tokens podem ser palavras individuais, caracteres, frases ou até mesmo partes específicas de um texto, dependendo do contexto e das necessidades do processamento de linguagem natural. 

**Lemmatize:** A lematização de texto é um processo linguístico que visa reduzir as palavras em sua forma base ou forma lematizada. O objetivo é transformar palavras flexionadas em sua forma canônica, chamada de "lema" ou "base". Por exemplo, a lematização transforma palavras como "correndo" em "correr", "carros" em "carro" e assim por diante.<div></div> 


<div></div>

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer

# Download the WordNet corpus if not already downloaded
nltk.download('wordnet')

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize text
def lemmatize_text(text):
    if text is not None:  # Check if the text is not None
        # Lemmatize each word in the text and join them back into a string
        return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    else:
        return text

dt_m['p_overview'] = dt_m['p_overview'].map(lemmatize_text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Identificação das query / docs

Foi feita uma separação do index das query, para pode fazer uma localização do na base origina após o TF-IDF, dado que o TF-IDF reseta os index dos termos por documento

In [10]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow_hub as hub

# Load the pre-trained word embedding model
embed = hub.load("../models/Wiki-words-250_2")

# Sentences to encode
sentences = dt_m['p_overview']

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Compute TF-IDF values for the sentences
tfidf_matrix = vectorizer.fit_transform(sentences)

# Get the vocabulary (feature names) from the vectorizer
vocabulary = vectorizer.get_feature_names_out()

# Generate the embeddings for the sentences
embeddings = embed(sentences)

# Create an empty matrix to store the combined representations
combined_matrix = np.zeros((len(sentences), len(vocabulary) + embeddings.shape[1]))

# Fill the combined matrix with TF-IDF values and word embeddings
for i in range(len(sentences)):
    # Get the TF-IDF values for the current sentence
    tfidf_values = tfidf_matrix[i].toarray().flatten()
    # Fill the corresponding TF-IDF values in the combined matrix
    combined_matrix[i, :len(vocabulary)] = tfidf_values
    # Fill the corresponding word embeddings in the combined matrix
    combined_matrix[i, len(vocabulary):] = embeddings[i]



In [11]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(dt_m.index, index=dt_m['title']).drop_duplicates()

In [18]:
from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(combined_matrix)

In [19]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:101]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return dt_m[['title', 'score']].iloc[movie_indices].sort_values(by='score', ascending=False)

In [20]:
dt_m.dropna(subset=['title']).query('title.str.contains("Toy Story")')

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,crew,score,p_overview
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",7.444365,led woody andys toy live happily room andys bi...
1843,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",90000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story-2,863.0,tt0120363,en,Toy Story 2,"Andy heads off to Cowboy Camp, leaving his toy...",...,Released,The toys are back!,Toy Story 2,False,7.3,3914.0,"[{'cast_id': 18, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8025073', 'de...",7.087499,andy head cowboy camp leaving toy device thing...
3859,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",200000000,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",http://disney.go.com/toystory/,10193.0,tt0435761,en,Toy Story 3,"Woody, Buzz, and the rest of Andy's toys haven...",...,Released,No toy gets left behind.,Toy Story 3,False,7.6,4710.0,"[{'cast_id': 6, 'character': 'Woody (voice)', ...","[{'credit_id': '5770143fc3a3683733000f3a', 'de...",7.340308,woody buzz rest andys toy havent played year a...


In [21]:
get_recommendations('Toy Story')

Unnamed: 0,title,score
2055,Gladiator,7.606264
4173,Captain America: The Winter Soldier,7.381240
432,Fargo,7.209395
4243,Mad Max: Fury Road,7.196152
4297,Straight Outta Compton,7.092325
...,...,...
3888,Eat Pray Love,6.327258
3928,I Am Number Four,6.219208
3828,Valentine's Day,6.208060
4049,Total Recall,6.084154


In [None]:
dt_m.query('title=="Toy Story"')['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [None]:
dt_m.query('title=="Gladiator"')['overview'][2055]

"In the year 180, the death of emperor Marcus Aurelius throws the Roman Empire into chaos. Maximus is one of the Roman army's most capable and trusted generals and a key advisor to the emperor. As Marcus' devious son Commodus ascends to the throne, Maximus is set to be executed. He escapes, but is captured by slave traders. Renamed Spaniard and forced to become a gladiator, Maximus must battle to the death with other men for the amusement of paying audiences. His battle skills serve him well, and he becomes one of the most famous and admired men to fight in the Colosseum. Determined to avenge himself against the man who took away his freedom and laid waste to his family, Maximus believes that he can use his fame and skill in the ring to avenge the loss of his family and former glory. As the gladiator begins to challenge his rule, Commodus decides to put his own fighting mettle to the test by squaring off with Maximus in a battle to the death."

In [None]:
cosine_similarity(embeddings[0].numpy().reshape(1, -1), embeddings[3].numpy().reshape(1, -1))

array([[0.8839029]], dtype=float32)