# Case Técnico Fácil Espaider: Classificação Multilabel de gêneros de filmes baseado em sinopse

## Planejamento da Solução

### Entrada

* Base de metadados de filmes com as característica de cada um dos filmes e seus respectivos gêneros atribuídos.
* Sinopse dos filmes com identificador usado para se referenciar a base de metadados dos filmes.
                                                

### Saída

* Qual formato de entrega da solução?
    * 1 Jupyter Notebooks, onde o projeto foi desenvolvido por meio de ciclos, com objetivo de gerenciar e planejar os próximos passos.
    * 2 Modelo em produção disponibilizado por um BOT no Telegram que realiza a consulta do nosso modelo e baseado na entrada retorna os labels de gêneros preditos.
    
    


Metodologia
* CRISP-DS, metodologia ágil para desenvolvimento de projetos de ciência de dados (método cíclico)
* Abordagem Classificação MultiLabel utilizando NLP


Ferraments Utilizadas
* Python 3.10.6, Jupyter-Lab, Poetry, Git, Github.

## Implementações realizadas na Sprint

### Ciclo 1

* Entendimento do problema.
* Importações de Bibliotecas e Funções Auxiliares.
* Leitura e Entendimento dos Dados.
* Descrição dos dados.
* Filtragem das variáveis que serão utilizadas no modelo.
* Análise Exploratória da distribuição dos gêneros de filmes.
* Featuring Enginerring (Bag of Words utilizando tfidf)
* Metrics Definition
* Testing New Data



# 0.0 Imports and Helper Functions

In [1]:
import pandas as pd
import numpy as np
import csv
import json
import seaborn as sns
import unicodedata
import warnings
import pickle
import re

from tqdm import tqdm

from matplotlib import pyplot as plt
from IPython.core.display import HTML

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
#nltk.download('stopwords')
from nltk.corpus   import stopwords
from nltk.stem     import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
stop_words = set(stopwords.words('english'))

from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MultiLabelBinarizer


from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score,hamming_loss,jaccard_score,make_scorer

from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB


In [2]:
# stop_words_file = '../data/stopwords.txt'

In [3]:
# with open(stop_words_file, 'r') as file:
#     stop_words = set(file.read().split())

## 0.1 Auxiliar Functions

In [4]:
def jupyter_settings():
    %matplotlib inline
    #%pylab inline
    
    plt.style.use('bmh')
    plt.rcParams['figure.figsize'] = [25,12]
    plt.rcParams['font.size'] = 24
    
    display(HTML ('<style>.container {width:100% !important;} </style>') )
    pd.options.display.max_columns=None
    pd.options.display.max_rows = None
    pd.set_option('display.expand_frame_repr',False)
    pd.set_option('display.float_format', lambda x: '%.4f' % x)
    
    sns.set()
    
jupyter_settings()

# Contraction to Full Word
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value=contractions[key]
            x=x.replace(key,value)
        return x
    else:
        return x
    
def freq_words(x, terms = 30):
    all_words = ' '.join([text for text in x]) 
    all_words = all_words.split()
    fdist = nltk.FreqDist(all_words)
    words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())}) 
    
    # selecting top 20 most frequent words 
    d = words_df.nlargest(columns="count", n = terms) 

    # visualize words and frequencies
    plt.figure(figsize=(12,15)) 
    ax = sns.barplot(data=d, x= "count", y = "word") 
    ax.set(ylabel = 'Word') 
    plt.show()
    
    
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)


def multilabel_metrics(model,y_true,y_pred):
   
    precision = precision_score(y_true,y_pred,average='weighted',zero_division=1)
    recall = recall_score(y_true,y_pred,average='weighted',zero_division=1)
    f1 = f1_score(y_true,y_pred,average='weighted',zero_division=1)
    hamming = hamming_loss(y_true,y_pred)
    jaccard = jaccard_score(y_true,y_pred,average='weighted',zero_division=1)
    
    
    model_name = model.__class__.__name__
    estimator_name = model.estimator.__class__.__name__
    full_name = '_'.join([model_name, estimator_name])
    
    return pd.DataFrame({'Model Name':full_name,
                         'precision': precision,
                         'recall': recall,
                         'f1_score': f1,
                         'hamming_loss': hamming,
                         'jaccard_score': jaccard},index=[0])


# Initialize stemmer and tokenizer
stemmer = SnowballStemmer("english")
tokenizer = RegexpTokenizer(r'\w+')


def tokenize_and_stem(text):
    # Tokenize the text into individual words
    tokens = tokenizer.tokenize(text.lower())
    
    # Apply the Snowball stemmer to each word
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    
    return stemmed_tokens


def clean_text(text):
    
    # Remove Url
    text = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)
    
    #Remove tudo entre tag <ref }
    text = re.sub(r'<ref.*?}}', '', text)
    
    # lower case
    text = text.lower()
    
    # Remove Contraction and transform into full word
    text = cont_to_exp(text)
    
    # Remove Special Chars or punctuation
    text = re.sub('[^A-Z a-z 0-9-]+', '',text)
    
    # Removed Accented Chars
    text = remove_accented_chars(text)
    
    # Remove Stopwords
    text = remove_stopwords(text)
    
    # Remove all non alphabeticall
    text = re.sub('[^a-zA-Z]',' ',text)
    
    # Removed duplicated spaces
    text = " ".join(text.split())
    
    # Remove numbers in form of text
    text = re.sub(r'\b(zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred)\b', '', text)
    
    return text

def predict_genres(model,text):
    text = clean_text(text)
    text_vec = tfidf.transform([text])
    text_vec = normalizer.fit_transform(text_vec)
    text_pred = model.predict(text_vec)
    
    return multilabel.inverse_transform(text_pred)


# 1.0 Data Description

## 1.1 Leitura Dados dos Filmes

In [5]:
df1 = pd.read_csv("../data/movie.metadata.tsv", sep = '\t', header = None)

### 1.1.1 Rename Columns

Dataset sem identificação das colunas, vamos então dar nome a aquelas que nos interessa 

In [6]:
df1.columns = ['id_filme','col_2', 'nome_filme', 'cod_3', 'cod_4', 'cod_5', 'cod_6','cod_7','genero_filme']

## 1.2 Leitura Dados de Entrada ( Sinopses dos Filmes em formato txt)

In [7]:
sinopse = []

with open("../data/plot_summaries.txt", 'r') as file:
    texto = csv.reader(file,dialect='excel-tab')
    
    for row in tqdm(texto):
        sinopse.append(row)

42303it [00:00, 74896.96it/s]


In [8]:
id_filme = []
sinopse_filme = []

In [9]:
for i in tqdm(sinopse):
    id_filme.append(i[0])
    sinopse_filme.append(i[1])
    
# Montar o DataFrame 

df_sinopse = pd.DataFrame({'id_filme':id_filme, 'sinopse_filme': sinopse_filme})

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 42303/42303 [00:00<00:00, 2126789.19it/s]


Check duplicated on both dataframes

## 1.3 Data Types

Iremos alterar o id do filme para string, para que possamos juntar os 2 dataframes

In [10]:
df1['id_filme'] = df1['id_filme'].astype(str)

## 1.5 Join Datasets

So we can analyze a little bit better we should join the sinopse with the metadata of the movies 

In [11]:
df_final = pd.merge(df1,df_sinopse,on='id_filme',how='left')

# 2.0 Data Filtering

In [12]:
df2 = df_final.copy()

## 2.1 Select Columns

In [13]:
# We will we keep the necessary variables that will be used for the development of the project
df2 = df2[['id_filme','nome_filme','genero_filme','sinopse_filme']].copy()

## 2.2 Selecting only The Genders

In [14]:
# Extract all genres
generos = []

for i in df2['genero_filme']:
    generos.append(list(json.loads(i).values()))
    
df2['genero_filme'] = generos
df2 = df2.dropna()

# Drop lines without gender classification
df2_new = df2[~(df2['genero_filme'].str.len() == 0 )]

In [15]:
df2.shape, df2_new.shape

((42204, 4), (41793, 4))

## Converting the Genres to labels

In [16]:
# multilabel = MultiLabelBinarizer()
multilabel = pickle.load(open('/home/jordanmalheiros/Estudismo/desafio_espaider/transformations/multilabel_transformation.pkl','rb'))
y = multilabel.transform(df2_new['genero_filme'])

In [17]:
y.shape

(41793, 363)

# 3.0 EDA

In [18]:
df4 = df2_new.copy()

# 4.0 Data Preparation - PreProcessing and Cleaning

In [19]:
df5 = df4.copy()

In [20]:
with open('../data/contractions.txt') as file:
    data = file.read()

contractions = json.loads(data)

# Cleaning Sinopse Text
df5['sinopse_filme'] = df5['sinopse_filme'].apply(lambda x: clean_text(x))


# 5.0 Feature Creation

We will be focusing on the TF-IDF technique in this project to convert our train and test dataset into numerical vectors.

* TF-IDF picks the most frequently occurring terms (words with high term frequency or tf). 
* However, the most frequent word is a less useful metric since some words like ‘this’, ‘a’ occur very frequently across all documents.

In [21]:
df6 = df5.copy()

## 5.1 Split Train and Test Dataset

In [22]:
x_train, x_validation, y_train, y_validation = train_test_split(df6['sinopse_filme'], y, test_size=0.2, random_state=42)

In [23]:
tfidf = pickle.load(open('/home/jordanmalheiros/Estudismo/desafio_espaider/transformations/tfidf_transformation.pkl','rb'))

In [24]:
X = tfidf.transform(df6['sinopse_filme'])

In [25]:
X

<41793x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 5090379 stored elements in Compressed Sparse Row format>

In [26]:
# Create the Matrix of tdidf frequency of words

# tfidf = TfidfVectorizer(analyzer='word',max_features=20000,max_df=0.85,min_df=3,ngram_range=(1, 3),tokenizer=tokenize_and_stem)
# pickle.dump(tfidf,open('/home/jordanmalheiros/Estudismo/desafio_espaider/transformations/tfidf_transformation.pkl','wb'))

# X = tfidf.fit_transform(df6['sinopse_filme'])

## 5.2 Apply Normalization and Tfidf Transformation do dataset

Applying tfidf Vectorizer into the training and validation dataset

In [27]:
# normalizer = Normalizer()

# pickle.dump(normalizer,open('/home/jordanmalheiros/Estudismo/desafio_espaider/transformations/norm_transformation.pkl','wb'))

normalizer = pickle.load(open('/home/jordanmalheiros/Estudismo/desafio_espaider/transformations/norm_transformation.pkl','rb'))

x_train = tfidf.transform(x_train)
x_train = normalizer.fit_transform(x_train)

x_validation = tfidf.transform(x_validation)
x_validation = normalizer.transform(x_validation)


Let's test first using the validation and later on the cross_validation method

# 6.0 Final Model

## 6.1 Join Train and Validation Dataset

In [28]:
x_train_final = X
y_train_final = y

## 6.2 Final Model - LogisticRegression()

In [29]:
model = pickle.load(open('/home/jordanmalheiros/Estudismo/desafio_espaider/model/model_lr_tuned.pkl','rb'))

In [30]:
model

# 7.0 Predict Movie Gender

In [12]:
print('Enter the sinopse:')
sinopse = input()
# predict_genres(model,sinopse)

Enter the sinopse:


 On the lush alien world of Pandora live the Na'vi, beings who appear primitive but are highly evolved. Because the planet's environment is poisonous, human/Na'vi hybrids, called Avatars, must link to human minds to allow for free movement on Pandora. Jake Sully (Sam Worthington), a paralyzed former Marine, becomes mobile again through one such Avatar and falls in love with a Na'vi woman (Zoe Saldana). As a bond with her grows, he is drawn into a battle for the survival of her world.


# TEST API

In [13]:
import requests

In [14]:
def API_Test(sinopse,endpoint):
    
    header = {'Content-type': 'text/plain'}
    data = sinopse

    r = requests.post(endpoint,data=data,headers=header)
    print('Status Code {}'.format(r.status_code))
    
    
    r_json = r.json()
    genres = r_json[0]
    genre_str = ", ".join(genres)
    
    return print('Genres Predicted:{}'.format(genre_str))
    

In [16]:
# Endpoint Local
# url = 'http://127.0.0.1:5000/genres_pred/predict'

# Endpoint Production
url = 'https://jbm-genrepred-deploy.onrender.com/genres_pred/predict'
# url = 'https://jbm-genre-prediction.onrender.com/genres_pred/predict'

API_Test(sinopse,url)

Status Code 200
Genres Predicted:Action, Fantasy, Science Fiction


In [67]:
r.json()[0]

['Action',
 'Action/Adventure',
 'Adventure',
 'Airplanes and airports',
 'Costume Adventure',
 'New Hollywood',
 'Science Fiction',
 'Thriller']

In [68]:
# Imprima a lista de gêneros predita pelo modelo
r_json = r.json()
genres = r_json[0]
genre_str = ", ".join(genres)
print(genre_str)


Action, Action/Adventure, Adventure, Airplanes and airports, Costume Adventure, New Hollywood, Science Fiction, Thriller
