## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Genre Identification by Text Classification

## Sprint 3

We will start solving a **Text Classification** problem. We will train a model to predict movies' genres throught their descriptions <br/>

In this notebook, we will:
- Evaluate several classifiers by `PyCaret`
- Keep the TF-IDF as feature extraction

## 1. Get the Dataset
https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

In [1]:
import pandas as pd

In [2]:
df_train = pd.read_csv('./datasets/genre_classification_train_preprocessed.csv', sep=';')
df_test = pd.read_csv('./datasets/genre_classification_test_preprocessed.csv', sep=';')

In [3]:
df_train

Unnamed: 0,id,title,genre,description,label,description-pre
0,27180,Rivals (1972),drama,"Scott Jacoby, as a boy with an unhealthy and p...",8,scott jacoby boy unhealthy pathological attach...
1,19975,Kosava (1974),drama,The story of two workers who returned from abr...,8,story workers returned abroad wants find good ...
2,48284,In Winter (2017),drama,In Winter is an independent feature emerging f...,8,winter independent feature emerging classical ...
3,37540,Maria Chapdelaine (1950),drama,"At the beginning of the 20th century, in the N...",8,beginning th century north province quebec yea...
4,43389,The Gift Of (2018),comedy,A delicious combo of romantic-comedy and socia...,5,delicious combo romanticcomedy social satire f...
...,...,...,...,...,...,...
43366,40649,Mesto nic neví (1976),crime,A summer's day. Sixteen-year-old Hedvika arriv...,6,summers day sixteenyearold hedvika arrives ost...
43367,50892,Join the Cult (2015),documentary,"Join The Cult follows Cult Of Tomorrows End, a...",7,join cult follows cult tomorrows end young ine...
43368,28767,Hancock's Half Hour: The New Neighbour (2016),comedy,Whilst claiming all his neighbours are voyeurs...,5,whilst claiming neighbours voyeurs hancock kee...
43369,37822,New Project 'Zengin Sinifin Dizi Dibinde' (2013),drama,"Spring of 2013, Istanbul in the midst of youth...",8,spring istanbul midst youth upheavals iskender...


In [4]:
df_test

Unnamed: 0,id,title,genre,description,label,description-pre
0,14679,Undesignated Driver (1996),short,"This video series, in national distribution wi...",21,video series national distribution film ideas ...
1,8348,Proteolysis (????),action,"""Proteolysis"" is a gritty, action-adventure, s...",0,proteolysis gritty actionadventure set rural m...
2,34987,Intimately Yours (1998),adventure,Love bondager Chelsea Pfeiffer ties and gags o...,2,love bondager chelsea pfeiffer ties gags harmo...
3,15885,49 Days (????),horror,"Jason and Camille, sweethearts since childhood...",13,jason camille sweethearts childhood swimming n...
4,42009,The Torturer (2005),horror,The twenty-four year-old aspirant actress Gine...,13,twentyfour yearold aspirant actress ginette ca...
...,...,...,...,...,...,...
10838,47207,Uso Justo (2005),short,When an experimental filmmaker decides to shoo...,21,experimental filmmaker decides shoot film fict...
10839,53454,The Perfect Girl (2015),romance,A young boy (Jay) and a girl (Vedika) happen t...,19,young boy jay girl vedika happen meet tourist ...
10840,21050,"""Trapped Minds"" (2016)",drama,Trapped Minds is a 4-episode psychological thr...,8,trapped minds episode psychological thriller m...
10841,44343,Chronicles of a Silver Revolver (????),short,With today's issue with gun violence and contr...,21,todays issue gun violence control director jd ...


## 2. Feature Extraction

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train = tfidf.fit_transform(df_train['description-pre'])
y_train = df_train['label']

X_test = tfidf.transform(df_test['description-pre'])
y_test = df_test['label']

In [6]:
X_train.shape, X_test.shape

((43371, 128428), (10843, 128428))

In [7]:
print(f'Vocabulary size: {len(tfidf.vocabulary_)}')

Vocabulary size: 128428


In [8]:
feature_names = tfidf.get_feature_names()

## 3. Evaluate multiple models by `TPOT`

In [19]:
from tpot import TPOTClassifier

model = TPOTClassifier(generations=5, population_size=50, verbosity=2,
                       max_time_mins=2, scoring='f1_macro', random_state=42,
                       n_jobs=-1, config_dict='TPOT sparse', cv=1)

In [None]:
# performing the search for best fit
model.fit(X_train, y_train)

## 6. Evaluate the model on the Test Set

In [None]:
# prediction on testing set
y_test_pred = logreg.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred, target_names=target_names))

In [None]:
from sklearn.metrics import f1_score

f1_test = f1_score(y_test, y_test_pred, average='macro')

print(f'F1 Test: {f1_test}')

<br/>

The resulting **F1 score** has not improved after considering _text preprocessing_, at least for _Logistic Regression_.