## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Genre Identification by Text Classification

## Sprint 2

We will start solving a **Text Classification** problem. We will train a model to predict movies' genres throught their descriptions <br/>

In this notebook, we will:
- Perform _text preprocessing_
- Run the previous experiments again

## 1. Get the Dataset
https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

In [None]:
import pandas as pd

In [None]:
df_train = pd.read_csv('./datasets/genre_classification_train.csv', sep=';')
df_test = pd.read_csv('./datasets/genre_classification_test.csv', sep=';')

In [None]:
df_train

In [None]:
df_test

## 2. Text Preprocessing
- lowering
- expand contractions
- remove:
  + punctuations
  + stop words
  + urls
  + emails
  + numbers
  + emojis
  + phone numbers
  + multiple whitespaces
  + currency symbols
  + special characters

In [None]:
import neattext.functions as ntx

def text_preprocessing(text_in: str) -> str:
    text = text_in.lower()
    
    text = ntx.fix_contractions(text)
    text = ntx.remove_punctuations(text)
    text = ntx.remove_stopwords(text)
    text = ntx.remove_urls(text)
    text = ntx.remove_emails(text)
    text = ntx.remove_numbers(text)
    text = ntx.remove_emojis(text)
    text = ntx.remove_phone_numbers(text)
    text = ntx.remove_multiple_spaces(text)
    text = ntx.remove_currency_symbols(text)
    text = ntx.remove_special_characters(text)
    
    return text

In [None]:
# progress bar in pandas
!pip install tqdm

In [None]:
# pre-process the training set


In [None]:
df_train.head()

In [None]:
# pre-process the training set


In [None]:
df_test.head()

In [None]:
# save the preprocessed datasets
df_train.to_csv('./datasets/genre_classification_train_preprocessed.csv', sep=';', index=False)
df_test.to_csv('./datasets/genre_classification_test_preprocessed.csv', sep=';', index=False)

## 2. Word Cloud for Train Set

In [None]:
# classes/genres


In [None]:
# plot a word cloud for each genre
import matplotlib.pyplot as plt
from wordcloud import WordCloud

fig, axes = plt.subplots(9, 3, figsize=(15, 20))

idx = 0

for row in range(9):
    for col in range(3):
        genre = genres[idx]
        
        df_genre = df_train.query("genre == @genre")

        text = ' '.join(df_genre['description'])
        wordcloud = WordCloud().generate(text)
        axes[row, col].imshow(wordcloud)
        axes[row, col].set_title(f'{genre}')
        axes[row, col].axis('off')

        idx += 1

While there are _stop words_ (which we should remove), we can clearly see that there is a **subset of specific words** related to each _genre_.

We should repeat this analysis after **_text cleaning/preprocessing_**.

## 3. Feature Extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train = tfidf.fit_transform(df_train['description-pre'])
y_train = df_train['label']

X_test = tfidf.transform(df_test['description-pre'])
y_test = df_test['label']

In [None]:
X_train.shape, X_test.shape

In [None]:
print(f'Vocabulary size: {len(tfidf.vocabulary_)}')

The **vocabulary size has increased**, probably due to _remove punctuations_. <br/>
I belive that when _removing the punctuation_ of **compound words**, like `'well-known'`, a _new word_ has been created `'wellknown'`. However, the _corpus_ may also have the single words `'well'` and `'known'`, which will result in _three words_ to the corpus.

## 5. Train the models

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(class_weight='balanced', n_jobs=-1)

logreg.fit(X_train, y_train)

In [None]:
# prediction on training set
y_train_pred = logreg.predict(X_train)

In [None]:
target_names = df_train[['genre', 'label']].sort_values(by='label')['genre'].unique()
target_names

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred, target_names=target_names))

In [None]:
from sklearn.metrics import f1_score

f1_train = f1_score(y_train, y_train_pred, average='macro')

print(f'F1 Train: {f1_train}')

In [None]:
from sklearn.metrics import balanced_accuracy_score

balacc_train = balanced_accuracy_score(y_train, y_train_pred)

print(f'Balanced Acc Train: {balacc_train}')

## 6. Evaluate the model on the Test Set

In [None]:
# prediction on testing set
y_test_pred = logreg.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred, target_names=target_names))

In [None]:
from sklearn.metrics import f1_score

f1_test = f1_score(y_test, y_test_pred, average='macro')

print(f'F1 Test: {f1_test}')

<br/>

The resulting **F1 score** has not improved after considering _text preprocessing_, at least for _Logistic Regression_.