## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Genre Identification by Text Classification

## Sprint 1

We will start solving a **Text Classification** problem. We will train a model to predict movies' genres throught their descriptions <br/>

In this notebook, we will:
- Get the dataset
- Perform a simple analysis
- Split dataset
- Perform feature extraction with TF-IDF (without text preprocessing)
- Train a simple model (logistic regression)

## 1. Get the Dataset
https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./Genre Classification Dataset/train_data.txt', sep='\s:::\s', engine='python', header=None, names=['id', 'title', 'genre', 'description'])
df

## 2. Simple EDA

### 2.1. Info

In [None]:
df.info()

In [None]:
# convert datatype from str to category
df['genre'] = df['genre'].astype('category')
df.info()

### 2.2. Class Proportion

In [None]:
df['genre'].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.countplot(data=df, y='genre')

In [None]:
order = df['genre'].value_counts().index

plt.figure(figsize=(10,6))
sns.countplot(data=df, y='genre', order=order)

The dataset is hugely imbalanced!

### 2.3. Label Classes

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(df['genre'])

In [None]:
print(f'Class labels')

for label, class_ in enumerate(label_encoder.classes_):
    print(f'{class_} ==> {label}')

In [None]:
df['label'] = label_encoder.transform(df['genre'])

In [None]:
df

### 2.4. Check blank description

In [None]:
df[df['description'] == '']

### 2.5. Word Cloud

In [None]:
from wordcloud import WordCloud

# Generate a word cloud image
text = ' '.join(df['description'])

wordcloud = WordCloud().generate(text)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,8))
plt.imshow(wordcloud)
plt.axis("off")

In [None]:
# classes/genres
genres = sorted(df['genre'].unique())
genres

In [None]:
# number of genres/classes
len(genres)

In [None]:
# plot a word cloud for each genre

fig, axes = plt.subplots(9, 3, figsize=(15, 20))

idx = 0

for row in range(9):
    for col in range(3):
        genre = genres[idx]
        
        df_genre = df.query("genre == @genre")

        text = ' '.join(df_genre['description'])
        wordcloud = WordCloud().generate(text)
        axes[row, col].imshow(wordcloud)
        axes[row, col].set_title(f'{genre}')
        axes[row, col].axis('off')

        idx += 1

While there are _stop words_ (which we should remove), we can clearly see that there is a **subset of specific words** related to each _genre_.

We should repeat this analysis after **_text cleaning/preprocessing_**.

## 3. Split the Dataset

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, stratify=df['genre'], random_state=42)

In [None]:
print(f"===> TRAIN\n{df_train['genre'].value_counts() / df_train.shape[0]}\n")
print(f"===> TEST\n{df_test['genre'].value_counts() / df_test.shape[0]}")

In [None]:
# save the datasets
df_train.to_csv('./datasets/genre_classification_train.csv', sep=';', index=False)
df_test.to_csv('./datasets/genre_classification_test.csv', sep=';', index=False)

## 4. Feature Extraction

In [None]:
df_train.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# by default, it performs: lowercasing, remove punctuation, strip accents
tfidf = TfidfVectorizer()

X_train = tfidf.fit_transform(df_train['description'])
y_train = df_train['label']

X_test = tfidf.transform(df_test['description'])
y_test = df_test['label']

In [None]:
# all words in the vocabulary.
tfidf.vocabulary_

Note that there are a lot of **stop words** and **numbers** that will hinder our classification.

In [None]:
print(f'Vocabulary size: {len(tfidf.vocabulary_)}')

In [None]:
print('Number of Feats')
print(f'Train.shape: {X_train.shape}')
print(f'Test.shape: {X_test.shape}')

## 5. Train the models

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(class_weight='balanced')

logreg.fit(X_train, y_train)

In [None]:
# prediction on training set
y_train_pred = logreg.predict(X_train)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred, target_names=label_encoder.classes_))

In [None]:
from sklearn.metrics import f1_score

f1_train = f1_score(y_train, y_train_pred, average='macro')

print(f'F1 Train: {f1_train}')

In [None]:
from sklearn.metrics import balanced_accuracy_score

balacc_train = balanced_accuracy_score(y_train, y_train_pred)

print(f'Balanced Acc Train: {balacc_train}')

## 6. Evaluate the model on the Test Set

In [None]:
# prediction on testing set
y_test_pred = logreg.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred, target_names=label_encoder.classes_))

In [None]:
from sklearn.metrics import f1_score

f1_test = f1_score(y_test, y_test_pred, average='macro')

print(f'F1 Test: {f1_test}')

<br/>

The resulting **F1 score** for the testing set is low, so we need to ***improve*** our solution. <br/>
Some possibilities:
- Perform _text preprocessing_
- Try different _feature extractors_
- Try different _classifiers_
- Fine-tuning