## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Genre Identification by Text Classification

## Sprint 3

We will start solving a **Text Classification** problem. We will train a model to predict movies' genres throught their descriptions <br/>

In this notebook, we will:
- Perform some _text preprocessing_
- Extract text features by gensim Word2Vec
- Run the previous experiments again

## 1. Get the Dataset
https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb

In [1]:
import pandas as pd

In [4]:
df_train = pd.read_csv('./datasets/genre_classification_train.csv', sep=';')
df_test = pd.read_csv('./datasets/genre_classification_test.csv', sep=';')

In [5]:
df_train

Unnamed: 0,id,title,genre,description,label
0,27180,Rivals (1972),drama,"Scott Jacoby, as a boy with an unhealthy and p...",8
1,19975,Kosava (1974),drama,The story of two workers who returned from abr...,8
2,48284,In Winter (2017),drama,In Winter is an independent feature emerging f...,8
3,37540,Maria Chapdelaine (1950),drama,"At the beginning of the 20th century, in the N...",8
4,43389,The Gift Of (2018),comedy,A delicious combo of romantic-comedy and socia...,5
...,...,...,...,...,...
43366,40649,Mesto nic neví (1976),crime,A summer's day. Sixteen-year-old Hedvika arriv...,6
43367,50892,Join the Cult (2015),documentary,"Join The Cult follows Cult Of Tomorrows End, a...",7
43368,28767,Hancock's Half Hour: The New Neighbour (2016),comedy,Whilst claiming all his neighbours are voyeurs...,5
43369,37822,New Project 'Zengin Sinifin Dizi Dibinde' (2013),drama,"Spring of 2013, Istanbul in the midst of youth...",8


In [6]:
df_test

Unnamed: 0,id,title,genre,description,label
0,14679,Undesignated Driver (1996),short,"This video series, in national distribution wi...",21
1,8348,Proteolysis (????),action,"""Proteolysis"" is a gritty, action-adventure, s...",0
2,34987,Intimately Yours (1998),adventure,Love bondager Chelsea Pfeiffer ties and gags o...,2
3,15885,49 Days (????),horror,"Jason and Camille, sweethearts since childhood...",13
4,42009,The Torturer (2005),horror,The twenty-four year-old aspirant actress Gine...,13
...,...,...,...,...,...
10838,47207,Uso Justo (2005),short,When an experimental filmmaker decides to shoo...,21
10839,53454,The Perfect Girl (2015),romance,A young boy (Jay) and a girl (Vedika) happen t...,19
10840,21050,"""Trapped Minds"" (2016)",drama,Trapped Minds is a 4-episode psychological thr...,8
10841,44343,Chronicles of a Silver Revolver (????),short,With today's issue with gun violence and contr...,21


## 2. Text Preprocessing

In [7]:
import neattext.functions as ntx

def text_preprocessing(text_in: str) -> str:
    text = text_in.lower()
    
    text = ntx.fix_contractions(text)
    text = ntx.remove_punctuations(text)
    text = ntx.remove_stopwords(text)
    text = ntx.remove_numbers(text)
    text = ntx.remove_emojis(text)
    text = ntx.remove_multiple_spaces(text)
    text = ntx.remove_special_characters(text)
    
    return text

In [None]:
# progress bar in pandas
!pip install tqdm

In [8]:
from tqdm import tqdm
tqdm.pandas()  # it enables some new progress bar functions/methods for pandas

In [9]:
# pre-process the training set
df_train['description-pre'] = df_train['description'].progress_apply(lambda text: text_preprocessing(text))

100%|████████████████████████████████████████████████████████████████████████████| 43371/43371 [00:07<00:00, 5634.39it/s]


In [10]:
df_train.head()

Unnamed: 0,id,title,genre,description,label,description-pre
0,27180,Rivals (1972),drama,"Scott Jacoby, as a boy with an unhealthy and p...",8,scott jacoby boy unhealthy pathological attach...
1,19975,Kosava (1974),drama,The story of two workers who returned from abr...,8,story workers returned abroad wants find good ...
2,48284,In Winter (2017),drama,In Winter is an independent feature emerging f...,8,winter independent feature emerging classical ...
3,37540,Maria Chapdelaine (1950),drama,"At the beginning of the 20th century, in the N...",8,beginning th century north province quebec yea...
4,43389,The Gift Of (2018),comedy,A delicious combo of romantic-comedy and socia...,5,delicious combo romanticcomedy social satire f...


In [11]:
# pre-process the training set
df_test['description-pre'] = df_test['description'].progress_apply(lambda text: text_preprocessing(text))

100%|████████████████████████████████████████████████████████████████████████████| 10843/10843 [00:01<00:00, 5562.31it/s]


In [12]:
df_test.head()

Unnamed: 0,id,title,genre,description,label,description-pre
0,14679,Undesignated Driver (1996),short,"This video series, in national distribution wi...",21,video series national distribution film ideas ...
1,8348,Proteolysis (????),action,"""Proteolysis"" is a gritty, action-adventure, s...",0,proteolysis gritty actionadventure set rural m...
2,34987,Intimately Yours (1998),adventure,Love bondager Chelsea Pfeiffer ties and gags o...,2,love bondager chelsea pfeiffer ties gags harmo...
3,15885,49 Days (????),horror,"Jason and Camille, sweethearts since childhood...",13,jason camille sweethearts childhood swimming n...
4,42009,The Torturer (2005),horror,The twenty-four year-old aspirant actress Gine...,13,twentyfour yearold aspirant actress ginette ca...


In [13]:
# save the preprocessed datasets
df_train.to_csv('./datasets/genre_classification_train_preprocessed_sprint3.csv', sep=';', index=False)
df_test.to_csv('./datasets/genre_classification_test_preprocessed_sprint3.csv', sep=';', index=False)

## 3. Feature Extraction by gensim Word2Vec

In [None]:
!pip install gensim

In [14]:
import gensim.downloader as api

w2v_model = api.load('word2vec-google-news-300')

In [19]:
import numpy as np

def text_feat_extraction_by_word2vec(text: str, w2v_model) -> np.array:
    words = text.split()
    
    words_embedding_list = []
    for word in words:
        # check if the word belongs to the (pre-travocabulary
        if word in w2v_model:
            word_embedding = w2v_model[word]
            
            words_embedding_list.append(word_embedding)

    # do the same but in a pythonic way
    # words_embedding_list = [w2v_model[word] for word in words if word in w2v_model]

    if len(words_embedding_list) == 0:
        return np.zeros(300)

    words_embedding_np = np.array(words_embedding_list)
    
    # compute the average of each feature in the list of embeddings
    # return an averaged vector with 300 averages
    return words_embedding_np.mean(axis=0)

In [20]:
# returns a series with the result
X_train = df_train['description-pre'].progress_apply(lambda text: text_feat_extraction_by_word2vec(text, w2v_model))
X_test = df_test['description-pre'].progress_apply(lambda text: text_feat_extraction_by_word2vec(text, w2v_model))

100%|███████████████████████████████████████████████████████████████████████████| 43371/43371 [00:04<00:00, 10340.68it/s]
100%|████████████████████████████████████████████████████████████████████████████| 10843/10843 [00:01<00:00, 9807.35it/s]


In [22]:
print(type(X_train))
print(type(X_test))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [25]:
# it is a Series of numpy arrays
X_train

0        [0.07034633, 0.00084653107, -0.04805855, 0.077...
1        [0.10695253, 0.06018698, -0.012417497, 0.04755...
2        [0.060095984, 0.07227541, -0.034637567, 0.0578...
3        [0.026325287, 0.035334755, -0.0068383594, 0.03...
4        [0.08964, 0.028149772, -0.050548777, 0.1478370...
                               ...                        
43366    [0.05042376, 0.09617179, 0.047974724, 0.031843...
43367    [0.047248177, 0.058825534, 0.0027866364, 0.106...
43368    [0.03060913, 0.086205535, -0.0017564562, 0.033...
43369    [0.088308, 0.071201324, -0.008885701, 0.054457...
43370    [-0.01994053, 0.05230319, 0.013930782, 0.05590...
Name: description-pre, Length: 43371, dtype: object

In [26]:
# "gambiarra" to convert the Series of np.arrays into a 2D np.array
X_train = np.stack(X_train)
X_test = np.stack(X_test)

In [27]:
print(f'X_train.shape = {X_train.shape}')
print(f'X_test.shape = {X_test.shape}')

X_train.shape = (43371, 300)
X_test.shape = (10843, 300)


In [28]:
# labels
y_train = df_train['label']
y_test = df_test['label']

## 4. Train the models

In [29]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced', n_jobs=-1)

model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
# prediction on training set
y_train_pred = model.predict(X_train)

In [31]:
target_names = df_train[['genre', 'label']].sort_values(by='label')['genre'].unique()
target_names

array(['action', 'adult', 'adventure', 'animation', 'biography', 'comedy',
       'crime', 'documentary', 'drama', 'family', 'fantasy', 'game-show',
       'history', 'horror', 'music', 'musical', 'mystery', 'news',
       'reality-tv', 'romance', 'sci-fi', 'short', 'sport', 'talk-show',
       'thriller', 'war', 'western'], dtype=object)

In [32]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred, target_names=target_names))

              precision    recall  f1-score   support

      action       0.37      0.44      0.40      1052
       adult       0.29      0.77      0.42       472
   adventure       0.22      0.32      0.26       620
   animation       0.23      0.48      0.31       398
   biography       0.05      0.51      0.09       212
      comedy       0.65      0.39      0.49      5957
       crime       0.15      0.50      0.23       404
 documentary       0.82      0.32      0.46     10477
       drama       0.75      0.29      0.42     10890
      family       0.17      0.38      0.24       627
     fantasy       0.14      0.50      0.22       258
   game-show       0.49      0.90      0.63       155
     history       0.06      0.54      0.11       194
      horror       0.59      0.63      0.61      1763
       music       0.38      0.73      0.50       585
     musical       0.12      0.55      0.19       222
     mystery       0.13      0.52      0.20       255
        news       0.12    

In [33]:
from sklearn.metrics import f1_score

f1_train = f1_score(y_train, y_train_pred, average='macro')

print(f'F1 Train: {f1_train}')

F1 Train: 0.3454694024096135


In [34]:
from sklearn.metrics import balanced_accuracy_score

balacc_train = balanced_accuracy_score(y_train, y_train_pred)

print(f'Balanced Acc Train: {balacc_train}')

Balanced Acc Train: 0.5651363259974491


## 6. Evaluate the model on the Test Set

In [35]:
# prediction on testing set
y_test_pred = model.predict(X_test)

In [36]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred, target_names=target_names))

              precision    recall  f1-score   support

      action       0.34      0.40      0.36       263
       adult       0.27      0.72      0.40       118
   adventure       0.16      0.28      0.21       155
   animation       0.11      0.23      0.15       100
   biography       0.04      0.38      0.07        53
      comedy       0.64      0.38      0.48      1490
       crime       0.14      0.45      0.21       101
 documentary       0.81      0.32      0.46      2619
       drama       0.73      0.28      0.41      2723
      family       0.13      0.32      0.19       157
     fantasy       0.10      0.38      0.16        65
   game-show       0.39      0.67      0.50        39
     history       0.04      0.39      0.08        49
      horror       0.56      0.61      0.58       441
       music       0.39      0.74      0.51       146
     musical       0.06      0.31      0.10        55
     mystery       0.08      0.30      0.13        64
        news       0.07    

In [37]:
from sklearn.metrics import f1_score

f1_test = f1_score(y_test, y_test_pred, average='macro')

print(f'F1 Test: {f1_test}')

F1 Test: 0.30239037241574485


<br/>

The resulting **F1 score** has not improved after considering _Word2Vec_, at least for _Logistic Regression_.

We could try to improve these results by:
- applying other _text preprocessing_
    + stemming
    + lemmatization
    + POS tagging
    + ...
- training our own Word2Vec based on the corpus of the considered dataset
- evaluating other classifiers (e.g., MLP, Random Forest, ...)
- performing fine-tuning
- evaluating other feature extractors:
    + Fast Text
    + BERT
    + roBERTa
    + ...