# Genre Classification Project: Model Building and Evaluation

This notebook focuses on text vectorization, model building, evaluation, and exporting the final model.

## Contents
1. **Text Vectorization**:
    - Bag of Words (BOW)
    - N-Grams
    - Term Frequency-Inverse Document Frequency (TFIDF)
    - Word2Vec

2. **Model Building**:
    - Multinomial Naive Bayes (NB)
    - Logistic Regression
    - Support Vector Classifier (SVC)

3. **Model Selection and Hyperparameter Tuning**:
    - Comparing performance of different models
    - Selecting the best-performing model
    - Applying basic hyperparameter tuning using GridSearchCV

4. **Pipeline and Export**:
    - Creating a complete pipeline including preprocessing and model
    - Exporting the final model using pickle

This notebook encapsulates the end-to-end process of building, evaluating, and exporting the best model for the genre classification task.


In [1]:
import numpy as np
import pandas as pd

### Imported preprocessed data

In [3]:
df = pd.read_csv("data/movie-genre-classification-preprocessed.csv")
df.head()

Unnamed: 0,title,genre,description,cleaned_text
0,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...,listening conversation doctor parents year old...
1,Cupid (1997),thriller,A brother and sister with a past incestuous r...,brother sister past incestuous relationship cu...
2,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...,bus empties students field trip museum natural...
3,The Secret Sin (1915),drama,To help their unemployed father make ends mee...,help unemployed father make ends meet edith tw...
4,The Unrecovered (2007),drama,The film's title refers not only to the un-re...,film title refers recovered bodies ground zero...


In [14]:
df.shape

(54214, 4)

In [15]:
X = df["cleaned_text"]
y = df["genre"]

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import KFold, cross_val_score
import string

# Bag of Words with 2000 features

#### Training NB, SVC and LR and Saving the results

In [17]:
cv = CountVectorizer(max_features=2000)
X_bow = cv.fit_transform(X)

models = {
    "nb": MultinomialNB(),
    "svc": SVC(random_state=42),
    "lr": LogisticRegression(solver='lbfgs', max_iter=500)
}

res = []

for name, model in models.items():

  # Split the data into training and test sets
  X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

  # Train the model
  model.fit(X_train, y_train)
  # Make predictions on the test set
  y_pred = model.predict(X_test)
  print(f"{name} - OK")

  res.append([name, accuracy_score(y_test, y_pred)])

model_summary_bow = pd.DataFrame(res, columns=['model', 'accuracy'])
model_summary_bow

nb - OK
svc - OK
lr - OK


Unnamed: 0,model,accuracy
0,nb,0.510929
1,svc,0.545237
2,lr,0.537766


# Bi-grams and Unigrams with 2000 features

#### Training NB, SVC and LR and Saving the results

In [18]:
cv = CountVectorizer(max_features=2000, ngram_range=(1,2))
X_ngrams = cv.fit_transform(X)

models = {
    "nb": MultinomialNB(),
    "svc": SVC(random_state=42),
    "lr": LogisticRegression(solver='lbfgs', max_iter=500)
}
res = []

for name, model in models.items():

  # Split the data into training and test sets
  X_train, X_test, y_train, y_test = train_test_split(X_ngrams, y, test_size=0.2, random_state=42)

  # Train the model
  model.fit(X_train, y_train)
  # Make predictions on the test set
  y_pred = model.predict(X_test)
  print(f"{name} - OK")

  res.append([name, accuracy_score(y_test, y_pred)])

model_summary_ngrams = pd.DataFrame(res, columns=['model', 'accuracy'])
model_summary_ngrams

nb - OK
svc - OK
lr - OK


Unnamed: 0,model,accuracy
0,nb,0.508623
1,svc,0.543761
2,lr,0.539057


# Tf-Idf with 2000 features

#### Training NB, SVC and LR and Saving the results

In [20]:
tfidf = TfidfVectorizer(max_features=2000)
X_tfidf = tfidf.fit_transform(X)

models = {
    "nb": MultinomialNB(),
    "svc": SVC(random_state=42),
    "lr": LogisticRegression(solver='lbfgs', max_iter=500)
}
res = []

for name, model in models.items():

  # Split the data into training and test sets
  X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

  # Train the model
  model.fit(X_train, y_train)
  # Make predictions on the test set
  y_pred = model.predict(X_test)
  print(f"{name} - OK")

  res.append([name, accuracy_score(y_test, y_pred)])

model_summary_tfidf = pd.DataFrame(res, columns=['model', 'accuracy'])
model_summary_tfidf

nb - OK
svc - OK
lr - OK


Unnamed: 0,model,accuracy
0,nb,0.506871
1,svc,0.561284
2,lr,0.568846


## Word2Vec model

In [21]:
import gensim
import nltk
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [22]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
corpus = ""

for doc in df['description'].values:
  corpus += doc

raw_sent = sent_tokenize(corpus)

story = []
stopwords = nltk.corpus.stopwords.words('english')

for sent in raw_sent:
  for word in nltk.word_tokenize(sent):
    if word in stopwords:
      sent = sent.replace(word, "")
  story.append(simple_preprocess(sent))

### Training custom Word2Vec model

In [25]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    vector_size=300
)

In [26]:
model.build_vocab(story)

In [27]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(16165019, 16983985)

### Function for feature vector

In [28]:
def get_feature_vector(document, model):
  raw_sent = sent_tokenize(document)
  words = []
  for sent in raw_sent:
    for word in nltk.word_tokenize(sent):
      if word in stopwords:
        sent = sent.replace(word, "")
    words.extend(simple_preprocess(sent))

  # Get the word vectors for the words in the description
  word_vectors = [model.wv[word] for word in words if word in model.wv]
  if len(word_vectors) == 0:
      return np.zeros(model.vector_size)
  # Compute the mean of the word vectors
  feature_vector = np.mean(word_vectors, axis=0)
  return feature_vector

In [29]:
df['description'][0]

' Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.'

#### Training NB, SVC and LR and Saving the results

In [30]:
X_wrod2vec = df['description'].apply(lambda x: get_feature_vector(x, model))
X_wrod2vec = np.array(X_wrod2vec.tolist())

In [31]:
models = {
    "svc": SVC(random_state=42),
    "lr": LogisticRegression(solver='lbfgs', max_iter=1000)
}
res = []

for name, model in models.items():

  # Split the data into training and test sets
  X_train, X_test, y_train, y_test = train_test_split(X_wrod2vec, y, test_size=0.2, random_state=42)

  # Train the model
  model.fit(X_train, y_train)
  # Make predictions on the test set
  y_pred = model.predict(X_test)
  print(f"{name} - OK")

  res.append([name, accuracy_score(y_test, y_pred)])

model_summary_word2vec = pd.DataFrame(res, columns=['model', 'accuracy'])
model_summary_word2vec

svc - OK
lr - OK


Unnamed: 0,model,accuracy
0,svc,0.514249
1,lr,0.521719


# Building Pipeline

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

In [6]:
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Custom Transformer
Read More about FunctionTransformer: [FunctionTransformer](https://www.linkedin.com/posts/pratheek-bedre_functiontransformer-activity-7214500699886272515-9uEz?utm_source=share&utm_medium=member_desktop)

In [7]:
def text_preprocessing(x):
  x_copy = x.copy()

  def remove_puctuations(text):
    l = []
    for i in text:
        if i not in string.punctuation:
            l.append(i)
    return "".join(l)

  stopwords = nltk.corpus.stopwords.words('english')
  def remove_stopwords(text):
    words = nltk.word_tokenize(text)

    l = []
    for i in words:
        if i not in stopwords and len(i) > 2:
            l.append(i)
    return " ".join(l)

  # Converting all documents to lowercase
  x_copy = x_copy.str.lower()
  # Removing all twitter mentions form documents
  x_copy = x_copy.apply(lambda x: re.sub(r'@\S+','',x))
  # Removing all urls from ducuments
  x_copy = x_copy.apply(lambda x: re.sub(r'http\S+', '', x))
  # Removing all pics from ducuments
  x_copy = x_copy.apply(lambda x: re.sub(r'pic.\S+', '',x))
  # keep only english chars / remove numbers from ducuments
  x_copy = x_copy.apply(lambda x: re.sub(r'[^a-zA-Z+]', ' ', x))
  # Removing puctuations from all ducuments
  x_copy = x_copy.apply(remove_puctuations)

  x_copy = x_copy.apply(remove_stopwords)
  # Removing repeated/leading/trailing spaces
  x_copy = x_copy.apply(lambda x: re.sub("\s[\s]+", " ",x).strip())
  # Stemming
  stemmer = PorterStemmer()
  x_copy = x_copy.apply(lambda x: stemmer.stem(x))

  return x_copy

TextPreprocessing = FunctionTransformer(text_preprocessing)

In [10]:
pipeline = Pipeline([
    ('text_preprocessing', TextPreprocessing),
    ('tfidf', TfidfVectorizer()),
    ('model', LogisticRegression(max_iter=500))
])

In [11]:
df = pd.read_csv("data/movie-genre-classification-preprocessed.csv")
df.head()

Unnamed: 0,title,genre,description,cleaned_text
0,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...,listening conversation doctor parents year old...
1,Cupid (1997),thriller,A brother and sister with a past incestuous r...,brother sister past incestuous relationship cu...
2,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...,bus empties students field trip museum natural...
3,The Secret Sin (1915),drama,To help their unemployed father make ends mee...,help unemployed father make ends meet edith tw...
4,The Unrecovered (2007),drama,The film's title refers not only to the un-re...,film title refers recovered bodies ground zero...


## Tuning

In [12]:
param_grid = {
    'tfidf__max_features': [2000, 3000, 5000],
    'model__C': [0.1, 1]
}

In [13]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['genre'])

In [14]:
grid_search = GridSearchCV(pipeline, param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(df['description'], y)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


In [17]:
grid_search.best_params_

{'model__C': 1, 'tfidf__max_features': 5000}

# Testing on Unseen data

In [19]:
test_df = pd.read_csv("data/test_data_solution.txt", sep=':::', names=['title', 'genre', 'description'], engine='python')
test_df.head()

Unnamed: 0,title,genre,description
1,Edgar's Lunch (1998),thriller,"L.R. Brane loves his life - his car, his apar..."
2,La guerra de papá (1977),comedy,"Spain, March 1964: Quico is a very naughty ch..."
3,Off the Beaten Track (2010),documentary,One year in the life of Albin and his family ...
4,Meu Amigo Hindu (2015),drama,"His father has died, he hasn't spoken with hi..."
5,Er nu zhai (1955),drama,Before he was known internationally as a mart...


In [20]:
y_pred = grid_search.best_estimator_.predict(test_df['description'])
y_true = label_encoder.transform(test_df['genre'])
accuracy_score(y_true, y_pred)

0.5823431734317344

# `58.23% Accuracy` On Unseen Data

## Exporting Model

In [22]:
label_encoder.inverse_transform(y_pred)

array([' drama ', ' drama ', ' documentary ', ..., ' comedy ', ' drama ',
       ' documentary '], dtype=object)

In [25]:
import pickle

# Save the model to a file
with open('data/model.pkl', 'wb') as file:
    pickle.dump(grid_search.best_estimator_, file)

# Save the encoder to a file
with open('data/encoder.pkl', 'wb') as file:
    pickle.dump(label_encoder, file)