Create a machine learning model that can predict the genre of a movie based on its plot summary or other textual information. You can use techniques like TF-IDF or word embeddings with classifiers such as Naive Bayes, Logistic Regression, or Support Vector Machines.

In [3]:
import os 
os.listdir()

['test_data_solution.txt',
 'train_data.txt',
 'test_data.txt',
 'Movie_Gener_Classification.ipynb',
 'description.txt']

In [4]:
import pandas as pd 
import warnings 
warnings.filterwarnings("ignore")
data = pd.read_csv("description.txt")
data


Unnamed: 0,Train data:
0,ID ::: TITLE ::: GENRE ::: DESCRIPTION
1,ID ::: TITLE ::: GENRE ::: DESCRIPTION
2,ID ::: TITLE ::: GENRE ::: DESCRIPTION
3,ID ::: TITLE ::: GENRE ::: DESCRIPTION
4,Test data:
5,ID ::: TITLE ::: DESCRIPTION
6,ID ::: TITLE ::: DESCRIPTION
7,ID ::: TITLE ::: DESCRIPTION
8,ID ::: TITLE ::: DESCRIPTION
9,Source:


# 2. Create and apply Functions to read data by splitting by ":::"

In [5]:
def load_data(file_path):                                         # Function that takes file path 
    with open(file_path, 'r' , encoding= 'utf-8') as f:           # To read the data in read mode with utf-8 encoding as refernence 
        data = f.readlines()                                      # Take each line and then 
    data = [line.strip().split(' ::: ') for line in data]         # split each line by ' ::: ' to get "ID ::: TITLE ::: GENRE ::: DESCRIPTION" format
    return data                                                   # Return the splitted line 

In [6]:
train_data = load_data("train_data.txt")                                              # To load training dataset and pass to load_data function
train_df = pd.DataFrame(train_data, columns=['ID', 'Title', 'Genre', 'Description'])  # Convert to dataframe and rename column with proper column name 

test_data = load_data("test_data.txt")
test_df = pd.DataFrame(test_data, columns=['ID', 'Title','Description'])              # Convert to dataframe and rename column with proper column name 

test_solution = load_data("test_data_solution.txt")
test_solution_df = pd.DataFrame(test_solution, columns=['ID', 'Title', 'Genre', 'Description'])  # Solution has 'Genre' column 

In [7]:
print("Train_data:")
train_df  # should have 4 columns: ID, Title, Genre, Description

Train_data:


Unnamed: 0,ID,Title,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...
...,...,...,...,...
54209,54210,"""Bonino"" (1953)",comedy,This short-lived NBC live sitcom centered on B...
54210,54211,Dead Girls Don't Cry (????),horror,The NEXT Generation of EXPLOITATION. The siste...
54211,54212,Ronald Goedemondt: Ze bestaan echt (2008),documentary,"Ze bestaan echt, is a stand-up comedy about gr..."
54212,54213,Make Your Own Bed (1944),comedy,Walter and Vivian live in the country and have...


In [8]:
print('\nTest Data:')
test_df    # should have 3 colums: ID, Title, Description 


Test Data:


Unnamed: 0,ID,Title,Description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),Before he was known internationally as a marti...
...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard ..."


In [9]:
print('\nTest Solution:')
test_solution_df    # should have 4 colums: ID, Title, Genre,  Description


Test Solution:


Unnamed: 0,ID,Title,Genre,Description
0,1,Edgar's Lunch (1998),thriller,"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),comedy,"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),documentary,One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),drama,"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),drama,Before he was known internationally as a marti...
...,...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),adult,"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard ..."


3. Feature Extraction : TF-IDF

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorize = TfidfVectorizer(max_features=10000)

X_train_tfidf = vectorize.fit_transform(train_df["Description"])
X_test_tfidf = vectorize.transform(test_df["Description"])

print(f"Training Data Shape: {X_train_tfidf.shape}")
print(f"Test Data Shape: {X_test_tfidf.shape}")

Training Data Shape: (54214, 10000)
Test Data Shape: (54200, 10000)


4. Encoding The Target Labels

In [11]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_df["Genre"])
print(f"Unique genres in the training data: {label_encoder.classes_}")

Unique genres in the training data: ['action' 'adult' 'adventure' 'animation' 'biography' 'comedy' 'crime'
 'documentary' 'drama' 'family' 'fantasy' 'game-show' 'history' 'horror'
 'music' 'musical' 'mystery' 'news' 'reality-tv' 'romance' 'sci-fi'
 'short' 'sport' 'talk-show' 'thriller' 'war' 'western']


5 . Model Building - Logistic Regression 

In [12]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train_tfidf, y_train)

y_pred = lr_model.predict(X_test_tfidf)
predicted_genres = label_encoder.inverse_transform(y_pred)

test_df["Predicted_Genre"] = predicted_genres
test_df[['Title', 'Predicted_Genre']]

Unnamed: 0,Title,Predicted_Genre
0,Edgar's Lunch (1998),drama
1,La guerra de papá (1977),drama
2,Off the Beaten Track (2010),documentary
3,Meu Amigo Hindu (2015),drama
4,Er nu zhai (1955),drama
...,...,...
54195,"""Tales of Light & Dark"" (2013)",drama
54196,Der letzte Mohikaner (1965),drama
54197,Oliver Twink (2007),comedy
54198,Slipstream (1973),drama


In [13]:

merged_df = pd.merge(
    test_solution_df[['ID', 'Genre']],  # Columns from test_solution_df
    test_df[['ID', 'Predicted_Genre']],  # Correct column name in test_df
    on='ID'  # Merge on 'ID'
)

# Check the merged DataFrame
print(merged_df)


          ID        Genre Predicted_Genre
0          1     thriller           drama
1          2       comedy           drama
2          3  documentary     documentary
3          4        drama           drama
4          5        drama           drama
...      ...          ...             ...
54195  54196       horror           drama
54196  54197      western           drama
54197  54198        adult          comedy
54198  54199        drama           drama
54199  54200        drama     documentary

[54200 rows x 3 columns]


6 . Model Evaluation - Logistic Regression 

In [14]:
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(merged_df['Genre'], merged_df['Predicted_Genre'])
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(merged_df['Genre'], merged_df['Predicted_Genre']))

Accuracy: 0.5948

Classification Report:
              precision    recall  f1-score   support

      action       0.50      0.29      0.37      1314
       adult       0.64      0.24      0.35       590
   adventure       0.67      0.16      0.26       775
   animation       0.55      0.04      0.08       498
   biography       0.00      0.00      0.00       264
      comedy       0.54      0.60      0.57      7446
       crime       0.41      0.03      0.06       505
 documentary       0.68      0.87      0.76     13096
       drama       0.55      0.78      0.65     13612
      family       0.49      0.08      0.14       783
     fantasy       0.65      0.03      0.06       322
   game-show       0.90      0.49      0.64       193
     history       0.00      0.00      0.00       243
      horror       0.66      0.57      0.61      2204
       music       0.68      0.46      0.55       731
     musical       0.45      0.02      0.03       276
     mystery       0.40      0.01      0

7. Model Building - Navie Bayes

In [15]:
from sklearn.naive_bayes import MultinomialNB           # importing MultinomialNB from sklearn
nb_model = MultinomialNB()                              # Building the Navie Bayes Model
nb_model.fit(X_train_tfidf, y_train)                    # Training the Navie Bayes Model with trning dataset 

In [16]:
y_pred_nb = nb_model.predict(X_test_tfidf)
predicted_genres_nb = label_encoder.inverse_transform(y_pred_nb)
test_df['Predicted_Genre_NB'] = predicted_genres_nb
merged_df_nb = pd.merge(test_solution_df, test_df[['ID', 'Predicted_Genre_NB']], on ='ID')

8. Model Evaluation - Navie Bayes 

In [17]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_nb = accuracy_score(merged_df_nb["Genre"], merged_df_nb['Predicted_Genre_NB'])
print(f"Navie Bayes Accuracy: {accuracy_nb}")

print("Navie Bayes Classification Report:")
print(classification_report(merged_df_nb['Genre'], merged_df_nb['Predicted_Genre_NB'], target_names=label_encoder.classes_))

Navie Bayes Accuracy: 0.5092066420664206
Navie Bayes Classification Report:
              precision    recall  f1-score   support

      action       0.57      0.03      0.06      1314
       adult       0.46      0.02      0.04       590
   adventure       0.77      0.04      0.08       775
   animation       0.00      0.00      0.00       498
   biography       0.00      0.00      0.00       264
      comedy       0.53      0.40      0.46      7446
       crime       0.00      0.00      0.00       505
 documentary       0.56      0.89      0.69     13096
       drama       0.44      0.84      0.58     13612
      family       0.00      0.00      0.00       783
     fantasy       0.00      0.00      0.00       322
   game-show       1.00      0.02      0.04       193
     history       0.00      0.00      0.00       243
      horror       0.77      0.23      0.35      2204
       music       0.89      0.02      0.04       731
     musical       0.00      0.00      0.00       276
     

9. Model Building : SVM (Support Vector Machine)

In [18]:
from sklearn.svm import SVC                         # Importing SVC module from sklearn
svm_model = SVC(kernel='linear')                    # Building a Support Vector Machine (SVM) model with a linear Kernel
svm_model.fit(X_train_tfidf, y_train)               # Training the SVM nodel with the training data set

In [19]:
y_pred_svm = svm_model.predict(X_test_tfidf)                                                  # Importing SVC module form sklearn
predicted_genres_svm = label_encoder.inverse_transform(y_pred_svm)                            # Applying inverse transform on predicted values 
test_df['Predicted_Genre_SVM'] = predicted_genres_svm                                         # Adding prediction to the test dataform 
merged_df_svm = pd.merge(test_solution_df, test_df[['ID', 'Predicted_Genre_SVM']], on='ID')   # Evaluating the model

10 . Model Evaluation - SVM (Support Vector Machine)

In [20]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_svm = accuracy_score(merged_df_svm['Genre'], merged_df_svm['Predicted_Genre_SVM'])
print(f"SVM Accuracy: {accuracy_svm}")

print("SVM Classification_report:")
print(classification_report(merged_df_svm['Genre'], merged_df_svm['Predicted_Genre_SVM'], target_names=label_encoder.classes_))

SVM Accuracy: 0.600110701107011
SVM Classification_report:
              precision    recall  f1-score   support

      action       0.44      0.36      0.40      1314
       adult       0.62      0.39      0.48       590
   adventure       0.55      0.22      0.31       775
   animation       0.47      0.13      0.21       498
   biography       0.00      0.00      0.00       264
      comedy       0.55      0.60      0.57      7446
       crime       0.31      0.04      0.06       505
 documentary       0.69      0.86      0.77     13096
       drama       0.56      0.77      0.65     13612
      family       0.52      0.10      0.17       783
     fantasy       0.37      0.06      0.10       322
   game-show       0.84      0.61      0.71       193
     history       0.00      0.00      0.00       243
      horror       0.67      0.60      0.63      2204
       music       0.67      0.51      0.58       731
     musical       0.47      0.03      0.06       276
     mystery       0.3

TEST CASE

In [21]:
# Assuming the model (lr_model, nb_model, svm_model) have already been trained 
# and that X_test_tfid is the Tf-IDF represented of the 'test_data'.

zoner_Description = [
    'Explosive fight scenes in the city streets',        # Action
    'A haunted mansion that traps its visitors ',        # Horror 
    'A brave adventure in search of lost treasure',      # Adventure 
    'A forbidden romance in the 1920s',                  # Romance 
    'A daring rescue mission with a love interest'       # Action 
]

# Step 1 : Vectorize the new test data using the same vectorizer
test_data_tfidf = vectorize.transform(zoner_Description)       # Transform the descriptions into Tf-IDF features

# Step 2 : Predict geners using each model 
y_pred_lr = lr_model.predict(test_data_tfidf)      # Predict using Logistic Regression 
predicted_genres_lr = label_encoder.inverse_transform(y_pred_lr)    # Inverse Transforming to get genre names

y_pred_nb = nb_model.predict(test_data_tfidf)     # Predict using Naive Bayes 
predicted_genres_nb = label_encoder.inverse_transform(y_pred_nb)    # Inverse Transforming to get genre names

y_pred_svm = svm_model.predict(test_data_tfidf)    # Predict using SVM 
predicted_genres_svm = label_encoder.inverse_transform(y_pred_svm)   # Inverse Transforming to get genre names

#Step 3 : Output the predicted genres

print("Predicting the Genre using Logistic Regression :", predicted_genres_lr)
print("Predicting the Genre using Naive Bayes         :", predicted_genres_nb)
print("Predicting the Genre using SVM                 :", predicted_genres_svm)

for i, message in enumerate(zoner_Description):
    print(f"Story : {message}")
    print(f"Status :\tNative Bayes Prediction : {predicted_genres_nb[i]}")
    print(f"\t\tLogistic Regression Prediction : {predicted_genres_lr[i]}")
    print(f"\t\tSVM Prediction : {predicted_genres_svm[i]}")
    print("="*100)    # Separate each message 



Predicting the Genre using Logistic Regression : ['documentary' 'horror' 'adventure' 'drama' 'comedy']
Predicting the Genre using Naive Bayes         : ['documentary' 'horror' 'documentary' 'drama' 'drama']
Predicting the Genre using SVM                 : ['documentary' 'horror' 'adventure' 'drama' 'comedy']
Story : Explosive fight scenes in the city streets
Status :	Native Bayes Prediction : documentary
		Logistic Regression Prediction : documentary
		SVM Prediction : documentary
Story : A haunted mansion that traps its visitors 
Status :	Native Bayes Prediction : horror
		Logistic Regression Prediction : horror
		SVM Prediction : horror
Story : A brave adventure in search of lost treasure
Status :	Native Bayes Prediction : documentary
		Logistic Regression Prediction : adventure
		SVM Prediction : adventure
Story : A forbidden romance in the 1920s
Status :	Native Bayes Prediction : drama
		Logistic Regression Prediction : drama
		SVM Prediction : drama
Story : A daring rescue mission