# Load the Dataset

### Objective:
Read the `train_data.txt` file and load it into a clean DataFrame.

---

### Why?

We want a structured table (like Excel) to work easily with movie plots and genres.
This will make it much easier to preprocess, analyze, and build models.

In [2]:
import pandas as pd

data = []
with open('archive (12)/Genre Classification Dataset/train_data.txt', 'r', encoding='utf-8') as file:
    for line in file:
        parts = line.strip().split(' ::: ')
        if len(parts) == 4:
            movie_id, title, genre, plot = parts
            data.append({
                'ID': int(movie_id),
                'Title': title,
                'Genre': genre,
                'Plot': plot
            })

# Step 2: Create a DataFrame
train_df = pd.DataFrame(data)

# Step 3: Check the first few entries
print(train_df.head())


   ID                             Title     Genre  \
0   1      Oscar et la dame rose (2009)     drama   
1   2                      Cupid (1997)  thriller   
2   3  Young, Wild and Wonderful (1980)     adult   
3   4             The Secret Sin (1915)     drama   
4   5            The Unrecovered (2007)     drama   

                                                Plot  
0  Listening in to a conversation between his doc...  
1  A brother and sister with a past incestuous re...  
2  As the bus empties the students for their fiel...  
3  To help their unemployed father make ends meet...  
4  The film's title refers not only to the un-rec...  


# Preprocess the Plot Text

### Objective:
Clean the "Plot" column to make it suitable for machine learning.

---

### Why?

Raw text is messy — we need clean input for the model:

- Convert all text to **lowercase**.
- **Remove punctuation** and special characters.
- **Remove stopwords** (common words like "the", "is", "a", etc.) that don't carry much meaning.



In [4]:

import re
import nltk
from nltk.corpus import stopwords

# Download stopwords (only once)
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Define a text cleaning function
def clean_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    # Join back to text
    return ' '.join(words)

# Apply the cleaning function
train_df['Cleaned_Plot'] = train_df['Plot'].apply(clean_text)

# View cleaned data
print(train_df[['Plot', 'Cleaned_Plot']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Priyanshu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                                Plot  \
0  Listening in to a conversation between his doc...   
1  A brother and sister with a past incestuous re...   
2  As the bus empties the students for their fiel...   
3  To help their unemployed father make ends meet...   
4  The film's title refers not only to the un-rec...   

                                        Cleaned_Plot  
0  listening conversation doctor parents yearold ...  
1  brother sister past incestuous relationship cu...  
2  bus empties students field trip museum natural...  
3  help unemployed father make ends meet edith tw...  
4  films title refers unrecovered bodies ground z...  


# Feature Extraction (Vectorization)

### Objective:
Convert the cleaned plot text into numerical features.

---

### Why?

Machine learning models can only understand numbers — not words.

We'll use **TF-IDF (Term Frequency-Inverse Document Frequency)** to:
- Capture the importance of words across all plots.
- Represent text in a format suitable for machine learning algorithms.


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Fit and transform the cleaned plot
X = tfidf.fit_transform(train_df['Cleaned_Plot']).toarray()

# Target Variable (Genre)
y = train_df['Genre']

print(f"Shape of feature matrix: {X.shape}")


Shape of feature matrix: (54214, 5000)


### Train-Test Split

**Objective**: Split data into training and testing sets.

**Why?**  
We want to train the model on one portion of the data and test it on another to evaluate the model's performance and accuracy on unseen data. This helps in assessing the model's generalization ability and avoids overfitting to the training data.

---

In machine learning, splitting data into training and testing subsets is a crucial step in model evaluation. The most common split is 80% for training and 20% for testing, though the proportions can vary depending on the dataset and application.
``


In [11]:
from sklearn.model_selection import train_test_split

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training size: {X_train.shape}, Testing size: {X_test.shape}")


Training size: (43371, 5000), Testing size: (10843, 5000)


### Train Models
We'll try three models:

- Logistic Regression

- Random Forest

- SVM

---

# Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)

# Predict
y_pred_log = log_model.predict(X_test)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))


Logistic Regression Accuracy: 0.5787143779396846
              precision    recall  f1-score   support

      action       0.50      0.25      0.33       263
       adult       0.78      0.22      0.35       112
   adventure       0.41      0.14      0.21       139
   animation       0.64      0.09      0.15       104
   biography       0.00      0.00      0.00        61
      comedy       0.51      0.58      0.55      1443
       crime       0.33      0.02      0.04       107
 documentary       0.67      0.84      0.75      2659
       drama       0.54      0.78      0.64      2697
      family       0.48      0.09      0.15       150
     fantasy       0.00      0.00      0.00        74
   game-show       0.94      0.42      0.59        40
     history       0.00      0.00      0.00        45
      horror       0.62      0.55      0.58       431
       music       0.65      0.47      0.55       144
     musical       0.25      0.02      0.04        50
     mystery       0.00      0.0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Train
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.5001383380983123
              precision    recall  f1-score   support

      action       0.00      0.00      0.00       263
       adult       0.79      0.10      0.17       112
   adventure       0.43      0.09      0.14       139
   animation       0.00      0.00      0.00       104
   biography       0.00      0.00      0.00        61
      comedy       0.47      0.31      0.37      1443
       crime       1.00      0.01      0.02       107
 documentary       0.58      0.87      0.69      2659
       drama       0.43      0.84      0.57      2697
      family       1.00      0.03      0.05       150
     fantasy       0.00      0.00      0.00        74
   game-show       0.91      0.53      0.67        40
     history       0.00      0.00      0.00        45
      horror       0.61      0.19      0.29       431
       music       0.63      0.22      0.32       144
     musical       0.00      0.00      0.00        50
     mystery       0.00      0.00     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Svm

In [15]:
from sklearn.svm import LinearSVC

# Train
svm_model = LinearSVC()
svm_model.fit(X_train, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test)

# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.5696762888499493
              precision    recall  f1-score   support

      action       0.39      0.30      0.34       263
       adult       0.62      0.36      0.45       112
   adventure       0.35      0.22      0.27       139
   animation       0.36      0.17      0.23       104
   biography       0.00      0.00      0.00        61
      comedy       0.52      0.57      0.54      1443
       crime       0.15      0.05      0.07       107
 documentary       0.69      0.81      0.75      2659
       drama       0.56      0.70      0.62      2697
      family       0.28      0.15      0.19       150
     fantasy       0.22      0.05      0.09        74
   game-show       0.76      0.65      0.70        40
     history       0.00      0.00      0.00        45
      horror       0.58      0.62      0.60       431
       music       0.55      0.53      0.54       144
     musical       0.20      0.04      0.07        50
     mystery       0.16      0.05      0.08     

# Sample test

In [16]:
# Sample test case
sample_data = ["A young boy discovers a magical world hidden in his backyard."]
sample_cleaned = [clean_text(sample_data[0])]  # Clean the text using the existing clean_text function

# Transform the sample data using the TF-IDF vectorizer
sample_features = tfidf.transform(sample_cleaned).toarray()

# Predict using the trained models
log_pred = log_model.predict(sample_features)
rf_pred = rf_model.predict(sample_features)
svm_pred = svm_model.predict(sample_features)

# Display predictions
print("Logistic Regression Prediction:", log_pred[0])
print("Random Forest Prediction:", rf_pred[0])
print("SVM Prediction:", svm_pred[0])

Logistic Regression Prediction: short
Random Forest Prediction: drama
SVM Prediction: animation
