# Task
Create a machine learning model to predict movie genres from plot summaries using data from "/content/train_data.txt", "/content/test_data.txt", and "/content/test_data_solution.txt".

## Load and preprocess the training data

### Subtask:
Load the training data from "/content/train_data.txt" and preprocess it for model training. This may involve tasks like tokenization, removing stop words, and converting text to numerical features.


In [4]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already present
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

# Reload and parse the training data
data = []
with open('/content/train_data.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ::: ')
        if len(parts) == 4:
            data.append({'id': parts[0], 'title': parts[1], 'genre': parts[2], 'plot': parts[3]})

train_df = pd.DataFrame(data)

# Handle missing values in 'plot' column
train_df['plot'].fillna('', inplace=True)

# Preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing
train_df['plot_cleaned'] = train_df['plot'].apply(preprocess_text)

# Check for empty strings in cleaned plots again
empty_cleaned_plots = train_df[train_df['plot_cleaned'].str.strip() == '']
print(f"\nNumber of empty cleaned plots after re-parsing: {len(empty_cleaned_plots)}")


# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limiting features for demonstration
X_train = tfidf_vectorizer.fit_transform(train_df['plot_cleaned'])

# Label Encoding
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_df['genre'])

print("Preprocessing complete. Shapes of features and labels:")
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['plot'].fillna('', inplace=True)



Number of empty cleaned plots after re-parsing: 0
Preprocessing complete. Shapes of features and labels:
X_train shape: (54214, 5000)
y_train shape: (54214,)


## Load and preprocess the test data

### Subtask:
Load the test data from "/content/test_data.txt" and preprocess it using the same techniques as the training data.


**Reasoning**:
Load the test data, handle missing values, preprocess the plot summaries using the same function as the training data, and then transform the cleaned plots into a TF-IDF matrix using the already fitted vectorizer. Finally, print the shape of the resulting matrix.



In [5]:
# Load and parse the test data
test_data = []
with open('/content/test_data.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ::: ')
        if len(parts) == 3: # Test data has id, title, plot
            test_data.append({'id': parts[0], 'title': parts[1], 'plot': parts[2]})

test_df = pd.DataFrame(test_data)

# Handle missing values in 'plot' column
test_df['plot'].fillna('', inplace=True)

# Apply the same preprocessing function
test_df['plot_cleaned'] = test_df['plot'].apply(preprocess_text)

# Use the fitted tfidf_vectorizer to transform the test data
X_test = tfidf_vectorizer.transform(test_df['plot_cleaned'])

# Print the shape of the resulting TF-IDF feature matrix
print("Shape of TF-IDF feature matrix for test data:", X_test.shape)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df['plot'].fillna('', inplace=True)


Shape of TF-IDF feature matrix for test data: (54200, 5000)


## Train a machine learning model

### Subtask:
Choose a suitable machine learning model (e.g., Naive Bayes, Logistic Regression, or Support Vector Machine) and train it on the preprocessed training data.


**Reasoning**:
Import the Logistic Regression model and train it on the preprocessed training data.



In [6]:
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence

# Train the model
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the preprocessed test data using appropriate metrics (e.g., accuracy, precision, recall, F1-score). Compare the model's predictions with the true genres from "/content/test_data_solution.txt".


**Reasoning**:
Load the true genres from the solution file, preprocess them using the fitted label encoder, make predictions using the trained model on the test data, calculate evaluation metrics, and print the results.



In [7]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the true genres from the solution file
solution_data = []
with open('/content/test_data_solution.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ::: ')
        if len(parts) == 4: # Solution data has id, title, genre, plot (similar to train but we only need id and genre)
            solution_data.append({'id': parts[0], 'genre': parts[2]})

solution_df = pd.DataFrame(solution_data)

# Ensure the order of genres in solution_df matches the order of predictions from test_df
# This is important as the solution file might not be in the same order as the test file
# We can merge or reindex based on 'id'
# Assuming 'test_df' is already ordered by 'id' implicitly from loading,
# we need to ensure 'solution_df' is also ordered by 'id' or align them.
# A safe way is to create a mapping from id to genre from solution_df and use it to get true labels
# for the order of ids in test_df.

id_to_genre = solution_df.set_index('id')['genre'].to_dict()
true_genres_ordered = [id_to_genre[str(id)] for id in test_df['id']]


# Preprocess the true genres using the same label_encoder fitted on the training data
y_true = label_encoder.transform(true_genres_ordered)

# Use the trained model to make predictions on the preprocessed test data (X_test)
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

# Print the calculated evaluation metrics
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (weighted): {precision:.4f}")
print(f"Recall (weighted): {recall:.4f}")
print(f"F1-score (weighted): {f1:.4f}")

Model Evaluation Metrics:
Accuracy: 0.5837
Precision (weighted): 0.5618
Recall (weighted): 0.5837
F1-score (weighted): 0.5447


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Summary:

### Data Analysis Key Findings

*   The training data was successfully loaded, preprocessed using TF-IDF vectorization with a maximum of 5000 features, and the genres were label encoded. The resulting feature matrix `X_train` has a shape of (54214, 5000), and the label vector `y_train` has a shape of (54214,).
*   The test data was loaded, preprocessed using the same cleaning function, and transformed into a TF-IDF feature matrix `X_test` using the vectorizer fitted on the training data. `X_test` has a shape of (54200, 5000).
*   A Logistic Regression model was successfully trained on the preprocessed training data (`X_train`, `y_train`).
*   The trained model's performance on the test set was evaluated, yielding an Accuracy of 0.5837, a weighted Precision of 0.5618, a weighted Recall of 0.5837, and a weighted F1-score of 0.5447.
*   An `UndefinedMetricWarning` during precision calculation suggests that some genres present in the test set were not predicted by the model.

### Insights or Next Steps

*   The model's performance is moderate. Further investigation into class imbalance and exploration of different models or feature engineering techniques could improve metrics.
*   Analyzing the genres that were not predicted could provide insights into the model's weaknesses and guide future model improvements or data augmentation strategies.
