IMDB Baseline Model (Structured Data Only)

This notebook trains a baseline model to predict whether a movie is highly rated or not (`label` column), using only structured features from the dataset.

We are **NOT** using review text or sentiment analysis here. This version focuses on numerical and categorical data like votes, year, and Rotten Tomatoes ratings.


## **Step 1: Load Data & Initialize Spark**

We start by launching a Spark session and loading the cleaned dataset. The dataset includes movie-level information like `numVotes`, `genre`, `tomatometer_rating`, etc.


In [1]:
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName("IMDB_Baseline_Model").getOrCreate()

# Load final cleaned dataset (structured data only)
df = spark.read.csv("final_cleaned_df.csv", header=True, inferSchema=True)

# Quick check on columns
df.printSchema()
df.show(5)


root
 |-- tconst: string (nullable = true)
 |-- movie_title: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- numVotes: integer (nullable = true)
 |-- label: boolean (nullable = true)
 |-- genre: string (nullable = true)
 |-- content_rating: string (nullable = true)
 |-- production_company: string (nullable = true)
 |-- tomatometer_status: integer (nullable = true)
 |-- tomatometer_rating: integer (nullable = true)
 |-- audience_status: integer (nullable = true)
 |-- audience_rating: integer (nullable = true)
 |-- review_score: string (nullable = true)
 |-- like_count: double (nullable = true)
 |-- label_int: integer (nullable = true)
 |-- reviews: string (nullable = true)
 |-- review_lemmatized: string (nullable = true)

+---------+------------------+----+--------+-----+-------+--------------+------------------+------------------+------------------+---------------+---------------+------------+----------+---------+--------------------+--------------------+
|   tconst|


## **Step 2: Encode Categorical Variables**

Some columns are categorical (e.g., `genre`, `tomatometer_status`, `audience_status`).  
These are converted into numerical values using **StringIndexer** so they can be used as inputs to the machine learning model.



In [3]:
from pyspark.ml.feature import StringIndexer

# Index genre, tomatometer_status, and audience_status
genre_indexer = StringIndexer(inputCol="genre", outputCol="genre_indexed", handleInvalid="keep")
tomatometer_indexer = StringIndexer(inputCol="tomatometer_status", outputCol="tomatometer_status_indexed", handleInvalid="keep")
audience_indexer = StringIndexer(inputCol="audience_status", outputCol="audience_status_indexed", handleInvalid="keep")

# Apply indexers
df = genre_indexer.fit(df).transform(df)
df = tomatometer_indexer.fit(df).transform(df)
df = audience_indexer.fit(df).transform(df)

df.select("genre", "genre_indexed", "tomatometer_status", "tomatometer_status_indexed", "audience_status", "audience_status_indexed").show(5)


+-------+-------------+------------------+--------------------------+---------------+-----------------------+
|  genre|genre_indexed|tomatometer_status|tomatometer_status_indexed|audience_status|audience_status_indexed|
+-------+-------------+------------------+--------------------------+---------------+-----------------------+
|Unknown|          0.0|                 1|                       0.0|              1|                    0.0|
|Unknown|          0.0|                 1|                       0.0|              1|                    0.0|
|Unknown|          0.0|                 1|                       0.0|              1|                    0.0|
|Unknown|          0.0|                 1|                       0.0|              1|                    0.0|
|Unknown|          0.0|                 1|                       0.0|              1|                    0.0|
+-------+-------------+------------------+--------------------------+---------------+-----------------------+
only showi

## **Step 3: Assemble Features**

All selected features (e.g., `numVotes`, `year`, `genre_indexed`, `tomatometer_rating`, etc.) are combined into a single `features` vector using **VectorAssembler**, which is the format Spark ML models expect.


In [4]:
from pyspark.ml.feature import VectorAssembler

# Define features
feature_cols = ["numVotes", "year", "tomatometer_rating", "audience_rating", 
                "genre_indexed", "tomatometer_status_indexed", "audience_status_indexed"]

# Assemble into a feature vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

df_ml = assembler.transform(df).select("features", "label")
df_ml.show(5)


+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1646.0,1935.0,78...| true|
|[1646.0,1935.0,78...| true|
|[1646.0,1935.0,78...| true|
|[1646.0,1935.0,78...| true|
|[1080.0,1935.0,78...| true|
+--------------------+-----+
only showing top 5 rows



## **Step 4: Convert Target Column + : Split Data**

The target variable `label` is currently a boolean (True/False) and needs to be converted to integers (0 or 1) to work with **GBTClassifier**.

We split the dataset into:
- **80% training data** to train the model  
- **20% testing data** to evaluate the model performance.


In [5]:
from pyspark.sql.functions import col

# Convert label to Integer FIRST
df = df.withColumn("label", col("label").cast("integer"))

# Assemble features AFTER fixing label
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_ml = assembler.transform(df).select("features", "label")

# Now split data
train_data, test_data = df_ml.randomSplit([0.8, 0.2], seed=42)


## **Step 5: Train the Model**

We train a **Gradient Boosted Trees Classifier (GBTClassifier)** using Spark MLlib, which is effective for structured/tabular data.


In [6]:
from pyspark.ml.classification import GBTClassifier

# Initialize and train model
gbt = GBTClassifier(featuresCol="features", labelCol="label", maxIter=50)
gbt_model = gbt.fit(train_data)


## **Step 6: Make Predictions + evaluate performance**

We apply the trained model on the test set to generate predictions.

We evaluate the model using **ROC-AUC** and **Accuracy**:
- **ROC-AUC** gives an indication of how well the model distinguishes between highly rated and low-rated movies.
- **Accuracy** shows the percentage of correct classifications.




In [7]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Predict on test set
predictions = gbt_model.transform(test_data)
predictions.select("features", "label", "prediction").show(5)

# Evaluate performance
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
roc_auc = evaluator.evaluate(predictions)
print(f"ROC-AUC Score: {roc_auc:.4f}")


+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(7,[1,2,3],[1931....|    0|       1.0|
|(7,[1,2,3],[1933....|    1|       1.0|
|(7,[1,2,3],[1933....|    1|       1.0|
|(7,[1,2,3],[1933....|    1|       1.0|
|(7,[1,2,3],[1934....|    0|       1.0|
+--------------------+-----+----------+
only showing top 5 rows

ROC-AUC Score: 0.7146


In [8]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_acc.evaluate(predictions)
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.6554
