# Movie Rating Prediction

This notebook demonstrates how to predict movie ratings using a machine learning pipeline with synthetic data. The model uses a Random Forest Regressor and includes preprocessing for both categorical and numerical features.

---

## Outline
1. Import Required Libraries
2. Generate Synthetic Movie Data
3. Explore the Synthetic Dataset
4. Preprocessing: Encode Categorical and Scale Numerical Features
5. Split Data into Training and Test Sets
6. Train Random Forest Regression Model
7. Evaluate Model Performance
8. Predict Rating for a New Movie

## 1. Import Required Libraries
Import pandas, numpy, and scikit-learn modules needed for data generation, preprocessing, modeling, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

## 2. Generate Synthetic Movie Data
Create a synthetic dataset with features such as genre, director, actor, budget, duration, and rating.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)
genres = ['Action', 'Comedy', 'Drama', 'Horror', 'Sci-Fi']
directors = ['Spielberg', 'Nolan', 'Tarantino', 'Scorsese', 'Kubrick']
actors = ['DiCaprio', 'Johansson', 'Pitt', 'Streep', 'Hanks']

data = {
    'genre': np.random.choice(genres, 200),
    'director': np.random.choice(directors, 200),
    'actor': np.random.choice(actors, 200),
    'budget_million': np.random.uniform(10, 200, 200),
    'duration_min': np.random.randint(80, 180, 200),
    'rating': np.random.uniform(4, 9, 200)  # Simulated ratings
}
df = pd.DataFrame(data)

## 3. Explore the Synthetic Dataset
Display the first few rows and basic statistics of the generated dataset.

In [None]:
# Display the first 5 rows of the dataset
df.head()

In [None]:
# Show basic statistics
df.describe()

## 4. Preprocessing: Encode Categorical and Scale Numerical Features
Build a preprocessing pipeline using OneHotEncoder for categorical features and StandardScaler for numerical features.

In [None]:
categorical_features = ['genre', 'director', 'actor']
numerical_features = ['budget_million', 'duration_min']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(), categorical_features),
    ('num', StandardScaler(), numerical_features)
])

## 5. Split Data into Training and Test Sets
Split the dataset into training and test sets using train_test_split.

In [None]:
X = df.drop('rating', axis=1)
y = df['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Train Random Forest Regression Model
Fit a Random Forest Regressor to the training data using the preprocessing pipeline.

In [None]:
model = Pipeline([
    ('pre', preprocessor),
    ('reg', RandomForestRegressor(n_estimators=100, random_state=42))
])

model.fit(X_train, y_train)

## 7. Evaluate Model Performance
Evaluate the model using Mean Squared Error and R2 Score on the test set.

In [None]:
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

## 8. Predict Rating for a New Movie
Use the trained pipeline to predict the rating for a new, user-defined movie.

In [None]:
# Example: Predict the rating for a new movie
new_movie = pd.DataFrame([{
    'genre': 'Action',
    'director': 'Nolan',
    'actor': 'DiCaprio',
    'budget_million': 150,
    'duration_min': 130
}])
predicted_rating = model.predict(new_movie)
print("Predicted rating for new movie:", predicted_rating[0])