## Notebook 4: Predictive Analysis (`4_Predictive_Analysis.ipynb`)

**Purpose:** To apply Machine Learning algorithms to make predictions based on the data.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('/Users/apple/Documents/Uni/Personal /Netlfix_Project/Data/netflix_cleaned.csv')
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Cast,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,September
1,s2,TV Show,Blood & Water,No Director,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,September
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Country Unavailable,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021,September
3,s4,TV Show,Jailbirds New Orleans,No Director,No Cast,Country Unavailable,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",2021,September
4,s5,TV Show,Kota Factory,No Director,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2021,September
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8788,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2019,November
8789,s8804,TV Show,Zombie Dumb,No Director,No Cast,Country Unavailable,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",2019,July
8790,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,2019,November
8791,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",2020,January


*   **Task A: Classification (Logistic Regression)**
    *   **Goal:** Predict if a title is a "Movie" or "TV Show" based solely on its **Description**.
    *   **Process:** We use **TF-IDF (Term Frequency-Inverse Document Frequency)** to turn the text descriptions into numerical vectors. This highlights unique words (e.g., "season", "episode" vs. "film", "documentary").
    *   **Model:** A Logistic Regression model is trained on these word vectors.

In [4]:
#Can we predict if a title is a Movie or TV Show just by reading its description?

# 1. Prepare Data
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X = tfidf.fit_transform(df['description'])
y = df['type']

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# 4. Evaluate
y_pred = clf.predict(X_test)
print("Classification Report (Type Prediction):\n")
print(classification_report(y_test, y_pred))

Classification Report (Type Prediction):

              precision    recall  f1-score   support

       Movie       0.75      0.97      0.85      1221
     TV Show       0.79      0.29      0.42       538

    accuracy                           0.76      1759
   macro avg       0.77      0.63      0.63      1759
weighted avg       0.76      0.76      0.72      1759



*   **Result:** The classification report tells us how accurately the model can distinguish between the two types based on text patterns.


*   **Task B: Regression (Linear Regression)**
    *   **Goal:** Predict the **Duration** of a movie based on its **Release Year**.
    *   **Process:** We filter for movies and clean the duration column (converting "90 min" to the number 90), ensuring we drop any missing values.
    *   **Model:** A Linear Regression model fits a straight line to the data to see if there is a trend (e.g., are movies getting longer or shorter over time?).

In [7]:
#Is there a trend in movie length over the years? Let's try to predict duration based on the year it was released.

# 1. Prepare Data (Movies only)
movies = df[df['type'] == 'Movie'].copy()
movies.dropna(subset=['duration'], inplace=True)
movies['duration_min'] = movies['duration'].str.replace(' min', '').astype(int)

# We will use Release Year to predict Duration
X_reg = movies[['release_year']]
y_reg = movies['duration_min']

# 2. Split Data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# 3. Train Model
reg = LinearRegression()
reg.fit(X_train_r, y_train_r)

# 4. Evaluate
y_pred_r = reg.predict(X_test_r)

print("Regression Analysis (Duration Prediction):")
print(f"Mean Squared Error: {mean_squared_error(y_test_r, y_pred_r):.2f}")
print(f"R2 Score: {r2_score(y_test_r, y_pred_r):.4f}")
print(f"Coefficient (Trend): {reg.coef_[0]:.4f}") 
# A negative coefficient would imply movies are getting shorter over time

Regression Analysis (Duration Prediction):
Mean Squared Error: 691.11
R2 Score: 0.0343
Coefficient (Trend): -0.6231


  *   **Result:** The coefficients tell us the direction of the trend, and the Mean Squared Error tells us how close the predictions are to reality.