##  Feature Engineering 

In [32]:
import numpy as np
import pandas as pd
from pathlib import Path

current_dir = Path().resolve()
data_path = current_dir.joinpath('Data').resolve()

ori_pickle_file_path  = data_path.joinpath('movies_df_original.pkl').resolve()
cleaned_pickle_file_path = data_path.joinpath('movies_df_cleaned.pkl').resolve()

# read df from pickle file
movies_df_cleaned = pd.read_pickle(cleaned_pickle_file_path)
movies_df_cleaned.head()

Unnamed: 0,movie_id,movie_title,movie_info,rating,genre,directors,in_theaters_date,on_streaming_date,runtime_in_minutes,critic_rating,critic_count,audience_rating,audience_count,release_year
0,1,Percy Jackson & the Olympians: The Lightning T...,A teenager discovers he's the descendant of a ...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,2010-02-12,2010-06-29,83.0,49,144,53.0,254287.0,2010
1,2,Please Give,Kate has a lot on her mind. There's the ethics...,R,Comedy,Nicole Holofcener,2010-04-30,2010-10-19,90.0,86,140,64.0,11567.0,2010
2,3,10,Blake Edwards' 10 stars Dudley Moore as George...,R,"Comedy, Romance",Blake Edwards,1979-10-05,1997-08-27,118.0,68,22,53.0,14670.0,1979
3,4,12 Angry Men (Twelve Angry Men),"A Puerto Rican youth is on trial for murder, a...",NR,"Classics, Drama",Sidney Lumet,1957-04-13,2001-03-06,95.0,100,51,97.0,105000.0,1957
4,5,"20,000 Leagues Under The Sea","This 1954 Disney version of Jules Verne's 20,0...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,1954-01-01,2003-05-20,127.0,89,27,74.0,68860.0,1954


### 1. Split the data into a training and test set, with the training data including movies released in theatres before 2010 and the test data including movies released in theatres in 2010 and after.

In [33]:
train_data = movies_df_cleaned[movies_df_cleaned['release_year'] < 2010].dropna(subset = ['runtime_in_minutes'])
test_data = movies_df_cleaned[movies_df_cleaned['release_year'] >= 2010].dropna(subset = ['runtime_in_minutes'])

### 2. Your goal is to predict the critic_rating,  Update your training and test data sets to NOT include these columns.

In [34]:
columns_to_exclude = ['audience_rating', 'critic_count', 'audience_count']
train_data = train_data.drop(columns=columns_to_exclude)
test_data = test_data.drop(columns=columns_to_exclude)

### 3. Create a new DataFrame containing the following ID column and features(Using only the training data):
- movie_title
- runtime_in_minutes
- NEW: kid_friendly (1 if G or PG, 0 if other ratings)
- NEW: dummy variable columns for each genre

In [35]:
# kid_friendly

train_data['kid_friendly'] = train_data['rating'].apply(lambda x: 1 if x in ['G', 'PG'] else 0)
train_data.head()

Unnamed: 0,movie_id,movie_title,movie_info,rating,genre,directors,in_theaters_date,on_streaming_date,runtime_in_minutes,critic_rating,release_year,kid_friendly
2,3,10,Blake Edwards' 10 stars Dudley Moore as George...,R,"Comedy, Romance",Blake Edwards,1979-10-05,1997-08-27,118.0,68,1979,0
3,4,12 Angry Men (Twelve Angry Men),"A Puerto Rican youth is on trial for murder, a...",NR,"Classics, Drama",Sidney Lumet,1957-04-13,2001-03-06,95.0,100,1957,0
4,5,"20,000 Leagues Under The Sea","This 1954 Disney version of Jules Verne's 20,0...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,1954-01-01,2003-05-20,127.0,89,1954,1
5,6,"10,000 B.C.",A young outcast from a primitive tribe is forc...,PG-13,"Action & Adventure, Classics, Drama",Roland Emmerich,2008-03-07,2008-06-24,109.0,8,2008,0
6,7,The 39 Steps,A man in London tries to help a counterespiona...,NR,"Action & Adventure, Classics, Mystery & Suspense",Alfred Hitchcock,1935-08-01,2035-06-06,87.0,96,1935,0


In [36]:
new_train_data = train_data[['movie_title', 'runtime_in_minutes', 'rating', 'genre', 'kid_friendly']]
new_train_data.head()

Unnamed: 0,movie_title,runtime_in_minutes,rating,genre,kid_friendly
2,10,118.0,R,"Comedy, Romance",0
3,12 Angry Men (Twelve Angry Men),95.0,NR,"Classics, Drama",0
4,"20,000 Leagues Under The Sea",127.0,G,"Action & Adventure, Drama, Kids & Family",1
5,"10,000 B.C.",109.0,PG-13,"Action & Adventure, Classics, Drama",0
6,The 39 Steps,87.0,NR,"Action & Adventure, Classics, Mystery & Suspense",0


In [37]:
# Get genre dummies
new_train_data['genre_split'] = new_train_data['genre'].str.split(', ')
genre_dummy = pd.get_dummies(new_train_data['genre_split'].explode(), drop_first = True)
genre_dummy = genre_dummy.groupby(new_train_data['genre_split'].explode().index).sum()

new_train_data = pd.concat([new_train_data[['movie_title', 'runtime_in_minutes', 'kid_friendly', 'rating']], genre_dummy], axis=1)
new_train_data.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_train_data['genre_split'] = new_train_data['genre'].str.split(', ')


Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Horror,Kids & Family,Musical & Performing Arts,Mystery & Suspense,Romance,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western
2,10,118.0,0,R,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0


### 4. Create 3 new features that you think will do a good job predicting the critic_rating. Each new feature should use various combinations of the columns from your training data.

#### 1. Create binary feature for Not rated movie

In [38]:
new_train_data['Not_Rated'] = new_train_data['rating'].apply(lambda x : 1 if x in ['Not Rated', 'NR'] else 0)
new_train_data.head()

Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Kids & Family,Musical & Performing Arts,Mystery & Suspense,Romance,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western,Not_Rated
2,10,118.0,0,R,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1


#### 2. Divde runtime into three categories

In [39]:
q33 = new_train_data['runtime_in_minutes'].quantile(0.33)
q67 = new_train_data['runtime_in_minutes'].quantile(0.67)

# Create categories based on quantiles
def categorize_runtime(runtime):
    if runtime < q33:
        return 'Short'
    elif q33 <= runtime <= q67:
        return 'Medium'
    else:
        return 'Long'

new_train_data['runtime_category'] = new_train_data['runtime_in_minutes'].apply(categorize_runtime)
new_train_data.head()


Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Musical & Performing Arts,Mystery & Suspense,Romance,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western,Not_Rated,runtime_category
2,10,118.0,0,R,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,Long
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,Medium
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Long
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Long
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,Short


In [40]:
runtime_dummy = pd.get_dummies(new_train_data['runtime_category'], drop_first= True, dtype=int)
new_train_data = pd.concat([new_train_data, runtime_dummy], axis = 1 )
new_train_data = new_train_data.drop('runtime_category', axis = 1)
new_train_data.head()


Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Mystery & Suspense,Romance,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western,Not_Rated,Medium,Short
2,10,118.0,0,R,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,1,0
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,1


#### 3. Director Popularity and movie count

In [41]:
new_train_data = pd.concat([new_train_data, train_data['directors']], axis = 1)
new_train_data.head()

Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Romance,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western,Not_Rated,Medium,Short,directors
2,10,118.0,0,R,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,Blake Edwards
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,1,1,0,Sidney Lumet
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Richard Fleischer
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Roland Emmerich
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,Alfred Hitchcock


In [42]:
director_counts = new_train_data['directors'].value_counts()
new_train_data['director_movie_count'] = new_train_data['directors'].map(director_counts)

In [43]:
def categorize_director(count):
    if count > 5:
        return 'HIghPop'
    elif count > 2:
        return 'MediumPop'
    else:
        return 'LowPop'

new_train_data['director_popularity'] = new_train_data['director_movie_count'].apply(categorize_director)

dpdummy  = pd.get_dummies(new_train_data['director_popularity'], drop_first = True, dtype = int)
new_train_data = pd.concat([new_train_data,dpdummy], axis = 1)
new_train_data = new_train_data.drop(['director_popularity', 'directors'], axis = 1)
new_train_data.head()

Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,rating,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,...,Special Interest,Sports & Fitness,Television,Western,Not_Rated,Medium,Short,director_movie_count,LowPop,MediumPop
2,10,118.0,0,R,0,0,0,0,1,0,...,0,0,0,0,0,0,0,27.0,0,0
3,12 Angry Men (Twelve Angry Men),95.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,1,1,0,30.0,0,0
4,"20,000 Leagues Under The Sea",127.0,1,G,0,0,0,0,0,0,...,0,0,0,0,0,0,0,16.0,0,0
5,"10,000 B.C.",109.0,0,PG-13,0,0,0,1,0,0,...,0,0,0,0,0,0,0,8.0,0,0
6,The 39 Steps,87.0,0,NR,0,0,0,1,0,0,...,0,0,0,0,1,0,1,36.0,0,0


## Modeling

### 1. Make sure you apply the same transformations on your X_test and y_test data sets that you applied on the X_train and y_train data sets.

In [44]:
# Do same feature engineering in test data set

test_data['kid_friendly'] = test_data['rating'].apply(lambda x: 1 if x in ['G', 'PG'] else 0)
new_test_data = test_data[['movie_title', 'runtime_in_minutes', 'rating', 'genre', 'kid_friendly']]

new_test_data['genre_split'] = new_test_data['genre'].str.split(', ')
genre_dummy_t = pd.get_dummies(new_test_data['genre_split'].explode(), drop_first = True)
genre_dummy_t = genre_dummy_t.groupby(new_test_data['genre_split'].explode().index).sum()

new_test_data = pd.concat([new_test_data[['movie_title', 'runtime_in_minutes', 'kid_friendly', 'rating']], genre_dummy_t], axis=1)

new_test_data['Not_Rated'] = new_test_data['rating'].apply(lambda x : 1 if x in ['Not Rated', 'NR'] else 0)

new_test_data['runtime_category'] = new_test_data['runtime_in_minutes'].apply(categorize_runtime)
runtime_dummyt = pd.get_dummies(new_test_data['runtime_category'], drop_first= True, dtype=int)
new_test_data = pd.concat([new_test_data, runtime_dummyt], axis = 1 )
new_test_data = new_test_data.drop('runtime_category', axis = 1)

new_test_data = pd.concat([new_test_data, test_data['directors']], axis = 1)
director_counts = new_test_data['directors'].value_counts()
new_test_data['director_movie_count'] = new_test_data['directors'].map(director_counts)
new_test_data['director_popularity'] = new_test_data['director_movie_count'].apply(categorize_director)

dpdummyt  = pd.get_dummies(new_test_data['director_popularity'], drop_first = True, dtype = int)
new_test_data = pd.concat([new_test_data,dpdummyt], axis = 1)
new_test_data = new_test_data.drop(['director_popularity', 'directors'], axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_test_data['genre_split'] = new_test_data['genre'].str.split(', ')


### 2. Set up y_train, y_test, X_train, X_test

In [46]:
y_train = train_data['critic_rating']
y_test = test_data['critic_rating']
X_train = new_train_data.drop(['rating', 'movie_title', 'director_movie_count'], axis = 1)
X_test = new_test_data.drop(['rating', 'movie_title', 'director_movie_count'], axis = 1)


### 3.Fit models
- Model 1: Use only runtime_in_minutes
- Model 2: Use runtime_in_minutes and kid_friendly
- Model 3: Use runtime_in_minutes, kid_friendly and the dummy columns for the genres

In [47]:
from sklearn.linear_model import LinearRegression

# Model1
X_train1 = X_train[['runtime_in_minutes']]
X_test1 = X_test[['runtime_in_minutes']]

model1 = LinearRegression()

model1.fit(X_train1, y_train)
y_pred1 = model1.predict(X_test1)

# Model2
X_train2 = X_train[['runtime_in_minutes', 'kid_friendly']]
X_test2 = X_test[['runtime_in_minutes', 'kid_friendly']]

model2 = LinearRegression()

model2.fit(X_train2, y_train)
y_pred2 = model2.predict(X_test2)

# Model3
X_train3 = X_train.iloc[:, :-6]
X_test3 = X_test.iloc[:, :-6]

model3 = LinearRegression()
model3.fit(X_train3, y_train)
y_pred3 = model3.predict(X_test3)

### 4.  Score the linear regression

In [48]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def score_model(y_test, y_pred):
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"R²: {r2:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"RMSE: {rmse:.4f}")
    
    return r2, mae, rmse

In [49]:
print("Model 1:")
score_model(y_test, y_pred1)

print("\nModel 2:")
score_model(y_test, y_pred2)

print("\nModel 3:")
score_model(y_test, y_pred3)

Model 1:
R²: -0.0024
MAE: 24.4777
RMSE: 28.3738

Model 2:
R²: -0.0047
MAE: 24.5485
RMSE: 28.4055

Model 3:
R²: 0.1272
MAE: 22.4495
RMSE: 26.4760


(0.1271883079097076,
 np.float64(22.449472328792012),
 np.float64(26.476005197745465))

Analysis:

R² (Coefficient of Determination):

Model 1 and Model 2 have negative R² values. This indicates that the models are performing worse than a horizontal line (the mean of y). Essentially, the models are unable to capture any useful variance in the data.
Model 3 has an R² of 0.126, which is still relatively low but indicates that the model captures some variance in the target variable. It is a significant improvement over the other two models.

MAE (Mean Absolute Error):

Model 3 has the lowest MAE (22.4710) compared to Model 1 (24.4777) and Model 2 (24.5485). This means Model 3 has the lowest average error, making it more accurate in predicting the target variable.

RMSE (Root Mean Squared Error):

Model 3 also has the lowest RMSE (26.4946). This is another indication that Model 3 is performing better in terms of overall prediction error compared to Model 1 and Model 2.

Conclusion:

Model 3 performs the best among the three models. It has a higher R² value and lower MAE and RMSE, which suggests it is better at explaining the variance in the target and makes more accurate predictions.
The reason for the better performance of Model 3 is likely due to the inclusion of additional features (e.g., the dummy columns for the genres). These features help explain more variance in the target variable, improving the model's predictive capability.

### 5. Fitting 3 more linear regression models on your own

In [50]:
X_train.head(1)

Unnamed: 0,runtime_in_minutes,kid_friendly,Animation,Anime & Manga,Art House & International,Classics,Comedy,Cult Movies,Documentary,Drama,...,Science Fiction & Fantasy,Special Interest,Sports & Fitness,Television,Western,Not_Rated,Medium,Short,LowPop,MediumPop
2,118.0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
# Model4, using feature whether it is not rated or not
X_train4 = X_train.iloc[:, :-4]
X_test4 = X_test.iloc[:, :-4]

model4 = LinearRegression()

model4.fit(X_train4, y_train)
y_pred4 = model4.predict(X_test4)

print("Model 4:")
score_model(y_test, y_pred4)

# Model5, using model4 features and the length of movie

X_train5 = X_train.iloc[:, :-2]
X_test5 = X_test.iloc[:, :-2]

model5 = LinearRegression()

model5.fit(X_train5, y_train)
y_pred5 = model5.predict(X_test5)

print("\nModel 5:")
score_model(y_test, y_pred5)

# Model6, using director popularity
X_train6 = X_train.iloc[:, :]
X_test6 = X_test.iloc[:, :]

model6 = LinearRegression()

model6.fit(X_train6, y_train)
y_pred6 = model6.predict(X_test6)

print("\nModel 6:")
score_model(y_test, y_pred6)


Model 4:
R²: 0.1637
MAE: 21.7475
RMSE: 25.9158

Model 5:
R²: 0.1670
MAE: 21.6743
RMSE: 25.8647

Model 6:
R²: 0.1579
MAE: 21.9027
RMSE: 26.0065


(0.1578676198973188,
 np.float64(21.902733279237044),
 np.float64(26.006527254924833))

#### Observations

 **Model 4**
- **Features**: Builds on Model 3, Includes "whether the movie is not rated (NR)".
- **Performance**:
  - \( R^2 = 0.1637 \)
  - RMSE = 25.9158
  - MAE = 21.7475
  

 **Model 5**
- **Features**: Builds upon Model 4 by adding "length of the movie" .
- **Performance**:
  - \( R^2 = 0.167 \)
  - RMSE = 25.8647
  - MAE = 21.6743
- **Improvement**: Adding "length of the movie" increased \( R^2 \) and reduced RMSE and MAE, indicating better predictive power.

**Model 6**
- **Features**: Builds upon Model 5 by including "director popularity."
- **Performance**:
  - \( R^2 = 0.1579 \)
  - RMSE = 26.0065
  - MAE = 21.9027
- **Observation**: Adding "director popularity" slightly decreased \( R^2 \) and increased RMSE compared to Model 5. This could indicate that "director popularity" is not a strong predictor for the target variable or introduces noise.


### Next Steps for Analysis
1. **Why is Model 6 underperforming compared to Model 5?**
   - **Correlation Check**: Investigate if "director popularity" is weakly correlated with the target variable.
   - **Multicollinearity**: Check if "director popularity" is highly correlated with other predictors, leading to multicollinearity.

2. **Consider Feature Selection or Regularization**:
   - Use techniques like **Lasso Regression** or **Ridge Regression** to identify and penalize less useful features.


##### Which Model Performs the Best?
- **Model 5** performs the best based on:
  - Higher \( R^2 \).
  - Lower RMSE and MAE compared to other models.
- The combination of "runtime" and "whether the movie is not rated" provides the most predictive power for the target variable in this scenario.



 #### 6. List 3 other things you could to do at this point to try and improve your model

- 1. Key words extraction is one thing that we can do to improve the model. Key words extraction can be done using readily available keywords from TMDB API or on the existing column in the database, i.e. Movie Info.
- 2. We should explore using some non-parametrics model. Simple linear regeression has its limiation and it's good for explainability, but if we are looking for accuracy, non-parametrics models might be a better choice.
- 3. Another thing that we can do to improve the model is to include information like crew information and cast information. These information, again, is available in TMDB.