![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [60]:
#Importing necessary libraries and modules
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [61]:
#Reading the dataset
df = pd.read_csv('rental_info.csv')
#Exploring the dataset
print(df.head())
print(df.info())
print(df.isna().sum())

                 rental_date  ... rental_rate_2
0  2005-05-25 02:54:33+00:00  ...        8.9401
1  2005-06-15 23:19:16+00:00  ...        8.9401
2  2005-07-10 04:27:45+00:00  ...        8.9401
3  2005-07-31 12:06:41+00:00  ...        8.9401
4  2005-08-19 12:30:04+00:00  ...        8.9401

[5 rows x 15 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13      

In [62]:
#Creating a rental_length_days column
#Examining the columns to identify the datetime format
print(df['return_date'].head())
print(df['rental_date'].head())
#The format is year-month-date and then timezone
#Converting the return_date and rental_date to date format 
df['rental_date'] = pd.to_datetime(df['rental_date']).dt.tz_localize(None) #removed the timezone.
df['return_date'] = pd.to_datetime(df['return_date']).dt.tz_localize(None) #removed the timezone.
#Creating a rental_length_days column by subtracting rental_date from return_date
df['rental_length_days'] = (df['return_date'] - df['rental_date']).dt.days
#Examining the results
print(df['rental_length_days'].head())
print(df['rental_length_days'].value_counts())

0    2005-05-28 23:40:33+00:00
1    2005-06-18 19:24:16+00:00
2    2005-07-17 10:11:45+00:00
3    2005-08-02 14:30:41+00:00
4    2005-08-23 13:35:04+00:00
Name: return_date, dtype: object
0    2005-05-25 02:54:33+00:00
1    2005-06-15 23:19:16+00:00
2    2005-07-10 04:27:45+00:00
3    2005-07-31 12:06:41+00:00
4    2005-08-19 12:30:04+00:00
Name: rental_date, dtype: object
0    3
1    2
2    7
3    2
4    4
Name: rental_length_days, dtype: int64
7    1832
1    1829
8    1771
5    1767
6    1758
4    1757
2    1713
3    1694
9     894
0     846
Name: rental_length_days, dtype: int64


In [63]:
# Examining the special_features column
print(df['special_features'].head())
print(df['special_features'].value_counts())

# Create 'deleted_scenes' column (1 if 'Deleted Scenes' is in the set, else 0)
df['deleted_scenes'] = df['special_features'].apply(lambda x: 1 if 'Deleted Scenes' in x else 0)

# Create 'behind_the_scenes' column (1 if 'Behind the Scenes' is in the set, else 0)
df['behind_the_scenes'] = df['special_features'].apply(lambda x: 1 if 'Behind the Scenes' in x else 0)

# Verify the results
print(df[['deleted_scenes', 'behind_the_scenes']].head())
print(df['deleted_scenes'].value_counts())
print(df['behind_the_scenes'].value_counts())

0    {Trailers,"Behind the Scenes"}
1    {Trailers,"Behind the Scenes"}
2    {Trailers,"Behind the Scenes"}
3    {Trailers,"Behind the Scenes"}
4    {Trailers,"Behind the Scenes"}
Name: special_features, dtype: object
{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted 

In [64]:
#Creating the X dataframe and y dataframes
X = df[['amount', 'length','release_year', 'rental_rate', 'replacement_cost', 
        'amount_2', 'length_2', 'rental_rate_2', 'NC-17', 'PG', 
        'PG-13', 'deleted_scenes', 'behind_the_scenes','R']]
y = df['rental_length_days']
#Splitting the data into training and test sets, including 20% in the test set and setting random_state to 9
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [65]:
#Creating and instantiating a RandomForestRegressor
#Handles numeric and binary features well, capture feature interactions and robust to feature scales
best_model = RandomForestRegressor(n_estimators=100, random_state=42)
#Fitting the model 
best_model.fit(X_train, y_train)
#Predicting values
y_pred = best_model.predict(X_test)
#Extracting the mse score
best_mse = mean_squared_error(y_test, y_pred)
print(round(best_mse, 2))

2.03
