A DVD rental company wants to figure out how many days a customer will rent a DVD for based on some features. They want me to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model i make will help the company become more efficient in inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

"rental_date": The date (and time) the customer rents the DVD.


"return_date": The date (and time) the customer returns the DVD.


"amount": The amount paid by the customer for renting the DVD.


"amount_2": The square of "amount".


"rental_rate": The rate at which the DVD is rented for.


"rental_rate_2": The square of "rental_rate".


"release_year": The year the movie being rented was released.


"length": Lenght of the movie being rented, in minuites.


"length_2": The square of "length".


"replacement_cost": The amount it will cost the company to replace the DVD.


"special_features": Any special features, for example trailers/deleted scenes that the DVD also has.


"NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convenience, the reference dummy has already been dropped.




The main aim of this project, to predict the number of days a customer rents DVDs for. Regression models will be used.

This project will follow the following steps:

* Pre-process the data provided, in this case, a csv file called rental_info.csv. 

* Read in the csv file rental_info.csv using pandas.

* Create a column named "rental_length_days" using the columns "return_date" and "rental_date", and add it to the pandas     DataFrame. This column should contain information on how many days a DVD has been rented by a customer.

* Create two columns of dummy variables from "special_features", which takes the value of 1 when:
* The value is "Deleted Scenes", storing as a column called "deleted_scenes".
* The value is "Behind the Scenes", storing as a column called "behind_the_scenes".

* Make a pandas DataFrame called X containing all the appropriate features you can use to run the regression models, avoiding columns that leak data about the target.

* Choose the "rental_length_days" as the target column and save it as a pandas Series called y.
 
  Following the preprocessing i will:

* Split the data into X_train, y_train, X_test, and y_test train and test sets, avoiding any features that leak data about   the target variable, and include 20% of the total data in the test set.

* Set random_state to 9 whenever you use a function/method involving randomness, for example, when doing a test-train         split.

* Recommend a model yielding a mean squared error (MSE) less than 3 on the test set

* Save the model you would recommend as a variable named best_model, and save its MSE on the test set as best_mse.

STEP I: PREPROCESSING OUR DATA



This involves preparing and cleaning the raw data before feeding it into the machine learning model. This includes tasks like handling missing values, scaling features (normalization and standardization), encoding categorical variables and removing outliers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
rental_info = pd.read_csv(r"C:\Users\DELL\Downloads\rental_info.csv")
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
rental_info.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [27]:
rental_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


HANDLING MISSING VALUES

In [28]:
rental_info.isna().sum()

rental_date         0
return_date         0
amount              0
release_year        0
rental_rate         0
length              0
replacement_cost    0
special_features    0
NC-17               0
PG                  0
PG-13               0
R                   0
amount_2            0
length_2            0
rental_rate_2       0
dtype: int64

ENCODING CATEGORICAL VARIABLES

In [37]:
### Add dummy variables
# Add dummy for deleted scenes
rental_info["deleted_scenes"] =  np.where(rental_info["special_features"].str.contains("Deleted Scenes"), 1, 0)

# Add dummy for behind the scenes
rental_info["behind_the_scenes"] =  np.where(rental_info["special_features"].str.contains("Behind the Scenes"), 1, 0)

In [42]:
rental_info.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3 days,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7 days,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4 days,0,1


In [48]:
#creating a column that contains information on how many days a DVD has been rented by a customer.

rental_info['rental_length_days'] = pd.to_datetime(rental_info['return_date']) - pd.to_datetime(rental_info['rental_date'])
rental_info['rental_length_days'] = rental_info['rental_length_days'].dt.days.astype(str) + ' days'
rental_info['rental_length_days'].head()



0    3 days
1    2 days
2    7 days
3    2 days
4    4 days
Name: rental_length_days, dtype: object

In [45]:
#Drop columns that can cause data leakage
# Choose columns to drop
cols_to_drop = ["special_features", "rental_length_days", "rental_date", "return_date"]

STEP 2: CREATING A MODEL

In [None]:
# Split into feature and target sets
X = rental_info.drop(cols_to_drop, axis=1)
y = rental_info["rental_length_days"]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# For lasso
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Run OLS
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, 
                                                 y, 
                                                 test_size=0.2, 
                                                 random_state=9)

In [None]:
#NOTE: Regularization techniques like Lasso(l1 regularization) and a Ridge (l2 regularization) are not evaluation metrics 
#or algorithms in themselves; rather, they are methods used during the training phase of machine learning models to prevent 
#overfitting

# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selection by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()


STEP 3: MODEL TUNING I.E ADJUSTING HYPERPARAMETERS

a. GridSearchCV
b. RandomSearchCV

In [None]:

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

STEP 4: EVALUATING OUR MODEL'S PERFORMANCE - CLASSIFICATION OR REGRESSION METRICS


Here, we will focus on regression metrics which includes:
a. Mean Absolute Error
b. Mean Squared Error
c. Root Mean Squared Error
d. R-squared(R2)

In [None]:
# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest