![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [51]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below

from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error

In [52]:
#Obteniendo el CSV de las pelicula de renta
rental_df = pd.read_csv('rental_info.csv')

#Seteando una nueva columna que contendra los dias de alquiler en dias que un cliente ha tenido y esta se almacena en la nueva columna, el calculo es return_Date - rental_date
rental_df["rental_length_days"] = (pd.to_datetime(rental_df["return_date"]) -pd.to_datetime(rental_df["rental_date"])).dt.days

print(rental_df.shape)

#Se crean nuevas columnas (caracteristicas) que se llenaran a partir de la condicion de una caracteristica special_feature para encontrar si contiene o no ciertas palabras y asi determinar si le corresponde 1 o 0
rental_df["deleted_scenes"] = np.where(rental_df["special_features"].str.contains("Deleted Scenes"),1,0)
rental_df["behind_the_scenes"] = np.where(rental_df["special_features"].str.contains("Behind the Scenes"),1,0)

rental_df.info()

#Se seleccionan las caracteristicas que no son tipo object, por lo que previamente se realizo un .info para saber que caracteeristicas eran las indicadas y a partir de eso crear una lista de featuree.
features = ['amount', 'release_year', 'rental_rate', 'length', 'replacement_cost', 'NC-17', 'PG', 'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2', 'deleted_scenes', 'behind_the_scenes']

#Se crea Dataframe con la caracteristicas seleccionadas y estas seran las caracteristicas del modelo
X = rental_df[features]
#Se crea DataFrame con la variable objetivo para entrenar al modelo y testealo
y = rental_df['rental_length_days']

print("Caracteristicas X:")
print(X.head())
print("\nVariable objetivo y:")
print(y.head())

#Aplicando la division de los datos en conjuntos de entrenamiento y prueba
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=9)

print(X_train.shape)
print(X_test.shape)

#Se iniciliza el modelo de regresion Lasso para identificar las caracteristicas que pueden realizar una mejor prediccion
lasso = Lasso(alpha=0.1, random_state=9)

#Ajustando el modelo con datos de entrenamiento
lasso.fit(X_train, y_train)

#Aqui se obtiene los coeficientes del modelo lasso
lasso_coef = lasso.coef_

#Se identifican las caracteristicas que contengan un coeficiente mayor a 0
sel_feature_ind = lasso_coef > 0

#Se obtiene los nombres de las caracteristicas que realizaran una mejor prediccion 
sel_feature_names = X.columns[sel_feature_ind]

#Se hace un subconjunto tanto en los datos de entrenamiento y prueba a partir de las caracteristicas encontradas
X_train_sel = X_train.loc[:, sel_feature_ind].reset_index(drop=True)
X_test_sel = X_test.loc[:, sel_feature_ind].reset_index(drop=True)

print("Características seleccionadas:", sel_feature_names)
print(X_train_sel)
print(X_test_sel)

(15861, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   rental_date         15861 non-null  object 
 1   return_date         15861 non-null  object 
 2   amount              15861 non-null  float64
 3   release_year        15861 non-null  float64
 4   rental_rate         15861 non-null  float64
 5   length              15861 non-null  float64
 6   replacement_cost    15861 non-null  float64
 7   special_features    15861 non-null  object 
 8   NC-17               15861 non-null  int64  
 9   PG                  15861 non-null  int64  
 10  PG-13               15861 non-null  int64  
 11  R                   15861 non-null  int64  
 12  amount_2            15861 non-null  float64
 13  length_2            15861 non-null  float64
 14  rental_rate_2       15861 non-null  float64
 15  rental_length_days  15861 non-null  int64

In [53]:
#Definiendo los distintos modelos a evaluar para poder obtener el mejor
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=9),
    'Random Forest': RandomForestRegressor(random_state=9)
}

#Aqui se definen los rangos de los distintos hiperparametros, que se usaran en los distintos modelos. 
param_distributions = {
    'Linear Regression': {},
    'Decision Tree': {'max_depth': [None, 10, 20, 30, 40, 50],
                      'min_samples_split': [2, 5, 10],
                      'min_samples_leaf': [1, 2, 4]},
    'Random Forest': {'n_estimators': [50, 100, 200],
                      'max_depth': [None, 10, 20, 30, 40, 50],
                      'min_samples_split': [2, 5, 10],
                      'min_samples_leaf': [1, 2, 4]}
}

best_model = None
best_mse = float('inf')

#Se realiza la iteracion de los distintos modelos, realizando la busqueda de hiperparametros
for model_name, model in models.items():
    print(model_name)
    #Se usa RandomizedSearchCV, dicha funcion usada para encontrar el mejor modelo, al usar hiperparametros antes configurados.
    random_search = RandomizedSearchCV(model, param_distributions[model_name], cv=5, random_state=9)
    random_search.fit(X_train, y_train)
    
    # Obteniendo predicciones en el conjunto de prueba y calcular MSE
    y_pred = random_search.best_estimator_.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    
    # Aqui se guardara el mejor modelo si su MSE en el conjunto de prueba es menor que 3
    if mse < 3:
        best_model = random_search.best_estimator_
        best_mse = mse

print("El mejor modelo encontrado:", best_model)
print("El mejor MSE en la prueba:", best_mse)

Linear Regression
Decision Tree
Random Forest
El mejor modelo encontrado: RandomForestRegressor(max_depth=20, n_estimators=50, random_state=9)
El mejor MSE en la prueba: 2.034920081021213
