## AutoML

In this short notebook, we will use an automl tool (here TPOT, as it is usually better suited than most for regressions) on the raw dataset, without transforming anything. 

This will give us a baseline model, which will help us understand if the changes we make (better model, better features etc) are actually having a significant effect on the metrics.

### Librairies

In [None]:
!pip install tpot

In [1]:
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import  mean_squared_error
import pandas as pd

In [3]:
df = pd.read_csv("../data/get_around_pricing_project.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [4]:
# Shape of the dataset
print("The dataset is made of", df.shape[0], "observations and", df.shape[1], "features")

The dataset is made of 4843 observations and 15 features


In [5]:
# Counting null values
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0                   0
model_key                    0
mileage                      0
engine_power                 0
fuel                         0
paint_color                  0
car_type                     0
private_parking_available    0
has_gps                      0
has_air_conditioning         0
automatic_car                0
has_getaround_connect        0
has_speed_regulator          0
winter_tires                 0
rental_price_per_day         0
dtype: int64

In [6]:
# Drop duplicate rows
df.drop_duplicates(inplace=True)

# Encode categorical information into numerical variables (0 or 1)
df = pd.get_dummies(df)

In [7]:
# Separate target value from dataset
y = df["rental_price_per_day"]
print(y)

X = df.drop(columns=["rental_price_per_day"])
print(X)

# Seperate dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

0       106
1       264
2       101
3       158
4       183
       ... 
4838    121
4839    132
4840    130
4841    151
4842    124
Name: rental_price_per_day, Length: 4843, dtype: int64
      Unnamed: 0  mileage  engine_power  private_parking_available  has_gps  \
0              0   140411           100                       True     True   
1              1    13929           317                       True     True   
2              2   183297           120                      False    False   
3              3   128035           135                       True     True   
4              4    97097           160                       True     True   
...          ...      ...           ...                        ...      ...   
4838        4838    39743           110                      False     True   
4839        4839    49832           100                      False     True   
4840        4840    19633           110                      False     True   
4841        4841    279

In [14]:
# TPOT setup
GENERATIONS = 32
POP_SIZE = 50
CV = 5

tpot = TPOTRegressor(
    generations=GENERATIONS, # Number of iterations to the run pipeline optimization process.
    population_size=POP_SIZE, # Number of individuals to retain in the genetic programming population every generation
    random_state=42,
    n_jobs=8, # Number of processes to use in parallel for evaluating pipelines
    cv=CV, # Cross-validation strategy
    verbosity=2, # How much information TPOT communicates : 2, TPOT will print more information and provide a progress bar
    scoring="neg_mean_squared_error" # Function used to evaluate the quality of a given pipeline
)

# Train the model
tpot.fit(X_train, y_train)

preds = tpot.predict(X_test)
print(mean_squared_error(y_test, preds))

                                                                               
Generation 1 - Current best internal CV score: -285.15653797427467
                                                                               
Generation 2 - Current best internal CV score: -285.15653797427467
                                                                               
Generation 3 - Current best internal CV score: -281.53588210995224
                                                                               
Generation 4 - Current best internal CV score: -274.61194308749054
                                                                                 
Generation 5 - Current best internal CV score: -274.61194308749054
                                                                                 
Generation 6 - Current best internal CV score: -274.61194308749054
                                                                               
Generation 7 - Current best intern