# Pycaret with Google Colab

## Contents:
- Part 1: Cleaning and Visualization
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1biEgivJEOUVS8KbeTXyb1lNgsVtbitYj)

- Part 2: Using PyCaret for Model Hyperparameters Tuning
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lXJhdH3rGnKQ_LjBGMh8ZK-Lf2VcfLW5)
- Part 3: Create Model
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14XIC90Lss_izdw-PE1cgIe4eECsXrHbY)



I couldn't install PyCaret and many libraries on my machine (Macbook Air M1), and it is hard to fix, so I decided to use PyCaret libraries on Google Colab instead and it's easier to share with the others to understand the work.

## Preparation 

Install important variables and import to the notebook

In [None]:
# Install pycaret from source for further information. Please check at: https://github.com/pycaret/pycaret
import numpy as np
!pip install pycaret[full]==2.3.10 markupsafe==2.0.1 pyyaml==5.4.1 -qq



In [None]:
# Import libraries
!pip install -U matplotlib
import numpy as np
from pycaret.utils import enable_colab # enable Pycaret on Colab
import pandas as pd
import jinja2
import matplotlib.pyplot as plt
import matplotlib
import xgboost
from pycaret.regression import *
enable_colab()

### Import Data

Get the data from my Google Drive. You could check at: [MyGithub](https://github.com/northpr/GermanyRentalPrice)

In [None]:
!gdown --id 1yw4RN-Z9b7PlF45kC5HnXZokaXivk3EV

predict_df = pd.read_csv('predict_test.csv').iloc[:,1:]


# Basic inspection
Checking the data before continuing my work to make sure that's everything is on the right track

In [None]:
# Dataframe that I want to use in my prediction
predict_df.head()

In [None]:
print(f"Number of the dataframe: {predict_df.shape[0]}")

# Start using PyCaret on the df
I will use all of the variables from the 'predict_df'.
You could check how to use PyCaret Tutorial on how to preparation at: [PyCaret Tutorials](https://pycaret.gitbook.io/docs/get-started/tutorials)

Use only basic variables that has a high correlation to the prediction and the easiest choice for the users to get those variables

In [None]:
p_data = predict_df.sample(frac=0.9, random_state=123)
p_data_unseen = predict_df

p_data.reset_index(drop=True, inplace=True)
p_data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(p_data.shape))
print('Unseen Data For Predictions ' + str(p_data_unseen.shape))

In [None]:
p_data.head()

### Setting up the PyCaret on prediction data

In [None]:
import jinja2
from pycaret.regression import *

# Setup the data and choose the target for the model.
exp_reg101 = setup(data = p_data, target = 'totalRent', session_id=123, 
                   normalize = True, silent = True, combine_rare_levels = True, rare_level_threshold = 0.05,
                   remove_multicollinearity = True, multicollinearity_threshold=0.95,experiment_name='experiment_1') 

Exclude model that I don't concern to use in this dataframe due to the complexity of the models.

In [None]:
# Might take around 10-12 mins due to the complexity of Extreme Gradient Boost and CatBoost
best = compare_models(exclude = ['ransac','rf','ada','et','huber','knn','par','omp','dt','en'], n_select=3)

## CatBoost Regressor

From the table of model comparing above we could see that CatBoost Regressor gives the best result by inspect on MAE, RMSE and R2. So we should consider this model.

In [None]:
catboost_para = create_model('catboost', round=2)
print(catboost_para)

In [None]:
print(catboost_para)

### Hyperparameter Tuning
Tunes the 'CatBoost' model. The output of this function is a score grid with CV scores by fold of the best selected model based on optimize parameter.

[PyCaret: tune_model](https://pycaret.readthedocs.io/en/latest/api/regression.html#pycaret.regression.tune_model)

In [None]:
tuned_catboost = tune_model(catboost_para)
print(tuned_catboost)

In [None]:
tuned_catboost

### Visualization
From the plot below, we know that the 'CatBoost' considers numerical variables (Living Space, Addition Cost, and No. of Rooms) more important than categorical variables (City, Heating Type, and Room Condition).

In [None]:
plot_model(tuned_catboost, plot='feature_all')

## Extreme Gradient Boosting

Extreme Gradient Boosting or xgboost is a model that always gives the best result by inspect on MAE, RMSE and R2. So we should keep consider this model for further use.

In [None]:
xgboost_para = create_model('xgboost', round=2)
print(xgboost_para)

### Visualization
From the plot below, we know that the 'Light Gradient Boosting Machine' considers numerical variables (Living Space, Addition Cost, and No. of Rooms) more important than categorical variables (City, Heating Type, and Room Condition).

In [None]:
plot_model(xgboost_para, plot='feature_all')

Living Space, Addition Cost and Number of Rooms are the most important so we can make a conclusion that numerical are more important than catgorical variables for Light Gradient Boost

In [None]:
plot_model(xgboost_para, plot = 'error')

In [None]:
xgb_evaluation = predict_model(xgboost_para)
xgb_evaluation;

## Light Gradient Boost

From the table of model comparing above we could see that Light Gradient Boosting Machine gives the best result by inspect on MAE, RMSE and R2. So we should keep this model.

In [None]:
lightgbm_para = create_model('lightgbm', round=2)
print(lightgbm_para)

### Hyperparameter Tuning
Tunes the 'Light Gradient Boost' model. The output of this function is a score grid with CV scores by fold of the best selected model based on optimize parameter.

[PyCaret: tune_model](https://pycaret.readthedocs.io/en/latest/api/regression.html#pycaret.regression.tune_model)

In [None]:
tuned_lgb = tune_model(lightgbm_para)
print(tuned_lgb)

### Visualization
From the plot below, we know that the 'Light Gradient Boosting Machine' considers numerical variables (Living Space, Addition Cost, and No. of Rooms) more important than categorical variables (City, Heating Type, and Room Condition).

In [None]:
plot_model(tuned_lgb, plot='feature_all')

Living Space, Addition Cost and Number of Rooms are the most important so we can make a conclusion that numerical are more important than catgorical variables for Light Gradient Boost

In [None]:
plot_model(tuned_lgb, plot = 'error')

In [None]:
lgb_evaluation = predict_model(tuned_lgb)
lgb_evaluation;

## Linear Regression
Why not compare other models with Linear Regression? It's still the most straightforward model to understand, and we should look at how it decides to compare to LGBM.

Trains and evaluates the performance of a given estimator using cross validation.<br>
[PyCaret: create_model](https://pycaret.readthedocs.io/en/latest/api/regression.html#pycaret.regression.create_model)

In [None]:
lrm = create_model('lr')
print(lrm)

### Hyperparameter Tuning
Tunes the Linear Regression model. 


In [None]:
tuned_lr = tune_model(lrm)
print(tuned_lr)

### Visualization
We could know from the plot that Linear Regression consider categorical variable more than numerical variables because it consider the city and room condition before others numerical variables

In [None]:
plot_model(tuned_lr, plot='feature_all')

In [None]:
plot_model(tuned_lr, plot = 'error')

In [None]:
lr_evaluation = predict_model(tuned_lr)
lr_evaluation;

## Ridge Regression
Ridge Regression is always my favorite regression algorithm to predict the new datasets if we compare it with linear regression. I always used Ridge Regression because it will be optimized for prediction, so you could use a complex model and avoid overfitting.

In [None]:
ridge_model = create_model('ridge')
print(ridge_model)

### Hyperparameter Tuning
Tunes the Ridge Regression model with higher number of iterations. I want to use Ridge Regression as our main predictor because it doesn't give bias or cause overfitting as Linear Regression


In [None]:
tuned_ridge = tune_model(ridge_model)
print(tuned_ridge)

### Visualization
From the plot below, Ridge Regression mostly worked like Linear Regression by being concerned with the categorical variables more than numerical variables.

In [None]:
plot_model(tuned_ridge, plot='feature_all')

In [None]:
plot_model(tuned_ridge, plot = 'error')

In [None]:
ridge_evaluation = predict_model(tuned_ridge)
ridge_evaluation;