# Advanced House predictions using AutoML ( Pycaret )
**PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the 
hypothesis to insight cycle time in an ML experiment.  It enables data scientists to perform end-to-end experiments quickly and efficiently. In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to perform complex machine learning tasks with only a few lines of code. 
PyCaret is simple and easy to use**

PyCaret is a deployment ready library in Python which means all the steps performed in an ML experiment can be reproduced using a pipeline that is reproducible and guaranteed for production.  A pipeline can be saved in a binary file format that is transferable across environments.

Pycaret can be used by Citizen data scientists to perform Machine Learning Experiments with use 

In [1]:
!pip install pycaret # to install pycaret on your notebook

Collecting pycaret




  Using cached pycaret-3.2.0-py3-none-any.whl (484 kB)
Collecting tbats>=1.1.3
  Using cached tbats-1.1.3-py3-none-any.whl (44 kB)
Collecting scikit-plot>=0.3.7
  Using cached scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Collecting imbalanced-learn>=0.8.1
  Using cached imbalanced_learn-0.12.0-py3-none-any.whl (257 kB)
Collecting pmdarima!=1.8.1,<3.0.0,>=1.8.0
  Using cached pmdarima-2.0.4-cp39-cp39-win_amd64.whl (614 kB)
Collecting importlib-metadata>=4.12.0
  Using cached importlib_metadata-7.0.1-py3-none-any.whl (23 kB)
Collecting sktime!=0.17.1,!=0.17.2,!=0.18.0,<0.22.0,>=0.16.1
  Using cached sktime-0.21.1-py3-none-any.whl (17.1 MB)
Collecting pyod>=1.0.8
  Using cached pyod-1.1.3-py3-none-any.whl
Collecting schemdraw==0.15
  Using cached schemdraw-0.15-py3-none-any.whl (106 kB)
Collecting yellowbrick>=1.4
  Using cached yellowbrick-1.5-py3-none-any.whl (282 kB)
Collecting category-encoders>=2.4.0
  Using cached category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
Collecting deprec

In [3]:
import pandas as pd
import numpy as np 
import seaborn as sns


In [8]:
# reading in the data 
df = pd.read_csv("C:\\Datasets\\General_Datasets\\USA_Housing.csv")

In [9]:
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [10]:
df.isnull().sum()

Avg. Area Income                0
Avg. Area House Age             0
Avg. Area Number of Rooms       0
Avg. Area Number of Bedrooms    0
Area Population                 0
Price                           0
Address                         0
dtype: int64

In [11]:
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [12]:
# dropping Address ,wont be necessary in our experiment
df.drop(['Address'],axis=1,inplace=True)

In [13]:
df.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price'],
      dtype='object')

In [14]:
df.dtypes

Avg. Area Income                float64
Avg. Area House Age             float64
Avg. Area Number of Rooms       float64
Avg. Area Number of Bedrooms    float64
Area Population                 float64
Price                           float64
dtype: object

# AUTOML EXPERIMENT 
In this section , here are some of the key processes that I will do 

**Setting a Regression Experiment**

**Choosing the target feature in our experiment**

**Selection of the best model based on model performance**

**Predicting on New Data**

**Model Evaluation**

**Saving the model**

# Setting up a Regression Experiment

In [29]:
import pycaret 
from pycaret.regression import *

In [30]:
sess = RegressionExperiment()

In [33]:
sess.setup(df,target='Price',session_id=2024)

Unnamed: 0,Description,Value
0,Session id,2024
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5000, 6)"
4,Transformed data shape,"(5000, 6)"
5,Transformed train set shape,"(3500, 6)"
6,Transformed test set shape,"(1500, 6)"
7,Numeric features,5
8,Preprocess,True
9,Imputation type,simple


<pycaret.regression.oop.RegressionExperiment at 0x1234de8dc10>

# Choosing Best Model 

In [34]:
best_model = sess.compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,82644.0469,10452261683.2,102148.4273,0.913,0.1059,0.0774,5.092
lasso,Lasso Regression,82646.7727,10452952268.8,102151.7211,0.913,0.1059,0.0774,0.045
ridge,Ridge Regression,82647.782,10452958924.8,102151.8336,0.913,0.1059,0.0774,0.062
lar,Least Angle Regression,82646.7797,10452955750.4,102151.7367,0.913,0.1059,0.0774,0.048
llar,Lasso Least Angle Regression,82646.6211,10452875468.8,102151.4609,0.913,0.1059,0.0774,0.061
gbr,Gradient Boosting Regressor,89115.7406,12308109401.7558,110875.2251,0.8977,0.1208,0.0887,0.339
lightgbm,Light Gradient Boosting Machine,91275.1843,12921033673.5606,113606.7144,0.8926,0.1242,0.0916,0.17
et,Extra Trees Regressor,94322.4715,13943703336.5869,117980.11,0.8841,0.1311,0.0971,0.419
rf,Random Forest Regressor,97345.1242,15028607465.8741,122529.9507,0.8753,0.1343,0.1001,0.696
en,Elastic Net,99330.2211,15242241126.4,123407.2289,0.8737,0.1299,0.0972,0.032


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

# Evaluating Model and prediction on hold out

In [35]:
sess.evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [36]:
pred_hold_out = sess.predict_model(best_model)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Linear Regression,78751.0938,9786950656.0,98929.0156,0.9254,0.1028,0.0739


# Prediction on New Data

In [37]:
new_data = df.copy().drop('Price',axis=1)

In [38]:
predictions = sess.predict_model(best_model,new_data)
predictions.head(10)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,prediction_label
0,79545.460938,5.682861,7.009188,4.09,23086.800781,1223043.5
1,79248.640625,6.0029,6.730821,3.09,40173.070312,1494530.5
2,61287.066406,5.86589,8.512728,5.13,36882.160156,1253608.5
3,63345.238281,7.188236,5.586729,3.26,34310.242188,1121818.0
4,59982.195312,5.040555,7.839388,4.23,26354.109375,845633.75
5,80175.757812,4.988408,6.104512,4.04,26748.427734,1065891.25
6,64698.464844,6.025336,8.147759,3.41,60828.25,1671112.5
7,78394.335938,6.98978,6.620478,2.42,36516.359375,1571585.0
8,59927.660156,5.362125,6.393121,2.3,29387.396484,766693.0
9,81885.929688,4.423672,8.167688,6.1,40149.964844,1464261.0


# Saving the Model

In [39]:
sess.save_model(best_model,'best_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Avg. Area Income',
                                              'Avg. Area House Age',
                                              'Avg. Area Number of Rooms',
                                              'Avg. Area Number of Bedrooms',
                                              'Area Population'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('clean_column_names',
                  TransformerWrapper(transformer=CleanColumnNames())),
                 ('trained_model', LinearRegression(n_jobs=-1))]),
 'best_pipeline.pkl')