<a name="top"> <h1>04. Modeling and Interpretation Using PyCaret </h1> <a>

<p>Geospatial Analysis of the 2023 Earthquakes in Turkey<br />
<strong>Master Thesis</strong><br />
<strong>Master of Data Science</strong></p>


<p style="text-align:right">Gozde Yazganoglu (<em>gozde.yazganoglu@cunef.edu</em>)</p>

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Introduction](#1)
1. [Importing libraries](#2)
1. [Reading the data and Profiling](#3)
1. [Variables](#4)
1. [Visualization of the result of geostatistical data interpolation](#5)
1. [Saving data for other notebooks](#6)

## 1. Introduction <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)


Back then in previous 4 notebooks we have done most of the geospatial analisis and had some insights about geographic properties. However keeping an environment with geopandas is very hard with more libraries such as pycaret. For this reason we are going to examine in this notebook a pandas dataframe that has geographic variables and try to understand what affects the most damage_gra.

Pycaret is a low code machine learning tool that helps us deploy, compare, optimize and interepret the model with just a few lines of codes. Even though we do not use a lot of code it is not very hard to adjust it to our needs either. The most important functions, classes we use in pycaret could be listed as follows.


    ClassificationExperiment, 
    setup, 
    compare_model, 
    tune_model,
    interpret_model

## 2. Importing libraries  and Reading the Data <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In this notebook we are using new_pycaret.yaml. In order to run local this environment should be installed.

In this notebook we use a basic pandas library in order not to have problems with the environment. However data reserves geographic information such as latitude, longitude, lags and distances. Our hypothesis is this variables are important in magnitude in order we to predict the damage_gra after the disaster.

In [1]:
#importing libraries from new_pycaret environment
from pycaret.classification import *
from pycaret.regression import *
import pandas as pd
import pickle
import numpy as np


In [2]:
#reading pandas dataframe
data = pd.read_csv('../data/processed/df.csv')

In [3]:
data

Unnamed: 0,obj_type,info,damage_gra,locality,population,income,total_sales,second_sales,water_access,elec_cons,...,std_lag_nearest_camping_distance,lag_nearest_earthquake_distance,std_nearest_earthquake_distance,std_lag_nearest_earthquake_distance,lag_nearest_fault_distance,std_nearest_fault_distance,std_lag_nearest_fault_distance,lag_damage_gra,std_damage_gra,std_lag_damage_gra
0,11_RESIDENTIAL_BUILDINGS,997_NOT_APPLICABLE,1,ADIYAMAN,316140,4092,40087,20574,0.98,2060,...,-0.336509,1.121437,0.823108,0.822247,0.017026,-1.708649,-1.707721,1.0,-0.208399,-0.2745
1,11_RESIDENTIAL_BUILDINGS,997_NOT_APPLICABLE,1,ADIYAMAN,316140,4092,40087,20574,0.98,2060,...,-0.338395,1.121020,0.821473,0.821468,0.017263,-1.703229,-1.705170,1.0,-0.208399,-0.2745
2,12_NON_RESIDENTIAL_BUILDINGS,1251_INDUSTRIAL_BUILDINGS,1,ADIYAMAN,316140,4092,40087,20574,0.98,2060,...,-0.319254,1.126003,0.829966,0.830761,0.014187,-1.733850,-1.738174,1.0,-0.208399,-0.2745
3,12_NON_RESIDENTIAL_BUILDINGS,1251_INDUSTRIAL_BUILDINGS,1,ADIYAMAN,316140,4092,40087,20574,0.98,2060,...,-0.318057,1.126158,0.831173,0.831050,0.014298,-1.737541,-1.736986,1.0,-0.208399,-0.2745
4,11_RESIDENTIAL_BUILDINGS,997_NOT_APPLICABLE,1,ADIYAMAN,316140,4092,40087,20574,0.98,2060,...,-0.339910,1.120674,0.820239,0.820823,0.017477,-1.699083,-1.702874,1.0,-0.208399,-0.2745
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98790,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364657,0.092105,-1.097597,-1.097274,0.031475,-1.551232,-1.552713,1.0,-0.208399,-0.2745
98791,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364976,0.092183,-1.098330,-1.097127,0.031408,-1.547625,-1.553434,1.0,-0.208399,-0.2745
98792,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364853,0.092170,-1.098204,-1.097152,0.031503,-1.552762,-1.552407,1.0,-0.208399,-0.2745
98793,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364460,0.092054,-1.097125,-1.097368,0.031509,-1.553062,-1.552347,1.0,-0.208399,-0.2745


We eliminated the class "damage_gra = 0" since it represents buildings whose status is unknown to us. We also remove 'percentage' as it is directly related to the damage_gra value. 

In [11]:

#Remowing the 0 damage grade

data = data[data['damage_gra'] != 0]

data.groupby('damage_gra').count()

#removing percentage since it is directly correlated with damage grade

#data.drop('percentage', axis=1, inplace=True)

data.tail()

Unnamed: 0,obj_type,info,damage_gra,locality,population,income,total_sales,second_sales,water_access,elec_cons,...,std_lag_nearest_camping_distance,lag_nearest_earthquake_distance,std_nearest_earthquake_distance,std_lag_nearest_earthquake_distance,lag_nearest_fault_distance,std_nearest_fault_distance,std_lag_nearest_fault_distance,lag_damage_gra,std_damage_gra,std_lag_damage_gra
98790,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364657,0.092105,-1.097597,-1.097274,0.031475,-1.551232,-1.552713,1.0,-0.208399,-0.2745
98791,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364976,0.092183,-1.09833,-1.097127,0.031408,-1.547625,-1.553434,1.0,-0.208399,-0.2745
98792,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364853,0.09217,-1.098204,-1.097152,0.031503,-1.552762,-1.552407,1.0,-0.208399,-0.2745
98793,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.36446,0.092054,-1.097125,-1.097368,0.031509,-1.553062,-1.552347,1.0,-0.208399,-0.2745
98794,212_RAILWAYS,997_NOT_APPLICABLE,1,TURKOGLU,78976,5997,1938,536,0.95,4343,...,-0.364288,0.092011,-1.096724,-1.097448,0.031543,-1.554889,-1.551981,1.0,-0.208399,-0.2745


In PyCaret, there are two primary approaches to implementation. One can either instantiate a Classification object (as referenced in the API documentation), or one can directly utilize functions like setup among others.

The setup function in PyCaret is the initial step to setting up any experiment, and it essentially prepares the data for further analysis and modeling. Here's a bit more detail:

    Data Preprocessing: 
setup takes care of many preprocessing tasks. These include handling missing values, transforming categorical variables into numeric (via one-hot encoding or label encoding), and splitting the dataset into training and test sets.

    Pipeline Creation: 
When you use setup, PyCaret internally creates a pipeline that processes the data. This pipeline can then be used seamlessly with other PyCaret functions.

    Customization:
You can customize how the setup function behaves by using its many parameters. For example, you can specify the ratio of the train/test split, choose to ignore certain features, select specific normalization methods, etc.

    Environment Setup: 
It initializes the environment in such a way that all the functions of PyCaret can work in harmony. This means the algorithms, transformations, and configurations you choose while setting up will be consistent throughout your experiment.

    Informative Display: 
Once the setup function is run, it displays an informative grid that shows key details like the number of features, the number of samples, the transformation methods being used, and more.

    Flexibility: 
While setup does automate a lot of tasks, it doesn't take away your control. If you disagree with a preprocessing step it took, you can modify it using its parameters.

In essence, the setup function in PyCaret is your entry point into the library's streamlined workflow, taking care of a lot of the groundwork so you can focus on building and tuning your models.

In the following code, we've employed feature selection. Using PyCaret has notably reduced our coding effort compared to scikit-learn.



In [22]:
session = 123

s = ClassificationExperiment()

s.setup(data = data, 
        target = 'damage_gra', 
        session_id = session,
        feature_selection= True,
        
        feature_selection_method= 'classic',
      
        ) 


best = s.compare_models()


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5428
[LightGBM] [Info] Number of data points in the train set: 68790, number of used features: 75
[LightGBM] [Info] Start training from score -0.057225
[LightGBM] [Info] Start training from score -3.889599
[LightGBM] [Info] Start training from score -3.853993
[LightGBM] [Info] Start training from score -4.270839


Unnamed: 0,Description,Value
0,Session id,123
1,Target,damage_gra
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3"
4,Original data shape,"(98272, 53)"
5,Transformed data shape,"(98272, 11)"
6,Transformed train set shape,"(68790, 11)"
7,Transformed test set shape,"(29482, 11)"
8,Numeric features,49
9,Categorical features,3


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

In [None]:
print(best)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=-1, oob_score=False,
                       random_state=123, verbose=0, warm_start=False)


In [None]:


tuned_model= tune_model(best)

ValueError: _CURRENT_EXPERIMENT global variable is not set. Please run setup() first.

In [None]:
dashboard = s.dashboard(best)

In [None]:
graph_types = ['pipeline',
               'error',
               'learning',
               'vc',
               'feature',
               'feature_all',
               'parameter',
               
]
for graph_type in graph_types:
    s.plot_model(best, plot= graph_type)

In [None]:
s.predict_model(best)

In [None]:
s.save_model(best, 'my_best_pipeline')

In [None]:




interpretation_plots = ['summary', 'correlation', 'reason', 'pdp', 'msa', 'pfi']










interpretation = s.interpret_model(best, plot = 'summary')