<img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="360" />

# Insurance Dataset_Term3_Project1 (Pycaret)

### Problem Statement : - Automating ML and Finding the best model for the insurance dataset with Pycaret

**We have an altered dataset from Insurance_Dataset_Term3_Project1_Part1 notebook; which has a feature columns as Product_Info, Employment_Info, Family_Hist, InsuredInfo, Medical_History, Medical_Keyword(dummy variables) and Response. And we need to prepare data and deploy efficient model for target variabe (Response).** 

<center><img src="https://pycaret.org/wp-content/uploads/2020/03/Divi93_43.png" width="300" height="80" /></center>

### Features :-
- An Open Source and low-code machine learning Library for data preaparation and deployment of model.
- Time efficient (Less time spent on code and more productive).
- Simple & Easy to use in end-to-end machine learning experaiments with less lines of code.
- It's business ready solution allows to do prototyping quickly and efficiently from notebook environment.

In [4]:
import pycaret                            # import required library
import numpy as np
import pandas as pd

In [2]:
pycaret. __version__

'2.0.0'

***Required dataset:- Insurance_df3***

In [5]:
Ins_df = pd.read_csv('Insurance_df3.csv', index_col=0)         # Insurance dataset (Insurance_df3.csv) for Pycaret model evaluation.
Ins_df.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_3,Product_Info_5,Product_Info_6,Product_Info_7,Employment_Info_2,Employment_Info_3,Employment_Info_5,InsuredInfo_1,...,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response,Product_Info_2
0,2,1,10,2,1,1,12,1,3,1,...,0,0,0,0,0,0,0,0,8,16
1,5,1,26,2,3,1,1,3,2,1,...,0,0,0,0,0,0,0,0,4,0
2,6,1,26,2,3,1,9,1,2,1,...,0,0,0,0,0,0,0,0,8,18
3,7,1,10,2,3,1,9,1,3,2,...,0,0,0,0,0,0,0,0,8,17
4,8,1,26,2,3,1,9,1,2,1,...,0,0,0,0,0,0,0,0,8,15


|Records|Features|Dataset Size|
|:---- | :-----|:-----|
|59381|115|15.2 MB|

|Variable|Description|
| :--  |  :---  |
|Id |A unique identifier associated with an application.|
|Product_Info_1-7 | A set of normalized variables relating to the product applied for |
|Ins_Age | Normalized age of applicant |
|Ht | Normalized height of applicant |
|Wt | Normalized weight of applicant |
|BMI | Normalized BMI of applicant |
|Employment_Info_1-6 | A set of normalized variables relating to the employment history of the applicant. |
|InsuredInfo_1-6 | A set of normalized variables providing information about the applicant. |
|Insurance_History_1-9 | A set of normalized variables relating to the insurance history of the applicant. |
|Family_Hist_1-5 | A set of normalized variables relating to the family history of the applicant. |
|Medical_History_1-41 | A set of normalized variables relating to the medical history of the applicant. |
|Medical_Keyword_1-48 | A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.|
|**Response** | **This is the target variable**, an ordinal variable relating to the final decision associated with an application |

***The following variables are all categorical (nominal):***

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41


***The following variables are continuous:***

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5


***The following variables are discrete:***

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

***dummy variables***

Medical_Keyword_1-48 are dummy variables.

### Importing the module

In [4]:
from pycaret.classification import *               # import classification 

### Initializing the setup

In [5]:
clf1 = setup(data = Ins_df,                       # Let's do the setup for Ins_df dataset.
             target = 'Response',
             numeric_imputation = 'zero',
             ignore_features = ['Id'])

Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,6070
1,Target Type,Multiclass
2,Label Encoded,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7"
3,Original Data,"(59381, 115)"
4,Missing Values,False
5,Numeric Features,9
6,Categorical Features,105
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


***We can see here that all the setup values and descriptions. After observation we can now compare various classification models.***

In [7]:
compare_models()         # we can use exclude=['catboost'] to save processing time       

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Gradient Boosting Classifier,0.4618,0.0,0.2987,0.4249,0.4096,0.2843,0.3007,80.5283
1,CatBoost Classifier,0.4509,0.0,0.2916,0.4118,0.4067,0.2777,0.2885,176.5349
2,Light Gradient Boosting Machine,0.4498,0.0,0.2949,0.4077,0.4075,0.2788,0.2884,8.7634
3,Linear Discriminant Analysis,0.4495,0.0,0.3023,0.4164,0.4094,0.2798,0.2906,1.7862
4,Ridge Classifier,0.4476,0.0,0.2471,0.405,0.3843,0.2545,0.2758,0.5284
5,Extreme Gradient Boosting,0.4468,0.0,0.2963,0.4049,0.4036,0.274,0.284,68.1325
6,Ada Boost Classifier,0.4245,0.0,0.2724,0.3891,0.3736,0.2426,0.2553,3.7009
7,Extra Trees Classifier,0.3982,0.0,0.2472,0.3592,0.3653,0.2178,0.2225,6.5895
8,Random Forest Classifier,0.3796,0.0,0.2391,0.3459,0.3556,0.2047,0.2069,0.3699
9,Logistic Regression,0.3534,0.0,0.1566,0.2259,0.2297,0.0788,0.1131,1.6602


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=7028, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

**It can be seen from model camparison above that, Gradient Boosting Classifier in accuracy score, precision score, f1 score. And hence we will create model for Gradient Boosting Classifier***

## Model 1: Gradient Boosting Classifier (gbc)

In [6]:
gbc  = create_model('gbc')      # Let's create model for gbc (short name for Gradient boosting classifier)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.4772,0.0,0.2984,0.4358,0.4266,0.3083,0.3233
1,0.4507,0.0,0.3153,0.4249,0.4029,0.2721,0.2857
2,0.47,0.0,0.3365,0.4304,0.4171,0.301,0.3152
3,0.491,0.0,0.3418,0.4715,0.4466,0.3256,0.341
4,0.4657,0.0,0.3131,0.433,0.4176,0.2898,0.3058
5,0.4513,0.0,0.2832,0.4122,0.3957,0.2673,0.2844
6,0.4753,0.0,0.354,0.4579,0.4308,0.3046,0.3198
7,0.4705,0.0,0.294,0.4312,0.4162,0.2936,0.3119
8,0.4729,0.0,0.3136,0.4364,0.4196,0.2977,0.316
9,0.4681,0.0,0.3131,0.4378,0.4158,0.2935,0.3092


**Here we can compare model creation of 10 folds(default), mean and standard deviation value for gradient boosting classifier (gbc) model.**

In [7]:
gbc         # gbc model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=5370, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

**Let's tuned the model for best results of various score.**

In [8]:
tuned_gbc = tune_model(gbc)      # tune gbc model 

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.4147,0.0,0.2377,0.3634,0.3778,0.2393,0.2441
1,0.4123,0.0,0.2527,0.3786,0.3796,0.2351,0.2399
2,0.4327,0.0,0.2689,0.3917,0.3941,0.2617,0.2678
3,0.4236,0.0,0.2652,0.3859,0.383,0.2458,0.2523
4,0.4152,0.0,0.2455,0.3675,0.379,0.2381,0.2436
5,0.3839,0.0,0.2286,0.35,0.3457,0.1912,0.1969
6,0.4079,0.0,0.2457,0.3682,0.3773,0.2335,0.2372
7,0.4055,0.0,0.2425,0.3777,0.3677,0.2219,0.2277
8,0.4103,0.0,0.2419,0.3643,0.3698,0.2301,0.2359
9,0.4055,0.0,0.2405,0.3667,0.3654,0.2223,0.2281


**Further step in prediction we can apply predict_model on tuned_gbc model**

In [9]:
predictions_clfr = predict_model(tuned_gbc)  # predictions of model

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.412,0,0.2463,0.3718,0.3772,0.2354,0.2401


**In model evaluation technques we can get information on hyper-parameters, ROC curve, Confusion matrix, Precision - Recall, Learning and validation of model, Class report, Feature importance, Decision boundary. and with this information it's possible to evaluate peerformance of model.**

In [10]:
evaluate_model(tuned_gbc)       # Let's see this model_evaluation

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…




**After evaluation of Gradient Boosting Classifier (gbc) we can now consider to compare Random forest classfier (rf), Decision tree classifier (dt) and Logistic regression (lr).**

## Let's compare 3 models - RF, DT and LR

In [6]:
compare_models(include=['lr','dt','rf'])  

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Random Forest Classifier,0.3819,0.0,0.2473,0.3508,0.3602,0.2099,0.2119,0.4286
1,Logistic Regression,0.3503,0.0,0.159,0.2321,0.2282,0.0769,0.1103,1.6836
2,Decision Tree Classifier,0.3179,0.0,0.2403,0.3176,0.3172,0.1531,0.1532,0.8302


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=6070, verbose=0,
                       warm_start=False)

**Observation: - with the above observation it can be seen that RF model is better in comparision of accuracy, Precision-Recall and F1 Score. So, Random Forest classfier (rf) is our first model in prepartion.**

## Model 2 :- Random Forest Classifier (rf)

In [7]:
rf = create_model('rf')     # create model

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.399,0.0,0.2546,0.3701,0.3809,0.238,0.2391
1,0.3702,0.0,0.2476,0.3509,0.3566,0.2013,0.2022
2,0.387,0.0,0.2492,0.3548,0.3647,0.2176,0.2194
3,0.3694,0.0,0.2256,0.3346,0.3472,0.1936,0.1953
4,0.3875,0.0,0.2605,0.3599,0.3659,0.2139,0.2162
5,0.3574,0.0,0.2521,0.3347,0.3421,0.1829,0.1839
6,0.3947,0.0,0.2497,0.355,0.3655,0.2185,0.2218
7,0.3815,0.0,0.2366,0.3474,0.3602,0.2118,0.2133
8,0.3791,0.0,0.2287,0.3348,0.3478,0.1977,0.201
9,0.3935,0.0,0.2689,0.3663,0.3708,0.2241,0.2265


In [8]:
rf

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=6070, verbose=0,
                       warm_start=False)

**Let's tuned the random forest classifier model.**

In [9]:
tuned_rf = tune_model(rf)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.4736,0.0,0.2729,0.4582,0.4137,0.2969,0.3139
1,0.4447,0.0,0.2676,0.4304,0.3912,0.261,0.2746
2,0.4411,0.0,0.2629,0.3898,0.38,0.2557,0.2698
3,0.4597,0.0,0.2716,0.4237,0.4051,0.2814,0.2954
4,0.4681,0.0,0.2626,0.4381,0.4086,0.2885,0.3057
5,0.4452,0.0,0.2629,0.4108,0.3852,0.2566,0.2733
6,0.4537,0.0,0.2598,0.4253,0.3961,0.2714,0.2861
7,0.4549,0.0,0.2626,0.3979,0.3934,0.2725,0.2878
8,0.4729,0.0,0.2862,0.4547,0.4116,0.2907,0.3119
9,0.4465,0.0,0.2733,0.4145,0.3928,0.2646,0.2782


In [10]:
predictions_clsrf = predict_model(tuned_rf)       # Prediction of RF model

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.4533,0,0.2708,0.4288,0.4005,0.2743,0.2873


**Random Forest Classifier (rf) Model evaluation with techniques.**

In [11]:
evaluate_model(tuned_rf)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

**Random Forest Classifier Observation:-**

- **Confusion Matrix :-** We can compare True class & Predicted class for 1-8 response given.
- **Precision-Recall :-** We can see precision to recall score for model evaluation.
- **Class Report :-** We can clssification report of response given with precision, recall, f1 score and support.
- **Feature Importance :-** We can compare feature importance like Medical_History, InsuredInfo, Medical_Keyword and so on.
- **ROC-AUC curve :-** We can compare ROC-AUC curve for 1-8 response.
- **Learning curve :-** We can compare Training and Cross validation score with Training instances.
- **Validation curve :-** We can compare Training and Cross validation score with Max_depth.


## Model 3 :- Decision Tree Classifier (DT)

In [12]:
dt = create_model('dt')           # Create modeling

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.3173,0.0,0.2059,0.3173,0.3169,0.153,0.1532
1,0.2921,0.0,0.2092,0.2966,0.2937,0.1236,0.1236
2,0.3281,0.0,0.2425,0.328,0.3274,0.1651,0.1653
3,0.3105,0.0,0.2469,0.3019,0.3058,0.1398,0.1399
4,0.3225,0.0,0.2559,0.3209,0.3213,0.159,0.1591
5,0.3008,0.0,0.226,0.3,0.2998,0.1301,0.1302
6,0.3249,0.0,0.2598,0.3278,0.325,0.1612,0.1615
7,0.3141,0.0,0.2557,0.3199,0.3165,0.1537,0.1537
8,0.3333,0.0,0.2552,0.326,0.3293,0.1676,0.1677
9,0.3357,0.0,0.2457,0.3376,0.3366,0.1774,0.1774


In [13]:
dt

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=6070, splitter='best')

In [14]:
tuned_dt = tune_model(dt)          # Let's tuned the dt model

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.4375,0.0,0.2487,0.3818,0.3801,0.2505,0.2652
1,0.4026,0.0,0.2099,0.3296,0.3415,0.2043,0.215
2,0.4123,0.0,0.2195,0.3465,0.3534,0.2202,0.2304
3,0.4031,0.0,0.2161,0.3377,0.3483,0.2104,0.219
4,0.4043,0.0,0.2134,0.34,0.3379,0.1975,0.2128
5,0.3887,0.0,0.2137,0.3071,0.3283,0.1875,0.1971
6,0.432,0.0,0.2342,0.3721,0.3745,0.2478,0.2582
7,0.3959,0.0,0.2094,0.3203,0.334,0.1902,0.2024
8,0.4188,0.0,0.2228,0.3619,0.3456,0.2119,0.2319
9,0.4128,0.0,0.2261,0.3771,0.3648,0.2258,0.2351


In [15]:
predictions_clsdt = predict_model(tuned_dt)            # Prediction of dt model

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Decision Tree Classifier,0.4157,0,0.2437,0.3539,0.357,0.2222,0.2351


**Decision Tree Classifier (dt) Model evaluation with techniques.**


In [26]:
evaluate_model(tuned_dt)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

**Decision Tree Classifier Observation:-**

- **Confusion Matrix :-** We can compare True class & Predicted class for 1-8 response given.
- **Precision-Recall :-** We can see precision to recall score for model evaluation.
- **Class Report :-** We can clssification report of response given with precision, recall, f1 score and support.
- **Feature Importance :-** We can compare feature importance like Medical_History, InsuredInfo, Family_Hist, Product_Info, Medical_Keyword and so on.
- **ROC-AUC curve :-** We can compare ROC-AUC curve for 1-8 response.
- **Learning curve :-** We can compare Training and Cross validation score with Training instances.
- **Validation curve :-** We can compare Training and Cross validation score with Max_depth.

## Model 4:- Logistic Regression (lr)

In [31]:
lr = create_model('lr')           # create model

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.3594,0.0,0.1619,0.2056,0.2369,0.0882,0.1258
1,0.3425,0.0,0.1586,0.2544,0.2288,0.081,0.1038
2,0.351,0.0,0.1587,0.2005,0.2277,0.0818,0.1151
3,0.3418,0.0,0.1542,0.2771,0.2312,0.0659,0.0916
4,0.355,0.0,0.1633,0.1895,0.2282,0.0854,0.1203
5,0.367,0.0,0.1701,0.2942,0.2469,0.0946,0.1411
6,0.3526,0.0,0.1574,0.199,0.2176,0.0678,0.1133
7,0.3406,0.0,0.1569,0.2129,0.2176,0.0648,0.0925
8,0.3369,0.0,0.1428,0.1706,0.2079,0.0508,0.076
9,0.3562,0.0,0.1657,0.3176,0.2394,0.0889,0.1234


In [18]:
lr

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=6070, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
tuned_lr = tune_model(lr)           # Let's tune the lr model

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.357,0.0,0.1606,0.2024,0.2351,0.0878,0.1228
1,0.3462,0.0,0.1607,0.2412,0.2294,0.0818,0.1083
2,0.3522,0.0,0.1575,0.2001,0.2247,0.0784,0.1154
3,0.3369,0.0,0.1509,0.1905,0.2183,0.0593,0.0843
4,0.3562,0.0,0.1647,0.1916,0.2335,0.0924,0.1239
5,0.373,0.0,0.1745,0.333,0.2597,0.1102,0.1499
6,0.3538,0.0,0.1602,0.1996,0.2281,0.0787,0.1159
7,0.3394,0.0,0.1564,0.212,0.217,0.0637,0.0904
8,0.3357,0.0,0.1457,0.1712,0.212,0.0566,0.0793
9,0.3538,0.0,0.1649,0.203,0.2394,0.0906,0.1201


In [22]:
predictions_clslr = predict_model(tuned_lr)          # Prediction of lr model

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.3671,0,0.1726,0.2953,0.2495,0.1079,0.1445


**Logistic Regression (lr) Model evaluation with techniques.**

In [23]:
evaluate_model(tuned_lr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

**Logistic Regression Observation:-**

- **Confusion Matrix :-** We can compare True class & Predicted class for 1-8 response given.
- **Precision-Recall :-** We can see precision to recall score for model evaluation.
- **Class Report :-** We can clssification report of response given with precision, recall, f1 score and support.
- **Feature Importance :-** We can compare feature importance like Employment_Info, Product_Info, Medical_History, Medical_Keyword and so on.
- **ROC-AUC curve :-** We can compare ROC-AUC curve for 1-8 response.
- **Learning curve :-** We can compare Training and Cross validation score with Training instances.
- **Validation curve :-** We can compare Training and Cross validation score with Max_depth.

# Conclusion :-

**With the help of Pycaret Model Evaluation we can easily compare classification models and also with Hyper-parameter tuning and comparability of score, model selection, preparation and evaluation best possible outcome is possible in quick manner.**

# Thanks!!!