# PyCaret

https://pycaret.org/

There is low code for everything nowadays, so why not for AI. It can help you with

* Exploratory Data Analysis
* Data Preprocessing
* Model Training
* Model Explainability
* MLOps

So let's give it a test run. A [quickstart](https://pycaret.gitbook.io/docs/get-started/quickstart) sounds nice. The following is a copy of the code, for the explanations you'll need to visit the website.

In [2]:
# load sample dataset
from pycaret.datasets import get_data
data = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
from pycaret.classification import *
s = setup(data, target = 'Class variable', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


In [None]:
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data, target = 'Class variable', session_id = 123)

# --> we'll be working with the functional API, not the OOP API

In [4]:
# functional API
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7689,0.8047,0.5602,0.7208,0.6279,0.4641,0.4736,0.537
ridge,Ridge Classifier,0.767,0.806,0.5497,0.7235,0.6221,0.4581,0.469,0.006
lda,Linear Discriminant Analysis,0.767,0.8055,0.555,0.7202,0.6243,0.4594,0.4695,0.007
rf,Random Forest Classifier,0.7485,0.7911,0.5284,0.6811,0.5924,0.415,0.4238,0.042
nb,Naive Bayes,0.7427,0.7955,0.5702,0.6543,0.6043,0.4156,0.4215,0.007
gbc,Gradient Boosting Classifier,0.7373,0.7914,0.555,0.6445,0.5931,0.4013,0.4059,0.028
ada,Ada Boost Classifier,0.7372,0.7799,0.5275,0.6585,0.5796,0.3926,0.4017,0.022
et,Extra Trees Classifier,0.7299,0.7788,0.4965,0.6516,0.5596,0.3706,0.3802,0.036
qda,Quadratic Discriminant Analysis,0.7282,0.7894,0.5281,0.6558,0.5736,0.3785,0.391,0.007
lightgbm,Light Gradient Boosting Machine,0.7133,0.7645,0.5398,0.6036,0.565,0.3534,0.358,0.053


Just noting: do you see how we just trained a dozen models in less than a minute without using GPU? Try doing that when doing deep learning!

In [5]:
print(best)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


Do you see all the different metrics that are measured?

In [6]:
# functional API
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [7]:
# functional API
predict_model(best)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7576,0.8568,0.5309,0.7049,0.6056,0.4356,0.4447


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
552,6,114,88,0,0,27.799999,0.247,66,0,0,0.8036
438,1,97,70,15,0,18.200001,0.147,21,0,0,0.9648
149,2,90,70,17,0,27.299999,0.085,22,0,0,0.9394
373,2,105,58,40,94,34.900002,0.225,25,0,0,0.7999
36,11,138,76,0,0,33.200001,0.420,35,0,1,0.6393
...,...,...,...,...,...,...,...,...,...,...,...
85,2,110,74,29,125,32.400002,0.698,27,0,0,0.8002
7,10,115,0,0,0,35.299999,0.134,29,0,1,0.6230
298,14,100,78,25,184,36.599998,0.412,46,1,0,0.5984
341,1,95,74,21,73,25.900000,0.673,36,0,0,0.9244


In [8]:
# functional API
predictions = predict_model(best, data=data)
predictions.head()


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7773,0.8357,0.5709,0.7321,0.6415,0.4836,0.4915


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
0,6,148,72,35,0,33.599998,0.627,50,1,1,0.694
1,1,85,66,29,0,26.6,0.351,31,0,0,0.9419
2,8,183,64,0,0,23.299999,0.672,32,1,1,0.7976
3,1,89,66,23,94,28.1,0.167,21,0,0,0.9454
4,0,137,40,35,168,43.099998,2.288,33,1,1,0.8394


In [None]:
# functional API
save_model(best, 'exports/my_best_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Number of times pregnant',
                                              'Plasma glucose concentration a 2 '
                                              'hours in an oral glucose '
                                              'tolerance test',
                                              'Diastolic blood pressure (mm Hg)',
                                              'Triceps skin fold thickness (mm)',
                                              '2-Hour serum insulin (mu U/ml)',
                                              'Body mass index (weight in '
                                              'kg/(height in m)^2)',
                                              'Diabetes pedigre...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanC

In [10]:
# functional API
loaded_model = load_model('my_best_pipeline')
print(loaded_model)

Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=C:\Users\Jochen\AppData\Local\Temp\joblib),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['Number of times pregnant',
                                             'Plasma glucose concentration a 2 '
                                             'hours in an oral glucose '
                                             'tolerance test',
                                             'Diastolic blood pressure (mm Hg)',
                                             'Triceps skin fold thickness (mm)',
                                             '2-Hour serum insulin (mu U/ml)',
                                             'Body mass index (weig...
                 TransformerWrapper(exclude=None, include=None,
                                    transformer=CleanColumnNames(match='[\\]\\[\\,\\{\\}\\"\\:]+'))),
          

# The actual exercise

There was not much exercise in the part before this. We also didn't complete the Quickstart, but more copy pasting would not have helped us any further.

What would help us (and the world) much more is to solve heart failure. Or just help predicting it. We'll be using a [kaggle](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) dataset.

## Step 1: import dependencies

You need pandas and pycaret. Import them.

In [11]:
# DELETE

import pandas as pd
from pycaret.classification import *

## Step 2: Download and import data

Download the data from above and import as a pandas dataframe. It's also stored in the files-folder.

In [12]:
# DELETE

df = pd.read_csv('files/heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


Count the amount of different values per column. PyCaret likes to know which columns are categorical, and columns with low value-counts are likely to be categorical.

In [18]:
#DELETE
# df.info()
df.nunique()

age          41
sex           2
cp            4
trestbps     49
chol        152
fbs           2
restecg       3
thalach      91
exang         2
oldpeak      40
slope         3
ca            5
thal          4
target        2
dtype: int64

Good candidates are 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca' and 'thal'. Target is our target column which has only two values (0 or 1), so no need to include that in the categoricals.

When looking at the description of the dataset (link above, but alse [here](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)) we not that we were right on most counts:

1) age: not categorical
1) sex: categorical
1) chest pain type (4 values): categorical
1) resting blood pressure: not categorical
1) serum cholestoral in mg/dl: not categorical
1) fasting blood sugar > 120 mg/dl: categorical (True/False)
1) resting electrocardiographic results (values 0,1,2): categorical
1) maximum heart rate achieved: not categorical
1) exercise induced angina: categorical (True/False)
1) oldpeak = ST depression induced by exercise relative to rest: not categorical
1) the slope of the peak exercise ST segment: **not** categorical
1) number of major vessels (0-3) colored by flourosopy: **not** categorical
1) thal: 0 = normal; 1 = fixed defect; 2 = reversable defect: categorical

In [19]:
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'thal']

## Step 3: Train and evaluate model

Setup and experiment first. Make sure to pass the list of catergorical features.

In [20]:
# DELETE

experiment = setup(df, target='target', categorical_features=cat_features)


Unnamed: 0,Description,Value
0,Session id,8409
1,Target,target
2,Target type,Binary
3,Original data shape,"(1025, 14)"
4,Transformed data shape,"(1025, 22)"
5,Transformed train set shape,"(717, 22)"
6,Transformed test set shape,"(308, 22)"
7,Numeric features,7
8,Categorical features,6
9,Preprocess,True


Now the experiment is setup we can use it to compare the different models. Save the best result in a variable!

In [21]:
# DELETE

best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9888,0.9971,0.9864,0.9921,0.9891,0.9777,0.9781,0.064
et,Extra Trees Classifier,0.9805,0.9991,0.9809,0.9814,0.981,0.9609,0.9612,0.055
lightgbm,Light Gradient Boosting Machine,0.979,0.9925,0.9782,0.9814,0.9795,0.9581,0.9586,0.101
dt,Decision Tree Classifier,0.9777,0.9779,0.9701,0.9861,0.9778,0.9555,0.9559,0.023
gbc,Gradient Boosting Classifier,0.961,0.9884,0.9592,0.9659,0.9618,0.9219,0.9234,0.044
ada,Ada Boost Classifier,0.8855,0.9592,0.9017,0.8795,0.8894,0.7706,0.7728,0.035
lr,Logistic Regression,0.8339,0.9124,0.88,0.8132,0.8442,0.6669,0.6717,0.698
ridge,Ridge Classifier,0.8241,0.9112,0.8827,0.7974,0.8368,0.647,0.6534,0.021
lda,Linear Discriminant Analysis,0.8241,0.9114,0.8827,0.7974,0.8368,0.647,0.6534,0.022
nb,Naive Bayes,0.8228,0.8818,0.8748,0.8006,0.8348,0.6444,0.6501,0.023


## Step 4: Test model

Now that you have tested a lot of models, test the best model. Use only the bottom five lines of the data to test on.

In [22]:
# DELETE

# predict_model(best_model, data=df.drop('target', axis=1).tail()) -> without the target column, so how you would normally use a model
predict_model(best_model, data=df.tail())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,prediction_label,prediction_score
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1,1,0.87
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0,0,1.0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0,0,1.0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1,1,1.0
1024,54,1,0,120,188,0,1,113,0,1.4,1,1,3,0,0,1.0


## Step 5: Save the model

In a pickle-file.

In [23]:
# DELETE

save_model(best_model, 'files/heart_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['age', 'trestbps', 'chol',
                                              'thalach', 'oldpeak', 'slope',
                                              'ca'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',
                  TransformerWrapper(exclude...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
            

And you may feel bad for your teacher having to look all this up, but [don't](https://youtu.be/sL-4rWuEiVw?si=wr5YAFCrg1LlSkcP).