# PyCaret

https://pycaret.org/

There is low code for everything nowadays, so why not for AI. It can help you with


* Exploratory Data Analysis
* Data Preprocessing
* Model Training
* Model Explainability
* MLOps

So let's give it a test run.

Make sure you setup your virtual environment before running this code. Quick reminder:

```Shell
python -m venv venv
./venv/Scripts/activate
```

**Note**

I had some issues installing PyCaret in a Python 3.11 virtual environment. It was fixed by installing Python 3.10 (making sure not to overwrite the default 3.11-installation) and building a virtual environment from that version of Python. You can have two virtual environments in the same folder by making sure they have a different name. Something like this:

```Shell
&'C:\Python 3.10\python.exe' -m venv venv_caret
.\venv_caret\Scripts\activate
```

In [None]:
!pip install pandas numpy
!pip install pycaret

A [quickstart](https://pycaret.gitbook.io/docs/get-started/quickstart) sounds nice. The following is a copy of the code, for the explanations you'll need to visit the website.

In [1]:
# load sample dataset
from pycaret.datasets import get_data
data = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from pycaret.classification import *
s = setup(data, target = 'Class variable', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


In [None]:
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data, target = 'Class variable', session_id = 123)

# --> we'll be working with the functional API, not the OOP API

In [3]:
# functional API
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7689,0.8047,0.5602,0.7208,0.6279,0.4641,0.4736,0.447
ridge,Ridge Classifier,0.767,0.0,0.5497,0.7235,0.6221,0.4581,0.469,0.007
lda,Linear Discriminant Analysis,0.767,0.8055,0.555,0.7202,0.6243,0.4594,0.4695,0.009
rf,Random Forest Classifier,0.7485,0.7911,0.5284,0.6811,0.5924,0.415,0.4238,0.054
nb,Naive Bayes,0.7427,0.7955,0.5702,0.6543,0.6043,0.4156,0.4215,0.006
gbc,Gradient Boosting Classifier,0.7373,0.7918,0.555,0.6445,0.5931,0.4013,0.4059,0.027
ada,Ada Boost Classifier,0.7372,0.7799,0.5275,0.6585,0.5796,0.3926,0.4017,0.021
et,Extra Trees Classifier,0.7299,0.7788,0.4965,0.6516,0.5596,0.3706,0.3802,0.048
qda,Quadratic Discriminant Analysis,0.7282,0.7894,0.5281,0.6558,0.5736,0.3785,0.391,0.008
lightgbm,Light Gradient Boosting Machine,0.7133,0.7645,0.5398,0.6036,0.565,0.3534,0.358,0.059


In [4]:
print(best)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


Do you see all the different metrics that are measured? We'll get deeper into them in some of the later chapters.

In [5]:
# functional API
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [6]:
# functional API
predict_model(best)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7576,0.8568,0.5309,0.7049,0.6056,0.4356,0.4447


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
552,6,114,88,0,0,27.799999,0.247,66,0,0,0.8037
438,1,97,70,15,0,18.200001,0.147,21,0,0,0.9648
149,2,90,70,17,0,27.299999,0.085,22,0,0,0.9393
373,2,105,58,40,94,34.900002,0.225,25,0,0,0.7998
36,11,138,76,0,0,33.200001,0.420,35,0,1,0.6391
...,...,...,...,...,...,...,...,...,...,...,...
85,2,110,74,29,125,32.400002,0.698,27,0,0,0.8002
7,10,115,0,0,0,35.299999,0.134,29,0,1,0.6229
298,14,100,78,25,184,36.599998,0.412,46,1,0,0.5986
341,1,95,74,21,73,25.900000,0.673,36,0,0,0.9243


In [7]:
# functional API
predictions = predict_model(best, data=data)
predictions.head()


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7773,0.8357,0.5709,0.7321,0.6415,0.4836,0.4915


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
0,6,148,72,35,0,33.599998,0.627,50,1,1,0.6939
1,1,85,66,29,0,26.6,0.351,31,0,0,0.9419
2,8,183,64,0,0,23.299999,0.672,32,1,1,0.7975
3,1,89,66,23,94,28.1,0.167,21,0,0,0.9453
4,0,137,40,35,168,43.099998,2.288,33,1,1,0.8393


In [8]:
# functional API
save_model(best, 'my_best_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Number of times pregnant',
                                              'Plasma glucose concentration a 2 '
                                              'hours in an oral glucose '
                                              'tolerance test',
                                              'Diastolic blood pressure (mm Hg)',
                                              'Triceps skin fold thickness (mm)',
                                              '2-Hour serum insulin (mu U/ml)',
                                              'Body mass index (weight in '
                                              'kg/(height in m)^2)',
                                              'Diabetes pedigre...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanC

In [9]:
# functional API
loaded_model = load_model('my_best_pipeline')
print(loaded_model)

Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=FastMemory(location=C:\Users\Jochen\AppData\Local\Temp\joblib),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['Number of times pregnant',
                                             'Plasma glucose concentration a 2 '
                                             'hours in an oral glucose '
                                             'tolerance test',
                                             'Diastolic blood pressure (mm Hg)',
                                             'Triceps skin fold thickness (mm)',
                                             '2-Hour serum insulin (mu U/ml)',
                                             'Body mass index (weig...
                 TransformerWrapper(exclude=None, include=None,
                                    transformer=CleanColumnNames(match='[\\]\\[\\,\\{\\}\\"\\:]+'))),
          

# The actual exercise

There was not much exercise in the part before this. We also didn't complete the Quickstart, but more copy pasting would not have helped us any further.

What would help us, and the world, much more is to solve heart failure. Or just help predicting it. We'll be using a [kaggle](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) dataset.

## Step 1: import dependencies

You need pandas and pycaret. Import them.

In [None]:
# DELETE

import pandas as pd
from pycaret.classification import *

## Step 2: Download and import data

Download the data from above and import as a pandas dataframe. It's also stored in the files-folder.

In [None]:
# DELETE

df = pd.read_csv('files/heart.csv')
df.head()

Look at the types. Make a list with all the column names that contain categorical features.

In [None]:
df.dtypes

In [None]:
# DELETE

cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'thal']

## Step 3: Train and evaluate model

Setup and experiment first. Make sure to pass the list of catergorical features.

In [None]:
# DELETE

experiment = setup(df, target='target', categorical_features=cat_features)


Now the experiment is setup we can use it to compare the different models. Save the result in a variable!

In [None]:
# DELETE

best_model = compare_models()

## Step 4: Test model

Now that you have tested a lot of models, test the best model. Use only the bottom five lines of the data to test on.

In [None]:
# DELETE

# predict_model(best_model, data=df.drop('target', axis=1).tail()) -> without the target column, so how you would normally use a model
predict_model(best_model, data=df.tail())

## Step 5: Save the model

In a pickle-file.

In [None]:
# DELETE

save_model(best_model, 'files/heart_model')

And you may feel bad for your teacher having to look all this up, but [don't](https://youtu.be/sL-4rWuEiVw?si=wr5YAFCrg1LlSkcP).