# Using PyCaret for Classification Models

In [15]:
from pycaret.datasets import get_data
dataset = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


When the setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.

Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.

Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.

## Creating the Dataset

In [16]:
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (730, 9)
Unseen Data For Predictions: (38, 9)


Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [17]:
from pycaret.classification import *
exp_clf101 = setup(data = data, target = 'Class variable', session_id=123) 

best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7529,0.8105,0.5901,0.6845,0.6322,0.4478,0.4518,0.045
knn,K Neighbors Classifier,0.751,0.7719,0.6053,0.675,0.6318,0.4454,0.4516,0.052
lda,Linear Discriminant Analysis,0.7471,0.804,0.557,0.6836,0.609,0.4261,0.4342,0.008
lr,Logistic Regression,0.7451,0.8065,0.5412,0.6878,0.6006,0.4185,0.4285,2.165
ridge,Ridge Classifier,0.7451,0.0,0.557,0.678,0.6066,0.4221,0.4298,0.009
rf,Random Forest Classifier,0.7451,0.8095,0.5743,0.6677,0.615,0.427,0.4311,0.202
et,Extra Trees Classifier,0.7451,0.784,0.5357,0.6887,0.5997,0.4176,0.4265,0.2
lightgbm,Light Gradient Boosting Machine,0.7314,0.7821,0.5629,0.6545,0.5994,0.3998,0.4065,0.028
ada,Ada Boost Classifier,0.7294,0.7803,0.5678,0.6503,0.6026,0.3987,0.4037,0.044
nb,Naive Bayes,0.6863,0.7576,0.2953,0.6544,0.4002,0.227,0.2636,0.01


- Precision is the proportion of correct positive predictions of all cases classified as positive. In other words, can help determine the proportion of false positives.
- Accuracy is the proportion of correct predictions in all predictions
- AUC stands for "Area under the ROC Curve." It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1
- F1-Score or F-measure is an evaluation metric for a classification defined as the harmonic mean of precision and recall. It is a statistical measure of the accuracy of a test or model
-  Kappa indicates how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. A value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement.
- MCC is  a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. 

# Analyze Model

This function analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.

In [18]:
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

## Predictions

Score means the probability of the predicted class

In [19]:
unseen_predictions = predict_model(best, data=data_unseen)
unseen_predictions.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,Label,Score
0,5,116,74,0,0,25.6,0.201,30,0,0,0.9299
1,1,146,56,0,0,29.7,0.564,29,0,1,0.7043
2,7,103,66,32,0,39.1,0.344,31,1,0,0.7174
3,1,71,48,18,76,20.4,0.323,22,0,0,0.9876
4,2,107,74,30,100,33.6,0.404,23,0,0,0.9325


## Saving Best Model

In [20]:
save_model(best, 'my_best_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[],
                                       target='Class variable',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeri...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
          

To load the model back in environment:

In [21]:
loaded_model = load_model('my_best_pipeline')
print(loaded_model)

Transformation Pipeline and Model Successfully Loaded
Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[],
                                      target='Class variable',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeri...
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                      