<H1>Data Preparation</H1>

<p>Exracted from https://pycaret.gitbook.io/docs/get-started/preprocessing/data-preparation </p>

<H2>Missing Values</H2>
<p>Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most machine learning algorithms can't deal with values that are missing or blank. Removing samples with missing values is a basic strategy that is sometimes used, but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values. </p>

In [1]:
# load dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class')

Matplotlib is building the font cache; this may take a moment.


Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


Unnamed: 0,Description,Value
0,Session id,2876
1,Target,Class
2,Target type,Binary
3,Original data shape,"(154, 20)"
4,Transformed data shape,"(154, 20)"
5,Transformed train set shape,"(107, 20)"
6,Transformed test set shape,"(47, 20)"
7,Numeric features,19
8,Rows with missing values,48.1%
9,Preprocess,True


In [2]:
clf1.get_config('dataset_transformed')

Unnamed: 0,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
118,54.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,3.2,85.000000,28.0,3.8,62.561405,2.0,0
73,50.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.5,100.000000,100.0,5.3,62.561405,1.0,0
2,78.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.000000,32.0,4.0,62.561405,1.0,0
32,41.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,81.000000,53.0,5.0,74.000000,1.0,0
90,42.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,4.6,100.579544,55.0,3.3,62.561405,2.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,47.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,166.000000,30.0,2.6,31.000000,2.0,1
102,51.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,4.6,215.000000,269.0,3.9,51.000000,2.0,0
113,36.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.1,141.000000,75.0,3.3,62.561405,2.0,0
135,51.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,0.8,100.579544,33.0,4.5,62.561405,2.0,0


<H2>Data Types</H2>
<p>ach feature in the dataset has an associated data type such as numeric, categorical, or Datetime. PyCaret’s inference algorithm automatically detects the data type of each feature. However, sometimes the data types inferred by PyCaret are incorrect. Ensuring data types are correct is important as several downstream processes depend on the data type of the features. One example could be that Missing Values for numeric and categorical features in the dataset are imputed differently. To overwrite the inferred data types, numeric_features, categorical_features and date_features parameters can be used in the setup function. You can also use ignore_features to ignore certain features for model training.</p>

<h2>Example 1 - Categorical Features</h2>

In [20]:

# init setup
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')
hepatitis['AGE'] = hepatitis['AGE'].astype(str)

from pycaret.classification import *
clf2 = setup(data = hepatitis, target = 'Class', categorical_features = ['AGE'], max_encoding_ohe=100)


Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


Unnamed: 0,Description,Value
0,Session id,7313
1,Target,Class
2,Target type,Binary
3,Original data shape,"(154, 20)"
4,Transformed data shape,"(154, 60)"
5,Transformed train set shape,"(107, 60)"
6,Transformed test set shape,"(47, 60)"
7,Numeric features,18
8,Categorical features,1
9,Rows with missing values,48.1%


In [18]:
clf2.data.dtypes

AGE                category
SEX                    int8
STEROID             float32
ANTIVIRALS             int8
FATIGUE                int8
MALAISE                int8
ANOREXIA               int8
LIVER BIG           float32
LIVER FIRM          float32
SPLEEN PALPABLE     float32
SPIDERS             float32
ASCITES             float32
VARICES             float32
BILIRUBIN           float32
ALK PHOSPHATE       float32
SGOT                float32
ALBUMIN             float32
PROTIME             float32
HISTOLOGY              int8
Class                  int8
dtype: object

In [21]:
clf2.get_config('dataset_transformed')

Unnamed: 0,AGE_27,AGE_45,AGE_39,AGE_31,AGE_44,AGE_34,AGE_47,AGE_52,AGE_20,AGE_32,...,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,0.8,95.000000,46.0,3.80000,100.00000,1.0,0
112,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,1.2,81.000000,65.0,3.00000,63.21212,1.0,0
50,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,0.9,85.000000,60.0,4.00000,63.21212,1.0,0
130,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,8.0,101.522728,101.0,2.20000,63.21212,2.0,1
68,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,1.6,68.000000,68.0,3.70000,63.21212,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,2.0,1.0,4.6,215.000000,269.0,3.90000,51.00000,2.0,0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,0.9,81.000000,60.0,3.90000,52.00000,1.0,0
70,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,2.0,2.0,2.8,127.000000,182.0,3.85102,63.21212,1.0,1
99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,4.8,123.000000,157.0,2.70000,31.00000,2.0,1


In [6]:
print(f'Ordinal features: {clf2._fxs["Ordinal"]}')
print(f'Numeric features: {clf2._fxs["Numeric"]}')
print(f'Date features: {clf2._fxs["Date"]}')
print(f'Text features: {clf2._fxs["Text"]}')
print(f'Categorical features: {clf2._fxs["Categorical"]}')

Ordinal features: {}
Numeric features: ['SEX', 'STEROID', 'ANTIVIRALS', 'FATIGUE', 'MALAISE', 'ANOREXIA', 'LIVER BIG', 'LIVER FIRM', 'SPLEEN PALPABLE', 'SPIDERS', 'ASCITES', 'VARICES', 'BILIRUBIN', 'ALK PHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'HISTOLOGY']
Date features: []
Text features: []
Categorical features: ['AGE']


<H2>One-Hot Encoding</H2>
<p>Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. The majority of the machine learning algorithms cannot directly deal with categorical features and they must be transformed into numeric values before training a model. The most common type of categorical encoding is One-Hot Encoding (also known as dummy encoding) where each categorical level becomes a separate feature in the dataset containing binary values (1 or 0). 
Since this is an imperative step to perform an ML experiment, PyCaret will transform all categorical features in the dataset using one-hot encoding. This is ideal for features having nominal categorical data i.e. data cannot be ordered. In other different scenarios, other methods of encoding must be used. For example, when the data is ordinal i.e. data has intrinsic levels, Ordinal Encoding must be used. One-Hot Encoding works on all features that are either inferred as categorical or are forced as categorical using categorical_features in the setup function. </p>

<h2>Ignore Features and One-Hot Encoding</h2>

In [7]:
# load dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from pycaret.classification import *
clf3 = setup(data = pokemon, target = 'Legendary', ignore_features = ['#', 'Name'])

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


Unnamed: 0,Description,Value
0,Session id,2723
1,Target,Legendary
2,Target type,Binary
3,Original data shape,"(800, 13)"
4,Transformed data shape,"(800, 45)"
5,Transformed train set shape,"(560, 45)"
6,Transformed test set shape,"(240, 45)"
7,Ignore features,2
8,Numeric features,8
9,Categorical features,2


In [8]:
print(f'Ordinal features: {clf3._fxs["Ordinal"]}')
print(f'Numeric features: {clf3._fxs["Numeric"]}')
print(f'Date features: {clf3._fxs["Date"]}')
print(f'Text features: {clf3._fxs["Text"]}')
print(f'Categorical features: {clf3._fxs["Categorical"]}')

Ordinal features: {}
Numeric features: ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation']
Date features: []
Text features: []
Categorical features: ['Type 1', 'Type 2']


In [9]:
clf3.dataset_transformed

Unnamed: 0,Type 1_Grass,Type 1_Fighting,Type 1_Ground,Type 1_Normal,Type 1_Water,Type 1_Steel,Type 1_Electric,Type 1_Ghost,Type 1_Bug,Type 1_Fire,...,Type 2_Normal,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
505,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,454.0,74.0,100.0,72.0,90.0,72.0,46.0,4.0,False
497,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,525.0,70.0,110.0,70.0,115.0,70.0,90.0,4.0,False
113,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,425.0,60.0,80.0,110.0,50.0,80.0,45.0,1.0,False
116,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,385.0,90.0,55.0,75.0,60.0,75.0,30.0,1.0,False
562,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,528.0,95.0,100.0,85.0,108.0,70.0,70.0,5.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
635,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,290.0,45.0,30.0,50.0,55.0,65.0,45.0,5.0,False
362,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,335.0,50.0,85.0,40.0,85.0,40.0,35.0,3.0,False
764,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,289.0,44.0,38.0,33.0,61.0,43.0,70.0,6.0,False
796,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,700.0,50.0,160.0,110.0,160.0,110.0,110.0,6.0,True


<H2>Ordinal Encoding</H2>
<p>When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). This can be achieved using  the ordinal_features parameter in the setup function that accepts a dictionary with feature names and the levels in the increasing order from lowest to highest.</p>

In [10]:
# load dataset
from pycaret.datasets import get_data
employee = get_data('employee')

# init setup
from pycaret.classification import *
clf4 = setup(data = employee, target = 'left', ordinal_features = {'salary' : ['low', 'medium', 'high']})

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,department,salary,left
0,0.38,0.53,2,157,3,0,0,sales,low,1
1,0.8,0.86,5,262,6,0,0,sales,medium,1
2,0.11,0.88,7,272,4,0,0,sales,medium,1
3,0.72,0.87,5,223,5,0,0,sales,low,1
4,0.37,0.52,2,159,3,0,0,sales,low,1


Unnamed: 0,Description,Value
0,Session id,6078
1,Target,left
2,Target type,Binary
3,Original data shape,"(14999, 10)"
4,Transformed data shape,"(14999, 21)"
5,Transformed train set shape,"(10499, 21)"
6,Transformed test set shape,"(4500, 21)"
7,Ordinal features,1
8,Numeric features,7
9,Categorical features,2


In [11]:
clf4.get_config('dataset_transformed')

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,department_sales,department_hr,department_technical,...,department_support,department_product_mng,department_marketing,department_IT,department_management,department_accounting,salary_1.0,salary_0.0,salary_2.0,left
12994,0.78,0.60,4.0,206.0,3.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
10716,0.98,0.91,3.0,165.0,2.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
12731,0.79,0.94,4.0,232.0,5.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
10147,0.55,0.55,5.0,256.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
2949,0.15,0.86,3.0,204.0,4.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1712,0.36,0.51,2.0,157.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
7086,0.57,0.91,4.0,252.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
13199,0.32,0.86,4.0,266.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1854,0.43,0.53,2.0,147.0,3.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


<H2>Target Imbalance</H2>
<p>When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance parameter in the setup. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is used as a default method for resampling. The method for resampling can be changed using the fix_imbalance_method within the setup. </p>

In [None]:
# if you encountered error, run the following in your virtual environment
#it will not work if you run it in the notebook

#pip install --upgrade threadpoolctl
#restart kernel after install

In [12]:
# load dataset
from pycaret.datasets import get_data
credit = get_data('credit')

# init setup
from pycaret.classification import *
clf5 = setup(data = credit, target = 'default', session_id=123,fix_imbalance = True)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0


Unnamed: 0,Description,Value
0,Session id,123
1,Target,default
2,Target type,Binary
3,Original data shape,"(24000, 24)"
4,Transformed data shape,"(33372, 24)"
5,Transformed train set shape,"(26172, 24)"
6,Transformed test set shape,"(7200, 24)"
7,Numeric features,23
8,Preprocess,True
9,Imputation type,simple


In [13]:
#Pull transformed y_train dataset
y_train = clf5.get_config('y_train')

#Check distribution of classes
y_train.value_counts()

0    13086
1     3714
Name: default, dtype: int64

<H2>Remove Outliers</H2>
<p>The remove_outliers function in PyCaret allows you to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique. It can be achieved using remove_outliers parameter within setup. The proportion of outliers are controlled through outliers_threshold parameter.</p>

In [15]:
# load dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# init setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', remove_outliers = True)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Unnamed: 0,Description,Value
0,Session id,7461
1,Target,charges
2,Target type,Regression
3,Original data shape,"(1338, 7)"
4,Transformed data shape,"(1291, 10)"
5,Transformed train set shape,"(889, 10)"
6,Transformed test set shape,"(402, 10)"
7,Ordinal features,2
8,Numeric features,3
9,Categorical features,3


In [16]:
reg1.get_config('dataset_transformed')

Unnamed: 0,age,sex,bmi,children,smoker,region_southeast,region_northwest,region_northeast,region_southwest,charges
2,28.0,1.0,33.000000,3.0,0.0,1.0,0.0,0.0,0.0,4449.461914
885,32.0,1.0,28.930000,1.0,1.0,1.0,0.0,0.0,0.0,19719.695312
534,64.0,1.0,40.480000,0.0,0.0,1.0,0.0,0.0,0.0,13831.115234
615,47.0,0.0,36.630001,1.0,1.0,1.0,0.0,0.0,0.0,42969.851562
475,61.0,1.0,28.309999,1.0,1.0,0.0,1.0,0.0,0.0,28868.664062
...,...,...,...,...,...,...,...,...,...,...
193,56.0,0.0,26.600000,1.0,0.0,0.0,1.0,0.0,0.0,12044.341797
518,35.0,0.0,31.000000,1.0,0.0,0.0,0.0,0.0,1.0,5240.765137
445,45.0,0.0,33.099998,0.0,0.0,0.0,0.0,0.0,1.0,7345.083984
246,60.0,0.0,38.060001,0.0,0.0,1.0,0.0,0.0,0.0,12648.703125
