# Dataset's Story

Contains acoustic features extracted from 3 voice recording replications of the sustained /a/ phonation for each one of the 80 subjects (40 of them with Parkinson's Disease).

## Source links
https://archive-beta.ics.uci.edu/ml/datasets/parkinson+dataset+with+replicated+acoustic+features
https://archive.ics.uci.edu/ml/machine-learning-databases/00489/

## Explanation of Variables

* 1. __ID__:  Subjects's identifier. 

* 2. __Recording__: Number of the recording. 

* 3. __Status__: 0=Healthy; 1=PD 

* 4. __Gender__: 0=Man; 1=Woman 

* 5. __Pitch local perturbation measures__: __Jitter_rel__: relative jitter, __Jitter_abs__:absolute jitter, __Jitter_RAP__: relative average perturbation, __Jitter_PPQ__: pitch perturbation quotient.

* 6. __Amplitude perturbation measures__: __Shim_loc__: local shimmer, __Shim_dB__: shimmer in dB, __Shim_APQ3__: 3-point amplitude perturbation quotient, __Shim_APQ5__: 5-point amplitude perturbation quotient, __Shim_APQ11__: 11-point amplitude perturbation quotient.

* 7. __Harmonic-to-noise ratio measures__: __HNR05__: harmonic-to-noise ratio in the frequency band 0-500 Hz , __HNR15__: in 0-1500 Hz , __HNR25__: in 0-2500 Hz , __HNR35__: in 0-3500 Hz (), __HNR38__: in 0-3800 Hz. 

* 8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (__MFCC0, MFCC1,..., MFCC12__) and their derivatives (__Delta0, Delta1,..., Delta12__). 

* 9. __RPDE__: Recurrence period density entropy. 

* 10. __DFA__: Detrended fluctuation analysis. 

* 11. __PPE__: Pitch period entropy. 

* 12. __GNE__: Glottal-to-noise excitation ratio.

## Additional Information

_Important remarks before using this dataset:_

__1.__ _Each row can not be used independently, because is one of the three replications of one individual. Nature of data is dependent for each subject, but independent from one to another subject. So, traditional technique from machine learning can not be applied to this dataset, because those techniques are based on the independent nature of the instances. There are 240 instances but for only 80 subjects, so they are not independent. Techniques as those presented in Naranjo et al. (2016), Naranjo et al. (2017) or other specifically designed can be used._

__2.__ _The concept of replication considered here does not match the classical concept of statistical repeated measurements. The term 'replications' refers to the collection of features extracted from voice recordings belonging to the same subject. Since, in this context, features are extracted from multiple consecutive voice recordings from the same subject, in principle, the features should be identical. The imperfections in technology and the own biological variability result in non-identical replicated features that are more similar to one another than features from different subjects._

__3.__ _All information about how the dataset was generated is presented in Naranjo et al. (2016)._

_Relevant Papers:_

_Naranjo, L., PÃ©rez, C.J., Campos-Roca, Y., MartÃ­n, J.: Addressing voice recording replications for Parkinsonâ€™s disease detection. Expert Systems With Applications 46, 286-292 (2016)_
https://pubmed.ncbi.nlm.nih.gov/27209185/

_Naranjo, L., PÃ©rez, C.J., MartÃ­n, J., Campos-Roca, Y.: A two-stage variable selection and classification approach for Parkinsonâ€™s disease detection by using voice recording replications. Computer Methods and Programs in Biomedicine 142, 147-156 (2017)_
https://pubmed.ncbi.nlm.nih.gov/28325442/


## Data Scientist's Notes

### A Short Brief About Variables for Unfamiliar with the Field:

__ID:__ 

_The ID is a number assigned to the identity of the subjects._

However, since there were three separate records for each subject, one subject was kept as three different identities. So it looks like I will have to do a new operation on the ID. And based on the following explanation I quoted from above(Additional Information), I can grouping inside of ID and reduce the one subject's records to one.

_"Each row can not be used independently, because is one of the three replications of one individual. Nature of data is dependent for each subject, but independent from one to another subject."_

__Recording:__ 

_Number of the recording of the subjects_

__Pitch local perturbation measures__:

__Jitter_rel__:relative jitter,

__Jitter_abs__:absolute jitter,

__Jitter_RAP__: relative average perturbation,

__Jitter_PPQ__: pitch perturbation quotient

AND

__Amplitude perturbation measures__: 

__Shim_loc__: local shimmer, 

__Shim_dB__: shimmer in dB, 

__Shim_APQ3__: 3-point amplitude perturbation quotient, 

__Shim_APQ5__: 5-point amplitude perturbation quotient, 

__Shim_APQ11__: 11-point amplitude perturbation quotient


For the methods of measurements:

https://www.fon.hum.uva.nl/praat/manual/Voice_2__Jitter.html

https://www.fon.hum.uva.nl/praat/manual/Voice_3__Shimmer.html

For more detailed information:

"Jitter and shimmer are measures of the cycle-to-cycle variations
of fundamental frequency and amplitude, respectively, which
have been largely used for the description of pathological voice
quality. Since they characterise some aspects concerning
particular voices, it is a priori expected to find differences in the values of jitter and shimmer among speakers. In this paper,
several types of jitter and shimmer measurements have been
analysed. Experiments performed with the Switchboard-I
conversational speech database show that jitter and shimmer
measurements give excellent results in speaker verification as
complementary features of spectral and prosodic parameters." [1]

"Jitter is noise in the temporal or timing domain.
Yes, it really is that simple. Normally though we apply the term to electrical or optical signals we can measure with oscilloscopes or other time measurement equipment.
We can think of jitter two ways.
• As an instantaneous effect: This one edge isn’t where I wanted it to be.
• As an accumulation of effects: in this series of edges, each edge is displaced an equal amount,
and the last edge shows the sum of the displacement times. "[2]

"The most important vocal acoustic parameters for
clinical use are measurements of noise, vocal extension
profile, acoustic spectrography, fundamental frequency and
perturbation index - jitter and shimmer.
According to Behlau et al. fundamental frequency
is determined physiologically by the number of cycles that
the vocal folds make in a second, and they are the natural
result of the length of these structures.
Jitter and shimmer represent the variations that occur
in the fundamental frequency. Whereas jitter indicates the
variability or perturbation of fundamental frequency, shimmer
refers to the same perturbation, but it is related to amplitude of sound wave, or intensity of vocal emission. Jitter is
affected mainly because of lack of control of vocal fold
vibration and shimmer with reduction of glottic resistance
and mass lesions in the vocal folds, which are related with
presence of noise at emission and breathiness." [3]

[1] https://www.scielo.br/j/rboto/a/jfYLfsybBtsWkfrnS5ZHhNP/?format=pdf&lang=en

[2]https://nlp.lsi.upc.edu/papers/far_jit_07.pdf

[3]http://anlage.umd.edu/Microwave%20Measurements%20for%20Personal%20Web%20Site/Tek%20Intro%20to%20Jitter%2061W_18897_1.pdf


__Note:__ _As a result of my research and visualizations on the dataset, it seems that a single value in the Pitch Lo0cal Perturbation Measures values and Amplitude Perturbation Measures Values groups can represent this group. And this value is Jitter_rel( Jitter Relative) value according to the following visualizations.
So I'm thinking of eliminate the data inside this set of values._


__HNR Values:__

Harmonic noise ratio (Harmonic-to-Noise Ratio, HNR): It is the ratio of the total of the fundamental frequency and its multiples of harmonics to the noise. Its unit is dB, and high values indicate that the ratio of sound to noise is low. This parameter, which was not measured by MDVP,  it can be measured with Praat and Dr. Speech Vocal Assessment (Tiger DRS, Ine.).

"Harmonic to Noise Ratio (HNR) measures the ratio between periodic and non-periodic components of a speech sound. It has become more and more important in the vocal acoustic analysis to diagnose pathologic voices. The measure of this parameter can be done with Praat software that is commonly accept by the scientific community has an accurate measure." [4]

These variables, which called ratio variables, have higher numbers compared to other variables. So I can do a standardization process.

[4]https://www.sciencedirect.com/science/article/pii/S1877050918316739#:~:text=measures%20the%20ratio,an%20accurate%20measure.

__MFCC Values:__ 

_Mel frequency cepstral coefficient_

Mel-frequency cepstrum[5][6] coefficents, known as MFCC for short, is a fourier based transformation[7] used in feature extraction in applications such as speech recognition and speaker recognition. In short, it allows us to extract some information from the audio data that will characterize that audio data.

Variables named MFCC0, MFCC1,..., MFCC12 were recorded as a result of the measurement of the subjects voices from 0 to 12. So it looks like I will have to do a new operation on these variables.

[5] https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

[6] https://en.wikipedia.org/wiki/Mel_scale

[7] https://en.wikipedia.org/wiki/Fourier_transform

__Delta Values:__

Delta variables (named Delta0, Delta1,..., Delta12) are derivatives of MFCC Values. I will probably do same operations to this variable group as I would do to the MFCC.


__DFA Value:__

"In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical self-affinity of a signal. It is useful for analysing time series that appear to be long-memory processes (diverging correlation time, e.g. power-law decaying autocorrelation function) or 1/f noise.

The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are non-stationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform.

Peng et al. introduced DFA in 1994 in a paper that has been cited over 3,000 times as of 2020 and represents an extension of the (ordinary) fluctuation analysis (FA), which is affected by non-stationarities." [8]

[8]https://en.wikipedia.org/wiki/Detrended_fluctuation_analysis

__RPDE:__

Recurrence period density entropy

"Recurrence period density entropy (RPDE) is a method, in the fields of dynamical systems, stochastic processes, and time series analysis, for determining the periodicity, or repetitiveness of a signal."[9]

[9]https://en.wikipedia.org/wiki/Recurrence_period_density_entropy

__PPE Value__:

_Pitch Period Entropy_

The new research below mentioned about a new measure of dysphonia, called Pitch Period Entropy (PPE),  as a new method for diagnosing people with Parkinson's.

As I understand, many voice recognition methods are used in the dataset. It seems I will have to build a model where I can determine which values work most efficiently together or separately.

"We present an assessment of the practical value of existing traditional and non-standard measures for discriminating healthy people from people with Parkinson's disease (PD) by detecting dysphonia. We introduce a new measure of dysphonia, Pitch Period Entropy (PPE), which is robust to many uncontrollable confounding effects including noisy acoustic environments and normal, healthy variations in voice frequency. We collected sustained phonations from 31 people, 23 with PD. We then selected 10 highly uncorrelated measures, and an exhaustive search of all possible combinations of these measures finds four that in combination lead to overall correct classification performance of 91.4%, using a kernel support vector machine. In conclusion, we find that non-standard methods in combination with traditional harmonics-to-noise ratios are best able to separate healthy from PD subjects. The selected non-standard methods are robust to many uncontrollable variations in acoustic environment and individual subjects, and are thus well-suited to telemonitoring applications."[10]

[10]https://www.researchgate.net/publication/50377363_Suitability_of_Dysphonia_Measurements_for_Telemonitoring_of_Parkinson's_Disease

 __GNE Value__:
 
 _Glottal-to-noise excitation ratio_
 
 As I understand from the explanation below, the GNR variable is a value formed by comparing the HNR and NNE sound measurements. I want to compare it with Status and Gender variables and see the results. However, I have doubts about its necessity in the model since the GNR value is derived from the HNR which is in the dataset and the NNE data which is hidden in the dataset. After all, I can eliminate one of the two values.

"In this article a new acoustic parameter for the objective description of voice quality is introduced. It is based on
the correlation coefficient for Hilbert envelopes of different frequency bands. The parameter indicates whether a
given voice signal originatesfrom vibrations of the vocal folds or from turbulent noise generated in the vocal tract
and is thus related to (but not a direct measure of) breathiness. Therefore it is named Glottal-to-Noise Excitation
Ratio (GNE Ratio). GNE is compared to HNR (Harmonics-to-NoiseRatio) and NNE (Normalized Noise Energy),
existing measures also sensitive to additive noise (turbulence). Experiments with artificial signals show that only
the GNE is almost independent of frequency modulation noise (jitter) and amplitude modulation noise (shimmer)."[11]

[11] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.549.2511&rep=rep1&type=pdf#:~:text=The%20parameter%20indicates%20whether%20a,Excitation%20Ratio%20(GNE%20Ratio)

## Result

__It has been mentioned above that traditional machine learning methods cannot be used. However, a research on papers of Naranjo et al. concluded that Bayesian regression models were used.[12] In addition, as a result of litarature research, in the paper that titled "Gradient boosting for Parkinson's disease diagnosis from voice recordings", a study was conducted on this dataset and it was stated that they achieved the best results with LGB. [13]__

__Therefore, I would like to study on this data set myself and evaluated the results.__

__Results__: _Machine learning models such as Random Forest, Gradient Boosting Machine, XGBoost, CatBoost, Light GBM and Naive Bayes have been tried and the best result with 0.8472 is obtained with the CatBoost model._

[12] https://pubmed.ncbi.nlm.nih.gov/27209185/

[13] https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-01250-7

# Libraries

In [3]:
import numpy as np
import pandas as pd 
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

from sklearn import preprocessing 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neighbors import LocalOutlierFactor
from catboost import CatBoostClassifier


# Dataset

In [4]:
df = pd.read_csv("ReplicatedAcousticFeatures-ParkinsonDatabase.csv")

In [5]:
df.head()

Unnamed: 0,ID,Recording,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,CONT-01,1,0,1,0.25546,1.5e-05,0.001467,0.001673,0.030256,0.26313,...,1.407701,1.417218,1.380352,1.42067,1.45124,1.440295,1.403678,1.405495,1.416705,1.35461
1,CONT-01,2,0,1,0.36964,2.2e-05,0.001932,0.002245,0.023146,0.20217,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.32287,1.314549,1.318999,1.323508
2,CONT-01,3,0,1,0.23514,1.3e-05,0.001353,0.001546,0.019338,0.1671,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.38891,1.305469,1.305402
3,CONT-02,1,0,0,0.2932,1.7e-05,0.001105,0.001444,0.024716,0.20892,...,1.5012,1.53417,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,CONT-02,2,0,0,0.23075,1.5e-05,0.001073,0.001404,0.013119,0.11607,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023


In [6]:
df.shape

(240, 48)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 48 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          240 non-null    object 
 1   Recording   240 non-null    int64  
 2   Status      240 non-null    int64  
 3   Gender      240 non-null    int64  
 4   Jitter_rel  240 non-null    float64
 5   Jitter_abs  240 non-null    float64
 6   Jitter_RAP  240 non-null    float64
 7   Jitter_PPQ  240 non-null    float64
 8   Shim_loc    240 non-null    float64
 9   Shim_dB     240 non-null    float64
 10  Shim_APQ3   240 non-null    float64
 11  Shim_APQ5   240 non-null    float64
 12  Shi_APQ11   240 non-null    float64
 13  HNR05       240 non-null    float64
 14  HNR15       240 non-null    float64
 15  HNR25       240 non-null    float64
 16  HNR35       240 non-null    float64
 17  HNR38       240 non-null    float64
 18  RPDE        240 non-null    float64
 19  DFA         240 non-null    f

In [8]:
df = df.drop(["ID","Recording"],axis=1)

In [9]:
df.head()

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0,1,0.25546,1.5e-05,0.001467,0.001673,0.030256,0.26313,0.017463,0.01966,...,1.407701,1.417218,1.380352,1.42067,1.45124,1.440295,1.403678,1.405495,1.416705,1.35461
1,0,1,0.36964,2.2e-05,0.001932,0.002245,0.023146,0.20217,0.01301,0.014097,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.32287,1.314549,1.318999,1.323508
2,0,1,0.23514,1.3e-05,0.001353,0.001546,0.019338,0.1671,0.011049,0.012683,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.38891,1.305469,1.305402
3,0,0,0.2932,1.7e-05,0.001105,0.001444,0.024716,0.20892,0.014525,0.015696,...,1.5012,1.53417,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,0,0,0.23075,1.5e-05,0.001073,0.001404,0.013119,0.11607,0.006461,0.008385,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023


# N/A Values

In [10]:
df.isnull().sum()

Status        0
Gender        0
Jitter_rel    0
Jitter_abs    0
Jitter_RAP    0
Jitter_PPQ    0
Shim_loc      0
Shim_dB       0
Shim_APQ3     0
Shim_APQ5     0
Shi_APQ11     0
HNR05         0
HNR15         0
HNR25         0
HNR35         0
HNR38         0
RPDE          0
DFA           0
PPE           0
GNE           0
MFCC0         0
MFCC1         0
MFCC2         0
MFCC3         0
MFCC4         0
MFCC5         0
MFCC6         0
MFCC7         0
MFCC8         0
MFCC9         0
MFCC10        0
MFCC11        0
MFCC12        0
Delta0        0
Delta1        0
Delta2        0
Delta3        0
Delta4        0
Delta5        0
Delta6        0
Delta7        0
Delta8        0
Delta9        0
Delta10       0
Delta11       0
Delta12       0
dtype: int64

# Variable's Describe

In [11]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Status,240.0,0.5,0.501045,0.0,0.0,0.5,1.0,1.0
Gender,240.0,0.4,0.490922,0.0,0.0,0.0,1.0,1.0
Jitter_rel,240.0,0.583987,0.535769,0.14801,0.29826,0.481455,0.681685,6.8382
Jitter_abs,240.0,4.4e-05,4.5e-05,7e-06,1.9e-05,3.5e-05,5.6e-05,0.00055
Jitter_RAP,240.0,0.003172,0.003373,0.000678,0.001551,0.002337,0.003678,0.043843
Jitter_PPQ,240.0,0.003532,0.004449,0.001036,0.001867,0.00287,0.003991,0.065199
Shim_loc,240.0,0.038428,0.023213,0.007444,0.024336,0.03296,0.045475,0.1926
Shim_dB,240.0,0.336832,0.205905,0.064989,0.211785,0.287885,0.39986,1.7476
Shim_APQ3,240.0,0.021499,0.013787,0.003344,0.01291,0.018571,0.025784,0.11324
Shim_APQ5,240.0,0.023468,0.014402,0.004103,0.014985,0.019897,0.0279,0.12076


In [12]:
df["Status"].value_counts()

0    120
1    120
Name: Status, dtype: int64

In [13]:
df.groupby("Status").mean()

Unnamed: 0_level_0,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.45,0.482186,4e-05,0.002486,0.002787,0.032928,0.287586,0.018054,0.019978,0.025368,...,1.453583,1.460161,1.451821,1.446357,1.456511,1.444688,1.453727,1.441393,1.472436,1.457309
1,0.35,0.685788,4.9e-05,0.003858,0.004277,0.043928,0.386079,0.024945,0.026959,0.031974,...,1.232892,1.237667,1.223613,1.237214,1.225279,1.243267,1.229217,1.221474,1.220326,1.234979


# Outlier Observations

In [14]:
lof = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)

In [15]:
lof.fit_predict(df)

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1, -1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

In [16]:
df_scores = lof.negative_outlier_factor_

In [17]:
df_scores[0:10]

array([-1.02912794, -0.98750191, -1.00216538, -0.98575286, -0.96651051,
       -0.98973178, -1.06220613, -1.02159179, -1.03745719, -1.03454109])

In [18]:
np.sort(df_scores)[:20]

array([-3.31808704, -1.90343024, -1.45254902, -1.31758157, -1.28700747,
       -1.2847338 , -1.23120414, -1.22822043, -1.22673033, -1.20360902,
       -1.18836181, -1.18524265, -1.14720974, -1.14146916, -1.14140184,
       -1.13732771, -1.13405632, -1.13119751, -1.12248582, -1.12165276])

In [19]:
threshold_value = np.sort(df_scores)[11]

In [20]:
df[df_scores == threshold_value]

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
175,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059


In [21]:
swamping_value = df[df_scores == threshold_value]

In [22]:
outliers = df[df_scores < threshold_value]

In [23]:
outliers

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
12,0,0,0.78955,0.000103,0.004238,0.004574,0.039162,0.33289,0.021275,0.024791,...,1.346048,1.221651,1.441105,1.518406,1.5843,1.628549,1.385665,1.426097,1.515076,1.305066
35,0,0,1.1401,0.000105,0.006396,0.00621,0.041771,0.37992,0.022701,0.024203,...,1.359173,1.418337,1.453293,1.223665,1.437905,1.179541,1.457774,1.133481,1.522927,1.177012
43,0,0,0.32578,3.7e-05,0.001477,0.002297,0.03109,0.26649,0.018798,0.017405,...,1.466801,1.473792,1.448296,1.542649,1.497951,1.405354,1.344423,1.376743,1.459605,1.373816
81,0,1,0.14841,8e-06,0.000678,0.001036,0.008117,0.070521,0.003344,0.004103,...,1.762194,1.655392,1.785984,1.483136,1.518154,1.426148,1.400958,1.343733,1.370644,1.782916
88,0,1,0.8365,4.5e-05,0.005276,0.004068,0.03774,0.3467,0.021721,0.019687,...,1.623247,1.513092,1.660356,1.526017,1.382985,1.698539,1.405085,1.621746,1.522796,1.720484
92,0,1,0.4664,3.5e-05,0.002469,0.00287,0.014603,0.12666,0.006765,0.008213,...,1.774729,1.745344,1.708155,1.597912,1.738612,1.736856,1.752343,1.644977,1.713689,1.5651
130,1,0,0.47726,3.1e-05,0.002798,0.002665,0.038216,0.34316,0.021891,0.020847,...,1.535776,1.408177,1.70683,1.526697,1.597078,1.566296,1.45996,1.46473,1.603044,1.34622
157,1,1,0.91168,4.8e-05,0.005096,0.005412,0.062342,0.54934,0.03594,0.038618,...,0.963214,0.951079,1.005077,0.975537,0.764649,0.853069,1.026538,0.978925,1.00879,1.077665
160,1,0,0.40853,2.6e-05,0.001713,0.002134,0.020213,0.17457,0.011215,0.011996,...,1.118683,1.411682,1.035295,1.27492,1.153357,1.137231,1.166626,1.072314,1.181294,1.069168
184,1,0,0.20582,1.1e-05,0.001074,0.001391,0.027439,0.23685,0.014362,0.016783,...,1.263249,1.461548,1.32073,1.512891,1.210139,1.37435,1.460985,1.557161,1.360821,1.327815


In [24]:
outliers.to_records(index = False)

rec.array([(0, 0, 0.78955, 1.0252e-04, 0.0042382, 0.0045742, 0.039162 , 0.33289 , 0.021275 , 0.024791 , 0.034528 ,  78.97295802,  92.21333242, 105.3673573 , 113.153833  , 114.8115302 , 0.31925708, 0.76688914, 0.25655255, 0.91278195, 1.37348074, 1.25122642, 1.54013227, 1.552164  , 1.74778509, 1.68654725, 1.30260771, 1.35166611, 1.35186838, 1.26037926, 1.48355262, 1.498432  , 1.56718613, 1.64082686, 1.33433938, 1.39778336, 1.34604764, 1.22165111, 1.44110482, 1.5184062 , 1.58430022, 1.6285486 , 1.38566503, 1.42609698, 1.51507584, 1.30506559),
           (0, 0, 1.1401 , 1.0525e-04, 0.0063958, 0.00621  , 0.041771 , 0.37992 , 0.022701 , 0.024203 , 0.031866 , 101.2063256 , 109.6511166 , 120.7128299 , 128.2893253 , 129.9852356 , 0.38023876, 0.76022866, 0.39671166, 0.89446038, 1.56586574, 1.0382172 , 0.90051677, 1.51898447, 1.55129465, 1.5725921 , 1.61628588, 1.57490698, 1.5052214 , 1.47396252, 1.53493523, 1.53078748, 1.54358729, 1.44538039, 1.31570698, 1.44562935, 1.35917303, 1.41833725, 1.453

In [25]:
record = outliers.to_records(index = False)

In [26]:
record[:] = swamping_value.to_records(index = False)

In [27]:
record

rec.array([(1, 1, 0.20278, 8.51e-06, 0.0013118, 0.0014976, 0.02233, 0.19321, 0.012544, 0.01302, 0.016276, 50.18104333, 46.44235068, 52.32971005, 55.45250641, 56.68659046, 0.17888125, 0.50186771, 0.00412669, 0.95157105, 1.1585117, 1.16612817, 1.10017139, 1.18644585, 1.06420159, 1.17316096, 1.10681619, 1.15286715, 1.16220314, 1.13629824, 1.25637395, 1.20490762, 1.16852665, 1.11866728, 1.17323651, 1.07349758, 1.17029407, 1.06968594, 1.13797063, 1.11592708, 1.1185114, 1.14926576, 1.09002635, 1.11623541, 1.17086463, 1.09605934),
           (1, 1, 0.20278, 8.51e-06, 0.0013118, 0.0014976, 0.02233, 0.19321, 0.012544, 0.01302, 0.016276, 50.18104333, 46.44235068, 52.32971005, 55.45250641, 56.68659046, 0.17888125, 0.50186771, 0.00412669, 0.95157105, 1.1585117, 1.16612817, 1.10017139, 1.18644585, 1.06420159, 1.17316096, 1.10681619, 1.15286715, 1.16220314, 1.13629824, 1.25637395, 1.20490762, 1.16852665, 1.11866728, 1.17323651, 1.07349758, 1.17029407, 1.06968594, 1.13797063, 1.11592708, 1.1185114, 1

In [28]:
df[df_scores < threshold_value] = pd.DataFrame(record, index = df[df_scores < threshold_value].index) 

In [29]:
df[df_scores <= threshold_value]

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
12,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
35,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
43,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
81,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
88,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
92,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
130,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
157,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
160,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059
175,1,1,0.20278,9e-06,0.001312,0.001498,0.02233,0.19321,0.012544,0.01302,...,1.170294,1.069686,1.137971,1.115927,1.118511,1.149266,1.090026,1.116235,1.170865,1.096059


In [30]:
df.head(14)

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0,1,0.25546,1.5e-05,0.001467,0.001673,0.030256,0.26313,0.017463,0.01966,...,1.407701,1.417218,1.380352,1.42067,1.45124,1.440295,1.403678,1.405495,1.416705,1.35461
1,0,1,0.36964,2.2e-05,0.001932,0.002245,0.023146,0.20217,0.01301,0.014097,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.32287,1.314549,1.318999,1.323508
2,0,1,0.23514,1.3e-05,0.001353,0.001546,0.019338,0.1671,0.011049,0.012683,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.38891,1.305469,1.305402
3,0,0,0.2932,1.7e-05,0.001105,0.001444,0.024716,0.20892,0.014525,0.015696,...,1.5012,1.53417,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,0,0,0.23075,1.5e-05,0.001073,0.001404,0.013119,0.11607,0.006461,0.008385,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023
5,0,0,0.16489,1e-05,0.000819,0.001191,0.010666,0.094738,0.005518,0.006785,...,1.480657,1.675417,1.37346,1.709614,1.444187,1.383488,1.625396,1.651655,1.652845,1.427623
6,0,1,0.22506,1.4e-05,0.001358,0.00146,0.017181,0.14812,0.009609,0.01106,...,1.712147,1.419443,1.501822,1.503534,1.486685,1.648505,1.345959,1.741863,1.828781,1.655604
7,0,1,0.23086,1.5e-05,0.001349,0.001546,0.017775,0.1578,0.009262,0.011683,...,1.535326,1.627976,1.332839,1.25456,1.598743,1.297679,1.526714,1.64791,1.662981,1.609652
8,0,1,0.22898,1.5e-05,0.001375,0.001607,0.02011,0.17577,0.010571,0.013321,...,1.620783,1.431508,1.598949,1.394543,1.45937,1.313012,1.44747,1.354798,1.585025,1.334293
9,0,1,1.31,0.000103,0.008245,0.00628,0.030742,0.27064,0.01859,0.016261,...,1.54101,1.347021,1.526148,1.428505,1.51613,1.491684,1.579521,1.374581,1.550638,1.572821


# Data Standardization

In [31]:
cat_value_drop = df.drop(["Status","Gender"], axis = 1)
cat_values = df[["Status","Gender"]]

In [32]:
cat_value_drop

Unnamed: 0,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,HNR05,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0.25546,0.000015,0.001467,0.001673,0.030256,0.26313,0.017463,0.019660,0.021882,59.437966,...,1.407701,1.417218,1.380352,1.420670,1.451240,1.440295,1.403678,1.405495,1.416705,1.354610
1,0.36964,0.000022,0.001932,0.002245,0.023146,0.20217,0.013010,0.014097,0.016828,59.838895,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.322870,1.314549,1.318999,1.323508
2,0.23514,0.000013,0.001353,0.001546,0.019338,0.16710,0.011049,0.012683,0.013038,57.293808,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.388910,1.305469,1.305402
3,0.29320,0.000017,0.001105,0.001444,0.024716,0.20892,0.014525,0.015696,0.018330,62.179573,...,1.501200,1.534170,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,0.23075,0.000015,0.001073,0.001404,0.013119,0.11607,0.006461,0.008385,0.011037,67.534024,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,0.57585,0.000037,0.003701,0.005149,0.016868,0.14928,0.008835,0.010232,0.015297,28.530790,...,1.116409,1.104511,1.099866,1.080320,1.154057,1.117423,1.167076,1.132436,1.107824,1.109144
236,0.23322,0.000015,0.001270,0.001497,0.017923,0.16720,0.008436,0.011578,0.015473,33.617211,...,1.107477,1.083859,1.101819,1.114161,1.090095,1.140705,1.126667,1.158444,1.096073,1.141835
237,0.26862,0.000022,0.001354,0.001615,0.028040,0.24182,0.015937,0.015441,0.021133,56.853169,...,1.335189,1.385580,1.281551,1.367171,1.319055,1.367095,1.343193,1.374330,1.383364,1.456409
238,0.45376,0.000037,0.002724,0.002258,0.064605,0.58002,0.041295,0.027626,0.037650,60.096871,...,1.327629,1.349928,1.461323,1.350599,1.346363,1.415338,1.361937,1.331923,1.423062,1.307353


In [33]:
cat_values

Unnamed: 0,Status,Gender
0,0,1
1,0,1
2,0,1
3,0,0
4,0,0
...,...,...
235,1,0
236,1,0
237,1,0
238,1,0


In [34]:
preprocessing.scale(cat_value_drop, copy = False)

array([[-0.58101625, -0.62592023, -0.48236751, ...,  0.3992009 ,
         0.36644924,  0.09179239],
       [-0.36830627, -0.46944211, -0.3443091 , ..., -0.03567923,
        -0.08151006, -0.05861059],
       [-0.6188711 , -0.65844894, -0.51632506, ...,  0.31989683,
        -0.14354048, -0.14616714],
       ...,
       [-0.55650002, -0.4603155 , -0.51602797, ...,  0.25017934,
         0.21358973,  0.58406869],
       [-0.21159614, -0.14055298, -0.10889383, ...,  0.04739592,
         0.39559411, -0.13673616],
       [-0.41037136, -0.32547162, -0.39959783, ...,  0.35213786,
        -0.096673  ,  0.42025583]])

In [35]:
df = cat_values.join(cat_value_drop)

In [36]:
df

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0,1,-0.581016,-0.625920,-0.482368,-0.399512,-0.333175,-0.338382,-0.275735,-0.246098,...,0.385218,0.387523,0.276500,0.434974,0.577023,0.514389,0.359978,0.399201,0.366449,0.091792
1,0,1,-0.368306,-0.469442,-0.344309,-0.271091,-0.639881,-0.634816,-0.599614,-0.632767,...,-0.015243,-0.506103,-0.568066,0.108630,0.112829,0.151616,-0.034451,-0.035679,-0.081510,-0.058611
2,0,1,-0.618871,-0.658449,-0.516325,-0.428043,-0.804148,-0.805353,-0.742243,-0.731051,...,0.409324,-0.048015,-0.250870,0.478040,0.599785,0.167130,0.527763,0.319897,-0.143540,-0.146167
3,0,0,-0.510709,-0.565150,-0.589974,-0.451020,-0.572155,-0.601993,-0.489424,-0.521625,...,0.874865,0.937932,-0.008565,0.798990,0.680802,1.500949,1.080461,1.512640,1.225190,1.382203
4,0,0,-0.627049,-0.626362,-0.599451,-0.460013,-1.072419,-1.053500,-1.075970,-1.029821,...,0.912927,-0.001718,1.441577,1.704935,0.416104,1.168911,1.513907,1.012083,-0.179913,0.224355
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,1,0,0.015850,-0.131736,0.181335,0.381919,-0.910697,-0.892008,-0.903302,-0.901413,...,-1.140259,-1.084160,-1.142207,-1.200095,-0.845170,-1.055641,-0.794885,-0.906501,-1.049695,-1.095232
236,1,0,-0.622448,-0.624263,-0.540776,-0.439150,-0.865187,-0.804867,-0.932286,-0.807856,...,-1.187032,-1.181354,-1.132332,-1.037520,-1.151269,-0.942429,-0.992123,-0.782141,-1.103572,-0.937146
237,1,0,-0.556500,-0.460316,-0.516028,-0.412642,-0.428767,-0.442008,-0.386725,-0.539349,...,0.005475,0.238627,-0.223237,0.177962,-0.055558,0.158438,0.064751,0.250179,0.213590,0.584069
238,1,0,-0.211596,-0.140553,-0.108894,-0.268101,1.148549,1.202578,1.457633,0.307597,...,-0.034114,0.070837,0.686053,0.098349,0.075123,0.393031,0.156240,0.047396,0.395594,-0.136736


In [37]:
df.isnull()

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
236,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
237,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
238,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [38]:
y = df["Status"]
X = df.drop(["Status"], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30,
                                                    random_state = 17)

# CatBoost

In [39]:
catb = CatBoostClassifier().fit(X_train, y_train)

Learning rate set to 0.00481
0:	learn: 0.6897696	total: 150ms	remaining: 2m 30s
1:	learn: 0.6862181	total: 155ms	remaining: 1m 17s
2:	learn: 0.6834633	total: 159ms	remaining: 53s
3:	learn: 0.6800522	total: 164ms	remaining: 40.8s
4:	learn: 0.6765847	total: 168ms	remaining: 33.5s
5:	learn: 0.6738297	total: 173ms	remaining: 28.7s
6:	learn: 0.6710396	total: 178ms	remaining: 25.3s
7:	learn: 0.6678141	total: 183ms	remaining: 22.7s
8:	learn: 0.6645253	total: 189ms	remaining: 20.8s
9:	learn: 0.6614993	total: 194ms	remaining: 19.2s
10:	learn: 0.6581656	total: 199ms	remaining: 17.9s
11:	learn: 0.6553003	total: 207ms	remaining: 17s
12:	learn: 0.6522588	total: 214ms	remaining: 16.2s
13:	learn: 0.6493512	total: 220ms	remaining: 15.5s
14:	learn: 0.6464077	total: 225ms	remaining: 14.8s
15:	learn: 0.6433242	total: 229ms	remaining: 14.1s
16:	learn: 0.6400721	total: 234ms	remaining: 13.5s
17:	learn: 0.6370937	total: 239ms	remaining: 13s
18:	learn: 0.6346417	total: 245ms	remaining: 12.6s
19:	learn: 0.631

In [40]:
y_pred = catb.predict(X_test)
accuracy_score(y_test,y_pred)

0.8611111111111112

In [41]:
catb_params = {
    'iterations': [200,500],
    'learning_rate': [0.01,0.05, 0.1],
    'depth': [3,5,8] }

In [42]:
catb = CatBoostClassifier()
catb_cv_model = GridSearchCV(catb, catb_params, cv=5, n_jobs = -1, verbose = 2)
catb_cv_model.fit(X_train, y_train)
catb_cv_model.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits
0:	learn: 0.6860761	total: 4.57ms	remaining: 909ms
1:	learn: 0.6787322	total: 8.59ms	remaining: 850ms
2:	learn: 0.6708649	total: 12ms	remaining: 787ms
3:	learn: 0.6654880	total: 16.1ms	remaining: 789ms
4:	learn: 0.6579462	total: 19.9ms	remaining: 777ms
5:	learn: 0.6521875	total: 24ms	remaining: 777ms
6:	learn: 0.6462223	total: 27.8ms	remaining: 765ms
7:	learn: 0.6407302	total: 31.8ms	remaining: 762ms
8:	learn: 0.6353836	total: 35.7ms	remaining: 757ms
9:	learn: 0.6298339	total: 39.8ms	remaining: 757ms
10:	learn: 0.6238866	total: 43.6ms	remaining: 749ms
11:	learn: 0.6175032	total: 47.5ms	remaining: 744ms
12:	learn: 0.6132838	total: 51.2ms	remaining: 736ms
13:	learn: 0.6059541	total: 55.1ms	remaining: 732ms
14:	learn: 0.6008930	total: 58.6ms	remaining: 722ms
15:	learn: 0.5948111	total: 62.8ms	remaining: 722ms
16:	learn: 0.5905708	total: 66.6ms	remaining: 716ms
17:	learn: 0.5849980	total: 70.2ms	remaining: 710ms
18:	learn: 0.5807

{'depth': 5, 'iterations': 200, 'learning_rate': 0.01}

In [43]:
catb_cv_model.best_params_

{'depth': 5, 'iterations': 200, 'learning_rate': 0.01}

In [44]:
catb = CatBoostClassifier(iterations = 200, 
                          learning_rate = 0.01, 
                          depth = 5)

catb_tuned = catb.fit(X_train, y_train)
y_pred = catb_tuned.predict(X_test)

0:	learn: 0.6860761	total: 19.1ms	remaining: 3.79s
1:	learn: 0.6787322	total: 23.2ms	remaining: 2.29s
2:	learn: 0.6708649	total: 27.3ms	remaining: 1.79s
3:	learn: 0.6654880	total: 31.8ms	remaining: 1.56s
4:	learn: 0.6579462	total: 36.2ms	remaining: 1.41s
5:	learn: 0.6521875	total: 40.4ms	remaining: 1.31s
6:	learn: 0.6462223	total: 45ms	remaining: 1.24s
7:	learn: 0.6407302	total: 50.3ms	remaining: 1.21s
8:	learn: 0.6353836	total: 55.2ms	remaining: 1.17s
9:	learn: 0.6298339	total: 60.9ms	remaining: 1.16s
10:	learn: 0.6238866	total: 67.9ms	remaining: 1.17s
11:	learn: 0.6175032	total: 77ms	remaining: 1.21s
12:	learn: 0.6132838	total: 86.4ms	remaining: 1.24s
13:	learn: 0.6059541	total: 92.2ms	remaining: 1.22s
14:	learn: 0.6008930	total: 97.4ms	remaining: 1.2s
15:	learn: 0.5948111	total: 103ms	remaining: 1.19s
16:	learn: 0.5905708	total: 108ms	remaining: 1.17s
17:	learn: 0.5849980	total: 113ms	remaining: 1.14s
18:	learn: 0.5807091	total: 119ms	remaining: 1.13s
19:	learn: 0.5756963	total: 124

In [47]:
y_pred = catb_tuned.predict(X_test)
accuracy_score(y_test, y_pred)

0.8472222222222222