# Dataset's Story

Contains acoustic features extracted from 3 voice recording replications of the sustained /a/ phonation for each one of the 80 subjects (40 of them with Parkinson's Disease).

## Source links
https://archive-beta.ics.uci.edu/ml/datasets/parkinson+dataset+with+replicated+acoustic+features
https://archive.ics.uci.edu/ml/machine-learning-databases/00489/

## Explanation of Variables

* 1. __ID__:  Subjects's identifier. 

* 2. __Recording__: Number of the recording. 

* 3. __Status__: 0=Healthy; 1=PD 

* 4. __Gender__: 0=Man; 1=Woman 

* 5. __Pitch local perturbation measures__: __Jitter_rel__: relative jitter, __Jitter_abs__:absolute jitter, __Jitter_RAP__: relative average perturbation, __Jitter_PPQ__: pitch perturbation quotient.

* 6. __Amplitude perturbation measures__: __Shim_loc__: local shimmer, __Shim_dB__: shimmer in dB, __Shim_APQ3__: 3-point amplitude perturbation quotient, __Shim_APQ5__: 5-point amplitude perturbation quotient, __Shim_APQ11__: 11-point amplitude perturbation quotient.

* 7. __Harmonic-to-noise ratio measures__: __HNR05__: harmonic-to-noise ratio in the frequency band 0-500 Hz , __HNR15__: in 0-1500 Hz , __HNR25__: in 0-2500 Hz , __HNR35__: in 0-3500 Hz (), __HNR38__: in 0-3800 Hz. 

* 8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (__MFCC0, MFCC1,..., MFCC12__) and their derivatives (__Delta0, Delta1,..., Delta12__). 

* 9. __RPDE__: Recurrence period density entropy. 

* 10. __DFA__: Detrended fluctuation analysis. 

* 11. __PPE__: Pitch period entropy. 

* 12. __GNE__: Glottal-to-noise excitation ratio.

## Additional Information

_Important remarks before using this dataset:_

__1.__ _Each row can not be used independently, because is one of the three replications of one individual. Nature of data is dependent for each subject, but independent from one to another subject. So, traditional technique from machine learning can not be applied to this dataset, because those techniques are based on the independent nature of the instances. There are 240 instances but for only 80 subjects, so they are not independent. Techniques as those presented in Naranjo et al. (2016), Naranjo et al. (2017) or other specifically designed can be used._

__2.__ _The concept of replication considered here does not match the classical concept of statistical repeated measurements. The term 'replications' refers to the collection of features extracted from voice recordings belonging to the same subject. Since, in this context, features are extracted from multiple consecutive voice recordings from the same subject, in principle, the features should be identical. The imperfections in technology and the own biological variability result in non-identical replicated features that are more similar to one another than features from different subjects._

__3.__ _All information about how the dataset was generated is presented in Naranjo et al. (2016)._

_Relevant Papers:_

_Naranjo, L., PÃ©rez, C.J., Campos-Roca, Y., MartÃ­n, J.: Addressing voice recording replications for Parkinsonâ€™s disease detection. Expert Systems With Applications 46, 286-292 (2016)_
https://pubmed.ncbi.nlm.nih.gov/27209185/

_Naranjo, L., PÃ©rez, C.J., MartÃ­n, J., Campos-Roca, Y.: A two-stage variable selection and classification approach for Parkinsonâ€™s disease detection by using voice recording replications. Computer Methods and Programs in Biomedicine 142, 147-156 (2017)_
https://pubmed.ncbi.nlm.nih.gov/28325442/


## Data Scientist's Notes

### A Short Brief About Variables for Unfamiliar with the Field:

__ID:__ 

_The ID is a number assigned to the identity of the subjects._

However, since there were three separate records for each subject, one subject was kept as three different identities. So it looks like I will have to do a new operation on the ID. And based on the following explanation I quoted from above(Additional Information), I can grouping inside of ID and reduce the one subject's records to one.

_"Each row can not be used independently, because is one of the three replications of one individual. Nature of data is dependent for each subject, but independent from one to another subject."_

__Recording:__ 

_Number of the recording of the subjects_

__Pitch local perturbation measures__:

__Jitter_rel__:relative jitter,

__Jitter_abs__:absolute jitter,

__Jitter_RAP__: relative average perturbation,

__Jitter_PPQ__: pitch perturbation quotient

AND

__Amplitude perturbation measures__: 

__Shim_loc__: local shimmer, 

__Shim_dB__: shimmer in dB, 

__Shim_APQ3__: 3-point amplitude perturbation quotient, 

__Shim_APQ5__: 5-point amplitude perturbation quotient, 

__Shim_APQ11__: 11-point amplitude perturbation quotient


For the methods of measurements:

https://www.fon.hum.uva.nl/praat/manual/Voice_2__Jitter.html

https://www.fon.hum.uva.nl/praat/manual/Voice_3__Shimmer.html

For more detailed information:

"Jitter and shimmer are measures of the cycle-to-cycle variations
of fundamental frequency and amplitude, respectively, which
have been largely used for the description of pathological voice
quality. Since they characterise some aspects concerning
particular voices, it is a priori expected to find differences in the values of jitter and shimmer among speakers. In this paper,
several types of jitter and shimmer measurements have been
analysed. Experiments performed with the Switchboard-I
conversational speech database show that jitter and shimmer
measurements give excellent results in speaker verification as
complementary features of spectral and prosodic parameters." [1]

"Jitter is noise in the temporal or timing domain.
Yes, it really is that simple. Normally though we apply the term to electrical or optical signals we can measure with oscilloscopes or other time measurement equipment.
We can think of jitter two ways.
• As an instantaneous effect: This one edge isn’t where I wanted it to be.
• As an accumulation of effects: in this series of edges, each edge is displaced an equal amount,
and the last edge shows the sum of the displacement times. "[2]

"The most important vocal acoustic parameters for
clinical use are measurements of noise, vocal extension
profile, acoustic spectrography, fundamental frequency and
perturbation index - jitter and shimmer.
According to Behlau et al. fundamental frequency
is determined physiologically by the number of cycles that
the vocal folds make in a second, and they are the natural
result of the length of these structures.
Jitter and shimmer represent the variations that occur
in the fundamental frequency. Whereas jitter indicates the
variability or perturbation of fundamental frequency, shimmer
refers to the same perturbation, but it is related to amplitude of sound wave, or intensity of vocal emission. Jitter is
affected mainly because of lack of control of vocal fold
vibration and shimmer with reduction of glottic resistance
and mass lesions in the vocal folds, which are related with
presence of noise at emission and breathiness." [3]

[1] https://www.scielo.br/j/rboto/a/jfYLfsybBtsWkfrnS5ZHhNP/?format=pdf&lang=en

[2]https://nlp.lsi.upc.edu/papers/far_jit_07.pdf

[3]http://anlage.umd.edu/Microwave%20Measurements%20for%20Personal%20Web%20Site/Tek%20Intro%20to%20Jitter%2061W_18897_1.pdf


__Note:__ _As a result of my research and visualizations on the dataset, it seems that a single value in the Pitch Lo0cal Perturbation Measures values and Amplitude Perturbation Measures Values groups can represent this group. And this value is Jitter_rel( Jitter Relative) value according to the following visualizations.
So I'm thinking of eliminate the data inside this set of values._


__HNR Values:__

Harmonic noise ratio (Harmonic-to-Noise Ratio, HNR): It is the ratio of the total of the fundamental frequency and its multiples of harmonics to the noise. Its unit is dB, and high values indicate that the ratio of sound to noise is low. This parameter, which was not measured by MDVP,  it can be measured with Praat and Dr. Speech Vocal Assessment (Tiger DRS, Ine.).

"Harmonic to Noise Ratio (HNR) measures the ratio between periodic and non-periodic components of a speech sound. It has become more and more important in the vocal acoustic analysis to diagnose pathologic voices. The measure of this parameter can be done with Praat software that is commonly accept by the scientific community has an accurate measure." [4]

These variables, which called ratio variables, have higher numbers compared to other variables. So I can do a standardization process.

[4]https://www.sciencedirect.com/science/article/pii/S1877050918316739#:~:text=measures%20the%20ratio,an%20accurate%20measure.

__MFCC Values:__ 

_Mel frequency cepstral coefficient_

Mel-frequency cepstrum[5][6] coefficents, known as MFCC for short, is a fourier based transformation[7] used in feature extraction in applications such as speech recognition and speaker recognition. In short, it allows us to extract some information from the audio data that will characterize that audio data.

Variables named MFCC0, MFCC1,..., MFCC12 were recorded as a result of the measurement of the subjects voices from 0 to 12. So it looks like I will have to do a new operation on these variables.

[5] https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

[6] https://en.wikipedia.org/wiki/Mel_scale

[7] https://en.wikipedia.org/wiki/Fourier_transform

__Delta Values:__

Delta variables (named Delta0, Delta1,..., Delta12) are derivatives of MFCC Values. I will probably do same operations to this variable group as I would do to the MFCC.


__DFA Value:__

"In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical self-affinity of a signal. It is useful for analysing time series that appear to be long-memory processes (diverging correlation time, e.g. power-law decaying autocorrelation function) or 1/f noise.

The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are non-stationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform.

Peng et al. introduced DFA in 1994 in a paper that has been cited over 3,000 times as of 2020 and represents an extension of the (ordinary) fluctuation analysis (FA), which is affected by non-stationarities." [8]

[8]https://en.wikipedia.org/wiki/Detrended_fluctuation_analysis

__RPDE:__

Recurrence period density entropy

"Recurrence period density entropy (RPDE) is a method, in the fields of dynamical systems, stochastic processes, and time series analysis, for determining the periodicity, or repetitiveness of a signal."[9]

[9]https://en.wikipedia.org/wiki/Recurrence_period_density_entropy

__PPE Value__:

_Pitch Period Entropy_

The new research below mentioned about a new measure of dysphonia, called Pitch Period Entropy (PPE),  as a new method for diagnosing people with Parkinson's.

As I understand, many voice recognition methods are used in the dataset. It seems I will have to build a model where I can determine which values work most efficiently together or separately.

"We present an assessment of the practical value of existing traditional and non-standard measures for discriminating healthy people from people with Parkinson's disease (PD) by detecting dysphonia. We introduce a new measure of dysphonia, Pitch Period Entropy (PPE), which is robust to many uncontrollable confounding effects including noisy acoustic environments and normal, healthy variations in voice frequency. We collected sustained phonations from 31 people, 23 with PD. We then selected 10 highly uncorrelated measures, and an exhaustive search of all possible combinations of these measures finds four that in combination lead to overall correct classification performance of 91.4%, using a kernel support vector machine. In conclusion, we find that non-standard methods in combination with traditional harmonics-to-noise ratios are best able to separate healthy from PD subjects. The selected non-standard methods are robust to many uncontrollable variations in acoustic environment and individual subjects, and are thus well-suited to telemonitoring applications."[10]

[10]https://www.researchgate.net/publication/50377363_Suitability_of_Dysphonia_Measurements_for_Telemonitoring_of_Parkinson's_Disease

 __GNE Value__:
 
 _Glottal-to-noise excitation ratio_
 
 As I understand from the explanation below, the GNR variable is a value formed by comparing the HNR and NNE sound measurements. I want to compare it with Status and Gender variables and see the results. However, I have doubts about its necessity in the model since the GNR value is derived from the HNR which is in the dataset and the NNE data which is hidden in the dataset. After all, I can eliminate one of the two values.

"In this article a new acoustic parameter for the objective description of voice quality is introduced. It is based on
the correlation coefficient for Hilbert envelopes of different frequency bands. The parameter indicates whether a
given voice signal originatesfrom vibrations of the vocal folds or from turbulent noise generated in the vocal tract
and is thus related to (but not a direct measure of) breathiness. Therefore it is named Glottal-to-Noise Excitation
Ratio (GNE Ratio). GNE is compared to HNR (Harmonics-to-NoiseRatio) and NNE (Normalized Noise Energy),
existing measures also sensitive to additive noise (turbulence). Experiments with artificial signals show that only
the GNE is almost independent of frequency modulation noise (jitter) and amplitude modulation noise (shimmer)."[11]

[11] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.549.2511&rep=rep1&type=pdf#:~:text=The%20parameter%20indicates%20whether%20a,Excitation%20Ratio%20(GNE%20Ratio)

## Result

__It has been mentioned above that traditional machine learning methods cannot be used. However, a research on papers of Naranjo et al. concluded that Bayesian regression models were used.[12] In addition, as a result of litarature research, in the paper that titled "Gradient boosting for Parkinson's disease diagnosis from voice recordings", a study was conducted on this dataset and it was stated that they achieved the best results with LGB. [13]__

__Therefore, I would like to study on this data set myself and evaluated the results.__

__Results__: _Machine learning models such as Random Forest, Gradient Boosting Machine, XGBoost, CatBoost, Light GBM and Naive Bayes have been tried and the best result with 0.8472 is obtained with the CatBoost model._

[12] https://pubmed.ncbi.nlm.nih.gov/27209185/

[13] https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-01250-7

# Libraries

In [126]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

from sklearn import preprocessing 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.neighbors import LocalOutlierFactor
from catboost import CatBoostClassifier


# Dataset

In [63]:
df = pd.read_csv("ReplicatedAcousticFeatures-ParkinsonDatabase.csv")

In [64]:
df.head()

Unnamed: 0,ID,Recording,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,CONT-01,1,0,1,0.25546,1.5e-05,0.001467,0.001673,0.030256,0.26313,...,1.407701,1.417218,1.380352,1.42067,1.45124,1.440295,1.403678,1.405495,1.416705,1.35461
1,CONT-01,2,0,1,0.36964,2.2e-05,0.001932,0.002245,0.023146,0.20217,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.32287,1.314549,1.318999,1.323508
2,CONT-01,3,0,1,0.23514,1.3e-05,0.001353,0.001546,0.019338,0.1671,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.38891,1.305469,1.305402
3,CONT-02,1,0,0,0.2932,1.7e-05,0.001105,0.001444,0.024716,0.20892,...,1.5012,1.53417,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,CONT-02,2,0,0,0.23075,1.5e-05,0.001073,0.001404,0.013119,0.11607,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023


In [65]:
df.shape

(240, 48)

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 48 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          240 non-null    object 
 1   Recording   240 non-null    int64  
 2   Status      240 non-null    int64  
 3   Gender      240 non-null    int64  
 4   Jitter_rel  240 non-null    float64
 5   Jitter_abs  240 non-null    float64
 6   Jitter_RAP  240 non-null    float64
 7   Jitter_PPQ  240 non-null    float64
 8   Shim_loc    240 non-null    float64
 9   Shim_dB     240 non-null    float64
 10  Shim_APQ3   240 non-null    float64
 11  Shim_APQ5   240 non-null    float64
 12  Shi_APQ11   240 non-null    float64
 13  HNR05       240 non-null    float64
 14  HNR15       240 non-null    float64
 15  HNR25       240 non-null    float64
 16  HNR35       240 non-null    float64
 17  HNR38       240 non-null    float64
 18  RPDE        240 non-null    float64
 19  DFA         240 non-null    f

In [67]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recording,240.0,2.0,0.818203,1.0,1.0,2.0,3.0,3.0
Status,240.0,0.5,0.501045,0.0,0.0,0.5,1.0,1.0
Gender,240.0,0.4,0.490922,0.0,0.0,0.0,1.0,1.0
Jitter_rel,240.0,0.583987,0.535769,0.14801,0.29826,0.481455,0.681685,6.8382
Jitter_abs,240.0,4.4e-05,4.5e-05,7e-06,1.9e-05,3.5e-05,5.6e-05,0.00055
Jitter_RAP,240.0,0.003172,0.003373,0.000678,0.001551,0.002337,0.003678,0.043843
Jitter_PPQ,240.0,0.003532,0.004449,0.001036,0.001867,0.00287,0.003991,0.065199
Shim_loc,240.0,0.038428,0.023213,0.007444,0.024336,0.03296,0.045475,0.1926
Shim_dB,240.0,0.336832,0.205905,0.064989,0.211785,0.287885,0.39986,1.7476
Shim_APQ3,240.0,0.021499,0.013787,0.003344,0.01291,0.018571,0.025784,0.11324


# N/A Values

In [68]:
df.isnull().sum()

ID            0
Recording     0
Status        0
Gender        0
Jitter_rel    0
Jitter_abs    0
Jitter_RAP    0
Jitter_PPQ    0
Shim_loc      0
Shim_dB       0
Shim_APQ3     0
Shim_APQ5     0
Shi_APQ11     0
HNR05         0
HNR15         0
HNR25         0
HNR35         0
HNR38         0
RPDE          0
DFA           0
PPE           0
GNE           0
MFCC0         0
MFCC1         0
MFCC2         0
MFCC3         0
MFCC4         0
MFCC5         0
MFCC6         0
MFCC7         0
MFCC8         0
MFCC9         0
MFCC10        0
MFCC11        0
MFCC12        0
Delta0        0
Delta1        0
Delta2        0
Delta3        0
Delta4        0
Delta5        0
Delta6        0
Delta7        0
Delta8        0
Delta9        0
Delta10       0
Delta11       0
Delta12       0
dtype: int64

# Data Preprocessing

__Subjects records were grouped according to the ID variable and their standard deviation was used. Thus, it was tried to reach a single result of each subject.__

In [69]:
df_s_g = df[["ID","Status","Gender"]]
df = df.drop(["Status","Gender","Recording"], axis= 1)

In [70]:
df.shape

(240, 45)

In [71]:
df.head()

Unnamed: 0,ID,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,CONT-01,0.25546,1.5e-05,0.001467,0.001673,0.030256,0.26313,0.017463,0.01966,0.021882,...,1.407701,1.417218,1.380352,1.42067,1.45124,1.440295,1.403678,1.405495,1.416705,1.35461
1,CONT-01,0.36964,2.2e-05,0.001932,0.002245,0.023146,0.20217,0.01301,0.014097,0.016828,...,1.331232,1.227338,1.213377,1.352739,1.354242,1.365692,1.32287,1.314549,1.318999,1.323508
2,CONT-01,0.23514,1.3e-05,0.001353,0.001546,0.019338,0.1671,0.011049,0.012683,0.013038,...,1.412304,1.324674,1.276088,1.429634,1.455996,1.368882,1.438053,1.38891,1.305469,1.305402
3,CONT-02,0.2932,1.7e-05,0.001105,0.001444,0.024716,0.20892,0.014525,0.015696,0.01833,...,1.5012,1.53417,1.323993,1.496442,1.472926,1.643177,1.551286,1.638346,1.604008,1.621456
4,CONT-02,0.23075,1.5e-05,0.001073,0.001404,0.013119,0.11607,0.006461,0.008385,0.011037,...,1.508468,1.334511,1.610694,1.685021,1.417614,1.574895,1.640088,1.533666,1.297536,1.382023


In [72]:
df_s_g = df_s_g.groupby(["ID"]).mean()

In [73]:
df_s_g = df_s_g.reset_index()

In [74]:
df_s_g

Unnamed: 0,ID,Status,Gender
0,CONT-01,0.0,1.0
1,CONT-02,0.0,0.0
2,CONT-03,0.0,1.0
3,CONT-04,0.0,1.0
4,CONT-05,0.0,0.0
...,...,...,...
75,PARK-36,1.0,1.0
76,PARK-37,1.0,1.0
77,PARK-38,1.0,1.0
78,PARK-39,1.0,0.0


In [75]:
df_s_g.Status = df_s_g.Status.astype(int)
df_s_g.Gender = df_s_g.Gender.astype(int)

In [76]:
df_s_g.head()

Unnamed: 0,ID,Status,Gender
0,CONT-01,0,1
1,CONT-02,0,0
2,CONT-03,0,1
3,CONT-04,0,1
4,CONT-05,0,0


In [77]:
df_s_g.shape

(80, 3)

In [78]:
df = df.groupby("ID").std()

In [79]:
df.head()

Unnamed: 0_level_0,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,HNR05,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CONT-01,0.072503,4.572766e-06,0.000307,0.000372,0.005542,0.048593,0.003287,0.003688,0.004437,1.368432,...,0.045536,0.09495,0.084345,0.042047,0.057424,0.042181,0.059131,0.048435,0.060695,0.024888
CONT-02,0.064163,3.695894e-06,0.000157,0.000136,0.007505,0.060709,0.00495,0.004751,0.005001,4.340748,...,0.014424,0.171285,0.153256,0.116625,0.027663,0.134622,0.047599,0.064623,0.192594,0.127134
CONT-03,0.002959,3.48721e-07,1.3e-05,7.4e-05,0.001548,0.014031,0.000678,0.001168,0.001654,8.51157,...,0.088427,0.117069,0.134662,0.124808,0.073856,0.198271,0.090605,0.201892,0.124488,0.173769
CONT-04,0.148709,1.229108e-05,0.001187,0.00019,0.004275,0.037805,0.003379,0.001529,0.000905,6.842806,...,0.11155,0.113368,0.097098,0.059446,0.10904,0.022939,0.119272,0.16723,0.119695,0.108697
CONT-05,0.164566,2.114105e-05,0.001106,0.000844,0.003228,0.022632,0.002111,0.0025,0.003941,9.800123,...,0.088891,0.241085,0.1791,0.091858,0.150241,0.190669,0.079965,0.054692,0.085536,0.213939


In [80]:
df.shape

(80, 44)

In [81]:
df = df.reset_index()

In [82]:
df = df.drop(["ID"], axis= 1)

In [83]:
df.head()

Unnamed: 0,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,HNR05,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0.072503,4.572766e-06,0.000307,0.000372,0.005542,0.048593,0.003287,0.003688,0.004437,1.368432,...,0.045536,0.09495,0.084345,0.042047,0.057424,0.042181,0.059131,0.048435,0.060695,0.024888
1,0.064163,3.695894e-06,0.000157,0.000136,0.007505,0.060709,0.00495,0.004751,0.005001,4.340748,...,0.014424,0.171285,0.153256,0.116625,0.027663,0.134622,0.047599,0.064623,0.192594,0.127134
2,0.002959,3.48721e-07,1.3e-05,7.4e-05,0.001548,0.014031,0.000678,0.001168,0.001654,8.51157,...,0.088427,0.117069,0.134662,0.124808,0.073856,0.198271,0.090605,0.201892,0.124488,0.173769
3,0.148709,1.229108e-05,0.001187,0.00019,0.004275,0.037805,0.003379,0.001529,0.000905,6.842806,...,0.11155,0.113368,0.097098,0.059446,0.10904,0.022939,0.119272,0.16723,0.119695,0.108697
4,0.164566,2.114105e-05,0.001106,0.000844,0.003228,0.022632,0.002111,0.0025,0.003941,9.800123,...,0.088891,0.241085,0.1791,0.091858,0.150241,0.190669,0.079965,0.054692,0.085536,0.213939


In [84]:
df = df_s_g.join(df)

In [85]:
df.head()

Unnamed: 0,ID,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,CONT-01,0,1,0.072503,4.572766e-06,0.000307,0.000372,0.005542,0.048593,0.003287,...,0.045536,0.09495,0.084345,0.042047,0.057424,0.042181,0.059131,0.048435,0.060695,0.024888
1,CONT-02,0,0,0.064163,3.695894e-06,0.000157,0.000136,0.007505,0.060709,0.00495,...,0.014424,0.171285,0.153256,0.116625,0.027663,0.134622,0.047599,0.064623,0.192594,0.127134
2,CONT-03,0,1,0.002959,3.48721e-07,1.3e-05,7.4e-05,0.001548,0.014031,0.000678,...,0.088427,0.117069,0.134662,0.124808,0.073856,0.198271,0.090605,0.201892,0.124488,0.173769
3,CONT-04,0,1,0.148709,1.229108e-05,0.001187,0.00019,0.004275,0.037805,0.003379,...,0.11155,0.113368,0.097098,0.059446,0.10904,0.022939,0.119272,0.16723,0.119695,0.108697
4,CONT-05,0,0,0.164566,2.114105e-05,0.001106,0.000844,0.003228,0.022632,0.002111,...,0.088891,0.241085,0.1791,0.091858,0.150241,0.190669,0.079965,0.054692,0.085536,0.213939


In [86]:
df.shape

(80, 47)

In [87]:
df = df.drop(["ID"], axis = 1)

In [88]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Status,80.0,0.5,0.503155,0.0,0.0,0.5,1.0,1.0
Gender,80.0,0.4,0.492989,0.0,0.0,0.0,1.0,1.0
Jitter_rel,80.0,0.179718,0.391044,0.002959189,0.048714,0.103746,0.163789,3.389431
Jitter_abs,80.0,1.4e-05,3.3e-05,2.421432e-07,4e-06,9e-06,1.2e-05,0.000285
Jitter_RAP,80.0,0.001144,0.002527,1.342572e-05,0.000294,0.000647,0.000972,0.021872
Jitter_PPQ,80.0,0.001176,0.003856,1.263738e-05,0.000263,0.000539,0.000907,0.034564
Shim_loc,80.0,0.009538,0.011573,0.0008266456,0.003889,0.005765,0.010564,0.066761
Shim_dB,80.0,0.084637,0.102936,0.008985,0.034379,0.051197,0.087649,0.604007
Shim_APQ3,80.0,0.005904,0.007298,0.0001986882,0.002294,0.003427,0.006361,0.040632
Shim_APQ5,80.0,0.005554,0.006582,0.0008710017,0.002214,0.003739,0.006264,0.037691


# Outlier Observations

__Local Outlier Factor swamping method was used on outlier observations.__

In [89]:
lof = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)

In [90]:
lof.fit_predict(df)

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

In [91]:
df_scores = lof.negative_outlier_factor_

In [92]:
df_scores[0:10]

array([-1.11634603, -1.00063476, -1.02310267, -0.99955392, -1.17114643,
       -1.08214625, -1.04583676, -1.08179491, -1.15467062, -1.17512983])

In [93]:
np.sort(df_scores)[:20]

array([-3.27689476, -1.71580017, -1.69581681, -1.63368145, -1.59551254,
       -1.38806   , -1.27844232, -1.26270854, -1.25511783, -1.24549715,
       -1.23462161, -1.21250884, -1.17666686, -1.17512983, -1.17114643,
       -1.15467062, -1.146841  , -1.13962896, -1.1365789 , -1.12990721])

In [94]:
threshold_value = np.sort(df_scores)[11]

In [95]:
df[df_scores == threshold_value]

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
67,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904


In [96]:
swamping_value = df[df_scores == threshold_value]

In [97]:
outliers = df[df_scores < threshold_value]

In [98]:
outliers

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
11,0,0,0.573984,5.3e-05,0.003779,0.003177,0.014522,0.145947,0.008605,0.009089,...,0.140553,0.100686,0.149783,0.047116,0.246413,0.067801,0.206328,0.074438,0.129344,0.027059
13,0,1,0.118966,1e-05,0.000813,0.000513,0.003274,0.027346,0.00232,0.001965,...,0.205038,0.04404,0.158308,0.175996,0.179842,0.208482,0.205575,0.16479,0.217571,0.214805
28,0,1,0.150218,9e-06,0.000689,0.000752,0.004057,0.051665,0.001395,0.001446,...,0.05682,0.146108,0.134987,0.142239,0.066595,0.068571,0.037327,0.073084,0.032711,0.148002
29,0,1,0.297666,1.6e-05,0.001904,0.001307,0.011328,0.080929,0.007649,0.004508,...,0.092341,0.095761,0.079022,0.094596,0.11334,0.093646,0.064769,0.115459,0.172773,0.157019
35,0,0,0.026287,2e-06,0.000138,0.000223,0.002595,0.02314,0.001539,0.001538,...,0.146625,0.23588,0.08002,0.232582,0.054529,0.083274,0.067433,0.003828,0.224203,0.072879
38,0,0,0.253817,2e-05,0.001647,0.001051,0.013607,0.13172,0.007207,0.007296,...,0.103308,0.159419,0.194642,0.215566,0.160818,0.24142,0.127173,0.257223,0.068246,0.064025
44,1,0,0.091313,6e-06,0.000516,0.0005,0.005815,0.05073,0.003268,0.003708,...,0.106529,0.22959,0.303352,0.172413,0.106278,0.186984,0.368433,0.284948,0.07782,0.138849
53,1,0,0.161427,1.1e-05,0.000611,0.000982,0.005208,0.04694,0.003103,0.002847,...,0.12515,0.17349,0.234933,0.123305,0.100218,0.160595,0.095512,0.128904,0.072038,0.045819
57,1,0,3.389431,0.000285,0.021872,0.034564,0.040784,0.323122,0.026101,0.024789,...,0.079611,0.016419,0.096569,0.071172,0.053817,0.127356,0.087,0.05022,0.02281,0.101855
58,1,1,0.021314,1e-06,0.000123,7.3e-05,0.003465,0.030011,0.002086,0.002704,...,0.016394,0.08976,0.02992,0.112791,0.067748,0.04598,0.048513,0.024188,0.027646,0.047911


In [99]:
outliers.to_records(index = False)

rec.array([(0, 0, 0.57398426, 5.28236422e-05, 0.00377885, 3.17732996e-03, 0.01452219, 0.14594722, 0.00860521, 0.00908928, 0.00836793, 21.10397969, 20.66249168, 20.6181093 , 20.85594767, 20.98684942, 0.04440703, 0.02146355, 1.20023883e-01, 0.01090234, 0.19888776, 0.20665413, 0.18504614, 0.28515106, 0.15346337, 0.15996338, 0.23890004, 0.17229709, 0.19512826, 0.17101747, 0.16066044, 0.2530174 , 0.13715257, 0.19162256, 0.06825478, 0.13737608, 0.14055305, 0.10068562, 0.14978346, 0.0471156 , 0.24641299, 0.0678007 , 0.20632817, 0.07443798, 0.12934449, 0.02705944),
           (0, 1, 0.11896584, 9.92242472e-06, 0.00081262, 5.13046794e-04, 0.00327439, 0.02734574, 0.00232016, 0.00196512, 0.00125564, 10.73658011, 12.56204192, 14.21503196, 14.80073327, 15.26698894, 0.02394733, 0.0088237 , 7.11621575e-02, 0.0125146 , 0.12781421, 0.31005381, 0.29942195, 0.20304617, 0.13684134, 0.13351792, 0.13161214, 0.17375959, 0.13713721, 0.14558523, 0.23486172, 0.19115401, 0.1824543 , 0.1637966 , 0.16981983, 0.132

In [100]:
record = outliers.to_records(index = False)

In [101]:
record[:] = swamping_value.to_records(index = False)

In [102]:
record

rec.array([(1, 0, 0.14192555, 1.17761061e-05, 0.00071701, 0.0010091, 0.00965307, 0.08412512, 0.00630685, 0.00686862, 0.00857979, 10.54059332, 10.16901919, 10.12948584, 10.47003307, 10.5377585, 0.02069747, 0.05197, 0.19749312, 0.0407782, 0.11274054, 0.1429649, 0.06520124, 0.19723888, 0.10856056, 0.18768472, 0.07440797, 0.1459366, 0.07979862, 0.18406437, 0.13760423, 0.1463425, 0.14208586, 0.14772975, 0.13472664, 0.10258305, 0.16095256, 0.13591679, 0.17900381, 0.11170355, 0.12154597, 0.15609398, 0.09358279, 0.13917882, 0.17378524, 0.11790429),
           (1, 0, 0.14192555, 1.17761061e-05, 0.00071701, 0.0010091, 0.00965307, 0.08412512, 0.00630685, 0.00686862, 0.00857979, 10.54059332, 10.16901919, 10.12948584, 10.47003307, 10.5377585, 0.02069747, 0.05197, 0.19749312, 0.0407782, 0.11274054, 0.1429649, 0.06520124, 0.19723888, 0.10856056, 0.18768472, 0.07440797, 0.1459366, 0.07979862, 0.18406437, 0.13760423, 0.1463425, 0.14208586, 0.14772975, 0.13472664, 0.10258305, 0.16095256, 0.13591679, 0.1

In [103]:
df[df_scores < threshold_value] = pd.DataFrame(record, index = df[df_scores < threshold_value].index) 

In [104]:
df[df_scores <= threshold_value]

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
11,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
13,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
28,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
29,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
35,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
38,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
44,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
53,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
57,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904
58,1,0,0.141926,1.2e-05,0.000717,0.001009,0.009653,0.084125,0.006307,0.006869,...,0.160953,0.135917,0.179004,0.111704,0.121546,0.156094,0.093583,0.139179,0.173785,0.117904


In [105]:
df.head(14)

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0,1,0.072503,4.572766e-06,0.000307,0.000372,0.005542,0.048593,0.003287,0.003688,...,0.045536,0.09495,0.084345,0.042047,0.057424,0.042181,0.059131,0.048435,0.060695,0.024888
1,0,0,0.064163,3.695894e-06,0.000157,0.000136,0.007505,0.060709,0.00495,0.004751,...,0.014424,0.171285,0.153256,0.116625,0.027663,0.134622,0.047599,0.064623,0.192594,0.127134
2,0,1,0.002959,3.48721e-07,1.3e-05,7.4e-05,0.001548,0.014031,0.000678,0.001168,...,0.088427,0.117069,0.134662,0.124808,0.073856,0.198271,0.090605,0.201892,0.124488,0.173769
3,0,1,0.148709,1.229108e-05,0.001187,0.00019,0.004275,0.037805,0.003379,0.001529,...,0.11155,0.113368,0.097098,0.059446,0.10904,0.022939,0.119272,0.16723,0.119695,0.108697
4,0,0,0.164566,2.114105e-05,0.001106,0.000844,0.003228,0.022632,0.002111,0.0025,...,0.088891,0.241085,0.1791,0.091858,0.150241,0.190669,0.079965,0.054692,0.085536,0.213939
5,0,1,0.079774,5.343918e-06,0.000539,0.000551,0.016862,0.145524,0.010524,0.011659,...,0.176824,0.079022,0.18275,0.064975,0.122877,0.091792,0.073726,0.039865,0.065188,0.050471
6,0,0,0.093556,1.039106e-05,0.000533,0.000435,0.007863,0.070059,0.004522,0.004903,...,0.206673,0.049412,0.206089,0.182368,0.23013,0.104754,0.082806,0.038617,0.135149,0.142957
7,0,1,0.118676,7.833761e-06,0.00064,0.000795,0.010867,0.088933,0.007475,0.004848,...,0.019543,0.023676,0.00551,0.046483,0.111449,0.049691,0.075671,0.040496,0.040285,0.031686
8,0,0,0.094234,1.103301e-05,0.000659,0.000618,0.004154,0.043579,0.003264,0.001266,...,0.101943,0.070169,0.098956,0.135556,0.082802,0.062786,0.018456,0.170319,0.035092,0.074232
9,0,1,0.027981,1.393921e-06,0.000194,0.000167,0.004415,0.039195,0.001393,0.002099,...,0.125272,0.086128,0.127841,0.224648,0.1568,0.113968,0.002261,0.081645,0.158778,0.151489


# Data Standardization

In [106]:
cat_value_drop = df.drop(["Status","Gender"], axis = 1)
cat_values = df[["Status","Gender"]]

In [107]:
cat_value_drop

Unnamed: 0,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,Shi_APQ11,HNR05,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0.072503,4.572766e-06,0.000307,0.000372,0.005542,0.048593,0.003287,0.003688,0.004437,1.368432,...,0.045536,0.094950,0.084345,0.042047,0.057424,0.042181,0.059131,0.048435,0.060695,0.024888
1,0.064163,3.695894e-06,0.000157,0.000136,0.007505,0.060709,0.004950,0.004751,0.005001,4.340748,...,0.014424,0.171285,0.153256,0.116625,0.027663,0.134622,0.047599,0.064623,0.192594,0.127134
2,0.002959,3.487210e-07,0.000013,0.000074,0.001548,0.014031,0.000678,0.001168,0.001654,8.511570,...,0.088427,0.117069,0.134662,0.124808,0.073856,0.198271,0.090605,0.201892,0.124488,0.173769
3,0.148709,1.229108e-05,0.001187,0.000190,0.004275,0.037805,0.003379,0.001529,0.000905,6.842806,...,0.111550,0.113368,0.097098,0.059446,0.109040,0.022939,0.119272,0.167230,0.119695,0.108697
4,0.164566,2.114105e-05,0.001106,0.000844,0.003228,0.022632,0.002111,0.002500,0.003941,9.800123,...,0.088891,0.241085,0.179100,0.091858,0.150241,0.190669,0.079965,0.054692,0.085536,0.213939
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,0.087015,4.062814e-06,0.000467,0.000348,0.009917,0.058511,0.006523,0.004149,0.005607,2.556724,...,0.142711,0.129427,0.162210,0.144232,0.101711,0.139037,0.103198,0.105527,0.157363,0.168384
76,0.348786,1.938841e-05,0.002368,0.001723,0.059913,0.565461,0.039214,0.028635,0.029929,3.743336,...,0.092997,0.021100,0.115034,0.100204,0.083589,0.075226,0.152393,0.066706,0.154291,0.103533
77,0.037466,1.952781e-06,0.000258,0.000172,0.004326,0.036449,0.002994,0.002122,0.001135,6.443996,...,0.097004,0.094840,0.107725,0.056577,0.167866,0.133751,0.181103,0.191542,0.215360,0.115435
78,0.184265,1.199181e-05,0.001317,0.002042,0.002304,0.024078,0.000775,0.001728,0.002611,2.614118,...,0.012772,0.027123,0.017356,0.017704,0.036760,0.024352,0.029283,0.030003,0.008649,0.023089


In [108]:
cat_values

Unnamed: 0,Status,Gender
0,0,1
1,0,0
2,0,1
3,0,1
4,0,0
...,...,...
75,1,1
76,1,1
77,1,1
78,1,0


In [109]:
preprocessing.scale(cat_value_drop, copy = False)

array([[-0.45349089, -0.56765552, -0.60600714, ..., -0.99114859,
        -0.81446191, -1.46879416],
       [-0.51844375, -0.65301171, -0.78502565, ..., -0.73142254,
         1.17385906,  0.4906502 ],
       [-0.99506929, -0.97883106, -0.95602953, ...,  1.47101809,
         0.14719811,  1.38436689],
       ...,
       [-0.72634719, -0.8226892 , -0.66466143, ...,  1.30496064,
         1.51704362,  0.26644932],
       [ 0.41686398,  0.15452633,  0.60054665, ..., -1.28688946,
        -1.59902264, -1.50326735],
       [-0.29442474, -0.30563664, -0.12955921, ..., -1.24776466,
        -0.91101994, -0.44834425]])

In [110]:
df = cat_values.join(cat_value_drop)

In [111]:
df

Unnamed: 0,Status,Gender,Jitter_rel,Jitter_abs,Jitter_RAP,Jitter_PPQ,Shim_loc,Shim_dB,Shim_APQ3,Shim_APQ5,...,Delta3,Delta4,Delta5,Delta6,Delta7,Delta8,Delta9,Delta10,Delta11,Delta12
0,0,1,-0.453491,-0.567656,-0.606007,-0.555768,-0.371531,-0.360937,-0.397070,-0.336841,...,-1.025362,-0.242142,-0.373819,-1.187865,-1.135954,-1.406001,-0.719616,-0.991149,-0.814462,-1.468794
1,0,0,-0.518444,-0.653012,-0.785026,-0.911107,-0.189446,-0.236663,-0.151584,-0.162319,...,-1.542822,1.242123,0.767415,0.351283,-1.811600,0.654297,-0.949393,-0.731423,1.173859,0.490650
2,0,1,-0.995069,-0.978831,-0.956030,-1.004560,-0.741947,-0.715441,-0.781927,-0.750849,...,-0.311994,0.187946,0.459484,0.520168,-0.762928,2.072915,-0.092479,1.471018,0.147198,1.384367
3,0,1,0.139967,0.183658,0.445090,-0.829500,-0.489039,-0.471591,-0.383444,-0.691511,...,0.072576,0.115986,-0.162618,-0.828791,0.035816,-1.834867,0.478710,0.914886,0.074944,0.137314
4,0,0,0.263453,1.045128,0.347688,0.155971,-0.586106,-0.627222,-0.570612,-0.531961,...,-0.304290,2.599318,1.195417,-0.159871,0.971172,1.903476,-0.304484,-0.890753,-0.439991,2.154175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,1,1,-0.340476,-0.617295,-0.414548,-0.591990,0.034372,-0.259209,0.080488,-0.261230,...,0.590863,0.428230,0.915705,0.921040,-0.130561,0.752714,0.158424,-0.075128,0.642773,1.281159
76,1,1,1.698081,0.874523,1.854068,1.480880,4.672007,4.940511,4.903975,3.760727,...,-0.235993,-1.678085,0.134421,0.012380,-0.541964,-0.669506,1.138653,-0.697993,0.596460,0.038354
77,1,1,-0.726347,-0.822689,-0.664661,-0.857001,-0.484314,-0.485497,-0.440328,-0.594129,...,-0.169354,-0.244273,0.013384,-0.887988,1.371298,0.634884,1.710705,1.304961,1.517044,0.266449
78,1,0,0.416864,0.154526,0.600547,1.960626,-0.671859,-0.612382,-0.767735,-0.658786,...,-1.570306,-1.560981,-1.483214,-1.690262,-1.605068,-1.803369,-1.314348,-1.286889,-1.599023,-1.503267


In [112]:
y = df["Status"]
X = df.drop(["Status"], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30,
                                                    random_state = 17)

# CatBoost

In [113]:
catb = CatBoostClassifier().fit(X_train, y_train)

Learning rate set to 0.003009
0:	learn: 0.6912377	total: 1.76ms	remaining: 1.75s
1:	learn: 0.6893044	total: 3.23ms	remaining: 1.61s
2:	learn: 0.6873055	total: 4.81ms	remaining: 1.6s
3:	learn: 0.6856513	total: 6.31ms	remaining: 1.57s
4:	learn: 0.6842391	total: 7.83ms	remaining: 1.56s
5:	learn: 0.6828232	total: 9.41ms	remaining: 1.56s
6:	learn: 0.6813825	total: 10.9ms	remaining: 1.55s
7:	learn: 0.6793796	total: 12.6ms	remaining: 1.56s
8:	learn: 0.6780947	total: 74.1ms	remaining: 8.15s
9:	learn: 0.6765093	total: 76.2ms	remaining: 7.54s
10:	learn: 0.6748049	total: 79.1ms	remaining: 7.11s
11:	learn: 0.6732366	total: 81.4ms	remaining: 6.71s
12:	learn: 0.6715400	total: 83.8ms	remaining: 6.36s
13:	learn: 0.6695773	total: 86.1ms	remaining: 6.07s
14:	learn: 0.6677604	total: 89ms	remaining: 5.84s
15:	learn: 0.6659046	total: 91.8ms	remaining: 5.64s
16:	learn: 0.6642384	total: 94.1ms	remaining: 5.44s
17:	learn: 0.6623308	total: 96.1ms	remaining: 5.24s
18:	learn: 0.6602648	total: 97.8ms	remaining: 5

In [114]:
y_pred = catb.predict(X_test)
accuracy_score(y_test,y_pred)

0.7916666666666666

In [115]:
catb_params = {
    'iterations': [200,500],
    'learning_rate': [0.01,0.05, 0.1],
    'depth': [3,5,8] }

In [116]:
catb = CatBoostClassifier()
catb_cv_model = GridSearchCV(catb, catb_params, cv=5, n_jobs = -1, verbose = 2)
catb_cv_model.fit(X_train, y_train)
catb_cv_model.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits
0:	learn: 0.6908627	total: 3.04ms	remaining: 1.52s
1:	learn: 0.6886750	total: 5.57ms	remaining: 1.39s
2:	learn: 0.6855477	total: 8.34ms	remaining: 1.38s
3:	learn: 0.6813863	total: 10.4ms	remaining: 1.29s
4:	learn: 0.6780481	total: 12.5ms	remaining: 1.24s
5:	learn: 0.6754931	total: 14.6ms	remaining: 1.2s
6:	learn: 0.6715885	total: 18.7ms	remaining: 1.32s
7:	learn: 0.6665158	total: 27.9ms	remaining: 1.71s
8:	learn: 0.6626897	total: 30.2ms	remaining: 1.65s
9:	learn: 0.6592667	total: 32.3ms	remaining: 1.58s
10:	learn: 0.6571492	total: 34.5ms	remaining: 1.53s
11:	learn: 0.6530219	total: 36.9ms	remaining: 1.5s
12:	learn: 0.6507169	total: 39ms	remaining: 1.46s
13:	learn: 0.6486642	total: 41.2ms	remaining: 1.43s
14:	learn: 0.6420884	total: 43.4ms	remaining: 1.4s
15:	learn: 0.6405358	total: 45.5ms	remaining: 1.38s
16:	learn: 0.6375201	total: 47.9ms	remaining: 1.36s
17:	learn: 0.6354997	total: 50.3ms	remaining: 1.35s
18:	learn: 0.63244

{'depth': 3, 'iterations': 500, 'learning_rate': 0.01}

In [117]:
catb_cv_model.best_params_

{'depth': 3, 'iterations': 500, 'learning_rate': 0.01}

In [118]:
catb = CatBoostClassifier(iterations = 500, 
                          learning_rate = 0.01, 
                          depth = 3)

catb_tuned = catb.fit(X_train, y_train)
y_pred = catb_tuned.predict(X_test)

0:	learn: 0.6908627	total: 2.67ms	remaining: 1.33s
1:	learn: 0.6886750	total: 4.58ms	remaining: 1.14s
2:	learn: 0.6855477	total: 6.26ms	remaining: 1.04s
3:	learn: 0.6813863	total: 8.13ms	remaining: 1.01s
4:	learn: 0.6780481	total: 11.4ms	remaining: 1.13s
5:	learn: 0.6754931	total: 14.2ms	remaining: 1.17s
6:	learn: 0.6715885	total: 17.1ms	remaining: 1.2s
7:	learn: 0.6665158	total: 18.8ms	remaining: 1.16s
8:	learn: 0.6626897	total: 20.5ms	remaining: 1.12s
9:	learn: 0.6592667	total: 22.3ms	remaining: 1.09s
10:	learn: 0.6571492	total: 24.3ms	remaining: 1.08s
11:	learn: 0.6530219	total: 26.3ms	remaining: 1.07s
12:	learn: 0.6507169	total: 28.3ms	remaining: 1.06s
13:	learn: 0.6486642	total: 30.6ms	remaining: 1.06s
14:	learn: 0.6420884	total: 32.6ms	remaining: 1.05s
15:	learn: 0.6405358	total: 34.3ms	remaining: 1.04s
16:	learn: 0.6375201	total: 36.1ms	remaining: 1.02s
17:	learn: 0.6354997	total: 37.9ms	remaining: 1.02s
18:	learn: 0.6324429	total: 39.7ms	remaining: 1s
19:	learn: 0.6298250	total

In [119]:
y_pred = catb_tuned.predict(X_test)
accuracy_score(y_test, y_pred)

0.75