# Description

##### The goal of this project is to apply data science techniques to discover information in two distinct datasets. The first dataset describes various speech signal processing algorithms applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment. The second dataset predicts forest cover type from cartographic variables only. It includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

#### Parkinson's Disease Dataset

In [3]:
import pandas as pd

data1 = pd.read_csv('pd_speech_features.csv', sep=',')

data1.shape

(756, 755)

In [4]:
cat_vars = data1.select_dtypes(include='category')
cat_vars.columns.size

0

In [5]:
num_vars = data1.select_dtypes(include='number')
num_vars.columns.size

755

In [6]:
null_vars = {}
for var in data1:
    nas = data1[var].isna().sum()
    if nas > 0:
        null_vars[var] = nas
print(len(null_vars))

0


In [7]:
data1.head(6)

Unnamed: 0,id,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,...,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36,class
0,0,1,0.85247,0.71826,0.57227,240,239,0.008064,8.7e-05,0.00218,...,1.562,2.6445,3.8686,4.2105,5.1221,4.4625,2.6202,3.0004,18.9405,1
1,0,1,0.76686,0.69481,0.53966,234,233,0.008258,7.3e-05,0.00195,...,1.5589,3.6107,23.5155,14.1962,11.0261,9.5082,6.5245,6.3431,45.178,1
2,0,1,0.85083,0.67604,0.58982,232,231,0.00834,6e-05,0.00176,...,1.5643,2.3308,9.4959,10.7458,11.0177,4.8066,2.9199,3.1495,4.7666,1
3,1,0,0.41121,0.79672,0.59257,178,177,0.010858,0.000183,0.00419,...,3.7805,3.5664,5.2558,14.0403,4.2235,4.6857,4.846,6.265,4.0603,1
4,1,0,0.3279,0.79782,0.53028,236,235,0.008162,0.002669,0.00535,...,6.1727,5.8416,6.0805,5.7621,7.7817,11.6891,8.2103,5.0559,6.1164,1
5,1,0,0.5078,0.78744,0.65451,226,221,0.007631,0.002696,0.00783,...,4.8025,5.0734,7.0166,5.9966,5.2065,7.4246,3.4153,3.5046,3.225,1


##### The dataset has 756 records and 755 attributes, all numeric, and does not have any null value.

#### Covertype Dataset

# Preprocessing

### Reducing Sample Size

#### Parkinson's Disease Dataset

##### This dataset describes three measurements for each patient, so we apply the mean of those samples to have one sample per patient.

In [9]:
data1 = data1.groupby('id').mean().reset_index()

data1.shape

(252, 755)

In [10]:
data1.head(6)

Unnamed: 0,id,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,...,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36,class
0,0,1.0,0.823387,0.69637,0.56725,235.333333,234.333333,0.00822,7.3e-05,0.001963,...,1.561733,2.862,12.293333,9.7175,9.0553,6.2591,4.021533,4.164333,22.9617,1.0
1,1,0.0,0.415637,0.793993,0.592453,213.333333,211.0,0.008884,0.001849,0.00579,...,4.918567,4.827133,6.117633,8.599667,5.737233,7.933133,5.490533,4.941833,4.467233,1.0
2,2,1.0,0.801973,0.619967,0.520563,319.333333,318.333333,0.006041,0.000104,0.002217,...,41.1294,31.201933,14.584467,5.4468,3.462,4.772067,9.176633,11.8481,5.552367,1.0
3,3,0.0,0.828707,0.626097,0.537183,493.0,492.0,0.003913,4.2e-05,0.000757,...,1.677633,1.9084,2.842167,3.493867,3.282433,3.085267,3.184433,4.032933,22.773633,1.0
4,4,0.0,0.831287,0.779397,0.726717,362.666667,361.666667,0.005622,0.002023,0.003593,...,4.1046,4.285233,2.9532,2.799933,2.6451,2.811367,7.268333,13.338833,63.7669,1.0
5,5,1.0,0.82252,0.622083,0.35766,284.0,283.0,0.006815,4.6e-05,0.00094,...,25.4825,29.795367,26.472767,43.9837,54.324967,49.879867,43.996667,37.834733,73.894367,1.0


### Feature Selection

# Classification