# Statistical Methods for High Dimensional Data project

Statistical project focused on dimensionality reduction methods to perform classification to detect the presence of Parkinson's disease using speech signals.

First of all we open the data and explore a bit the architecture of the dataset.

In [3]:
rm(list = ls())
data <- read.csv("pd_speech_features.csv",header = T,skip = 1)

In [4]:
data

id,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,⋯,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36,class
0,1,0.85247,0.71826,0.57227,240,239,0.008063530,0.000086800,0.00218,⋯,1.5620,2.6445,3.8686,4.2105,5.1221,4.4625,2.6202,3.0004,18.9405,1
0,1,0.76686,0.69481,0.53966,234,233,0.008258256,0.000073100,0.00195,⋯,1.5589,3.6107,23.5155,14.1962,11.0261,9.5082,6.5245,6.3431,45.1780,1
0,1,0.85083,0.67604,0.58982,232,231,0.008339590,0.000060400,0.00176,⋯,1.5643,2.3308,9.4959,10.7458,11.0177,4.8066,2.9199,3.1495,4.7666,1
1,0,0.41121,0.79672,0.59257,178,177,0.010857733,0.000182739,0.00419,⋯,3.7805,3.5664,5.2558,14.0403,4.2235,4.6857,4.8460,6.2650,4.0603,1
1,0,0.32790,0.79782,0.53028,236,235,0.008161574,0.002668863,0.00535,⋯,6.1727,5.8416,6.0805,5.7621,7.7817,11.6891,8.2103,5.0559,6.1164,1
1,0,0.50780,0.78744,0.65451,226,221,0.007631204,0.002696381,0.00783,⋯,4.8025,5.0734,7.0166,5.9966,5.2065,7.4246,3.4153,3.5046,3.2250,1
2,1,0.76095,0.62145,0.54543,322,321,0.005990989,0.000107266,0.00222,⋯,117.2678,75.3156,32.0478,7.7060,3.1060,4.6206,12.8353,13.8300,7.7693,1
2,1,0.83671,0.62079,0.51179,318,317,0.006073855,0.000135739,0.00282,⋯,3.8564,11.8909,7.2891,4.3682,3.6443,5.9610,11.7552,18.0927,5.0448,1
2,1,0.80826,0.61766,0.50447,318,317,0.006057188,0.000069300,0.00161,⋯,2.2640,6.3993,4.4165,4.2662,3.6357,3.7346,2.9394,3.6216,3.8430,1
3,0,0.85302,0.62247,0.54855,493,492,0.003910221,0.000039900,0.00075,⋯,1.6796,2.0474,2.8117,3.5070,3.2727,3.8415,3.9439,5.8807,38.7211,1


However, we don't want to lose the first row that contains the "macrocategories" of the features, so we keep it.

In [12]:
data_1 <- read.csv("pd_speech_features.csv",header = F)
data_1 <- data_1[1,]
rm(data_1)

In [14]:
dt <- cbind(c("3-23","24-26","27-30","31-34","35-56","57-140","141-322","323-755"),c("Baseline Feature","Intensity Parameters","Formant Frequencies","Bandwidth Parameters","Vocal Fold","MFCC","Wavelet Features","TQWT Features"))
colnames(dt) <- c("Colonne","Descrizione")

Here we can see the macrocategories names and the relative column numbers. Note that in the first column we have the ID while in the second there is the gender of the person.

In [26]:
dt

Colonne,Descrizione
3-23,Baseline Feature
24-26,Intensity Parameters
27-30,Formant Frequencies
31-34,Bandwidth Parameters
35-56,Vocal Fold
57-140,MFCC
141-322,Wavelet Features
323-755,TQWT Features


In [23]:
# Here we print the first names of the features as example
head(colnames(data))

First of all we removed the ID and we can also notice that we don't have any missing value. Just for our convenience we shift the response variable in the first column.

In [28]:
id <- data[,1]
data <- data[,-1]
data <- cbind(data[grep("class",colnames(data))],data[-grep("class",colnames(data))])
colnames(data)[1] <- "y"

In [30]:
head(data["y"])

y
1
1
1
1
1
1


Let's see whether the values in the response variable are balanced.

In [32]:
data$y <- as.factor(data$y)
data$gender <- as.factor(data$gender)
prop.table(table(data$y))


        0         1 
0.2539683 0.7460317 

Data are quite unbalanced but not so much, so it's not a big deal.<br>
We now proceed dividing the dataset into training and validation set.

In [35]:
set.seed(69)
train_fraction = 0.75
train_id <- sample(1:nrow(data), train_fraction*nrow(data))
train_set <- data[train_id,]
test_set <- data[-train_id,]

In [37]:
dim(train_set)

In [38]:
dim(test_set)

Let's check if the response variable is balanced in the two sets.

In [39]:
table(train_set$y)/length(train_set$y)


        0         1 
0.2557319 0.7442681 

In [40]:
table(test_set$y)/length(test_set$y)


        0         1 
0.2486772 0.7513228 

Ok, fortunately as we imagine the proportion are roughly still the same.