The data used was taken from a Parkinson's disease dataset stored in a file accessed via URL. This dataset appears to contain medical data related to Parkinson's disease, although the details are not explicitly described in the code. However, in general, Parkinson's datasets usually contain attributes or features related to the characteristics of a patient's voice, vibration, or movement that can help in the diagnosis of Parkinson's disease.

In [1]:
import pandas as pd
import numpy as np

# **Loading the Dataset**

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/ammfat/datasets/main/Parkinsson%20disease.xls')
df

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,...,0.08270,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.10470,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,phon_R01_S50_2,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,...,0.07008,0.02764,19.517,0,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,phon_R01_S50_3,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,...,0.04812,0.01810,19.147,0,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,phon_R01_S50_4,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,...,0.03804,0.10715,17.883,0,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,phon_R01_S50_5,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,...,0.03794,0.07223,19.020,0,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


Checking Missing Value

In [3]:
df.isnull().sum()

Unnamed: 0,0
name,0
MDVP:Fo(Hz),0
MDVP:Fhi(Hz),0
MDVP:Flo(Hz),0
MDVP:Jitter(%),0
MDVP:Jitter(Abs),0
MDVP:RAP,0
MDVP:PPQ,0
Jitter:DDP,0
MDVP:Shimmer,0


In [4]:
x = df.drop(['name','status'], axis=1)
y = df[['status']]

In [5]:
y.tail()

Unnamed: 0,status
190,0
191,0
192,0
193,0
194,0


In [6]:
x.dtypes

Unnamed: 0,0
MDVP:Fo(Hz),float64
MDVP:Fhi(Hz),float64
MDVP:Flo(Hz),float64
MDVP:Jitter(%),float64
MDVP:Jitter(Abs),float64
MDVP:RAP,float64
MDVP:PPQ,float64
Jitter:DDP,float64
MDVP:Shimmer,float64
MDVP:Shimmer(dB),float64


There's no any single non-numerical data. So, we don't need to do any encode.

# **Feature Selection with Variance Threshold**

In [7]:
from sklearn.feature_selection import VarianceThreshold

VarianceThreshold is a technique for selecting features based on their variance. Features with low variance (meaning their values ​​are nearly constant) usually don't provide much information and can be removed.

In [8]:
var_threshold = VarianceThreshold(threshold= 0 )

var_threshold.fit(x)
var_threshold.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])

## **Identifying and Dropping Low Variance Features**

In [11]:
low_var_cols_another = []

for column in x.columns:
    if column not in x.columns[var_threshold.get_support()]:
        low_var_cols_another.append(column)

for col in low_var_cols_another: #a list that stores the names of columns with low variance to be deleted
    print(col, end=', ')

MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP, MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA, NHR, RPDE, DFA, spread2, PPE, 

In [12]:
low_var_cols = [column for column in x.columns if column not in x.columns[var_threshold.get_support()]]

for col in low_var_cols:
  print(col)

MDVP:Jitter(%)
MDVP:Jitter(Abs)
MDVP:RAP
MDVP:PPQ
Jitter:DDP
MDVP:Shimmer
MDVP:Shimmer(dB)
Shimmer:APQ3
Shimmer:APQ5
MDVP:APQ
Shimmer:DDA
NHR
RPDE
DFA
spread2
PPE


In [13]:
x_high = x.drop(low_var_cols, axis=1)

## **Sampling the Resulting Data**

In [16]:
x_high.sample(5)

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),HNR,spread1,D2
92,148.272,164.989,142.299,18.78,-5.952058,2.344336
49,122.964,130.049,114.676,24.971,-6.482096,2.054419
30,197.076,206.896,192.055,26.775,-7.3483,1.743867
98,125.791,140.557,96.206,15.433,-5.159169,2.441612
192,174.688,240.005,74.287,17.883,-6.787197,2.679772


# **Conclusion**

This code aims to clean the dataset by removing features that have low variance, because these features usually do not provide significant information in the analysis or prediction model. After that, the remaining features are stored in DataFrame x_high for use in further analysis or model building.