# Instructions

You are provided with the dataset parkinsons.csv to be used for classification. The numbers in each row measures certain aspects of a person’s voice, and this could be related to whether the person has Parkinson’s disease or not. The target column is status, where 0 means no disease and 1 means the disease is present. Create and submit a Jupyter notebook according to the following requirements.

Load the data. (2 marks)

Drop the one non-numeric column from the data. (2 marks)

Use any suitable machine learning algorithm we have learnt about in class to train a model. (4 marks)

Evaluate your model on the train and test data. (2 marks)

Write a conclusion and a recommendation in a Markdown cell. (4 marks)

## Load the data

In [2]:
import pandas as pd
parkinsons = pd.read_csv("parkinsons.csv")

## Preview the data

In [3]:
parkinsons.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


## Drop the One Non-numeric column from the data

In [4]:
num_parkinsons = parkinsons.drop('name', axis=1)
num_parkinsons.head()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,0.482,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


# Use any suitable machine learning algorithm we have learnt about in class to train a model. (4 marks)

In [5]:
num_parkinsons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MDVP:Fo(Hz)       195 non-null    float64
 1   MDVP:Fhi(Hz)      195 non-null    float64
 2   MDVP:Flo(Hz)      195 non-null    float64
 3   MDVP:Jitter(%)    195 non-null    float64
 4   MDVP:Jitter(Abs)  195 non-null    float64
 5   MDVP:RAP          195 non-null    float64
 6   MDVP:PPQ          195 non-null    float64
 7   Jitter:DDP        195 non-null    float64
 8   MDVP:Shimmer      195 non-null    float64
 9   MDVP:Shimmer(dB)  195 non-null    float64
 10  Shimmer:APQ3      195 non-null    float64
 11  Shimmer:APQ5      195 non-null    float64
 12  MDVP:APQ          195 non-null    float64
 13  Shimmer:DDA       195 non-null    float64
 14  NHR               195 non-null    float64
 15  HNR               195 non-null    float64
 16  status            195 non-null    int64  
 1

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = num_parkinsons.drop('status', axis=1)
y = num_parkinsons['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# instantiate a model and fit it to the training set
logreg = LogisticRegression().fit(X_train, y_train)
# evaluate the model on the test set
print("Test set score: {:.2f}".format(logreg.score(X_test, y_test)))

Test set score: 0.92


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Evaluate your model on the train and test data. (2 marks)

In [49]:
from sklearn.metrics import mean_squared_error
test_pred = logreg.predict(X_test)
rmse = mean_squared_error(y_test, test_pred, squared=False)
print("Error on test data:", rmse)

Error on test data: 0.2857142857142857


In [50]:
from sklearn.metrics import mean_squared_error
train_pred = logreg.predict(X_train)
rmse = mean_squared_error(y_train, train_pred, squared=False)
print("Error on train data:", rmse)

Error on train data: 0.3792566630111542


## Write a conclusion and a recommendation in a Markdown cell. (4 marks)

I used the Logistic Regression to train a model. I set X to be the dataset without the target variable, status, and y to be the the target variable, status. I used train_test_split to split the data into training and test and calculated for the logistic regression score which I found to be 92%. I would prefer it to be around 98% but I guess for now this is okay and can be dealt with. I used random_state value 0 since it produced the highest test score as compared to other values I tested. When it came to evaluating my model on the training and test data, I used root mean squared error to calculate for an error on the train data which I found to be 0.379 and 0.2857 on the test data. Since the rmse score on the test data is lower than the one of the train data, this is a textbook case of overfitting and also suggests that this model is able to generalize better to new data than to the data it was trained on. My recommendation is to try and use cross validation so that we are able to fully cover the dataset when it comes to train and testing and I would further recommend Leave-One-Out Cross-Validation. This will thoroughly go through the data and provide better predictions that can guide us.