# DAT210x - Programming with Python for DS

## Module6- Lab3

In [None]:
import pandas as pd
import numpy as np

Load up the /Module6/Datasets/parkinsons.data data set into a variable X, being sure to drop the name column.

In [None]:
X = pd.read_csv('Datasets/parkinsons.data')
print(X.head(10))
X.drop('name', axis = 1, inplace = True)
print(X.isnull().sum())
print(X.dtypes)

Splice out the status column into a variable y and delete it from X.

In [None]:
y = X.status
X.drop(labels=['status'], axis=1, inplace= True)
print(X.columns)

Perform a train/test split. 30% test group size, with a random_state equal to 7.

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)

Create a SVC classifier. Don't specify any parameters, just leave everything as default. Fit it against your training data and then score your testing data.

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)

That accuracy was just too low to be useful. We need to get it up. One way you could go about doing that would be to manually try a bunch of combinations of C, and gamma values for your rbf kernel. But that could literally take forever. Also, you might unknowingly skip a pair of values that would have resulted in a very good accuracy.
Instead, lets get the computer to do what computers do best. Program a naive, best-parameter search by creating nested for-loops. The outer for-loop should iterate a variable C from 0.05 to 2, using 0.05 unit increments. The inner for-loop should increment a variable gamma from 0.001 to 0.1, using 0.001 unit increments. As you know, Python ranges won't allow for float intervals, so you'll have to do some research on NumPy ARanges, if you don't already know how to use them.
Since the goal is to find the parameters that result in the model having the best accuracy score, you'll need a best_score = 0 variable that you initialize outside of the for-loops. Inside the inner for-loop, create an SVC model and pass in the C and gamma parameters its class constructor. Train and score the model appropriately. If the current best_score is less than the model's score, update the best_score being sure to print it out, along with the C and gamma values that resulted in it.

In [None]:
best_score = 0 
for i in np.arange(start = 0.05, stop = 2.05, step = 0.05):
    for j in np.arange(start = 0.001, stop = 0.101, step = 0.001):
        model = SVC(C = i, gamma = j)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_C = model.C
            best_gamma = model.gamma
print("The highest score obtained:", best_score)
print("C value:", best_C)
print("gamma value:", best_gamma)

Wait a second. Pull open the dataset's label file from: https://archive.ics.uci.edu/ml/datasets/Parkinsons
Look at the units on those columns: Hz, %, Abs, dB, etc. What happened to transforming your data? With all of those units interacting with one another, some pre-processing is surely in order.
Right after you preform the train/test split but before you train your model, inject SciKit-Learn's pre-processing code. Unless you have a good idea which one is going to work best, you're going to have to try the various pre-processors one at a time, checking to see if they improve your predictive accuracy.
Experiment with Normalizer(), MaxAbsScaler(), MinMaxScaler(), KernelCenterer(), and StandardScaler().

In [None]:
X = pd.read_csv('Datasets/parkinsons.data')
X.drop('name', axis = 1, inplace = True)

In [None]:
y = X.status
X.drop(labels=['status'], axis=1, inplace= True)

In [None]:
from sklearn import preprocessing
#T = preprocessing.Normalizer().fit_transform(X)
#T = preprocessing.MaxAbsScaler().fit_transform(X)
#T = preprocessing.MinMaxScaler().fit_transform(X)
#T = preprocessing.KernelCenterer().fit_transform(X)
T = preprocessing.StandardScaler().fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)

In [None]:
best_score = 0 
for i in np.arange(start = 0.05, stop = 2.05, step = 0.05):
    for j in np.arange(start = 0.001, stop = 0.101, step = 0.001):
        model = SVC(C = i, gamma = j)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_C = model.C
            best_gamma = model.gamma
print("The highest score obtained:", best_score)
print("C value:", best_C)
print("gamma value:", best_gamma)

The accuracy score keeps creeping upwards. Let's have one more go at it. Remember how in a previous lab we discovered that SVM's are a bit sensitive to outliers and that just throwing all of our unfiltered, dirty or noisy data at it, particularly in high-dimensionality space, can actually cause the accuracy score to suffer?
Well, let's try to get rid of some useless features. Immediately after you do the pre-processing, run PCA on your dataset. The original dataset has 22 columns and 1 label column. So try experimenting with PCA n_component values between 4 and 14. Are you able to get a better accuracy?
If you are not, then forget about PCA entirely. However if you are able to get a higher score, then be sure keep that accuracy score in mind, and comment out all the PCA code for now.
In the same spot, run Isomap on the data. Manually experiment with every inclusive combination of n_neighbors between 2 and 5, and n_components between 4 and 6. Are you able to get a better accuracy?
If you are not, then forget about isomap entirely. However if you are able to get a higher score, then be sure keep that figure in mind.
If either PCA or Isomap helped you out, then uncomment out the appropriate transformation code so that you have the highest accuracy possible.

PCA

In [246]:
X = pd.read_csv('Datasets/parkinsons.data')
X.drop('name', axis = 1, inplace = True)

In [247]:
y = X.status
X.drop(labels=['status'], axis=1, inplace= True)

In [248]:
from sklearn import preprocessing
T = preprocessing.StandardScaler().fit_transform(X)

In [249]:
best_score = 0

from sklearn.decomposition import PCA
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

for m in range(4, 15):
    pca = PCA(n_components = m)
    X_pca = pca.fit_transform(T)

    X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size = 0.3, random_state = 7)        
            
    for i in np.arange(start = 0.05, stop = 2.05, step = 0.05):
        for j in np.arange(start = 0.001, stop = 0.101, step = 0.001):
            model = SVC(C = i, gamma = j)
            model.fit(X_train, y_train)
            score = model.score(X_test, y_test)
            
            if score > best_score:
                best_score = score
                best_C = model.C
                best_gamma = model.gamma
                best_pca_components = pca.n_components        
                print("The current highest score is:", best_score, "with", best_pca_components, "PCA components and ", best_C, "C and ", best_gamma, "gamma")
                
print("The highest score obtained:", best_score)
print("PCA n_components:", best_pca_components)
print("C value:", best_C)
print("gamma value:", best_gamma)

The current highest score is: 0.796610169492 with 4 PCA components and  0.05 C and  0.001 gamma
The current highest score is: 0.813559322034 with 4 PCA components and  0.15 C and  0.029 gamma
The current highest score is: 0.830508474576 with 4 PCA components and  0.15 C and  0.032 gamma
The current highest score is: 0.847457627119 with 4 PCA components and  0.15 C and  0.036 gamma
The current highest score is: 0.864406779661 with 4 PCA components and  0.2 C and  0.024 gamma
The current highest score is: 0.881355932203 with 4 PCA components and  0.2 C and  0.034 gamma
The current highest score is: 0.898305084746 with 4 PCA components and  0.55 C and  0.094 gamma
The current highest score is: 0.915254237288 with 5 PCA components and  0.45 C and  0.065 gamma
The current highest score is: 0.932203389831 with 7 PCA components and  1.75 C and  0.099 gamma
The highest score obtained: 0.932203389831
PCA n_components: 7
C value: 1.75
gamma value: 0.099


Isomap

In [250]:
X = pd.read_csv('Datasets/parkinsons.data')
X.drop('name', axis = 1, inplace = True)

In [251]:
y = X.status
X.drop(labels=['status'], axis=1, inplace= True)

In [252]:
from sklearn import preprocessing
T = preprocessing.StandardScaler().fit_transform(X)

In [253]:
best_score = 0

from sklearn.manifold import Isomap
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC


for k in range(2, 6):
    for l in range(4, 7):
        iso = Isomap(n_neighbors = k, n_components = l)
        X_iso = iso.fit_transform(T)                
        
        X_train, X_test, y_train, y_test = train_test_split(X_iso, y, test_size = 0.3, random_state = 7)        
        

        for i in np.arange(start = 0.05, stop = 2.05, step = 0.05):
            for j in np.arange(start = 0.001, stop = 0.101, step = 0.001):
                model = SVC(C = i, gamma = j)
                model.fit(X_train, y_train)
                score = model.score(X_test, y_test)

                if score > best_score:
                    best_score = score
                    best_C = model.C
                    best_gamma = model.gamma
                    best_n_neighbors = iso.n_neighbors
                    best_n_components = iso.n_components
                    print("The current highest score is:", best_score, "with", best_n_neighbors, 
                          "neighbors and", best_n_components, "components and", best_C, "C and ", best_gamma, "gamma")

                       
print("The highest score obtained:", best_score)
print("isomap n_neighbors:", best_n_neighbors)
print("isomap n_components:", best_n_components)
print("C value:", best_C)
print("gamma value:", best_gamma)

The current highest score is: 0.796610169492 with 2 neighbors and  4 components and 0.05 C and  0.001 gamma
The current highest score is: 0.813559322034 with 2 neighbors and  4 components and 0.1 C and  0.007 gamma
The current highest score is: 0.830508474576 with 2 neighbors and  4 components and 0.1 C and  0.008 gamma
The current highest score is: 0.847457627119 with 2 neighbors and  4 components and 0.1 C and  0.01 gamma
The current highest score is: 0.864406779661 with 2 neighbors and  4 components and 0.15 C and  0.008 gamma
The current highest score is: 0.881355932203 with 2 neighbors and  4 components and 0.45 C and  0.009 gamma
The current highest score is: 0.898305084746 with 2 neighbors and  4 components and 0.45 C and  0.011 gamma
The current highest score is: 0.915254237288 with 2 neighbors and  4 components and 0.5 C and  0.014 gamma
The current highest score is: 0.932203389831 with 2 neighbors and  4 components and 0.65 C and  0.036 gamma
The current highest score is: 0.9