# Doing Machine Learning in Scikit Learn

Now we'll take a look at how to conduct the whole machine learning pipeline in python with scikit-learn and pandas.

We'll use Homework 3 from 15.680 as the guideline for this section.

## Classification: MAGIC Gamma Telescope
https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope

The data represents registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruc- tion of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). The aim is build a model that can distinguish between the signal and background cases.

Attribute Information:
1. fLength: continuous; major axis of ellipse [mm]
2. fWidth: continuous; minor axis of ellipse [mm]
3. fSize: continuous; 10-log of sum of content of all pixels [in ;phot]
4. fConc: continuous; ratio of sum of two highest pixels over fSize [ratio]
5. fConc1: continuous; ratio of highest pixel over fSize [ratio]
6. fAsym: continuous; distance from highest pixel to center, projected onto major axis [mm] 
7. fM3Long: continuous; 3rd root of third moment along major axis [mm]
8. fM3Trans: continuous; 3rd root of third moment along minor axis [mm]
9. fAlpha: continuous; angle of major axis with vector to origin [deg]
10. fDist: continuous; distance from origin to center of ellipse [mm] 
11. class: g,h; gamma (signal), hadron (background)


First, let's load in the data:

In [1]:
import pandas as pd
df = pd.read_csv("magic04.csv", header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.0110,-8.2027,40.0920,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610,g
2,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620,g
...,...,...,...,...,...,...,...,...,...,...,...
19015,21.3846,10.9170,2.6161,0.5857,0.3934,15.2618,11.5245,2.8766,2.4229,106.8258,h
19016,28.9452,6.7020,2.2672,0.5351,0.2784,37.0816,13.1853,-2.9632,86.7975,247.4560,h
19017,75.4455,47.5305,3.4483,0.1417,0.0549,-9.3561,41.0562,-9.4662,30.2987,256.5166,h
19018,120.5135,76.9018,3.9939,0.0944,0.0683,5.8043,-93.5224,-63.8389,84.6874,408.3166,h


Extract the X and y

In [3]:
X = df.iloc[:, :-1]
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.0110,-8.2027,40.0920,81.8828
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610
2,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620
...,...,...,...,...,...,...,...,...,...,...
19015,21.3846,10.9170,2.6161,0.5857,0.3934,15.2618,11.5245,2.8766,2.4229,106.8258
19016,28.9452,6.7020,2.2672,0.5351,0.2784,37.0816,13.1853,-2.9632,86.7975,247.4560
19017,75.4455,47.5305,3.4483,0.1417,0.0549,-9.3561,41.0562,-9.4662,30.2987,256.5166
19018,120.5135,76.9018,3.9939,0.0944,0.0683,5.8043,-93.5224,-63.8389,84.6874,408.3166


In [4]:
y_orig = df.iloc[:, -1]
y_orig

0        g
1        g
2        g
3        g
4        g
        ..
19015    h
19016    h
19017    h
19018    h
19019    h
Name: 10, Length: 19020, dtype: object

The y is labelled g and h. Let's transform that to 0/1 labels. Luckily sklearn has an easy function for this:

In [5]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y_orig)
le.transform(y_orig)

array([0, 0, 0, ..., 1, 1, 1])

In [13]:
?preprocessing.LabelEncoder

In [6]:
y = le.transform(y_orig)
y

array([0, 0, 0, ..., 1, 1, 1])

Next, we need to split the data into training, validation and test. Again, sklearn has a function for this, it's basically the same as `splitobs` from MLDataUtils.jl:

In [7]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25)

In [8]:
train_X.shape

(14265, 10)

In [9]:
test_X.shape

(4755, 10)

We also need to split the training to get the actual training set and the validation set

In [10]:
tr_X, vl_X, tr_y, vl_y = train_test_split(train_X, train_y, test_size=0.33)

Now we can start running our methods!

### Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
m = LogisticRegression(solver='liblinear')
m.fit(tr_X, tr_y)

LogisticRegression(solver='liblinear')

In [12]:
from sklearn.metrics import accuracy_score, roc_auc_score
print(accuracy_score(vl_y, m.predict(vl_X)))
print(roc_auc_score(vl_y, m.predict_proba(vl_X)[:, 1]))

0.7833474936278675
0.8348806265479917


This logistic regression is regularized, so let's try it out with different regularization types and amounts

In [14]:
best_score = 0
for penalty in ['l1', 'l2']:
    for C in [1e-2, 1e0, 1e2, 1e4, 1e6, 1e8]:
        m = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
        m.fit(tr_X, tr_y)
        score = roc_auc_score(vl_y, m.predict_proba(vl_X)[:, 1])
        print(penalty, '\t', str(C).rjust(12), '\t', score)
        if score > best_score:
            best_score = score
            best_penalty = penalty
            best_C = C
print(best_penalty, best_C)

l1 	         0.01 	 0.8196329838055033
l1 	          1.0 	 0.8344170771363781
l1 	        100.0 	 0.8340550975269648
l1 	      10000.0 	 0.8340443219487355
l1 	    1000000.0 	 0.834033346822761
l1 	  100000000.0 	 0.8340772473266589
l2 	         0.01 	 0.8212624906910977
l2 	          1.0 	 0.8348806265479917
l2 	        100.0 	 0.8349309125797293
l2 	      10000.0 	 0.8341175559711471
l2 	    1000000.0 	 0.8341141636594822
l2 	  100000000.0 	 0.834917143785325
l2 100.0


Now train the final model

In [15]:
m = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
m.fit(train_X, train_y)
print(accuracy_score(test_y, m.predict(test_X)))
print(roc_auc_score(test_y, m.predict_proba(test_X)[:, 1]))

0.7901156677181914
0.8357549242706427


We can also try cross validation to select the parameters. There's a really nice interface for this in scikit-learn:

In [16]:
from sklearn.model_selection import GridSearchCV
params = {
    'penalty': ['l1', 'l2'],
    'C': [1e-2, 1e0, 1e2, 1e4, 1e6, 1e8]
}
model = GridSearchCV(LogisticRegression(solver='liblinear'), params, cv=5, scoring='roc_auc')
model.fit(train_X, train_y)

GridSearchCV(cv=5, estimator=LogisticRegression(solver='liblinear'),
             param_grid={'C': [0.01, 1.0, 100.0, 10000.0, 1000000.0,
                               100000000.0],
                         'penalty': ['l1', 'l2']},
             scoring='roc_auc')

In [17]:
model.best_params_

{'C': 1000000.0, 'penalty': 'l2'}

And now we can get performance on the test set:

In [18]:
print(accuracy_score(test_y, model.predict(test_X)))
print(roc_auc_score(test_y, model.predict_proba(test_X)[:, 1]))

0.7901156677181914
0.8357510842342699


#### Tuning via regularization paths

You've probably experienced that it's annoying trying to tune the regularization penalties in our various regression models. We don't know where to start our values, how far apart to space them, etc.

There's another way of tuning these parameters that [some methods support](http://scikit-learn.org/stable/modules/grid_search.html#model-specific-cross-validation). Where possible it's usually easier to use these approaches.

The core idea is that it turns out to be possible to train the model for **all** values of the regularization penalty at once, for not much more cost than training the model once normally (or sometimes even less cost than training it once!). This is called finding the regularization path of the model, and from there we are able to identify the best parameter value without having to worry about specifying a range of possible values.

Let's see how it works!

In [19]:
from sklearn.linear_model import LogisticRegressionCV
m = LogisticRegressionCV(scoring="roc_auc", solver='liblinear', cv=5)
m.fit(train_X, train_y)

LogisticRegressionCV(cv=5, scoring='roc_auc', solver='liblinear')

We can then inspect the best parameter, the validation scores, as well as calculate the test performance

In [20]:
# The best value of C
m.C_

array([166.81005372])

In [21]:
# The different values tried for C
m.Cs_

array([1.00000000e-04, 7.74263683e-04, 5.99484250e-03, 4.64158883e-02,
       3.59381366e-01, 2.78255940e+00, 2.15443469e+01, 1.66810054e+02,
       1.29154967e+03, 1.00000000e+04])

In [22]:
# The scores for C in each fold
m.scores_

{1: array([[0.77782724, 0.81349914, 0.82123102, 0.82387665, 0.83244154,
         0.83460034, 0.83471989, 0.83473611, 0.83448513, 0.83473503],
        [0.79117709, 0.82976832, 0.84025077, 0.84514986, 0.85599688,
         0.8595691 , 0.86035181, 0.86034803, 0.86023768, 0.86032855],
        [0.77191772, 0.81322111, 0.82820994, 0.83210779, 0.84112111,
         0.84300134, 0.84289478, 0.84290343, 0.84292453, 0.84291263],
        [0.75516011, 0.79634964, 0.80959   , 0.81352605, 0.82249028,
         0.82431341, 0.82420744, 0.82488814, 0.82412526, 0.82413499],
        [0.77518366, 0.81279885, 0.82304068, 0.82659826, 0.83580201,
         0.83787276, 0.8380328 , 0.83805983, 0.83747105, 0.83746834]])}

In [23]:
print(accuracy_score(test_y, m.predict(test_X)))
print(roc_auc_score(test_y, m.predict_proba(test_X)[:, 1]))

0.7903259726603575
0.8357048117959773


## Feature normalization

We've seen that feature normalization can help, especially for linear regression-based methods. This is easy to do in sklearn with by using a `Pipeline` to compose steps in the model building process.

We'll use it to build a model that standardizes the data before fitting (zeroing the mean and setting unit variance). The main advantage of the pipeline approach is that we don't have to worry about applying the same transformation to our datasets, or de-transforming results.

In [25]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(StandardScaler(), LogisticRegressionCV(scoring="roc_auc", solver='liblinear', cv=5))
pipeline.fit(train_X, train_y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregressioncv',
                 LogisticRegressionCV(cv=5, scoring='roc_auc',
                                      solver='liblinear'))])

In [26]:
print(accuracy_score(test_y, pipeline.predict(test_X)))
print(roc_auc_score(test_y, pipeline.predict_proba(test_X)[:, 1]))

0.789484752891693
0.8360408149785994


## Other methods

We can use the same generic approaches to fit our other classification models

### CART

In [None]:
from sklearn.tree import DecisionTreeClassifier
params = {'min_samples_leaf': [10, 50, 100], 'max_depth': range(1, 11)}
model = GridSearchCV(DecisionTreeClassifier(), params, cv=5)
model.fit(train_X, train_y)
print(accuracy_score(test_y, model.predict(test_X)))
print(roc_auc_score(test_y, model.predict_proba(test_X)[:, 1]))

### Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=1000)
model.fit(train_X, train_y)
print(accuracy_score(test_y, model.predict(test_X)))
print(roc_auc_score(test_y, model.predict_proba(test_X)[:, 1]))

### Boosting

In [27]:
from sklearn.ensemble import GradientBoostingClassifier
params = {'learning_rate': [0.1, 0.2, 0.3], 'n_estimators': [50, 100, 500]}
model = GridSearchCV(GradientBoostingClassifier(), params, cv=5)
model.fit(train_X, train_y)
print(accuracy_score(test_y, model.predict(test_X)))
print(roc_auc_score(test_y, model.predict_proba(test_X)[:, 1]))

0.8759200841219769
0.926824266879456


# Regression: Parkinsons Telemonitoring
https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progres- sion monitoring. The recordings were automatically captured in the patient’s homes.

The columns in the table are subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor UPDRS scores (`motor_UPDRS`) from the 16 voice measures.

### Exercise

Conduct the same comparison of methods on the parkinsons dataset. 

#### Read in and prepare the data

In [None]:
df = pd.read_csv("parkinsons_updrs.csv")
df

In [None]:
X = df.iloc[:, 6:]
X

In [None]:
y = df.loc[:, 'motor_UPDRS']
y

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25)

#### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(train_X, train_y)

In [None]:
from sklearn.metrics import r2_score
r2_score(test_y, m.predict(test_X))

Let's look at the coefficients

In [None]:
m.coef_

#### Ridge regression

In [None]:
from sklearn.linear_model import RidgeCV
m = RidgeCV(cv=5)
m.fit(train_X, train_y)
print(r2_score(test_y, m.predict(test_X)))
print(m.coef_)
print(m.alpha_)

#### Lasso Regression

In [None]:
from sklearn.linear_model import LassoCV
m = LassoCV(cv=5)
m.fit(train_X, train_y)
print(r2_score(test_y, m.predict(test_X)))
print(m.coef_)
print(m.alpha_)

#### CART

In [None]:
from sklearn.tree import DecisionTreeRegressor
params = {'min_samples_leaf': [10, 50, 100], 'max_depth': range(1, 11)}
model = GridSearchCV(DecisionTreeRegressor(), params, cv=5)
model.fit(train_X, train_y)
print(r2_score(test_y, model.predict(test_X)))

#### Random Forests

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=1000)
model.fit(train_X, train_y)
print(r2_score(test_y, model.predict(test_X)))

#### Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
params = {'learning_rate': [0.1, 0.2, 0.3], 'n_estimators': [50, 100, 500]}
model = GridSearchCV(GradientBoostingRegressor(), params, cv=5)
model.fit(train_X, train_y)
print(r2_score(test_y, model.predict(test_X)))

#### Neural Networks

In [None]:
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(max_iter=1000)
model.fit(train_X, train_y)
print(r2_score(test_y, model.predict(test_X)))