<a href="https://colab.research.google.com/github/njweg/Machine-Learning-Exercises/blob/main/more_logistic%2Bpca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
 #using Breast Cancer Wisconsin data from UCI
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
                  header=None)

This dataset has 30 features. The first two columns are sample ID number and the target label (M=malignant, B=benign).

In [None]:
#use LabelEncoder to transform class label from 'M', 'B' to integers
from sklearn.preprocessing import LabelEncoder

#assign the 30 features to NumPy array
X = df.loc[:, 2:].values
y = df.loc[:, 1].values  #target labels

#initialize labelencoder object
le = LabelEncoder()

#transform target labels
y = le.fit_transform(y)

In [None]:
#double check the mapping
le.transform(['M', 'B'])

array([1, 0])

^ so M -> 1 and B -> 0

In [None]:
#split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                    stratify=y,
                                                    random_state=1)

We will use a logistic regression model to predict 'Benign' or 'Malignant'. The logistic regression model requires all features to be on the same scale, so we'll standardize the data. Let's also say we want to compress our data from 30 dimensions into two dimensions. Instead of going through the data transformation and model fitting for the training and test sets individually, we can package the following objects into one pipeline:

- `StandardScaler`
- `PCA`
- `LogisticRegression`

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

#instantiate pipeline object
pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression())
#fit to data, fit method
pipe_lr.fit(X_train, y_train)

#predict on test set, predict method
y_pred = pipe_lr.predict(X_test)

#get accuracy score, score method
test_acc = pipe_lr.score(X_test, y_test)

print(f'Test Accuracy: {test_acc}')

Test Accuracy: 0.956140350877193


The `make_pipeline` object takes in any transformers (objects that have `fit` and `transform` methods) and an estimator that has `fit` and `predict` methods.

When we call `fit` method on the `pipe_lr` object, the `StandardScaler` fit and transformed the data. The the transformed data is passed to `PCA`, where it fits to and transforms the data. Then that transformed data is passed to the estimator `LogisticRegression` which fits to it.

When we pass in a dataset to the `predict` call of the pipeline object, the data will pass through the transforms (NOT the fits! The transformers and estimator have already been fit!) and then into the estimator, which will return predictions on the transformed data.

----

#K-fold cross-validation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr,           #the pipe object we created above
                         X=X_train,
                         y=y_train,
                         cv=10,                       #the number of folds
                         n_jobs=1)                    #number of CPUs to use
print(f'Accuracy scores: {scores}')

Accuracy scores: [0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556
 0.97777778 0.93333333 0.95555556 0.95555556]


In [None]:
import numpy as np
#mean score +- one std deviation
print(f'Average Accuracy Estimate: {np.mean(scores):.3f}' f'+/- {np.std(scores):.3f}')

Average Accuracy Estimate: 0.950+/- 0.014


^ the `:.3f` inside the `{}` in the print statement rounds the number to 3 decimal places I think...

----

In [None]:
#same thing using StratifiedKFold iterator
from sklearn.model_selection import StratifiedKFold

#instantiate iterator object
kfold = StratifiedKFold(n_splits=10,).split(X_train, y_train)

scores = []

for k, (train, test) in enumerate(kfold):
  pipe_lr.fit(X_train[train], y_train[train])
  score = pipe_lr.score(X_train[test], y_train[test])
  scores.append(score)
  print(f'fold: {k+1}, ' f'Class distr.: {np.bincount(y_train[train])}, '
        f'Accuracy: {score:.3f}')

fold: 1, Class distr.: [256 153], Accuracy: 0.935
fold: 2, Class distr.: [256 153], Accuracy: 0.935
fold: 3, Class distr.: [256 153], Accuracy: 0.957
fold: 4, Class distr.: [256 153], Accuracy: 0.957
fold: 5, Class distr.: [256 153], Accuracy: 0.935
fold: 6, Class distr.: [257 153], Accuracy: 0.956
fold: 7, Class distr.: [257 153], Accuracy: 0.978
fold: 8, Class distr.: [257 153], Accuracy: 0.933
fold: 9, Class distr.: [257 153], Accuracy: 0.956
fold: 10, Class distr.: [257 153], Accuracy: 0.956


^ notice how for each fold, the number of benign and malignant examples are the same - this is what the "stratified" version does!