### 14-Cancer Gene expression

Cancer Microarray Classification Problem (hard) 
- Training set of patients with 14 different types of cancer
- Gene expression measurements available for 16,063 genes at the time the dataset snapshot was taken 



**14 Cancer Info**

14-cancer  gene expression data. **16,063 genes, 144 training samples**,
54 test samples. 

One gene per row, one sample per column

Cancer classes are labelled as follows:

1.  breast
2.  prostate
3.  lung
4.  collerectal
5.  lymphoma
6.  bladder
7.  melanoma
8.  uterus
9.  leukemia
10. renal
11. pancreas
12. ovary
13. meso
14. cns

Reference:

S. Ramaswamy and P.  Tamayo and  R. Rifkin and S. Mukherjee and C.H. Yeang and
M. Angelo and C. Ladd and M. Reich and E. Latulippe and J.P. Mesirov and
T. Poggio and W. Gerald and M. Loda and E.S. Lander and  T.R. Golub (2001)

Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures

Proc. Natl. Acad. Sci., 98, p15149-15154.


The primary goal is to train a variety of Linear Regression Algorithms for the classification, optimize paramters, use validation, and compare performance results against test datasets. For this problem using this dataset (like with the Prostate Cancer we consider ShrinkedCentroidClassifier, Standard based model, L1 & L2 penalized discriminant Analysis, K-Nearest Neighbors. 

In [None]:
""" Dataset

In Gene expression dataset we have thousand of rows representing individual genes 
and tens of columns representing samples. 

Graph of DNA microarray data - expression of genes (6830 rows ), and samples 
(64 columns) for the human tumor data. 
Use a random sample of 100 rows for display. Build a heatmap, as the screen in R,
ranging from green (negative and underexpressed) to red (positive, overexpressed)
Use the colors in R package. Missing values in gray. 
Rows and Columns are displayed in randomly chosen order. 

""" 

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
from matplotlib import gridspec, transforms
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, BoundaryNorm 
%matplotlib inline 
plt.rcParams['axes.linewidth'] = 0.5
gray_, gray__, gray___, blue_, blue__, purple_, orange_ = \
  "#231F20", "#646379", "#929495", '#57B5E8', '#bdd7e7', "#A021F1",  '#E69E00'
""" turn off numpy warnings for clean output """ 
np.warnings.filterwarnings('ignore')

In [None]:
""" process 14-cancer data 
train: 4764 rows, test_data:16063 rows,  
"""  
sample_columns = [f's{i+1}' for i in range(144)]
dataset_14Cancer_xtrain = pd.read_csv(
    '14cancer-xtrain.csv', delim_whitespace= '\t', names=sample_columns)
dataset_14Cancer_xtrain.head(1), 

(   s1  s2  s3  s4  s5  s6  s7  s8  s9  s10  ...  s135  s136  s137  s138  s139  \
 0 -73 -16   4 -31 -33 -37 -18 -26 -40   22  ...   274   -96   -96  -124  -201   
 
    s140  s141  s142  s143  s144  
 0  -196    34   -56  -245   -26  
 
 [1 rows x 144 columns],)

In [None]:
dataset_14Cancer_ytrain = pd.read_csv(
    '14cancer-ytrain.csv', delim_whitespace= '\t', names=sample_columns)

In [None]:
""" cross validation  - 144 samples are splitted intp 8 cv """
cv_indices = np.array([
[5, 2, 1, 3, 6, 4, 7, 8],             [14, 15, 12,  9, 11, 16, 10, 13], 
[23, 19, 20, 17, 21, 24, 18, 22],     [31, 32, 29, 28, 26, 30, 25, 27],
[35, 48, 38, 46, 42, 34, 47, 33],     [44, 45, 41, 40, 37, 43, 39, 36],
[55, 56, 49, 51, 53, 50, 52, 54],     [63, 59, 64, 61, 60, 62, 57, 58],
[69, 71, 67, 66, 72, 68, 70, 65],     [87, 91, 76, 86, 81, 88, 83, 96],
[92, 74, 89, 93, 95, 84, 79, 73],     [85, 90, 75, 77, 82, 94, 80, 78],
[99, 103,  98, 100,  97, 104, 102, 101],  
[105, 111, 106, 109, 107, 112, 108, 110],
[117, 118, 120, 113, 116, 115, 119, 114],
[128, 121, 122, 124, 125, 127, 123, 126],
[133, 139, 137, 138, 132, 142, 144, 135],
[136, 129, 130, 134, 141, 131, 143, 140],
])
cv_indices = (cv_indices.T -1).tolist()
cv_folds = []
for i in range(len(cv_indices)):
  train = [j for i in cv_indices[:i] + cv_indices[i+1:] for j in i] 
  cv_folds.append([train, cv_indices[i]])

In [None]:
"""
train fold  126 
[4, 13, 22, 30, 34, 43, 54, 62, 68, 86, 91, 84, 98, 104, 116, 127, 132, 135, 1, 
14, 18, 31, 47, 44, 55, 58, 70, 90, 73, 89, 102, 110, 117, 120, 138, 128, 0, 11, 
19, 28, 37, 40, 48, 63, 66, 75, 88, 74, 97, 105, 119, 121, 136, 129, 5, 10, 20, 
25, 41, 36, 52, 59, 71, 80, 94, 81, 96, 106, 115, 124, 131, 140, 3, 15, 23, 29, 
33, 42, 49, 61, 67, 87, 83, 93, 103, 111, 114, 126, 141, 130, 6, 9, 17, 24, 46, 
38, 51, 56, 69, 82, 78, 79, 101, 107, 118, 122, 143, 142, 7, 12, 21, 26, 32, 35, 
53, 57, 64, 95, 72, 77, 100, 109, 113, 125, 134, 139]
"""

In [None]:
""" auxiliary function for cross validation errors out of input samples, 144 
    stadard errors and test errors for input out of 54, 
    and print the statistics """ 
def calculate_cross_validation_errors(grid_search):
  cv_errors = 18*(1 -np.stack(
      [grid_search.cv_results_[f'split{i}_test_score'] for i in range(8)]).T)
  best_cv_errors = cv_errors[grid_search.best_index_, :]
  cv_errors_count = np.sum(best_cv_errors)
  cv_errors_count_std = np.sqrt(np.var(best_cv_errors, ddof=1)*8)
  test_errors_count = np.sum(grid_search.best_estimator_.predict(X_test) != y_test)
  return cv_errors_count, cv_errors_count_std, test_errors_count

def print_cv_stats(grid_search):
  cv_errors_count, cv_errors_count_std, test_errors_count = \
                calculate_cross_validation_errors(grid_search)
  print(grid_search.best_params_)
  print(f'CV errors {cv_errors} ({cv_errors_count_std}) Test errors {test_errors}')

In [None]:
sample_columns_test = [f's{i+1}' for i in range(54)]
dataset_14Cancer_xtest = pd.read_csv(
    '14cancer-xtest.csv', delim_whitespace= '\t', names=sample_columns_test)
dataset_14Cancer_ytest = pd.read_csv(
    '14cancer-ytest.csv', delim_whitespace= '\t', names=sample_columns_test)
dataset_14Cancer_xtest.head(1), dataset_14Cancer_ytest

(   s1  s2    s3  s4  s5  s6  s7  s8  s9  s10  ...  s45  s46  s47  s48  s49  \
 0 -44 -13 -64.3 -22 -28 -94 -59 -19 -24  -30  ...  -74 -157  -76  -24  -37   
 
    s50  s51  s52  s53  s54  
 0  -91 -107 -151 -124   17  
 
 [1 rows x 54 columns],
    s1  s2  s3  s4  s5  s6  s7  s8  s9  s10  ...  s45  s46  s47  s48  s49  s50  \
 0   1   1   1   2   2   3   3   3   4    4  ...   14   14   12    4    3    2   
 
    s51  s52  s53  s54  
 0    2    2    2    1  
 
 [1 rows x 54 columns])

In [None]:
X = dataset_14Cancer_xtrain[sample_columns].values 
y_classtypes = [x+1 for x in range(14)]
y = dataset_14Cancer_ytrain[sample_columns].values
y_labels = { 
 '1': 'breast', '2': 'prostate', '3': 'lung', '4': 'collerectal', '5': 'lymphoma', 
 '6': 'bladder', '7': 'melanoma', '8': 'uterus', '9': 'leukemia', '10': 'renal', 
 '11': 'pancreas', '12': 'ovary', '13': 'meso', '14': 'cns' 
} 
X, y, X.shape, y.shape

(array([[  -73.,   -16.,     4., ...,   -56.,  -245.,   -26.],
        [  -69.,   -63.,   -45., ...,  -818.,  -235., -1595.],
        [  -48.,   -97.,  -112., ..., -1338.,  -127., -2085.],
        ...,
        [  -21.,     8.,   -32., ...,    -9.,   -53.,   -57.],
        [   94.,   153.,    16., ...,    75.,    98.,    16.],
        [  760.,  1197.,   154., ...,   992.,   901.,   243.]]),
 array([[ 1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,
          3,  3,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,
          5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,
          6,  6,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  7,  7,
          8,  8,  8,  8,  8,  8,  8,  8,  9,  9,  9,  9,  9,  9,  9,  9,
          9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,
         10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11,
         12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13,
         14, 14, 14, 14

In [None]:
""" breast cancer and cns samples """ 
y[0, :8], y[0, 128:144], 

(array([1, 1, 1, 1, 1, 1, 1, 1]),
 array([14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14]))

In [None]:
""" labeling and indexing """ 
labels_ = pd.read_csv('14Cancer_label.txt', names=['cancer_type'])
labels_['cancer_type'] = labels_['cancer_type'].str.strip()
labels_grp = labels_.copy()

labels_grp.loc[labels_grp['cancer_type'] == 'breast', 'type'] = 1
labels_grp.loc[labels_grp['cancer_type'] == 'prostate', 'type'] = 2 
labels_grp.loc[labels_grp['cancer_type'] == 'lung', 'type'] = 3 
labels_grp.loc[labels_grp['cancer_type'] == 'collecteral', 'type'] = 4 
labels_grp.loc[labels_grp['cancer_type'] == 'lymphoma', 'type'] = 5 
labels_grp.loc[labels_grp['cancer_type'] == 'bladder', 'type'] = 6 
labels_grp.loc[labels_grp['cancer_type'] == 'melanoma', 'type'] = 7 
labels_grp.loc[labels_grp['cancer_type'] == 'uterus', 'type'] = 8 
labels_grp.loc[labels_grp['cancer_type'] == 'leukemia', 'type'] = 9 
labels_grp.loc[labels_grp['cancer_type'] == 'renal', 'type'] = 10 
labels_grp.loc[labels_grp['cancer_type'] == 'pancreas', 'type'] = 11 
labels_grp.loc[labels_grp['cancer_type'] == 'ovary', 'type'] = 12 
labels_grp.loc[labels_grp['cancer_type'] == 'meso', 'type'] = 13 
labels_grp.loc[labels_grp['cancer_type'] == 'cns', 'type'] = 14 

y_grp = labels_grp['cancer_type'].values
labels_grp.loc, y_grp, y_grp[13]

(<pandas.core.indexing._LocIndexer at 0x7fe90fe4dc20>,
 array(['breast', 'prostate', 'lung', 'collerectal', 'lymphoma', 'bladder',
        'melanoma', 'uterus', 'leukemia', 'renal', 'pancreas', 'ovary',
        'meso', 'cns'], dtype=object),
 'cns')

In [None]:
""" sampling """ 
random_rows = np.random.choice(range(X.shape[0]), 50, replace=False)
random_columns = np.random.choice(range(X.shape[1]), X.shape[1], replace=False)

In [None]:
import sklearn
from sklearn import cluster
from sklearn.cluster import KMeans  
from collections import Counter 
n_clusters = 14
k_means = KMeans(n_clusters=n_clusters, random_state=1).fit(X.T)
clusters = k_means.predict(X.T)
cluster_counts = {}
for c in range(14): cluster_counts[y_grp[c]] = Counter(clusters == c)[1]

[(y_grp[c], Counter(clusters == c)) for c in range(14)], 
[(y_grp[c], Counter(clusters == c)[1]) for c in range(14)], 
cluster_counts,

({'breast': 12,
  'prostate': 38,
  'lung': 6,
  'collerectal': 4,
  'lymphoma': 6,
  'bladder': 12,
  'melanoma': 11,
  'uterus': 24,
  'leukemia': 4,
  'renal': 5,
  'pancreas': 5,
  'ovary': 5,
  'meso': 11,
  'cns': 1},)

In [None]:
""" process all data """ 


In [None]:
""" dataset 14Cancer 
one gene per row, one sample for columns, labeling as following:
1.breast    2.prostate  3.lung    4.collerectal 5.lymphoma 
6.bladder   7 melanoma  8.uterus  9.leukemia    10.renal  
11.pancreas 12.ovarian  13.meso   14.cns

Use sklearn for implementation 
Implement nearest centroid model - same implementation as in the book, 
the output task slightly different
Each class is represented by its centroids with test samples classified 
to the class with the nearest centroids 
Params: Delta for shrinking centroids to remove the features, 
        all classes represented in the training set, centroids of each class, 
        mean of each feature, Class prior probabilities, class variances of the 
        features, shrunked centroids of each class, the indices of features 
        that are not shunken to overall centroids
Calculate cross validation errors, standard errors and test errors 
Fit centroid model according to the training data -L2PenalizedDiscriminantAnalysis
Training X: Training vector n-samples, n-features
Predict: Target values in integers
Implement penalized discriminant classifier and grid search [pipeline]
Perform classification on an array of test vectors X ~ the predicted class 
for each sample in X is returned
""" 
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import PredefinedSplit, GridSearchCV, StratifiedKFold
from sklearn.neighbors import NearestCentroid
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier 
from sklearn.preprocessing import StandardScaler, LabelEncoder, Normalizer 
from sklearn.metrics import accuracy_score, label_ranking_loss
from sklearn.pipeline import Pipeline

In [None]:
class L2PenalizedDiscriminantAnalysis(BaseEstimator, ClassifierMixin):
    def __init__(self, delta):
      self.delta = delta 

      self.classes_ = None 
      self.centroids_ = None 
      self.overall_centroids_ = None 
      self.priors_ = None 
      self.vars_ = None
      self.shrunken_centroids_ = None 
      self.features_used_ = None  
   
    def fit(self, X, y): 
      """ replace target values with classes 0, 1 .. k """ 
      label_encoder = LabelEncoder()
      y = label_encoder.fit_transform(y)
      #y = label_encoder.fit_transform(y.T) 
      self.classes  = label_encoder.classes_
      K, N, p = self.classes_.size, X.shape[0], X.shape[1]
      """ calculate classes prior - 14 total - posterior probabilities """ 
      elements, counts_elements = np.unique(y, return_counts=True)
      self.priors_ = counts_elements/N
      """ transform the predictors using SVD """ 
      u, s, vh = np.linalg.svd(X, full_matrices=False)
      U, D, V = u, np.diag(s), vh.T
      R = U @ D 
      """ calculate means and covariance matrices """ 
      # means = [np.mean(R[y == i], axis=1) for i in range(K)]
      means = [np.mean(R[:, y ==i], axis=0) for i in range(K)]
      covariances = np.zeros((N, N)) 
      for k in range(K):
        # R_k = R[y == k, :]
        R_k = R[:, y == k]
        """ delta in the alg: index size, and index in R_k """ 
        #for i in range(R_k.shape[0]):
        for i in range(R_k.shape[1]):
          mat_ = np.atleast_2d(R_k[i, :] - means[k]) #.T
          covariances += (mat_@mat_.T) / (N-K)
      # covariances = 0.01 * covariances + \
      covariances = self.lambda_ * covariances + \
                    (1 - self.covariances) * np.diag(np.diag(covariances))
      covariances_inverse = np.linalg.inv(covariances)

      self.coefficient_, self.intercept_ = [], []
      for k in range(K):
        m_k = np.atleast_2d(means[k]).T 
        p_k = priors[k] 
        self.coefficient_.append(V @ covariances_inverse @ m_k)
        self.intercept_.append(
                          -0.5*m_k.T @ covariances_inverse @ m_k + np.log(p_k))
      self.coefficient_ = np.hstack(self.coefficient_)
      self.intercept_ = np.hstack(self.intercept_)

    def predict(X): 
      """ """ 

In [None]:
""" replace target values with classes 0, 1 .. k """ 
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y.T)
#y = y.reshape(1, y.shape[0])
y.shape

(144,)

In [None]:
classes_ = label_encoder.classes_
K, N, p = classes_.size, X.shape[0], X.shape[1]
""" calculate classes prior - 14 total - posterior prob. """ 
elements, counts_elements = np.unique(y, return_counts=True)
priors = counts_elements/y.shape[0]
""" transform the predictors using SVD """ 
u, s, vh = np.linalg.svd(X, full_matrices=False)
U, D, V = u, np.diag(s), vh.T
R = U @ D 

classes_, counts_elements, K, N, p, priors, R,

(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 array([ 8,  8,  8,  8, 16,  8,  8,  8, 24,  8,  8,  8,  8, 16]),
 14,
 4763,
 144,
 array([0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.11111111,
        0.05555556, 0.05555556, 0.05555556, 0.16666667, 0.05555556,
        0.05555556, 0.05555556, 0.05555556, 0.11111111]),
 array([[-7.50334082e+02,  4.08265739e+02,  1.68248514e+02, ...,
         -2.12256892e+01, -2.48580477e+01, -2.72575536e+00],
        [-2.64540736e+03,  1.90876542e+03, -2.04994750e+02, ...,
         -3.19871644e+02, -2.72590684e+02,  2.71919452e+01],
        [-2.22533907e+03,  1.22033097e+03, -7.13821992e+02, ...,
         -1.92924413e+02, -1.70551878e+02, -2.23863434e+01],
        ...,
        [-1.10805585e+02,  5.56828579e+01,  7.80790322e+01, ...,
          2.43992604e+01,  1.20567857e+01, -9.55940437e-01],
        [ 1.34953509e+03, -1.91866994e+02, -7.00728565e+00, ...,
         -1.98704384e-01, -3.92393251e+01,  1.11557708e+01],
        [ 1.5

In [None]:
np.mean(R[:, 0:8], axis=0), np.mean(R[:, 128:144], axis=0),

(array([4063.55558524,   37.6613797 ,  163.96755797,  468.86296694,
          21.16343375,  135.40473937,   -5.65240022,  128.53328871]),
 array([-4.84270305,  0.77016689, -1.18387042, -2.49376477, -5.41445385,
        -2.89366795,  0.97953868, -1.15712376,  3.61022107, -0.29235715,
         2.69638216,  1.36074671,  1.00514961, -0.87753711,  1.283584  ,
        -0.73094995]))

In [None]:
np.mean(R[:, y==0], axis=0), np.mean(R[:, y==13], axis=0),

(array([4063.55558524,   37.6613797 ,  163.96755797,  468.86296694,
          21.16343375,  135.40473937,   -5.65240022,  128.53328871]),
 array([-4.84270305,  0.77016689, -1.18387042, -2.49376477, -5.41445385,
        -2.89366795,  0.97953868, -1.15712376,  3.61022107, -0.29235715,
         2.69638216,  1.36074671,  1.00514961, -0.87753711,  1.283584  ,
        -0.73094995]))

In [None]:
""" calculate means and covariance matrices
temporary work around - use copy() transposed """ 
means = [np.mean(R[:, y ==i], axis=0) for i in range(K)]
covs = np.zeros((N, N)) 
len(means), means[13]  

(14,
 array([-4.84270305,  0.77016689, -1.18387042, -2.49376477, -5.41445385,
        -2.89366795,  0.97953868, -1.15712376,  3.61022107, -0.29235715,
         2.69638216,  1.36074671,  1.00514961, -0.87753711,  1.283584  ,
        -0.73094995]))

In [None]:
for k in range(6):
  R_k = R[:, y == k]
  """ delta in the alg: index size, and index in R_k """ 
  for i in range(R_k.shape[1]):
    mx_ = np.atleast_2d(
        R_k[i, :] - means[k]
    )
    covs += (mx_@mx_.T) /(N-K)
covs.shape

(4763, 4763)

In [None]:
covs = 0.01 * covs + (1 - covs) * np.diag(np.diag(covs))
covs_inverse = np.linalg.inv(covs)

In [None]:
priors[13], np.atleast_2d(means[0]).shape, vh.shape

(0.1111111111111111, (1, 8), (144, 144))

In [None]:
a_ =np.atleast_2d(means[0]).T
means[0], np.atleast_2d(means[0]).T, 
a_.T@a_ # scalar 

array([[16795955.30597055]])

In [None]:
coefficient_, intercept_ = [], []
for k in range(K):
  m_k = np.atleast_2d(means[k]).T
  p_k = priors[k] 
  #V @ covs_inverse @ m_k

In [None]:
coefficient_, intercept_ = [], []
for k in range(K):
  m_k = np.atleast_2d(means[k]).T 
  p_k = priors[k] 
  """ wrong construct V(144, 144) with conv_inv (4763, 4763) with (8-24, 1)  """ 
  coefficient_.append(V @ covs_inverse @ m_k)
  """ wrong construct m_k [8-24, 1) with conv_inv (4763, 4763 with (1, 8-24""" 
  intercept_.append(-0.5*m_k.T @ covs_inverse @ m_k + np.log(p_k))

coefficient_, intercept_ = np.hstack(coefficient_),  np.hstack(intercept_)

In [None]:
""" Shrunken Centroids 

shrunken_centroids_clf = Pipeline([
  ('norm', Normalizer()), 
  ('classifier', ShrunkenCentroid())
])

shrunken_centroids_gridsearch = GridSearchCV(
  shrunken_centroids_clf, 
  {'classifier__C': np.linspace(0.1, 0.9, 9)}, 
  cv = cv_folds, scoring='accuracy'
).fit(X_train, y_train)
shrunken_centroids_n_genes = \
        shrunken_centroids_gridsearch.best_estimator_[1],feature_used_.size
print('Genes used', shrunken_centroids_n_genes)
print_cv_stats(shrunken_centroids_gridsearch)

""" 

In [None]:
""" Linear Penalized Discriminant Analysis 

pd_clf = Pipeline([
  ('norm', Normalizer()), 
  ('classifier', L2PenalizedDiscriminantAnalysis())
])

pd_clf_gridsearch = GridSearchCV(
  pd_clf, 
  {'classifier__lam': np.linspace(0.1, 0.9, 9)}, 
  cv = cv_folds, scoring='accuracy'
).fit(X_train, y_train)
print_cv_stats(pd_clf_gridsearch)
""" 


In [None]:
X = dataset_14Cancer_xtrain[sample_columns].values 
y = dataset_14Cancer_ytrain[sample_columns].values
y =  y.astype(int)
X, y

In [None]:
X_test = dataset_14Cancer_xtest[sample_columns_test].values 
y_test = dataset_14Cancer_ytest[sample_columns_test].values
y_test =  y_test.astype(int)
X_test, y_test  = X_test.T, y_test.T
X_test, y_test

In [None]:
X_train, y_train  = X.T, y.T

X, y_train, X_train.shape, y_train.shape, #8 CV folds 142 columns 

In [None]:
""" Suppor Vector Classifier """
support_vector_clf = Pipeline([
    ('norm', Normalizer()), 
    ('classifier', LinearSVC(tol=1e-3))
])

support_vector_gridsearch = GridSearchCV(
    support_vector_clf, 
    {'classifier__C': np.linspace(100, 3000, 3)},
    cv = cv_folds, scoring='accuracy'
).fit(X_train, y_train)
#print_cv_stats(support_vector_clf)
support_vector_gridsearch.cv_results_, 

({'mean_fit_time': array([3.47678742, 3.63547033, 3.7692219 ]),
  'std_fit_time': array([0.49647574, 0.4257881 , 0.63155643]),
  'mean_score_time': array([0.00395453, 0.00400323, 0.00385669]),
  'std_score_time': array([0.00217638, 0.00376773, 0.00195273]),
  'param_classifier__C': masked_array(data=[100.0, 1550.0, 3000.0],
               mask=[False, False, False],
         fill_value='?',
              dtype=object),
  'params': [{'classifier__C': 100.0},
   {'classifier__C': 1550.0},
   {'classifier__C': 3000.0}],
  'split0_test_score': array([0.72222222, 0.72222222, 0.72222222]),
  'split1_test_score': array([0.83333333, 0.83333333, 0.83333333]),
  'split2_test_score': array([0.88888889, 0.88888889, 0.88888889]),
  'split3_test_score': array([0.88888889, 0.88888889, 0.88888889]),
  'split4_test_score': array([0.77777778, 0.77777778, 0.77777778]),
  'split5_test_score': array([0.72222222, 0.72222222, 0.72222222]),
  'split6_test_score': array([0.66666667, 0.66666667, 0.66666667]),
 

In [None]:
""" K Nearest Neighbors """ 
k_nearest_neighbors_clf = Pipeline([
    ('norm', Normalizer()), 
    ('classifier', KNeighborsClassifier())
])

k_nn_gridsearch = GridSearchCV(
    k_nearest_neighbors_clf, 
    {'classifier__n_neighbors': list(range(1, 5))}, 
    cv=cv_folds, scoring='accuracy'
).fit(X_train, y_train)
#print_cv_stats(k_nn_gridsearch)
k_nn_gridsearch.cv_results_, 

({'mean_fit_time': array([0.00980982, 0.01503885, 0.01113248, 0.00955918]),
  'std_fit_time': array([0.0026021 , 0.00510716, 0.00548386, 0.00158557]),
  'mean_score_time': array([0.00797987, 0.02752727, 0.01094353, 0.05292669]),
  'std_score_time': array([0.00393635, 0.02818853, 0.00673865, 0.02716431]),
  'param_classifier__n_neighbors': masked_array(data=[1, 2, 3, 4],
               mask=[False, False, False, False],
         fill_value='?',
              dtype=object),
  'params': [{'classifier__n_neighbors': 1},
   {'classifier__n_neighbors': 2},
   {'classifier__n_neighbors': 3},
   {'classifier__n_neighbors': 4}],
  'split0_test_score': array([0.66666667, 0.66666667, 0.61111111, 0.61111111]),
  'split1_test_score': array([0.66666667, 0.66666667, 0.61111111, 0.61111111]),
  'split2_test_score': array([0.66666667, 0.66666667, 0.61111111, 0.61111111]),
  'split3_test_score': array([0.83333333, 0.66666667, 0.66666667, 0.66666667]),
  'split4_test_score': array([0.66666667, 0.72222222

In [None]:
""" L1 Penalized Multinomial - SGD """ 
l1sgd_clf = SGDClassifier(
    loss='log', penalty='l1', alpha=0.05, max_iter=1000,
    tol=1e-5, eta0=0.0005, learning_rate='adaptive'
)
l1_multinom_clf = Pipeline([ 
    ('norm', Normalizer()), 
    ('scale', StandardScaler()),
    ('classifier', l1sgd_clf)
])
l1_multinom_gridsearch = GridSearchCV(
    l1_multinom_clf, 
    {'classifier__alpha': [0.05]}, 
    cv = cv_folds, scoring='accuracy'
).fit(X_train, y_train)
l1_multinom_gridsearch.cv_results_

In [None]:
l1_multinom_gridsearch.cv_results_

{'mean_fit_time': array([59.65011004]),
 'std_fit_time': array([9.58614911]),
 'mean_score_time': array([0.00433144]),
 'std_score_time': array([0.00198086]),
 'param_classifier__alpha': masked_array(data=[0.05],
              mask=[False],
        fill_value='?',
             dtype=object),
 'params': [{'classifier__alpha': 0.05}],
 'split0_test_score': array([0.55555556]),
 'split1_test_score': array([0.72222222]),
 'split2_test_score': array([0.83333333]),
 'split3_test_score': array([0.88888889]),
 'split4_test_score': array([0.83333333]),
 'split5_test_score': array([0.72222222]),
 'split6_test_score': array([0.66666667]),
 'split7_test_score': array([0.88888889]),
 'mean_test_score': array([0.76388889]),
 'std_test_score': array([0.11023964]),
 'rank_test_score': array([1], dtype=int32)}