## Predicting whether a tissue sample is a tumor or normal from miRNA expression data using bagging support vector machines
#The code is forked and only slighlty modified from https://www.kaggle.com/thomasnelson/predicting-if-sample-is-tumor-bagsvm, I just tried to fit it more to what I encounter in day-to-day work. All credits to the original poster

### Import modules and the data

In [47]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [48]:
data = pd.read_csv("../input/cervical.csv")

### Take a look at the top of the data, we have 714 features and 58 samples, plus a column of miRNA names.

In [49]:
print(data.shape)
data.head()

### Let's look at the library sizes (total read counts per sample)

In [50]:
sizes = data.sum(numeric_only=True)
sizes.plot.bar()

### Normalization

Our library sizes are very different.  Standard practice is to use some normalization method.

Here I use the normalization method of counts per milion (CPM).  We divide each count by the library size to give the proportion of total reads for each gene, then multiply by 1 million to get counts per 1 million.

In [51]:
ID = data.ID
data = data.drop('ID', 1)
sums = data.sum()
cpm = (data.div(sums))*1000000
cpm.insert(loc=0, column='ID', value=ID)

In [52]:
print(cpm.shape)
cpm.head()

In [53]:
sizes = cpm.sum(numeric_only=True)
sizes.plot.bar()

### Here I re-format the data into a form suitable for input into the machine learning algorithm (put smples as rows and features as columns and create a vector of class labels for scikit-learn)

In [54]:
cpm = cpm.transpose()
cpm = np.array(cpm[1:])
class_labels = np.array(["normal"]*29 + ["tumor"]*29)

In [55]:
cpm

In [56]:
class_labels

### Now we need to scale the data, that prevents the really highly expressed miRNA's from overwhelming influence on the model.  We get a warning about converting data type but thats okay.

In [57]:
from sklearn.preprocessing import StandardScaler

sc_cpm = StandardScaler()
cpm = sc_cpm.fit_transform(cpm)

In [58]:
cpm

### Now we'll use support vector machines with a linear kernel to create a classifier.  Here I use bagging (bootstrap aggregating), an ensemble method, to avoid overfitting a small data set. 

In [59]:
from sklearn.ensemble import BaggingClassifier
from sklearn import svm
svc = svm.SVC(kernel='linear', C=1.0)

#Set a RANDOM_STATE to use for Bagging and cross validation
RANDOM_STATE=123454321

classifier = BaggingClassifier(base_estimator = svc, n_estimators = 10, random_state=RANDOM_STATE)


**10-fold Cross-validation by explicitly calling the StratifiedKFold method to control the random_state param.**

In [60]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv=StratifiedKFold(n_splits=10, random_state=RANDOM_STATE)
scores = cross_val_score(classifier, cpm, class_labels, cv=cv)

The score is the mean of all scores in the array.

In [61]:
print("10-Cross validated Prediction Accuracy: {}%".format(scores.mean()*100))