# Breast cancer data 
# Machine learning using Ada boosting


### The data was obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
### Missing values with "?" were replaced with "NaN" to make it easy to find and delete rows with missing data
### Values in the field classs was replaced as follows 2=0(benign) 4=1 (malignant /cancer)
### No other pre-processing was done

### Data description can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

### We are going to use breast cancer cell cytology features ( clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nuceloli, mitosis) to predict the class (benign = 0, malignant /cancer = 1)

In [1]:
import pandas as pd

In [2]:
 #Read breast cancer csv file
data = pd.read_csv('wisconsin_breast_cancer.csv')

In [3]:
data=data.dropna(how='any') # Dropping any rows that has missing values

In [4]:
data.shape

(683, 11)

In [5]:
x=data[['thickness','size','shape','adhesion','single','nuclei','chromatin','nucleoli','mitosis']] 
#creating feature data set

In [6]:
y=data['class'] # The class that we will predict using the feaatures 

In [7]:
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0) #splitting data into train and test

In [8]:
from sklearn import cross_validation  # importing cross validation and AdaBoost classifier 
from sklearn.ensemble import AdaBoostClassifier

In [9]:
# Defining parameters for Ada Boost 
num_folds = 10                  # number of folds for cross validation 
num_instances = len(x_train)    # Total number of observations going into the model 
seed = 7                        # Seed is fixed for reproducibility 
num_trees = 30             # Creates 30 decision trees - it will terminate before 30 if perfect solution is reached 
                           #earlier 

In [10]:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, x_train, y_train, cv=kfold)
print(results.mean()) 

0.960972850679


In [11]:
# Cross validated mean score is supposed to represent the accuracy of this model for an unknown data set

In [12]:
model.fit(x_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=30, random_state=7)

In [13]:
y_pred_class=model.predict(x_test) # make predictions based on test class and store it to y_pred_class

In [14]:
#Let us find the accuracy 
model.score(x_test,y_test)

0.94736842105263153

In [15]:
# with logistic regression we had an accuracy of 0.929824561404 but with Ada boost we get 0.947368421053
# But this is less than the cross validated mean score 

## Now let us create a confusion matrix to identify sensitivity specificity & all the other good statistical stuff


In [17]:
from sklearn import metrics

In [18]:
print metrics.confusion_matrix(y_test, y_pred_class)

[[103   4]
 [  5  59]]


In [19]:
confusion =metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [26]:
# Let us see the sensitivity of our AdaBoost model
print TP / float(TP+FN)

0.921875


In [21]:
# Let us calculate specificity
print TN / float(TN+FP)

0.96261682243


In [22]:
# precison - when it is predicting cancer how precise is it 
# positive predictive value 
print TP / float(TP+FP)

0.936507936508


In [23]:
# Negative predictive value
print TN / float(TN+ FN)

0.953703703704


### Calculate area under the curve

In [24]:
from sklearn.metrics import roc_auc_score
# calculates the probability of predicting "1" (cancer) and store the out put in probab_cancer
proba_cancer=model.predict_proba(x_test)[:,1]  

In [25]:
# we need the actual values in the cancer column and the predicted probabilities of postive value "1" to predict AUC
roc_auc_score(y_test, proba_cancer)

0.99401285046728982