# Breast Cancer Prediction 

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)

---

## Background
For this illustration, we have taken an example for breast cancer prediction using UCI'S breast cancer diagnostic data set available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. 

---

## Setup

Now we'll import the Python libraries we'll need.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import time
import json

import al

---
## Data

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
        https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [3]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)

# specify columns extracted from wbdc.names
data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

# save the data
data.to_csv("data.csv", sep=',', index=False)

# print the shape of the data file
print(data.shape)

# show the top few rows
display(data.head())

# describe the data object
display(data.describe())

# we will also summarize the categorical field diganosis 
display(data.diagnosis.value_counts())


(569, 32)


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


B    357
M    212
Name: diagnosis, dtype: int64

#### Key observations:
* Data has 569 observations and 32 columns.
* First field is 'id'.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features available for prediction.

## Create Features and Labels
#### Split the data into 80% training, 10% validation and 10% testing.

In [5]:
y = ((data.iloc[:,1] == 'M') +0).values;
X = data.iloc[:,2:].values;

---
## Train


In [8]:
m = al.AL()

m.fit(X, y)
progs = m.get_programs()

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/26 [00:00<?, ?it/s][A

Generating transforms
Programs of depth=0


Failed to apply transform(sklearn.pipeline.Pipeline): None
Failed to apply transform(sklearn.pipeline.Pipeline): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(builtins.module): None
Failed to apply transform(builtins.module): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): None
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): None
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>

 19%|█▉        | 5/26 [00:00<00:00, 40.53it/s][AFailed to apply transform(sklearn.ensemble.forest.RandomForestRegressor): None
Failed to apply transform(sklearn.ensemble.forest.RandomForestRegressor): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transf

[Transforms] Pruning programs of depth 0
Programs of depth=1


Failed to apply transform(sklearn.pipeline.Pipeline): None
Failed to apply transform(sklearn.pipeline.Pipeline): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(builtins.module): None
Failed to apply transform(builtins.module): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): None
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>

 19%|█▉        | 5/26 [00:00<00:00, 45.92it/s][AFailed to apply transform(sklearn.feature_extraction.text.CountVectorizer): None
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.decomposition.truncated_svd.TruncatedSVD): <class 'synthesis.runtime_helpers.ColumnLoop'>

 31%|███       | 8/26 [00:00<00:00, 34.41it/s][AFailed to apply transform(sklearn.ensemble.fores

Failed computing missing


Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): None
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(builtins.module): None
Failed to apply transform(builtins.module): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): None
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.text.TfidfVectorizer): None
Failed to apply transform(sklearn.feature_extraction.text.TfidfVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>

 15%|█▌        | 4/26 [00:00<00:00, 28.53it/s][AFailed to apply transform(sklearn.decomposition.truncated_svd.TruncatedSVD): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.preprocessing.imputa

[Transforms] Pruning programs of depth 1
Programs of depth=2


Failed to apply transform(sklearn.pipeline.Pipeline): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(builtins.module): None
Failed to apply transform(builtins.module): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): None
Failed to apply transform(sklearn.feature_extraction.text.CountVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): None
Failed to apply transform(sklearn.feature_extraction.dict_vectorizer.DictVectorizer): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.feature_selection.from_model.SelectFromModel): None
Failed to apply transform(sklearn.feature_selection.from_model.SelectFromModel): <class 'synthesis.runtime_helpers.ColumnLoop'>
Failed to apply transform(sklearn.decomposition.truncated_svd.TruncatedSVD): <class 'synthesis.runtime_helpers.Column

[Transforms] Pruning programs of depth 2
Generating modeling


Failed to apply model: builtins.module
Failed to apply model: xgboost.core.Booster

 16%|█▌        | 5/32 [00:00<00:00, 47.44it/s][A
 22%|██▏       | 7/32 [00:00<00:01, 16.50it/s][A
  3%|▎         | 1/31 [00:00<00:16,  1.77it/s]

  3%|▎         | 1/32 [00:00<00:14,  2.07it/s][AFailed to apply model: builtins.module
Failed to apply model: xgboost.core.Booster

 25%|██▌       | 8/32 [00:00<00:01, 13.72it/s][A
 34%|███▍      | 11/32 [00:01<00:01, 10.60it/s][A
  6%|▋         | 2/31 [00:01<00:24,  1.20it/s]

  3%|▎         | 1/32 [00:00<00:15,  2.07it/s][AFailed to apply model: builtins.module

 22%|██▏       | 7/32 [00:00<00:02, 11.91it/s][AFailed to apply model: xgboost.core.Booster

 28%|██▊       | 9/32 [00:00<00:01, 12.24it/s][A
 34%|███▍      | 11/32 [00:01<00:02, 10.33it/s][A
 10%|▉         | 3/31 [00:02<00:26,  1.07it/s]
  0%|          | 0/32 [00:00<?, ?it/s][A
 16%|█▌        | 5/32 [00:00<00:00, 28.34it/s][AFailed to apply model: xgboost.core.Booster

 28%|██▊       | 9/

Failed computing missing



 12%|█▎        | 4/32 [00:00<00:00, 38.35it/s][A
 19%|█▉        | 6/32 [00:00<00:02,  9.11it/s][AFailed to apply model: builtins.module
Failed to apply model: sklearn.neighbors.kde.KernelDensity

 31%|███▏      | 10/32 [00:01<00:02,  8.93it/s][AFailed to apply model: xgboost.core.Booster

 23%|██▎       | 7/31 [00:06<00:20,  1.16it/s]
  0%|          | 0/32 [00:00<?, ?it/s][AFailed to apply model: builtins.module
Failed to apply model: xgboost.core.Booster

 12%|█▎        | 4/32 [00:00<00:01, 18.21it/s][A
 16%|█▌        | 5/32 [00:00<00:01, 13.87it/s][A
 19%|█▉        | 6/32 [00:00<00:02, 12.32it/s][A
 31%|███▏      | 10/32 [00:00<00:01, 14.46it/s][AFailed to apply model: sklearn.grid_search.GridSearchCV

 26%|██▌       | 8/31 [00:06<00:19,  1.16it/s]
  0%|          | 0/32 [00:00<?, ?it/s][AFailed to apply model: builtins.module
Failed to apply model: xgboost.core.Booster

 12%|█▎        | 4/32 [00:00<00:01, 18.48it/s][A
 16%|█▌        | 5/32 [00:00<00:01, 13.81it/s][A
 19%|

[Model] Pruning programs of depth 0
[Model] Pruning programs of depth 1
[Model] Pruning programs of depth 2
[Model] Pruning programs of depth 3
Adding test data


Failed to add test data, program depth: 0
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
 30%|███       | 12/40 [00:00<00:00, 53.34it/s]Failed to add test data, program depth: 1
Failed to add test data, program depth: 1
 42%|████▎     | 17/40 [00:00<00:00, 49.19it/s]Failed to add test data, program depth: 1
 75%|███████▌  | 30/40 [00:00<00:00, 40.51it/s]Failed to add test data, program depth: 3
Failed to add test data, program depth: 3
100%|██████████| 40/40 [00:01<00:00, 33.59it/s]
  0%|          | 0/34 [00:00<?, ?it/s]






 18%|█▊        | 6/34 [00:03<00:15,  1.79it/s][A
 24%|██▎       | 8/34 [00:03<00:11,  2.31it/s][A
 26%|██▋       | 9/34 [00:03<00:10,  2.39it/s][A
 35%|███▌      | 12/34 [00:03<00:07,  3.08it/s][A
 44%|████▍     | 15/34 [00:04<00:05,  3.70it/s][A
 53%|█████▎    | 18/34 [00:04<00:03,  4.29it/s][A
 59%|█████▉    | 20/34 [00:04<00:03,  4.62it/s][A
 65%|██████▍   | 22/34 [00:04<00:02,  4.89it/s][A
 71%|███████   

Print pipline code

In [9]:
print(progs[0].pipeline_code())
print('\n\n-----------------------\n')
print(progs[1].pipeline_code())
print('\n\n-----------------------\n')
print(progs[3].pipeline_code())
print('\n\n-----------------------\n')
print(progs[4].pipeline_code())

import sklearn.neural_network.multilayer_perceptron
import sklearn.preprocessing.imputation
import xgboost
import sklearn.preprocessing.data
import sklearn

from sklearn.pipeline import Pipeline
p = Pipeline([('t0', runtime_helpers.ColumnLoop(sklearn.preprocessing.data.StandardScaler)),('t1', sklearn.preprocessing.imputation.Imputer()),('t2', sklearn.preprocessing.data.MinMaxScaler()),('model', sklearn.neural_network.multilayer_perceptron.MLPClassifier())])
p.fit(X_train, y_train)
print(p.score(X_val, y_val))
p.fit(X, y)
def predict(X_test): return p.predict(X_test)


-----------------------

import sklearn.neural_network.multilayer_perceptron
import sklearn.preprocessing.imputation
import xgboost
import sklearn.preprocessing.data
import sklearn

from sklearn.pipeline import Pipeline
p = Pipeline([('t0', runtime_helpers.ColumnLoop(sklearn.preprocessing.data.StandardScaler)),('t1', sklearn.preprocessing.imputation.Imputer()),('t2', runtime_helpers.ColumnLoop(sklearn.preprocessing.data.M

In [20]:
import synthesis.runtime_helpers as runtime_helpers
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

Select pipeline and paste it's code below

In [21]:
import sklearn.neural_network.multilayer_perceptron
import xgboost
import sklearn.preprocessing.data
import sklearn

from sklearn.pipeline import Pipeline
p = Pipeline([('t0', runtime_helpers.ColumnLoop(sklearn.preprocessing.data.StandardScaler)),('model', sklearn.neural_network.multilayer_perceptron.MLPClassifier())])
p.fit(X_train, y_train)
print(p.score(X_val, y_val))
p.fit(X, y)
def predict(X_test): return p.predict(X_test)



0.991228070175


