# Breast Cancer prediction using Turi Create

The data for this jupyter notebook was initially obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original) 

Data description can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

Objective 

Predict breast cancer based on a set of cytology features (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nuceloli, mitosis.

Data Preprocessing 

Rows with missing values -"NaN" were deleted. Labels were added in the first row based on the data description. Values in the field 'classs' was replaced as follows 2=0(benign) 4=1 (malignant /cancer). No other pre-processing was done. The data file "BreastCancer2.csv" is present in this same GitHub folder as this Jupyter Notebook.

In [1]:
# importing turi create as tc
import turicreate as tc

In [2]:
#Loading the data from the CSV file and assigning it to the variable data
data =  tc.SFrame('BreastCancer2.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int,int,int,int,int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
# Splitting the data into training (80% of data) and testing (20% of data) sets.
train_data, test_data = data.random_split(0.8)

In [4]:
# Creating the model. 'target' is the variable we are trying to predict. 'features' are the fields that we 
# use to predict the target
# tc.classifier.create command evalauates the data using different models (like logistic regression, radomforest) 
# and finds the best model. 
# The bst model is finally assigned to the variable 'model'
# In this example RandomForestClassifier was selected as the best model based on validation set performance

model = tc.classifier.create(train_data, target='class',
                             features = ['thickness', 'size','shape','adhesion','single','nuclei',
                                         'chromatin','nucleoli','mitosis'
                                         ])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, SVMClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.962962985039
PROGRESS: RandomForestClassifier          : 1.0
PROGRESS: DecisionTreeClassifier          : 0.925925910473
PROGRESS: SVMClassifier                   : 0.962963
PROGRESS: LogisticClassifier              : 0.962963
PROGRESS: ---------------------------------------------
PROGRESS: Selecting RandomForestClassifier based on validation set performance.


In [5]:
# Get predictions using the test data 
predictions = model.classify(test_data)

In [6]:
predictions

class,probability
0,0.876951314509
0,0.876951314509
1,0.773324787617
1,0.533553183079
0,0.876951314509
0,0.876951314509
1,0.604591429234
0,0.626685112715
1,0.577280879021
1,0.751432836056


In [7]:
# obtain statistical results for the model by model.evaluate method 
results = model.evaluate(test_data)

In [8]:
results

{'accuracy': 0.9536423841059603,
 'auc': 0.986919675755343,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |   5   |
 |      1       |        1        |   54  |
 |      0       |        0        |   90  |
 |      0       |        1        |   2   |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9391304347826087,
 'log_loss': 0.22541536105489204,
 'precision': 0.9642857142857143,
 'recall': 0.9152542372881356,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+----+----+
 | threshold | fpr | tpr | p  | n  |
 +-----------+-----+-----+----+----+
 |    0.0    | 1.0 | 1.0 | 59 | 92 |
 |   1e-05   | 1.0 | 1.0 | 59 | 92 |
 |   2e-05   | 1.0 | 1.0 | 59 | 92 |
 |

Model has an accuracy of  95.36%
AUC 98.69%


# End of Jupyter notebook