In [1]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [2]:
import pandas as pd
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

This is the **data**! The data that doctors should use to diagnose potential breast cancer:

In [4]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


These are the **labels**: The describe whether patients with above symptosm did end having cancer. `1`s are cancer, `0`s are no cancer:

In [5]:
y

0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     1
20     1
21     1
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
      ..
539    1
540    1
541    1
542    1
543    1
544    1
545    1
546    1
547    1
548    1
549    1
550    1
551    1
552    1
553    1
554    1
555    1
556    1
557    1
558    1
559    1
560    1
561    1
562    0
563    0
564    0
565    0
566    0
567    0
568    1
Length: 569, dtype: int32

**MORE COMPLEX VERSION, MAY NEED PROFESSOR-LEVEL PACKAGES**

If you're missing packages, then jump cells below and straight to the cell marked **RUN HERE FOR SIMPLE STANDALONE VERSION**

In [6]:
!pip install sklearn
!pip install sklearn.grid_search

Active code page: 936
Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Stored in directory: C:\Users\17931\AppData\Local\pip\Cache\wheels\76\03\bb\589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0
Active code page: 936
Collecting sklearn.grid_search


  ERROR: Could not find a version that satisfies the requirement sklearn.grid_search (from versions: none)
ERROR: No matching distribution found for sklearn.grid_search


In [148]:
#from sklearn.grid_search import GridSearchCV

Let's import our curve-building algorithm using tree-logic as explained above.

In [7]:
from sklearn.ensemble import RandomForestRegressor
#from sklearn.grid_search import GridSearchCV

Let's learn from our dataset.

In [8]:
%%time
param_grid = dict(
    max_features=np.arange(2, 8),
    max_depth=[2, 4],
    min_samples_split=[5, 10, 15, 20],
)
rfc = RandomForestRegressor(n_estimators=10)
gs = GridSearchCV(rfc, param_grid, cv=5, n_jobs=-1)
gs.fit(X.values, y.values)

NameError: name 'np' is not defined

In [146]:
scores = gs.grid_scores_
scores[:10]

[mean: 0.71057, std: 0.09425, params: {'max_depth': 2, 'max_features': 2, 'min_samples_split': 5},
 mean: 0.72736, std: 0.08405, params: {'max_depth': 2, 'max_features': 2, 'min_samples_split': 10},
 mean: 0.75531, std: 0.06236, params: {'max_depth': 2, 'max_features': 2, 'min_samples_split': 15},
 mean: 0.69665, std: 0.08930, params: {'max_depth': 2, 'max_features': 2, 'min_samples_split': 20},
 mean: 0.72999, std: 0.10873, params: {'max_depth': 2, 'max_features': 3, 'min_samples_split': 5},
 mean: 0.73250, std: 0.11170, params: {'max_depth': 2, 'max_features': 3, 'min_samples_split': 10},
 mean: 0.72664, std: 0.10219, params: {'max_depth': 2, 'max_features': 3, 'min_samples_split': 15},
 mean: 0.74243, std: 0.10487, params: {'max_depth': 2, 'max_features': 3, 'min_samples_split': 20},
 mean: 0.75516, std: 0.11585, params: {'max_depth': 2, 'max_features': 4, 'min_samples_split': 5},
 mean: 0.73956, std: 0.12662, params: {'max_depth': 2, 'max_features': 4, 'min_samples_split': 10}]

These are the features that doctors ought to be paying attention to in order to diagnose cancer:

In [9]:
pd.Series(gs.best_estimator_.feature_importances_,
          index=X.columns).sort_values().plot(kind='barh')

NameError: name 'gs' is not defined

### RUN HERE FOR SIMPLE STANDALONE VERSION

This is the simpler version of the regression forest algorithm that is likely to work very well without dependencies. It's what we used last class.

We split a dataset into a training bundle and a testing bundle. We train a Machine Learning algorithm on the training bundle, and then we test on the test bundle to see *if it guesses right*!

In [106]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(X.values, y.values, test_size = 0.25,
                                                                           random_state = 42)

In [107]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model 
rf = RandomForestRegressor(n_estimators= 10, random_state=42)

# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

What's the guess in error on the test bundle?

In [16]:
import numpy as np

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

Mean Absolute Error: 0.07 degrees.


Wowwwwww... 

In [17]:
# Saving feature names for later use
feature_list = list(X.columns)
feature_list

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension',
 'radius error',
 'texture error',
 'perimeter error',
 'area error',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension']

What are the synptoms that doctors should be paying attention to, prioritized, in order to correctly diagnose breast cancer?

In [11]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: worst perimeter      Importance: 0.3
Variable: worst concave points Importance: 0.22
Variable: worst radius         Importance: 0.16
Variable: worst area           Importance: 0.11
Variable: mean concave points  Importance: 0.08
Variable: worst texture        Importance: 0.03
Variable: mean texture         Importance: 0.02
Variable: area error           Importance: 0.02
Variable: worst concavity      Importance: 0.02
Variable: concavity error      Importance: 0.01
Variable: concave points error Importance: 0.01
Variable: mean radius          Importance: 0.0
Variable: mean perimeter       Importance: 0.0
Variable: mean area            Importance: 0.0
Variable: mean smoothness      Importance: 0.0
Variable: mean compactness     Importance: 0.0
Variable: mean concavity       Importance: 0.0
Variable: mean symmetry        Importance: 0.0
Variable: mean fractal dimension Importance: 0.0
Variable: radius error         Importance: 0.0
Variable: texture error        Importance: 0.0
V

# Conclusion

This is exactly what IBM's Watson is doing (albeit with ***a lot*** more data) and fooling the world into thinking it is some kind of Artificial Intelligence. Nope, it's just **statistics**.

What dataset would you like to apply this to, dress up like a magician, and amaze your friends?

<br />
<center>
    <img src="ipynb.images/fortune-teller-reading-crystal-ball.jpg" width=600 />
</center>