<h3>Building a Cancer Classifier using Random Forest</h3>

<h4>1- Load The Required Packages</h4>

In [0]:
import pandas as pd                                              #data manupilation
from sklearn.model_selection import train_test_split             #splitting the data to train and test
from sklearn import tree                                         #running a decision tree
from sklearn.ensemble import RandomForestClassifier              #running a random forest
from sklearn import datasets                                     #saved datasets

from sklearn import metrics                                      #assessing model performance
from sklearn.metrics import classification_report                #assessing model performance
from sklearn.metrics import confusion_matrix                     #assessing model performance
import matplotlib.pyplot as plt                                  #visualize model performance

pd.set_option('display.max_columns', 30)                         #display all columns in your data

<h4>2- Load The Data</h4>

In [0]:
cancer = datasets.load_breast_cancer()
X=pd.DataFrame(cancer.data,columns=[cancer.feature_names])      #define your features
Y=pd.Series(cancer.target)                                      #define the target variable
X.head()                                                        #view the first few rows from your features

In [0]:
#print the dimensions of the dataset
print(X.shape)

In [0]:
#let's look at column names
X.columns

In [0]:
#let's summarize the data
X.describe()

<h4>3- Split to Train and Test</h4>

In [0]:
#split the data to 70% train and 30% test
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=42)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

<h4>4- Train your model: Random Forest</h4>

In [0]:
rf_model = RandomForestClassifier(max_depth=3,n_estimators=15)        #define the model
rf_model.fit(x_train, y_train)                                        #fit the model (train)
rf_model.score(x_train,y_train)                                       #predict on new observations

#what is the accuracy of this model?

Let's visualize this tree! (https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176)

In [0]:
#select which tree do you want to visualize
selected_tree=2

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data2 = StringIO()
export_graphviz(rf_model.estimators_[selected_tree],
                out_file=dot_data2,
                filled=True,
                precision=2,
                feature_names=x_train.columns,
                rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data2.getvalue())
Image(graph.create_png())

<h4>5- Predict!</h4>

In [0]:
#let's pull information from one patient from the test set
patient1_test=(x_test.iloc[0:1,:])
patient1_test

In [0]:
#what would our model predict? Malignant or Benign?
rf_model.predict(patient1_test)

In [0]:
#can we predict the probability of a patient being malignant or benign?
rf_model.predict_proba(patient1_test)

In [0]:
#Can we predict multiple patients at once?
rf_model.predict(x_test)

In [0]:
#can we get the probability of each test case being malignant or benign? (display the first 10 lines)
rf_model.predict_proba(x_test)[0:10]

#do you see how the 0 and 1 were generated in the previous command?

<h4>6- How well did we predict?</h4>

In [0]:
#what is the accuracy of the model on the test set?
rf_model.score(x_test,y_test)

In [0]:
#let's generate a confusion matrix!
pd.DataFrame(confusion_matrix(y_test,rf_model.predict(x_test)),index=['benign','malignant'],columns=['predicted benign','predicted malignant'])

<h4>7- Identifying the important questions!</h4>

In [0]:
#let's create a data frame that contains information about how important each question is in generating the correct prediction!
feature_importances = pd.DataFrame(rf_model.feature_importances_,
                                   index = x_train.columns,
                                    columns=['importance']).sort_values('importance',                                                                 ascending=False)

In [0]:
#display the dataframe. Which questions do you think are important?
feature_importances

<h4>8- Let's build another model with less features!</h4>

In [0]:
#subset the questions we are interested in
X_reduced=X[['worst perimeter','worst concave points','worst radius','mean concave points','worst concavity']]   #define your features
Y=pd.Series(cancer.target)                                                                                       #define the target
X_reduced.head()

In [0]:
#split into train and test
x_train,x_test,y_train,y_test = train_test_split(X_reduced,Y,test_size=0.3,random_state=42)


In [0]:
#train a new model!
rf_model = RandomForestClassifier(max_depth=3,n_estimators=15)        #define the model
rf_model.fit(x_train, y_train)                                        #fit the model (train)
rf_model.score(x_train,y_train)                                       #predict on new observations

#what is the accuracy of this model?

In [0]:
#save the model!
from sklearn.externals import joblib

joblib.dump(rf_model, "cancer_classifier.pkl")    #save the whole model into a file to be used later

#to load the model next time we just need to do:
#classifer = joblib.load("model.pkl")
#classifer.predict(newobs)

<center><h3>Congratulations! You have built your first classifier!</h3></center>
<center><h5>www.thecodinghive.com</h5></center>