<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Use XGBoost to classify tumors with IBM Watson Machine Learning</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/images/cancer_banner-06.png" alt="Icon" width="700"> </th>
   </tr>
</table>

This notebook contains steps and code to get data from the IBM Watson Studio Community, create a predictive model, and start scoring new data. This notebook introduces commands for getting data and for basic data cleaning and exploration, model training, model persistance to Watson Machine Learning repository, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 3.5, XGBoost, and scikit-learn.

You will use a publicly available data set, the Breast Cancer Wisconsin (Diagnostic) Data Set, to train an XGBoost Model to classify breast cancer tumors (as benign or malignant) from 569 diagnostic images based on measurements such as radius, texture, perimeter and area. XGBoost is short for “E**x**treme **G**radient **Boost**ing”.

The XGBoost classifier makes its predictions based on the majority vote from collection of models which are a set of classification trees. It uses the combination of weak learners to create a single strong learner. It’s a sequential training process, whereby new learners focus on the misclassified examples of previous learners.


## Learning goals

You will learn how to:

-  Load a CSV file into numpy array
-  Explore data
-  Prepare data for training and evaluation
-  Create an XGBoost machine learning model
-  Train and evaluate a model
-  Use cross-validation to optimize model's hyperparameters
-  Persist a model in Watson Machine Learning repository
-  Deploy a model for online scoring
-  Score sample data


## Contents

This notebook contains the following parts:

1.	[Set up the environment](#setup)
2.	[Load and explore the data](#load)
3.	[Create the XGBoost model](#model)
4.	[Persist model](#persistence)
5.	[Deploy and score in a Cloud](#scoring)
6.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Set up the environment

Before you use the sample code in this notebook, you have to perform the following setup tasks:

- Create a [Watson Machine Learning (WML) Service](https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/) instance (a free plan is offered and information about how to create the instance is [here](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html))
-  Download **Breast Cancer Wisconsin (Diagnostic) Data Set** dataset from Watson Studio [Community](https://dataplatform.ibm.com/community?context=analytics).

**Note:** We provide the code to download data set, see [step 2](#load).

<a id="load"></a>
## 2. Load and explore the data

In this section you will load the data as a numpy array and perform a basic exploration.

To load the data as a numpy array, user `wget` to download the data, then use the `genfromtxt` method to read the data.

**Example**: First, you need to install the required packages. You can do this by running the following code. Run it only one time.<BR><BR>

In [None]:
!pip install wget --upgrade

In [None]:
### Insert Project Data Or


In [None]:
### Fetch Remote Data
import wget, os

WisconsinDataSet = 'BreastCancerWisconsinDataSet.csv' 
if not os.path.isfile(WisconsinDataSet):
    link_to_data = 'https://apsportal.ibm.com/exchange-api/v1/entries/c173693bf48aeb22e41bbe2b41d79c1f/data?accessKey=941eec501eadcdceb5abd25cf7c029d5'
    WisconsinDataSet = wget.download(link_to_data)

print(WisconsinDataSet)

The csv file **BreastCancerWisconsinDataSet.csv** is downloaded. Run the code in the next cells to load the file to the numpy array.

**Note:** Update `numpy` to ensure you have the latest version.

In [None]:
# Run this code to upgrade numpy.
!pip install numpy --upgrade

In [None]:
import numpy as np


np_data = np.genfromtxt(WisconsinDataSet, delimiter=',', names=True, dtype=None, encoding='utf-8')
print(np_data[0])

Run the code in the next cell to view the feature names and data storage types.

In [None]:
# Display the feature names and data storage types.
print(np_data.dtype)

In [None]:
# Display the number of records and features.
print('Number of rows: {}'.format(np_data.size))
print('Number of columns: {}'.format(len(np_data[0])))

You can see that the data set has 569 records and 32 features.

<a id="model"></a>
## 3. Create an XGBoost model

In this section you will learn how to train and test an XGBoost model.

- [3.1. Prepare the data](#prepare)
- [3.2. Create the XGBoost model](#create)

### 3.1. Prepare data<a id="prepare"></a>

Now, you can prepare your data for model building. You will use the `diagnosis` column as your target variable so you must remove it from the set of predictors. You must also remove the `id` variable.

In [None]:
y = 1.0*(np_data['diagnosis'] == 'M')
X = np.array([list(r)[2:] for r in np_data])

Split the data set into: 
- Train data set
- Test data set

In [None]:
# Split the data set and create two data sets.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=143)

In [None]:
# List the number of records in each data set.
print("Number of training records: " + str(X_train.shape[0]))
print("Number of testing records : " + str(X_test.shape[0]))

The data has been successfully split into two data sets:
- The train data set, which is the largest group, will be used for training
- The test data set will be used for model evaluation and is used to test the assumptions of the model

### 3.2. Create the XGBoost model<a id="create"></a>

Start by importing the necessary libraries.

In [None]:
# Import the libraries you need to create the XGBoost model.
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

#### 3.2.1. Create an XGBoost classifier

In this section you create an XGBoost classifier with default hyperparameter values and you will call it *xgb_model*. 

**Note** The next sections show you how to improve this base model.

In [None]:
# Create the XGB classifier, xgb_model.
xgb_model = XGBClassifier()

Display the default parameters for *xgb_model*.

In [None]:
# List the default parameters.
print(xgb_model.get_xgb_params())

Now that your XGBoost classifier, *xgb_model*, is set up, you can train it by invoking the fit method. You will also evaluate *xgb_model* while the train and test data are being trained.

In [None]:
# Train and evaluate.
xgb_model.fit(X_train, y_train, eval_metric=['error'], eval_set=[((X_train, y_train)),(X_test, y_test)])

**Note:** You can also use a pandas dataFrame instead of the numpy array.

Plot the model performance evaluated during the training process to assess model overfitting.

In [None]:
# Import the library
from matplotlib import pyplot

%matplotlib inline

In [None]:
# Plot and display the performance evaluation
xgb_eval = xgb_model.evals_result()
eval_steps = range(len(xgb_eval['validation_0']['error']))

fig, ax = pyplot.subplots(1, 1, sharex=True, figsize=(8, 6))

ax.plot(eval_steps, [1-x for x in xgb_eval['validation_0']['error']], label='Train')
ax.plot(eval_steps, [1-x for x in xgb_eval['validation_1']['error']], label='Test')
ax.legend()
ax.set_title('Accuracy')
ax.set_xlabel('Number of iterations')

You can see that there is model overfitting, and there is a decrease in model accuracy after about 60 iterations 

Select the trained model obtained after 30 iterations.

In [None]:
# Select trained model.
n_trees = 30
y_pred = xgb_model.predict(X_test, ntree_limit= n_trees)

In [None]:
# Check the accuracy of the trained model.
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.1f%%" % (accuracy * 100.0))

**Note:** You will use the accuracy value obtained on the test data to compare the accuracy of the model with default parameters to the accuracy of the model with tuned parameters.

#### 3.2.2. Use grid search and cross-validation to tune the model 

You can use grid search and cross-validation to tune your model to achieve better accuracy.

XGBoost has an extensive catalog of hyperparameters which provides great flexibility to shape an algorithm’s desired behavior. Here you will the optimize the model tuning which adds an L1 penalty (`reg_alpha`).

Use a 5-fold cross-validation because your training data set is small.

In the cell below, create the XGBoost pipeline and set up the parameter grid for the search.

In [None]:
# Create XGBoost pipeline, set up parameter grid.
xgb_model_gs = XGBClassifier()
parameters = {'reg_alpha': [0.0, 1.0], 'reg_lambda': [0.0, 1.0], 'n_estimators': [n_trees], 'seed': [1337]}

Use ``GridSearchCV`` to search for the best parameters over the parameters values that were specified in the previous section.

In [None]:
# Search for the best parameters.
clf = GridSearchCV(xgb_model_gs, parameters, scoring='accuracy', cv=5, verbose=1, n_jobs=-1, refit=True)
clf.fit(X_train, y_train)

From the grid scores, you can see the performance result of all parameter combinations including the best parameter combination based on model performance.

In [None]:
# View the performance result.
clf.cv_results_

Display the accuracy estimated using cross-validation and the hyperparameter values for the best model.

In [None]:
print("Best score: %.1f%%" % (clf.best_score_*100))
print("Best parameter set: %s" % (clf.best_params_))

Display the accuracy of best parameter combination on the test set.

In [None]:
y_pred = clf.best_estimator_.predict(X_test, ntree_limit= n_trees)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.1f%%" % (accuracy * 100.0))

The accuracy on test set is about the same for tuned model as it is for the trained model that has default hyperparameters values, even though the selected hyperparameters are different to the default parameters.

#### 3.2.3. Model with pipeline data preprocessing

Here you learn how to use the XGBoost model within the scikit-learn pipeline. 

Let's start by importing the required objects.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=10)
xgb_model_pca = XGBClassifier(n_estimators=n_trees)
pipeline = Pipeline(steps=[('pca', pca), ('xgb', xgb_model_pca)])

In [None]:
pipeline.fit(X_train, y_train)

Now you are ready to evaluate accuracy of the model trained on the reduced set of features.

In [None]:
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.1f%%" % (accuracy * 100.0))

You can see that this model has a similar accuracy to the model trained using default hyperparameter values.

Let's see how you can save your XGBoost pipeline using the WML service instance and deploy it for online scoring.

<a id="persistence"></a>
## 4. Persist model

In this section you learn how to use the Python client libraries to store your XGBoost model in the WML repository.

First, you must import client libraries.

In [None]:
from dsx_ml.ml import save

save(name='XGBoostTumorClassification',
     model=pipeline,
     x_test=pd.DataFrame(X_test),
     y_test=pd.DataFrame(y_test),
     algorithm_type='Classification',
     source='Use XGBoost to classify tumors.ipynb',
     description='Tumor Malignancy Classifiation with XGBoost')

Get the saved model metadata from WML.

# 5. Score the Staged Model


In this section you will learn how to use WML to create online scoring and score a new data record.

### Perform prediction

Now, extract the url endpoint, *scoring_url*, which will be used to send scoring requests.

In [None]:
# Extract endpoint url and display it.
scoring_url = 'https://dsxl-api/v3/project/score/Python35/scikit-learn-0.20/wsl-workshop/XGBoostTumorClassification/1'
print(scoring_url)

Prepare the scoring payload with the values to score.

In [None]:
# Prepare scoring payload.
json_payload = [{"concave points_mean":0.01171,"perimeter_se":1.115,"fractal_dimension_mean":0.05581,"symmetry_se":0.01619,"smoothness_worst":0.09616,"concave points_se":0.005905,"fractal_dimension_se":0.002081,"concavity_se":0.01652,"compactness_se":0.01345,"compactness_mean":0.03729,"texture_mean":13.12,"radius_mean":12.89,"fractal_dimension_worst":0.06915,"area_mean":515.9,"radius_se":0.1532,"symmetry_worst":0.2309,"diagnosis":"B","concave points_worst":0.05366,"area_worst":577,"perimeter_mean":81.89,"smoothness_se":0.004731,"texture_se":0.469,"area_se":12.68,"compactness_worst":0.1147,"symmetry_mean":0.1337,"smoothness_mean":0.06955,"perimeter_worst":87.4,"concavity_worst":0.1186,"concavity_mean":0.0226,"texture_worst":15.54,"radius_worst":13.62,"id":8913}]

payload_scoring = {"values": [X_test[0].tolist()]}
print(payload_scoring)

In [None]:
# Perform prediction and display the result.
import requests, json, os

header_online = {'Content-Type': 'application/json', 'Authorization':os.environ['DSX_TOKEN']}
response_scoring = requests.post(scoring_url, json=json_payload, headers=header_online)
response_scoring.content

**Result**: The patient record is classified as a benign tumor.

<a id="summary"></a>
## 6. Summary and next steps     

You successfully completed this notebook! 

You learned how to use XGBoost machine learning as well as Watson Machine Learning to create and deploy a model. 

Check out our [Online Documentation](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html) for more samples, tutorials, documentation, how-tos, and blog posts. 

### Data citations

Lichman, M. (2013). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License.