# Predict heart failure with Watson Machine Learning

This notebook contains steps and code to create a predictive model to predict heart failure and then deploy that model to Watson Machine Learning so it can be used in an application.

## Learning Goals

The learning goals of this notebook are:

* Load a CSV file into the Object Storage service linked to your Watson Studio
* Create an Apache Spark machine learning model
* Train and evaluate a model
* Persist a model in a Watson Machine Learning repository

## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

* Create a Deployment Space(Watson Machine Learning service) instance and associate it with your project
* Upload heart failure data to the Watson Studio project

We'll be using a few libraries for this exercise:

1. [Watson Machine Learning Client](http://wml-api-pyclient.mybluemix.net/): Client library to work with the Watson Machine Learning service on IBM Cloud.
1. [Pixiedust](https://github.com/pixiedust/pixiedust): Python Helper library for Jupyter Notebooks
1. [ibmos2spark](https://github.com/ibm-watson-data-lab/ibmos2spark): Facilitates Data I/O between Spark and IBM Object Storage services

# 1.0 Install required packages

In [None]:
!pip install --user watson-machine-learning-client --upgrade | tail -n 1
!pip install --user pyspark==2.3.3 --upgrade|tail -n 1
!pip install --upgrade pixiedust | tail -n 1

import pixiedust

# 2.0 Load and Explore Data

You'll load our data as a pandas data frame. The dataset should have been uploaded into your Watson Studio project. Otherwise, refer to the repo README file to upload the sample dataset.

* Highlight the cell below by clicking it.
* Click the `10/01` "Find data" icon in the upper right of the notebook.
* In the `Find and add data` right hand panel, your data file should be listed.
* Click the `Insert to code` drop down menue under your file name.
* Select `Insert SparkSession DataFrame`
* The code that brings the data into the notebook environment and creates a Pandas DataFrame will be added to the cell below.
* Run the cell

> **IMPORTANT**: Ensure the DataFrame is named `df_data_1`. If not, rename it.

In [None]:
# Place cursor below and insert the Pandas DataFrame for the sample dataset
import pandas as pd




We'll use the Pandas naming convention df for our DataFrame. Make sure that the cell below uses the name for the dataframe used above. For the locally uploaded file it should look like df_data_1 or df_data_2 or df_data_x. For the virtualized data case it should look like data_df_1 or data_df_2 or data_df_x.

In [None]:
# for virtualized data
# df = data_df_1

# for local upload
df = df_data_1

### 2.1 Drop PATIENTID feature (column)
Sometimes, you may want to build your model based on part of your dataset. For example, `PATIENTID` can be irrelevant and you want to remove it from the dataset.

In [None]:
df = df.drop('PATIENTID', axis=1)
df.head(5)

### 2.2 Explore sample dataset

Explore the loaded dataset by using the following DataFrame methods:

df.info() to print the data schema
df.count() to count all records

After removing PATIENTID  column, the dataset contains ten fields and 10800 records. The HEARTFAILURE field is the one we would like to predict (label). 

In [None]:
df.info()

In [None]:
df.count()



### 2.3 Any NaN values should be removed to create a more accurate model.
To check if your dataset contains NaN values and remove NaN values if they are identified.

In [None]:
# Check if we have any NaN values
df.isnull().values.any()

NaN values were identified in the sample dataset. The previous command also identified NaN value in the attribute `CHOLESTEROL`.
Set `nan_column` to the column number for CHOLESTEROL (starting at 0).

In [None]:
nan_column = df.columns.get_loc("CHOLESTEROL")
print(nan_column)

In [None]:
# Handle missing values for nan_column (TotalCharges)

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values="NaN", strategy="mean")

df.iloc[:, nan_column] = imp.fit_transform(df.iloc[:, nan_column].values.reshape(-1, 1))
df.iloc[:, nan_column] = pd.Series(df.iloc[:, nan_column])

In [None]:
# Check if we have any NaN values
df.isnull().values.any()


Now, NaN values were removed.


### 2.4 Visualize data

Python provides rich set of visualization libraries. For example, with PixieDust's `display()` method you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.

To explore a data set: choose the desired chart type from the drop down, configure chart options, configure display options.



In [None]:
import json
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics

In [None]:
# Plot HEARTFAILURE Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="CHOLESTEROL", hue="HEARTFAILURE", data=df)

In [None]:
# Plot HEARTFAILURE Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="FAMILYHISTORY", hue="HEARTFAILURE", data=df)

In [None]:
# Plot HEARTFAILURE Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="SMOKERLAST5YRS", hue="HEARTFAILURE", data=df)


In [None]:
display(df)


[Optional] Another option to explore data is to use the built in data refinery tool. From your main project page in Watson Studio, select the Assets tab and click on the name of your training data. From the data preview page, you can click on the Refine button to load the data into Data Refinery.

Understand the quality and distribution of your data using data profiler, and dozens of built-in charts, graphs, and statistics. Automatically detect data types and column classifications. Explore the data, selecting the Profile tab to better understand the values for the columns or features used later when building the machine learning models.


# 3.0 Create Model



# 3.1 Preparation



In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.sql import SparkSession
import pandas as pd
import json

spark = SparkSession.builder.getOrCreate()
df_data = spark.createDataFrame(df)
df_data.head()

### 3.2 Split the data into training and test sets


In this subsection you will split your data into: train and test data sets.


In [None]:
split_data = df_data.randomSplit([0.8, 0.20], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

As you can see our data has been successfully split into two data sets:

- The train data set, which is the largest group, is used for training.
- The test data set will be used for model evaluation and is used to test the assumptions of the model.



### 3.3 Convert all String Fields to Numeric


Convert all the string fields to numeric ones by using the StringIndexer transformer.


In [None]:
stringIndexer_label = StringIndexer(inputCol="HEARTFAILURE", outputCol="label").fit(df_data)
stringIndexer_sex = StringIndexer(inputCol="SEX", outputCol="SEX_IX")
stringIndexer_famhist = StringIndexer(inputCol="FAMILYHISTORY", outputCol="FAMILYHISTORY_IX")
stringIndexer_smoker = StringIndexer(inputCol="SMOKERLAST5YRS", outputCol="SMOKERLAST5YRS_IX")

### 3.4 Create a single vector


In this step, create a feature vector by combining all features together.


In [None]:
vectorAssembler_features = VectorAssembler(inputCols=["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX_IX","FAMILYHISTORY_IX","SMOKERLAST5YRS_IX","EXERCISEMINPERWEEK"], outputCol="features")

### 3.5 Define Estimator


Next, define estimators you want to use for classification. Random Forest is used in the case.


In [None]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

### 3.6 Map Indexed Labels back to Original Labels


Finally, indexed labels back to original labels.


In [None]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

In [None]:
transform_df_pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features])
transformed_df = transform_df_pipeline.fit(df_data).transform(df_data)
transformed_df.show()

### 3.7 Create Pipeline


Let's build the pipeline now. A pipeline consists of transformers and an estimator.


In [None]:
pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, rf, labelConverter])

### 3.8 Train Model 


Now, you can train your Random Forest model by using the previously defined pipeline and training data.


In [None]:
model = pipeline.fit(train_data)

### 3.9 Check Model Accuracy


You can check your model accuracy now. To evaluate the model, use test data.


In [None]:
predictions = model.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)

print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))


You can tune your model now to achieve better accuracy. For simplicity of this example tuning section is omitted.


# 4.0 Save the model and test data

Add a unique name for MODEL_NAME.

In [None]:
MODEL_NAME = "my-model-today"

### 4.1 Save the model to ICP4D local Watson Machine Learning
Replace the `username` and `password` values of `*****` with your Cloud Pak for Data `username` and `password`.
The value for `url` should match the `url` for your Cloud Pak for Data cluster.

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

wml_credentials = {
                   "url": "https://zen-cpd-zen.apps.os-workshop-nov22.vz-cpd-nov22.com",
                   "username": "*****",
                   "password" : "*****",
                   "instance_id": "wml_local",
                   "version" : "2.5.0"
                  }

client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
client.spaces.list()

### Use the desired space as the `default_space`
Put the `GUID` of the desired space as the parameter below

In [None]:
client.set.default_space('<GUID>')

In [None]:
# Store our model
model_props = {client.repository.ModelMetaNames.NAME: MODEL_NAME,
               client.repository.ModelMetaNames.RUNTIME_UID : "spark-mllib_2.3",
               client.repository.ModelMetaNames.TYPE : "mllib_2.3"}
published_model = client.repository.store_model(model=model, pipeline=pipeline, meta_props=model_props, training_data=train_data)

In [None]:
# Use this cell to do any cleanup of previously created models and deployments
client.repository.list_models()
client.deployments.list()

# client.repository.delete('GUID of stored model')
# client.deployments.delete('GUID of deployed model')


### 4.2 Write the test data to a .csv so that we can later use it for evaluation

In [None]:
write_eval_CSV=test_data.toPandas()
write_eval_CSV.to_csv('/project_data/data_asset/HEARTFAILURE-SparkMLEval.csv', sep=',', index=False)

## Congratulations, you have created a model based on customer churn data, and deployed it to Watson Machine Learning!