In [None]:
# The code was removed by Watson Studio for sharing.

# Payment Plan Campaign

Jupiter Energy is the leading supplier of clean, renewable energy for the greater Boston area, servicing nearly 5 million customers across 4 counties. To better serve their clients, especially those facing financial hardship, Jupiter is launching a new set of electricity rates and payment plans. These new plans will significantly lower the total cost of energy for their clients in need.

Your task is to identify the customers who could benefit from the new plans.

In this Python 3.10 notebook, you'll improve the quality of the data and then build a machine learning model to determine which clients should be offered the new payment plans because they are likely to miss payments. You’ll be guided through these steps:

- Step 1: Load the data
- Step 2: Explore the data
- Step 3: Prepare the data
- Step 4: Build and train models
- Step 5: Evaluate the models
- Step 6: Predict potential missed payments
- Step 7: Deploy the model (Optional)


#### Insert a project token

When you import this project from the Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

```python
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
```

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

* Click **More -> Insert project token** in the top-right menu section.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

* This should insert a cell at the top of this notebook similar to the example given above.

  > If an error is displayed indicating that no project token is defined, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data).

* Run the newly inserted cell before proceeding with the notebook execution below.

## Import libraries

Many popular open source libraries are pre-installed on Cloud Pak for Data platform environments. All you have to do is import them. If a library is not preinstalled, you can add it through the notebook or by adding a customization to the environment in which the notebook runs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

%matplotlib inline
plt.style.use('ggplot')

## Step 1: Load and access the data

In [None]:
df = pd.read_csv(project.get_file('Historical-Customer-Payments-Prepared.csv'))
df.head()

### Alternative load data method

Use the **Insert code to cell** function to automatically generate code that loads the data and shows the first 5 rows in a pandas DataFrame:
1. Click the **Code Snippets** icon on the notebook action bar.
2. In the side pane, click **Read data**.
3. Click **Select data from project**.
4. Cick **Data assets > Historical-Customer-Payments-Prepared.csv**.
5. Click **Select**.
6. In the *Load as* drop-down list, select **pandas DataFrame**.
7. Click **Insert code to cell**.
8. Rename the dataframe from `df_data_1` to `df` in the second to last link and then run the cell.

In [None]:
# rename df_data_X to df

## Step 2: Explore the data

You can use plots, graphs, and summary statistics to systematically go through the data. For example, you can plot the distribution of all variables, plot a time series of the data, transform variables, look at all pairwise relationship between variables using scatterplot matrices, and generate summary statistics for all of them. Here are some of these methods.

### Check the summary statistics

In [None]:
df.describe()

### Check the target variable `MISSED_PAYMENT`

In [None]:
sns.countplot(x=df["MISSED_PAYMENT"])
plt.show()
df["MISSED_PAYMENT"].value_counts()

In the above chart, you can see that `MISSED_PAYMENT` is a binary variable with a fairly uniform distribution. This column is your target variable because whether a customer has missed a previous payment is the best indicator of whether they will miss a future payment.

### Check the correlation between numerical features to understand relationships in the data

In [None]:
plt.figure(figsize=(16, 10))
corr = df.drop(["CUSTOMER_ID"], axis=1).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
heatmap = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, annot=True, cbar_kws={"shrink": .5})
heatmap.set_title('CORRELATION HEATMAP')

## Step 3: Prepare the data

To prepare your data for model building, you can use data pre-processing techniques, including the addition, deletion, or transformation of training data. Here you’ll set up the split between the training and testing data and transform some string data to numeric data to make it quantifiable. 

In [None]:
X = df.drop(["CUSTOMER_ID","MISSED_PAYMENT"], axis=1)
y = df["MISSED_PAYMENT"]

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, test_size=0.20)

In [None]:
ordinal_cols_mapping = [{
    "col":"SMART_METER_COMMENTS",    
    "mapping": [('Positive', 1), ('Negative', -1), ('Neutral', 0)]
}]

categorical_columns = X.drop(["SMART_METER_COMMENTS"], axis=1).select_dtypes(include='object').columns.tolist()
numerical_columns = X.select_dtypes(include=np.number).columns.tolist()

column_transformer = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_columns), 
    (OrdinalEncoder(), ["SMART_METER_COMMENTS"]),
    (MinMaxScaler(), numerical_columns), 
    remainder='passthrough')

## Step 4: Build and train models

To find the best model, you’ll train multiple candidate models. There are many predictive modeling algorithms to choose from. For this type of problem, these are the best choices:

- Random Forest
- Logistic Regression
- XGBoost

### Random Forest

In [None]:
rf_pipeline = make_pipeline(column_transformer, RandomForestClassifier(n_estimators=100))
rf_pipeline.fit(X_train, y_train)

y_rf_score = rf_pipeline.score(X_test, y_test)
print("Random Forest model accuracy:", np.round(y_rf_score, decimals=2))

### Logistic Regression

In [None]:
lr_pipeline = make_pipeline(column_transformer, LogisticRegression())
lr_pipeline.fit(X_train, y_train)

y_lr_score = lr_pipeline.score(X_test, y_test)
print("Logistic Regression model accuracy:", np.round(y_lr_score, decimals=2))

### XGBoost

In [None]:
xgb_pipeline = make_pipeline(column_transformer, XGBClassifier(use_label_encoder=False))

In [None]:
xgb_pipeline.fit(X_train, y_train)

In [None]:
y_xgb_score = xgb_pipeline.score(X_test, y_test)
print("XGBoost model accuracy:", np.round(y_xgb_score, decimals=2))

## Step 5: Evaluate the model

Now you must evaluate your candidate models. A useful method for evaluating the performance of a model is measuring the area under the Receiver Operating Characteristic (ROC) curve. An ROC curve plots the true-positive rate (sensitivity) versus the false-positive rate (specificity). 

In [None]:
y_rf_probs = rf_pipeline.predict_proba(X_test)[::,1]
y_lr_probs = lr_pipeline.predict_proba(X_test)[::,1]
y_xgb_probs = xgb_pipeline.predict_proba(X_test)[::,1]

rf_auc = roc_auc_score(y_test, y_rf_probs)
lr_auc = roc_auc_score(y_test, y_lr_probs)
xgb_auc = roc_auc_score(y_test, y_xgb_probs)

rf_fpr, rf_tpr, _ = roc_curve(y_test, y_rf_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, y_lr_probs)
xgb_fpr, xgb_tpr, _ = roc_curve(y_test, y_xgb_probs)

plt.figure(figsize=(8, 6))

plt.plot([0,1],[0,1],'w--')
plt.plot(rf_fpr, rf_tpr, label='Random Forest (auc={:.1%})'.format(rf_auc))
plt.plot(lr_fpr, lr_tpr, label='Logistic (auc={:.1%})'.format(lr_auc))
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost (auc={:.1%})'.format(xgb_auc))

plt.title('ROC-AUC CURVES')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In the chart above, each line is color-coded by model. The legend shows that Random Forest has the highest ROC-AUC score, meaning it yields the best performance out of the three models. 

## Step 6: Predict potential missed payments

Use the Random Forest model to predict which of the 10 randomly selected customers from the original data set might miss a payment.

In [None]:
df_sample = df.sample(n = 10)
df_sample.rename(columns={'MISSED_PAYMENT': 'Actual'}, inplace=True)

y_proba = rf_pipeline.predict_proba(df_sample.drop(["CUSTOMER_ID","Actual"], axis=1))
df_sample["Prediction"] = labelencoder_y.inverse_transform(y_proba.argmax(axis=-1))
df_sample["Probability"] = y_proba.max(axis=-1)

print("Predicting potential missed payments for 10 customers")
df_sample[["CUSTOMER_ID", "Prediction", "Probability"] + X.columns.tolist()]

The `Prediction` column contains the prediction results generated from the Random Forest model based on the customer payment history.
The `Probability` column contains the probability of each prediction.

## Step 7: Deploy the model (optional)

Deployment is the final stage of the lifecycle of a model or script. In a notebook, you can use the IBM Watson Machine Learning Python client library to deploy the trained machine learning model to IBM Watson Machine Learning.

Check out our online documentation, <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/wmls/wmls-deploy-overview.html" target="_blank" rel="noopener noreferrer">Deploying assets</a>, for more samples, tutorials, and information.


## Summary

In this notebook, you loaded and accessed the available data, prepared the data, and built a machine learning model to determine which clients should be offered payment plans.

### Author

**Eric Dong** is a Data Scientist at IBM.

***
Copyright © IBM Corp. 2023. This notebook and its source code are released under the terms of the MIT License.