In [None]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='cc6c9888-8c0a-4b88-a4f0-18910f2493cc', project_access_token='p-7a04d0040bd18434bf5e315f9911d660be4a34b7')


<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Use Spark and Python to Predict Equipment Purchase</b></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://github.com/pmservice/wml-sample-models/blob/master/spark/product-line-prediction/images/products_graphics.png?raw=true" alt="Icon"> </th>
   </tr>
</table>

This notebook demonstrates how to perform data analysis on a classification problem. You will build a machine learning model to predict clients' interests in terms of product line, such as golf accessories, camping equipment, etc. You will use a publicly available data set, **GoSales Transactions for Naive Bayes Model**, which details anonymous outdoor equipment purchases. The data contains five features (predictors) namely `PRODUCT_LINE`, `GENDER`, `AGE`, `MARITAL_STATUS` and `PROFESSION`.



**Note**: The GoSales data is available to the <a  href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/8044492073eb964f46597b4be06ff5ea" target="_blank" rel="noopener no referrer">Watson Studio Community</a>. The machine learning model is built using <a href="http://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html" target="_blank" rel="noopener no referrer">PySpark ML package</a>. Some familiarity with Python is helpful. This notebook is compatible with Python 3.6 and Spark 2.x.


## Learning goals

You will learn how to:

-  Load data into an Apache® Spark DataFrame.
-  Explore the data.
-  Prepare data for training and evaluation.
-  Create an Apache® Spark machine learning pipeline.
-  Train and evaluate a model.
-  Explore and visualize the prediction results.


## Contents

This notebook contains the following parts. You need to execute the code cells in order from the top to bottom. Code cells can be executed by either clicking on the run cell button from the toolbar or hitting `Shift + Return` keys.

1.	[Load the data](#load)
2.	[Explore the data](#explore)
3.	[Build a machine learning model](#model)
4.	[Predict and visualize prediction results](#visualization)
5.	[Summary and next steps](#summary)

**Note**: Make sure you run the code cell at the top of this notebook. All projects in IBM Watson Studio have an authorization token that is used to enable access to project assets, for example data assets and connections, and is used by platform APIs. This token is called the project access token. The code to use this token was added to the notebook for you during the project template import.

<a id="load"></a>
## 1. Load the data

In this section, you will load the data as an Apache® Spark DataFrame. You will access the data set `GoSales_Tx_NaiveBayes.csv` loaded in the Cloud Object Storage bucket associated with the project. The CSV file is added to the project together with the notebook and project token when this Project Template is used to create a project. 

Execute the code cell below to create a connection to Cloud Object Storage, and use the Spark `read` method to read the data into the dataFrame. The code below will also display the first 5 records from the dataset. Take a moment to analyse the values present in the dataset.

In [2]:
# @hidden_cell

import ibmos2spark
import pandas as pd
storage_metadata = project.get_storage_metadata()

credentials = {
    'endpoint': storage_metadata['properties']['endpoint_url'],
    'service_id': storage_metadata['properties']['credentials']['editor']['service_id'],
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token',
    'api_key': storage_metadata['properties']['credentials']['editor']['api_key']
}

configuration_name = 'cos_config'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('GoSales_Tx_NaiveBayes.csv', storage_metadata['properties']['bucket_name']))
df.take(5)

[Row(PRODUCT_LINE='Personal Accessories', GENDER='M', AGE='27', MARITAL_STATUS='Single', PROFESSION='Professional'),
 Row(PRODUCT_LINE='Personal Accessories', GENDER='F', AGE='39', MARITAL_STATUS='Married', PROFESSION='Other'),
 Row(PRODUCT_LINE='Mountaineering Equipment', GENDER='F', AGE='39', MARITAL_STATUS='Married', PROFESSION='Other'),
 Row(PRODUCT_LINE='Personal Accessories', GENDER='F', AGE='56', MARITAL_STATUS='Unspecified', PROFESSION='Hospitality'),
 Row(PRODUCT_LINE='Golf Equipment', GENDER='M', AGE='45', MARITAL_STATUS='Married', PROFESSION='Retired')]

As you can see, the data contains five features (predictors) namely `PRODUCT_LINE`, `GENDER`, `AGE`, `MARITAL_STATUS` and `PROFESSION`. `PRODUCT_LINE` is the one you want to predict (label).


<a id="explore"></a>
## 2. Explore the data

Now you have successfully loaded your data into an Apache® Spark DataFrame. In the next steps you will inspect the data and examine its properties. You will also do visual exploration to understand more about the structure of the dataset and the characteristics of the data.

You can check the schema of the DataFrame by executing the next code cell.  You will notice that the attribute `AGE` is of string type, but this is a numerical data, we will convert it to a numerical type in the next section.

In [3]:
df.dtypes

[('PRODUCT_LINE', 'string'),
 ('GENDER', 'string'),
 ('AGE', 'string'),
 ('MARITAL_STATUS', 'string'),
 ('PROFESSION', 'string')]

### 2.1 Change data type<a id="dtype"></a>

Convert the attribute `AGE` from a string type to a numerical type.

In [4]:
from pyspark.sql.types import IntegerType
df = df.withColumn("AGE", df["AGE"].cast(IntegerType()))

Have a look at the new data types.

In [5]:
df.dtypes

[('PRODUCT_LINE', 'string'),
 ('GENDER', 'string'),
 ('AGE', 'int'),
 ('MARITAL_STATUS', 'string'),
 ('PROFESSION', 'string')]

### 2.2 Visualize the data
Data visualization helps to identify significant trends and characterstics of the data. Simple charts like bar chart or pie chart might help to build a better understanding of the data. You will use the Python library called `brunel` to create visualizations. `brunel` defines a highly succinct and novel language that defines interactive data visualizations based on tabular data. `brunel` documentation and code can be found <a href="https://github.com/Brunel-Visualization/Brunel" target="_blank" rel="noopener no referrer">here</a>.  You will notice that zoom in and zoom out are supported in brunel plots.

You have to convert the PySpark DataFrame into a Pandas DataFrame first in order to pass it to `brunel`. Execute the next code cell to do so.

In [6]:
df_pd = df.toPandas()

Now plot a couple of bar charts to understand the proportion of values in individual features. Let's start with the target column i.e. `PRODUCT_LINE`. 

In [7]:
import brunel
%brunel data('df_pd') bar x(PRODUCT_LINE) y(#count) color(PRODUCT_LINE) :: width=600, height=400

<IPython.core.display.Javascript object>

In the plot above we see that `Camping Equipment` is the `PRODUCT_LINE` appraring most often and `Outdoor Protection` the least.

The following three bar plots below represent the distribution of other categorical columns in the dataset: `PROFESSION`, `GENDER`and `MARITAL_STATUS`.

In [8]:
%brunel data('df_pd') bar x(PROFESSION) y(#count) color(PROFESSION) :: width=600, height=400

<IPython.core.display.Javascript object>

In [9]:
%brunel data('df_pd') bar x(MARITAL_STATUS) y(#count) color(MARITAL_STATUS) :: width=600, height=400

<IPython.core.display.Javascript object>

In [10]:
%brunel data('df_pd') bar x(GENDER) y(#count) color(GENDER) :: width=600, height=400

<IPython.core.display.Javascript object>

Age of a customer might have a strong impact on the buying preferences. Let's use another type of plotting technique called Heat map to visualize the relation between `AGE` and `PRODUCT_LINE`. Heat map provides an immediate visual summary of the information.

In [11]:
%brunel data('df_pd') x(PRODUCT_LINE) y(AGE) color(#count:blue) style('symbol:rect; size:100%;') :: width=750, height=500

<IPython.core.display.Javascript object>

The intensity of the blue color in the plot represents purchases of a specific `PRODUCT_LINE` by customers in respective AGE group. You can see that people in the age group of 20 to 40 have an inclination towards buying Camping Equipment and Personal Accessories, while the people in 50-60 years of range have higher tendency to buy Golf Equipments.

<a id="model"></a>
## 3. Build a machine learning model

In this section, you will learn how to:

- [3.1 Split data](#prep)
- [3.2 Build a machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

### 3.1 Split data<a id="prep"></a>

To avoid overfitting our machine learning model and ensure a good performance on unseen data, split the data set into two data sets: 
- Train data set
- Test data set

In [12]:
split_data = df.randomSplit([0.8, 0.2], 24)
train_data = split_data[0]
test_data = split_data[1]

print('Number of training records: ' + str(train_data.count()))
print('Number of testing records : ' + str(test_data.count()))

Number of training records: 48176
Number of testing records : 12076


As you can see, the data has been successfully split into two data sets with a proportion of 80% and 20% for train and test data set respectively.

-  The train data set which is the larger group is used for training (80%).
-  The test data set will be used for model evaluation and is used to test the assumptions of the model (20%).

### 3.2 Create the pipeline<a id="pipe"></a>

In this subsection, you will create an Apache® Spark machine learning pipeline and train the model. In the first step, you need to import the Apache® Spark machine learning modules that will be needed in the subsequent steps.

In [13]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

Now, use the `StringIndexer` transformer to convert all string fields into numerical type.

In [14]:
stringIndexer_label = StringIndexer(inputCol='PRODUCT_LINE', outputCol='label').fit(df)
stringIndexer_prof = StringIndexer(inputCol='PROFESSION', outputCol='PROFESSION_IX')
stringIndexer_gend = StringIndexer(inputCol='GENDER', outputCol='GENDER_IX')
stringIndexer_mar = StringIndexer(inputCol='MARITAL_STATUS', outputCol='MARITAL_STATUS_IX')

In the following step, create a feature vector to combine all features (predictors) together. This transformer merges multiple columns into a vector column.

In [15]:
vectorAssembler_features = VectorAssembler(inputCols=['GENDER_IX', 'AGE', 'MARITAL_STATUS_IX', 'PROFESSION_IX'], outputCol='features')

Next, select the estimator you want to use for classification. <a href="http://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier" target="_blank" rel="noopener no referrer">Random Forest</a> is used in this example. It supports both binary and multiclass labels, as well as both continuous and categorical features.

In [16]:
rf = RandomForestClassifier(labelCol='label', featuresCol='features')

Finally, convert the indexed labels back to original labels. This transformer maps a column of indices back to a new column of corresponding string values. 

In [17]:
labelConverter = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=stringIndexer_label.labels)

Now build the pipeline. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.

In [18]:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, rf, labelConverter])

### 3.3 Train a model<a id="train"></a>

Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**. In order to train the `Random Forest` model, run the following cell. When the `fit()` method is called on the pipeline, all the stages defined in the pipeline are executed in order.

In [19]:
model_rf = pipeline_rf.fit(train_data)

Congratulations! you just trained a machine learning model. You can check your **model accuracy** now. Use **test data** to evaluate the model.

Now let us look at which features have more importance in deciding the outcome in the model.

In [20]:
feature_importances = pd.DataFrame({'feature' : ['GENDER', 'AGE', 'MARITAL_STATUS', 'PROFESSION'],
                                   'importance' : model_rf.stages[5].featureImportances.values}).sort_values('importance', ascending=False)
%brunel data('feature_importances') bar x(feature) y(importance) sort(importance) transpose

<IPython.core.display.Javascript object>

As you can see form the plot above the features `AGE` and `PROFESSION` are important in deciding the outcome of our machine learning model.

In [21]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='accuracy')
accuracy = evaluatorRF.evaluate(predictions)

print('Accuracy = {:.2f}%'.format(accuracy*100))
print('Test Error = {:.2f}%'.format((1.0 - accuracy)*100))

Accuracy = 59.12%
Test Error = 40.88%


You can tune your model to achieve better accuracy. For simplicity, the tuning step is omitted in this example.

<a id="visualization"></a>
## 4. Predict and visualize prediction results

In this section, you will learn how to score the model using test data and visualize the prediction results.

- [4.1 Make a prediction using the trained model and test data](#local)
- [4.2 Visualize data](#plotly)

### 4.1 Make a prediction using the trained model and test data<a id="local"></a>

In this subsection, you will score the model with the *test_data* data set that we kept aside earlier.

In [22]:
predictions = model_rf.transform(test_data)

Let us preview the predictions DataFrame. We will convert this to a Pandas dataframe first.

In [23]:
predictions_pd = predictions.toPandas()
predictions_pd.head()

Unnamed: 0,PRODUCT_LINE,GENDER,AGE,MARITAL_STATUS,PROFESSION,label,PROFESSION_IX,GENDER_IX,MARITAL_STATUS_IX,features,rawPrediction,probability,prediction,predictedLabel
0,Camping Equipment,F,18,Single,Other,0.0,0.0,1.0,1.0,"[1.0, 18.0, 1.0, 0.0]","[5.159752968474141, 10.526462190893437, 3.6442...","[0.2579876484237071, 0.5263231095446719, 0.182...",1.0,Personal Accessories
1,Camping Equipment,F,18,Single,Retail,0.0,7.0,1.0,1.0,"[1.0, 18.0, 1.0, 7.0]","[2.32340304861027, 15.434980322236182, 1.62385...","[0.11617015243051351, 0.7717490161118092, 0.08...",1.0,Personal Accessories
2,Camping Equipment,F,19,Single,Hospitality,0.0,5.0,1.0,1.0,"[1.0, 19.0, 1.0, 5.0]","[11.986626125432828, 5.913802323402165, 1.5209...","[0.5993313062716414, 0.29569011617010826, 0.07...",0.0,Camping Equipment
3,Camping Equipment,F,19,Single,Hospitality,0.0,5.0,1.0,1.0,"[1.0, 19.0, 1.0, 5.0]","[11.986626125432828, 5.913802323402165, 1.5209...","[0.5993313062716414, 0.29569011617010826, 0.07...",0.0,Camping Equipment
4,Camping Equipment,F,19,Single,Hospitality,0.0,5.0,1.0,1.0,"[1.0, 19.0, 1.0, 5.0]","[11.986626125432828, 5.913802323402165, 1.5209...","[0.5993313062716414, 0.29569011617010826, 0.07...",0.0,Camping Equipment


### 4.2 Visualize results <a id="plotly"></a>

In this subsection, you will use the Plotly package to explore the prediction results. Plotly is an online analytics and data visualization tool.

Import Plotly and the other required packages.

In [24]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly.plotly as py
import sys

init_notebook_mode(connected=True)
sys.path.append(''.join([os.environ['HOME']])) 

Plot a pie chart that shows the predicted product-line interest.

In [25]:
cumulative_stats = predictions_pd.groupby(['predictedLabel']).count()
product_data = [go.Pie(labels=cumulative_stats.index, values=cumulative_stats['GENDER'])]
product_layout = go.Layout(title='Predicted product line client interest distribution')

fig = go.Figure(data=product_data, layout=product_layout)
iplot(fig)

With this data set, perform some analysis of the mean AGE per product line by using a bar chart.

In [26]:
age_data = [go.Bar(y=predictions_pd.groupby(['predictedLabel']).mean()['AGE'], x=cumulative_stats.index)]

age_layout = go.Layout(
    title='Mean AGE per predicted product line',
    xaxis=dict(title = 'Product Line', showline=False),
    yaxis=dict(title = 'Mean AGE'))

fig = go.Figure(data=age_data, layout=age_layout)
iplot(fig)



Based on the bar plot you created, the following conclusion can be reached: the mean age of clients that are interested in golf equipment is predicted to be over 50 years old.


<a id="summary"></a>
## 5. Summary and next steps     

You successfully completed this notebook! You learned how to load and explore data, visualizations as well as use Apache® Spark Machine Learning for model creation. 
 
Check out our [Online Documentation](https://dataplatform.cloud.ibm.com/community?context=wdp) for more samples, tutorials, documentation, how-tos, and blog posts. 

Copyright © 2017-2019 IBM. This notebook and its source code are released under the terms of the MIT License.