# Association Rule Learning (PySpark)

## Use Case: Brokerage Customers

Given a dataset of **brokerage customers** containing classes of stocks that each customer owns, learn the association rule model of commonly associated **stock classes** owned by the same **brokerage customer**

### Create DataFrame from dataset


In [1]:

# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'IAM_SERVICE_ID': 'iam-ServiceId-8f959616-4d92-4073-bbf2-1d3f4f6ffad0',
    'IBM_API_KEY_ID': 'VYkxc-7ohgrp0zdnHRSHAQW-aqI5m-b6j8K3YBllfR4a',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.bluemix.net/oidc/token',
    'BUCKET': 'datascienceinbanking-donotdelete-pr-hrnb5icgks2da6',
    'FILE': 'MysteryData.csv'
}


Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190422155237-0000
KERNEL_ID = 782a0057-3079-49d2-bc98-93849a6cd4bd


In [2]:

import ibmos2spark
# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-8f959616-4d92-4073-bbf2-1d3f4f6ffad0',
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token',
    'api_key': 'VYkxc-7ohgrp0zdnHRSHAQW-aqI5m-b6j8K3YBllfR4a'
}

configuration_name = 'os_1e498447d5f74cd6b90b92b35bb6514e_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
input_df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('MysteryData.csv', 'datascienceinbanking-donotdelete-pr-hrnb5icgks2da6'))
input_df.show(10, truncate=False)


+-------+------+-----+------+---------+----------+--------------+--------------+-----------+---------------+------------+----+----+-----+-------+------+-------+------+---------+----------+-------+-----------+
|CustID |Gender|Age  |Income|Education|Profession|AccountBalance|NumTradesPerYr|AccountType|TradingStrategy|TradingStyle|Tech|Auto|Hotel|Airline|Energy|Biotech|Pharma|Utilities|Financials|Staples|Industrials|
+-------+------+-----+------+---------+----------+--------------+--------------+-----------+---------------+------------+----+----+-----+-------+------+-------+------+---------+----------+-------+-----------+
|1000001|F     |65+  |61041 |Masters  |Architect |198459        |25            |Cash       |Growth         |Trend       |Y   |N   |N    |N      |N     |N      |N     |N        |N         |N      |N          |
|1000002|M     |35-44|259522|Doctorate|Doctor    |388705        |24            |Cash       |Growth         |Trend       |N   |Y   |N    |N      |N     |Y      |N   

### Select distinct group and item pairs
**User Actions:**
1. Set **`group_index`** to the name of the feature that identifies a group for association. For example, if the input dataset contains all purchased products, we may set `group_index` to be `'CUST_ORDER_NUMBER'` so that we can group products purchased in the same order (or basket)
2. Set **`item_feature`** to the name of the feature for which we are attempting to find association rules. For example, in the case of market basket analysis, we may set this to be `'PRODUCT_LINE'`, so that we can identify product lines that are commonly purchased together.
    **Note:** If your dataset does not contain a single feature for the items, but rather multiple boolean features indicating item presence, then give `item_feature` a sensible name for the types of items that we are grouping, and be sure to set **`item_features`**.
3. Set **`item_features`** to the names of the features that contain boolean flags indicating that the item is present in the given set. If your dataset does not contain this, leave the variable as an empty list.

In [3]:
group_index = 'CustID'
item_feature = 'Stocks'
item_features = ['Tech','Auto','Hotel','Airline','Energy','Biotech','Pharma','Utilities','Financials','Staples','Industrials']

if item_features:
    group_items_df = input_df.select([group_index] + item_features)
else:
    group_items_df = input_df.select(group_index, item_feature).distinct()
group_items_df.show(truncate=False)

+-------+----+----+-----+-------+------+-------+------+---------+----------+-------+-----------+
|CustID |Tech|Auto|Hotel|Airline|Energy|Biotech|Pharma|Utilities|Financials|Staples|Industrials|
+-------+----+----+-----+-------+------+-------+------+---------+----------+-------+-----------+
|1000001|Y   |N   |N    |N      |N     |N      |N     |N        |N         |N      |N          |
|1000002|N   |Y   |N    |N      |N     |Y      |N     |N        |N         |N      |N          |
|1000003|Y   |N   |N    |N      |N     |N      |N     |N        |N         |N      |N          |
|1000004|Y   |Y   |Y    |Y      |N     |N      |N     |N        |N         |N      |N          |
|1000005|Y   |Y   |Y    |Y      |N     |N      |N     |N        |N         |N      |N          |
|1000006|Y   |N   |N    |N      |N     |N      |N     |N        |N         |N      |N          |
|1000007|Y   |Y   |N    |N      |N     |N      |N     |N        |N         |N      |N          |
|1000008|Y   |Y   |Y    |Y    

### Map each group to a list of items
Take the above dataframe and transform it into an RDD containing one row per group, each with the `group_index` value and the list of items in each group

**User Action:** If **`item_features`** was set above, then set **`true_value`** to the value of the features that indicates its presence in the group

In [4]:
true_value = 'Y'

# If items presence is indicated in multiple features, then transform the set of boolean features to single feature with array of feature names
if item_features:

    # Given a row, return a list of names from item_features which are marked as true in the given row
    def item_list(group):
        item_list = []
        for item in item_features:
            if group[item] == true_value:
                item_list.append(item)
        return item_list

    # Map row to tuple of group ID and list of items
    group_items_rdd = group_items_df.rdd.map(lambda group: (group[0], item_list(group)))

# If the input datset contains one row per item
else:
    
    # Map each unique `group_index` to the array of `item_feature` values found in that group. (e.g. the list of all product lines that appeared in a given order)
    group_items_rdd = group_items_df.rdd.map(lambda group: (group[0], [group[1]])).reduceByKey(lambda x,y:x+y)

for row in group_items_rdd.take(20):
    print(row)

('1000001', ['Tech'])
('1000002', ['Auto', 'Biotech'])
('1000003', ['Tech'])
('1000004', ['Tech', 'Auto', 'Hotel', 'Airline'])
('1000005', ['Tech', 'Auto', 'Hotel', 'Airline'])
('1000006', ['Tech'])
('1000007', ['Tech', 'Auto'])
('1000008', ['Tech', 'Auto', 'Hotel', 'Airline'])
('1000009', ['Tech'])
('1000010', ['Tech', 'Financials'])
('1000011', ['Tech'])
('1000012', ['Tech'])
('1000013', ['Tech'])
('1000014', ['Tech', 'Auto'])
('1000015', ['Tech'])
('1000016', ['Tech', 'Auto'])
('1000017', ['Energy', 'Pharma', 'Financials', 'Industrials'])
('1000018', ['Tech', 'Auto'])
('1000019', ['Auto', 'Biotech'])
('1000020', ['Energy', 'Pharma', 'Financials', 'Industrials'])


### Filter out single product line orders and convert back to dataframe

1. Filter out groups that contain only a single `item_feature` value because these do not provide any useful grouping information to be used for training
2. Convert RDD back to dataframe

In [5]:
multi_item_groups_rdd = group_items_rdd.filter(lambda order: len(order[1])>1)

multi_item_groups_df = multi_item_groups_rdd.toDF([group_index, item_feature])
multi_item_groups_df.show(truncate=False)

+-------+-----------------------------------------+
|CustID |Stocks                                   |
+-------+-----------------------------------------+
|1000002|[Auto, Biotech]                          |
|1000004|[Tech, Auto, Hotel, Airline]             |
|1000005|[Tech, Auto, Hotel, Airline]             |
|1000007|[Tech, Auto]                             |
|1000008|[Tech, Auto, Hotel, Airline]             |
|1000010|[Tech, Financials]                       |
|1000014|[Tech, Auto]                             |
|1000016|[Tech, Auto]                             |
|1000017|[Energy, Pharma, Financials, Industrials]|
|1000018|[Tech, Auto]                             |
|1000019|[Auto, Biotech]                          |
|1000020|[Energy, Pharma, Financials, Industrials]|
|1000021|[Energy, Pharma, Financials, Industrials]|
|1000025|[Tech, Auto]                             |
|1000030|[Tech, Auto, Hotel, Airline]             |
|1000031|[Tech, Auto]                             |
|1000032|[Te

### Train FP-Growth Model and print the resulting frequent itemsets

**User Action:** Tune parameters [**`min_support`**](https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html?highlight=associationrules#pyspark.ml.fpm.FPGrowth.minSupport) for frequent itemsets and [**`min_confidence`**](https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html?highlight=associationrules#pyspark.ml.fpm.FPGrowth.minConfidence) for association rules. (See linked documentation for details)

In [6]:
from pyspark.ml.fpm import FPGrowth

min_support=0.05
min_confidence=0.2

fpGrowth = FPGrowth(itemsCol=item_feature, minSupport=min_support, minConfidence=min_confidence)
model = fpGrowth.fit(multi_item_groups_df)
model.freqItemsets.show(truncate=False)

+-----------------------------------------+-----+
|items                                    |freq |
+-----------------------------------------+-----+
|[Biotech]                                |3706 |
|[Biotech, Auto]                          |3706 |
|[Financials]                             |17062|
|[Financials, Tech]                       |6079 |
|[Airline]                                |9876 |
|[Airline, Auto]                          |9876 |
|[Airline, Auto, Tech]                    |8817 |
|[Airline, Hotel]                         |9876 |
|[Airline, Hotel, Auto]                   |9876 |
|[Airline, Hotel, Auto, Tech]             |8817 |
|[Airline, Hotel, Tech]                   |8817 |
|[Airline, Tech]                          |8817 |
|[Energy]                                 |3481 |
|[Energy, Financials]                     |3481 |
|[Energy, Pharma]                         |3481 |
|[Energy, Pharma, Financials]             |3481 |
|[Energy, Industrials]                    |3481 |


### Display generated association rules

In [7]:
model.associationRules.show(truncate=False)

+-----------------------+-------------+-------------------+
|antecedent             |consequent   |confidence         |
+-----------------------+-------------+-------------------+
|[Pharma, Financials]   |[Industrials]|0.5075823855351415 |
|[Pharma, Financials]   |[Energy]     |0.5075823855351415 |
|[Utilities]            |[Financials] |0.7929547088425594 |
|[Utilities]            |[Staples]    |1.0                |
|[Utilities, Financials]|[Staples]    |1.0                |
|[Auto]                 |[Tech]       |0.825308396339037  |
|[Auto]                 |[Hotel]      |0.3023049374024304 |
|[Auto]                 |[Airline]    |0.3023049374024304 |
|[Airline, Hotel, Auto] |[Tech]       |0.8927703523693803 |
|[Airline]              |[Auto]       |1.0                |
|[Airline]              |[Hotel]      |1.0                |
|[Airline]              |[Tech]       |0.8927703523693803 |
|[Energy, Pharma]       |[Financials] |1.0                |
|[Energy, Pharma]       |[Industrials]|1

### Visualize association rules using Brunel chord plot

In [8]:
# In Watson Studio
import brunel
pd_associationRules = model.associationRules.toPandas()
%brunel data('pd_associationRules') chord x(antecedent) y(consequent)  color(confidence) size(confidence) tooltip(antecedent, consequent, confidence) :: width=800, height=500

<IPython.core.display.Javascript object>

### Visualize association rules using Brunel edge plot

In [9]:
%%brunel data('pd_associationRules') edge color(confidence:sequential) key(antecedent, consequent) tooltip(antecedent, consequent, confidence) style('symbol:curvedArrow') + network y(antecedent, consequent) key(#values) label(#values) style('.label {font-size:10px}') 
:: width=800, height=600 

<IPython.core.display.Javascript object>

### Test the model with sample data
**User Action:** Fill **`test_items`** list with a few item names to use for prediction based on association rules model

In [10]:
test_items = ["Auto", "Tech"]

test_data = spark.createDataFrame([(test_items, )], [item_feature])
model.transform(test_data).show(truncate=False)

+------------+----------------+
|Stocks      |prediction      |
+------------+----------------+
|[Auto, Tech]|[Hotel, Airline]|
+------------+----------------+



### References:
- https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html?highlight=associationrules#pyspark.ml.fpm.FPGrowth
- https://github.com/apache/spark/blob/master/examples/src/main/python/ml/fpgrowth_example.py


### Developed by Data Science Elite Team, IBM Analytics:

- David Thomason - Software Engineer

Copyright (c) 2018 IBM Corporation