# Exercise 1

For this exercise the A-priori algorithm was developed to get the most frequent itemsets (size 2 and 3), and from them extract association rules. This algorithm was implemented using Spark, more specifically the PySpark library with the Dataframe API.

## Imports

PySpark is the only non-standard library required.

In [1]:
import os.path
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType, ArrayType
from itertools import combinations, chain
from typing import Iterable, Any, List

## Spark initialization

Spark is initialized, with as many worker threads as logical cores on the machine.
We did not use a fixed value since the machines used for development had a different number of CPU cores.

In [2]:
spark = SparkSession.builder \
    .appName('Apriori') \
    .config('spark.master', 'local[*]') \
    .getOrCreate()

23/03/16 21:54:38 WARN Utils: Your hostname, martinho-SATELLITE-L50-B resolves to a loopback address: 127.0.1.1; using 192.168.1.66 instead (on interface enp8s0)
23/03/16 21:54:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/03/16 21:54:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Prepare the data

The dataset is medical data, where each row identifies a patient and their disease tested at a certain time.
In this context, the diseases are the *items* and the patients are the *baskets*.

The data's format is CSV, and is loaded including the header.
The `START`, `STOP` and `ENCOUNTER` columns are removed as they are not useful for this problem.

In [3]:
df = spark.read \
    .option('header', True) \
    .csv('./data/conditions.csv.gz') \
    .drop('START', 'STOP', 'ENCOUNTER')

                                                                                

The dataframe `df` will have three columns: `PATIENT`, `CODE` and `DESCRIPTION`.

From this dataframe, we extract the mappings from `CODE` to its `DESCRIPTION`.
Throughout the algorithm, we will use the `CODE` to identify each disease, and then map it to its `DESCRIPTION` when we output the final results.

We first check how many unique diseases there are, so that we can determine whether we can keep this mapping in memory or not.

In [4]:
# print('Number of distinct CODE-DESCRIPTION pairs:', df.select('CODE', 'DESCRIPTION').distinct().count())
# print('Number of distinct CODEs:', df.select('CODE').distinct().count())

There is discrepancy between the counts due to one `CODE` having 2 different descriptions.
For simplification we simply choose one description over the other.

Since there are only 159 diseases, we can perfectly keep the mapping in memory.
And so, we collect the distinct `CODE`-`DESCRIPTION` pairs into an hash table.

In [5]:
code_description_map = {r.CODE: r.DESCRIPTION
    for r in df \
    .select('CODE', 'DESCRIPTION') \
    .distinct() \
    .collect()
}

                                                                                

For the algorithm, the diseases' `DESCRIPTION`s won't be needed anymore, as we have the mapping.
The distinct `PATIENT`-`CODE` pairs are taken (it doesn't make sense to have duplicate items within a basket).

In [6]:
df = df.drop('DESCRIPTION').distinct()

## A-priori algorithm

We set the support threshold parameter to 1000, as recommended.

In [7]:
support_threshold = 1000

The intermediate results of each pass are saved in disk in Parquet format (Spark's default format).

### First pass

In the first pass, the frequent items are taken.
For that, a "group by" operation is performed, grouping by `CODE` and counting the number of `PATIENTS` that each `CODE` is present in.
Finally, the diseases are filtered according to the `support_threshold`, by comparing with the support stored in `COUNT`.

In [8]:
if not os.path.exists('frequent_diseases_k1'):
    frequent_diseases_k1 = df \
        .groupBy('CODE') \
        .count() \
        .withColumnRenamed('count', 'COUNT') \
        .filter(col('COUNT') >= support_threshold)
    
    frequent_diseases_k1.write.mode('overwrite').parquet(path='frequent_diseases_k1', compression='gzip')

frequent_diseases_k1 = spark.read.parquet('frequent_diseases_k1')

                                                                                

The frequent items table is kept in memory for future passes (in a Python `set`, for quicker membership queries).

In [9]:
frequent_diseases_k1_set = {r.CODE for r in frequent_diseases_k1.select('CODE').collect()}

                                                                                

In [11]:
print('Number of frequent items:', len(frequent_diseases_k1_set))   # 131

Number of frequent items: 131


### Second pass

The second pass requires generating frequent pairs of items.
For that, an UDF was developed that simply took an array of items and returned the list of item pairs, an operation performed within Python.

In [12]:
@udf(returnType=ArrayType(ArrayType(StringType(), False), False))
def combine_pairs(elems: Iterable[Any]):
    return list(combinations(elems, 2))

First, the `CODE`s are filtered using the `frequent_diseases_k1_set`, so that we only have frequent diseases (*monotonicity of itemsets*: itemsets are only frequent if all their subsets are).
Then, for each `PATIENT` we collect its `CODE`s into an array, and then use that array in the UDF previously defined.

It's important to note that the array of `CODE`s should be sorted beforehand, so that pair comparison can be done properly. Since Spark doesn't have a "set" datatype, the elements should be kept in order so that two pairs (which are arrays) with the same items will be considered equal.
The `combinations` function is guaranteed to keep this order when generating pairs.

The result is a column `CODE_PAIRS`, containing an array of pairs, being each pair an array with two elements.
This column is exploded, producing a row for each pair within the arrays of pairs.

Afterwards, the same grouping procedure in the first pass is performed, grouping by the itemsets and obtaining the number of baskets each itemset belongs to, filtering with the `support_threshold`.

In [13]:
if not os.path.exists('frequent_diseases_k2'):
    frequent_diseases_k2 = df \
        .filter(col('CODE').isin(frequent_diseases_k1_set)) \
        .groupBy('PATIENT') \
        .agg(collect_list('CODE')) \
        .withColumn('collect_list(CODE)', array_sort('collect_list(CODE)')) \
        .withColumn('CODE_PAIRS', combine_pairs('collect_list(CODE)')) \
        .select('PATIENT', 'CODE_PAIRS') \
        .withColumn('CODE_PAIR', explode('CODE_PAIRS')) \
        .drop('CODE_PAIRS') \
        .groupBy('CODE_PAIR') \
        .count() \
        .withColumnRenamed('count', 'COUNT') \
        .filter(col('COUNT') >= support_threshold)
    
    frequent_diseases_k2.write.mode('overwrite').parquet(path='frequent_diseases_k2', compression='gzip')

frequent_diseases_k2 = spark.read.parquet('frequent_diseases_k2')

The table of frequent pairs is kept in memory for the third pass.

In [14]:
frequent_diseases_k2_set = {tuple(r.CODE_PAIR) for r in frequent_diseases_k2.select('CODE_PAIR').collect()}

                                                                                

In [16]:
print('Number of frequent pairs:', len(frequent_diseases_k2_set))   # 2940

Number of frequent pairs: 2940


### Third Pass

As was done for the second pass, an UDF was developed that returns an array of triples given an array of items.
This function includes the verification that all $k-1$ immediate subsets of each returned triple are frequent (that is, all pairs within the triple are frequent).

In [17]:
@udf(returnType=ArrayType(ArrayType(StringType(), False), False))
def combine_triples(elems: Iterable[Any]):
    return [
        combination for combination in list(combinations(elems, 3))
        if ((combination[0], combination[1]) in frequent_diseases_k2_set
            and (combination[0], combination[2]) in frequent_diseases_k2_set
            and (combination[1], combination[2]) in frequent_diseases_k2_set)
    ]

The same approach for the second pass was used, merely differing in the UDF used.

In [21]:
if not os.path.exists('frequent_diseases_k3'):
    frequent_diseases_k3 = df \
        .filter(col('CODE').isin(frequent_diseases_k1_set)) \
        .groupBy('PATIENT') \
        .agg(collect_list('CODE')) \
        .withColumn('collect_list(CODE)', array_sort('collect_list(CODE)')) \
        .withColumn('CODE_TRIPLES', combine_triples('collect_list(CODE)')) \
        .select('PATIENT', 'CODE_TRIPLES') \
        .withColumn('CODE_TRIPLE', explode('CODE_TRIPLES')) \
        .drop('CODE_TRIPLES') \
        .groupBy('CODE_TRIPLE') \
        .count() \
        .withColumnRenamed('count', 'COUNT') \
        .filter(col('COUNT') >= support_threshold)

    frequent_diseases_k3.write.mode('overwrite').parquet(path='frequent_diseases_k3', compression='gzip')

frequent_diseases_k3 = spark.read.parquet('frequent_diseases_k3')

                                                                                

The table of frequent triples is generated, merely because the same was done for the previous $k$, but it won't be used.

In [24]:
frequent_diseases_k3_set = {tuple(r.CODE_TRIPLE) for r in frequent_diseases_k3.select('CODE_TRIPLE').collect()}

                                                                                

In [26]:
print('Number of frequent triples:', len(frequent_diseases_k3_set))   # 13395

Number of frequent triples: 13392


### Most frequent

The listing of the 10 frequent pairs/triples is saved in a tab-separated CSV file, which includes the header.
Obtaining the 10 most frequent itemsets involves sorting the respective dataframe in descending order and taking the top 10 results.

In [39]:
with open('most_frequent_k2.csv', 'w') as f:
    print('pair\tcount', file=f)
    print(*(
            f'{r.CODE_PAIR}\t{r.COUNT}' for r in
            frequent_diseases_k2.sort('COUNT', ascending=False).take(10)
        ), sep='\n', file=f)

                                                                                

In [41]:
with open('most_frequent_k3.csv', 'w') as f:
    print('pair\tcount', file=f)
    print(*(
            f'{r.CODE_TRIPLE}\t{r.COUNT}' for r in
            frequent_diseases_k3.sort('COUNT', ascending=False).take(10)
        ), sep='\n', file=f)

                                                                                

### Association Rules

The total number of patients (baskets) is required for calculating the rule metrics.

In [25]:
n_patients = df.select('PATIENT').distinct().count()

                                                                                

Much like in the creation of pairs and triples, an UDF was created so that, taking an itemset as input, produces all subsets of items excluding the empty set and the complete set itself.

Each of these subsets is associated with its complementary subset, with both being stored in a tuple.
For instance, the subset $S'$ of itemset $S$ is associated with the subset $S''$ such that $S' \cup S'' = S$.

Therefore, essentially, the association rules are generated, where the head of the rule is the first subset and the tail is the complementary subset.

In [23]:
@udf(returnType=ArrayType(ArrayType(ArrayType(StringType(), False), False), False))
def inner_subsets(itemset: List[str]):
    itemset = set(itemset)
    combis = chain.from_iterable(list(map(set, combinations(itemset, k))) for k in range(1, len(itemset)))
    return list((sorted(combi), sorted(itemset - combi)) for combi in combis)

To generate rules from the frequent pairs, the previous UDF is applied to the dataframe of item pairs.

In [24]:
rules_k2 = frequent_diseases_k2 \
    .withColumn('SUBSETS', inner_subsets('CODE_PAIR')) \
    .withColumn('SUBSETS', explode('SUBSETS')) \
    .withColumnRenamed('COUNT', 'COUNT_PAIR') \
    .select(col('CODE_PAIR')[0].alias('RULE_1'), col('CODE_PAIR')[1].alias('RULE_2'), 'COUNT_PAIR') \
    .join(frequent_diseases_k1, frequent_diseases_k1['CODE'] == col('RULE_1')[0], 'inner') \
    .withColumnRenamed('COUNT', 'COUNT_1') \
    .drop('CODE') \
    .join(frequent_diseases_k1, frequent_diseases_k1['CODE'] == col('RULE_2')[0], 'inner') \
    .withColumnRenamed('COUNT', 'COUNT_2') \
    .drop('CODE')

In [35]:
rules_k2_metrics = rules_k2 \
    .withColumn('CONFIDENCE', col('COUNT_PAIR') / col('COUNT_1')) \
    .withColumn('INTEREST', col('CONFIDENCE') - col('COUNT_2') / n_patients) \
    .withColumn('LIFT', n_patients * col('CONFIDENCE') / col('COUNT_2')) \
    .withColumn('STANDARDISED_LIFT', 
                (col('LIFT') - array_max(array(
                    (col('COUNT_1') + col('COUNT_2')) / n_patients - 1,
                    lit(1 / n_patients)
                )) / (col('COUNT_1') * col('COUNT_2') / (n_patients ** 2)))
                /
                ((n_patients / array_max(array(col('COUNT_1'), col('COUNT_2')))) - array_max(array(
                    (col('COUNT_1') + col('COUNT_2')) / n_patients - 1,
                    lit(1 / n_patients)
                )) / (col('COUNT_1') * col('COUNT_2') / (n_patients ** 2)))
    ) \
    .filter(col('STANDARDISED_LIFT') >= 0.2) \
    .sort('STANDARDISED_LIFT', ascending=False)

In [29]:
rules_k3 = frequent_diseases_k3 \
    .withColumn('CODE_TRIPLE', inner_subsets('CODE_TRIPLE')) \
    .withColumn('CODE_TRIPLE', explode('CODE_TRIPLE')) \
    .withColumnRenamed('COUNT', 'COUNT_TRIPLE') \
    .select(col('CODE_TRIPLE')[0].alias('RULE_1'), col('CODE_TRIPLE')[1].alias('RULE_2'), 'COUNT_TRIPLE') \
    \
    .join(frequent_diseases_k1, array(frequent_diseases_k1['CODE']) == col('RULE_1'), 'left') \
    .withColumnRenamed('COUNT', 'COUNT_1') \
    .drop('CODE') \
    .join(frequent_diseases_k2, frequent_diseases_k2['CODE_PAIR'] == col('RULE_1'), 'left') \
    .withColumnRenamed('COUNT', 'COUNT_1_OTHER') \
    .drop('CODE_PAIR') \
    .withColumn('COUNT_1', coalesce('COUNT_1', 'COUNT_1_OTHER')) \
    .drop('COUNT_1_OTHER') \
    \
    .join(frequent_diseases_k1, array(frequent_diseases_k1['CODE']) == col('RULE_2'), 'left') \
    .withColumnRenamed('COUNT', 'COUNT_2') \
    .drop('CODE') \
    .join(frequent_diseases_k2, frequent_diseases_k2['CODE_PAIR'] == col('RULE_2'), 'left') \
    .withColumnRenamed('COUNT', 'COUNT_2_OTHER') \
    .drop('CODE_PAIR') \
    .withColumn('COUNT_2', coalesce('COUNT_2', 'COUNT_2_OTHER')) \
    .drop('COUNT_2_OTHER')

In [34]:
rules_k3_metrics = rules_k3 \
    .withColumn('CONFIDENCE', col('COUNT_TRIPLE') / col('COUNT_1')) \
    .withColumn('INTEREST', col('CONFIDENCE') - col('COUNT_2') / n_patients) \
    .withColumn('LIFT', n_patients * col('CONFIDENCE') / col('COUNT_2')) \
    .withColumn('STANDARDISED_LIFT', 
                (col('LIFT') - array_max(array(
                    (col('COUNT_1') + col('COUNT_2')) / n_patients - 1,
                    lit(1 / n_patients)
                )) / (col('COUNT_1') * col('COUNT_2') / (n_patients ** 2)))
                /
                ((n_patients / array_max(array(col('COUNT_1'), col('COUNT_2')))) - array_max(array(
                    (col('COUNT_1') + col('COUNT_2')) / n_patients - 1,
                    lit(1 / n_patients)
                )) / (col('COUNT_1') * col('COUNT_2') / (n_patients ** 2)))
    ) \
    .filter(col('STANDARDISED_LIFT') >= 0.2) \
    .sort('STANDARDISED_LIFT', ascending=False)

### Printing

In [36]:
rules_k2_metrics.show(truncate=False)



+----------------+----------------+----------+-------+-------+-------------------+-------------------+------------------+------------------+
|RULE_1          |RULE_2          |COUNT_PAIR|COUNT_1|COUNT_2|CONFIDENCE         |INTEREST           |LIFT              |STANDARDISED_LIFT |
+----------------+----------------+----------+-------+-------+-------------------+-------------------+------------------+------------------+
|[44054006]      |[422034002]     |20456     |77306  |20456  |0.26461076759889274|0.24693938821884232|14.973973559620212|1.0000000000000002|
|[1551000119108] |[1501000119109] |3035      |11705  |3035   |0.25929090132422045|0.2566690477644603 |98.89602733874415 |1.0000000000000002|
|[72892002]      |[398254007]     |22959     |205390 |22959  |0.11178246263206583|0.09194880995380139|5.635999805248552 |1.0000000000000002|
|[44054006]      |[1551000119108] |11705     |77306  |11705  |0.1514112746746695 |0.14129964504798342|14.973973559620212|1.0000000000000002|
|[67811000119

                                                                                

In [64]:
@udf(returnType=StringType())
def format_rule(rule_1: List[str], rule_2: List[str], *values: List[Any]):
    return f'{{{", ".join(rule_1)}}} -> {{{", ".join(rule_2)}}}: {", ".join(map(str, values))}'

In [69]:
code_description_df.count()

                                                                                

160

In [65]:
# TODO: map code to description

rules_k2_metrics \
    .select(format_rule('RULE_1', 'RULE_2', 'STANDARDISED_LIFT', 'LIFT', 'CONFIDENCE', 'INTEREST')) \
    .show(truncate=False)



+------------------------------------------------------------------------------------------------------------------+
|format_rule(RULE_1, RULE_2, STANDARDISED_LIFT, LIFT, CONFIDENCE, INTEREST)                                        |
+------------------------------------------------------------------------------------------------------------------+
|{1551000119108} -> {1501000119109}: 1.0000000000000002, 98.89602733874415, 0.25929090132422045, 0.2566690477644603|
|{44054006} -> {422034002}: 1.0000000000000002, 14.973973559620212, 0.26461076759889274, 0.24693938821884232       |
|{44054006} -> {1551000119108}: 1.0000000000000002, 14.973973559620212, 0.1514112746746695, 0.14129964504798342    |
|{72892002} -> {398254007}: 1.0000000000000002, 5.635999805248552, 0.11178246263206583, 0.09194880995380139        |
|{443165006} -> {64859006}: 1.0, 20.9323158713224, 1.0, 0.9522269773613528                                         |
|{64859006} -> {443165006}: 1.0, 20.9323158713224, 0.31225475127

                                                                                

Example output

```
...
{Diabetes, Neoplasm} -> {Colon polyp}: 0.2000, ...
...
```