# Exemplo 05: Regras de Associação
## Regras de associação em compras em supermercado

Regras de associação são usadas para descobrir elementos que ocorrem em comum dentro de um determinado conjunto de dados e suas possiveis associações.
 
As regras de Associação têm como premissa básica encontrar elementos que implicam na ocorrencia de outros elementos em uma mesma transação, ou seja, encontrar relacionamentos ou padrões frequentes entre conjuntos de dados. O termo transação indica quais itens foram consultados em uma determinada operação de consulta.

Um exemplo clássico é estabelecer associação de compra de produtos por um consumidor, isto é, se o cliente compra um determinado produto, quais outros produtos ele tende a comprar também. Essa técnica é largamente utilizada em supermercados e lojas de varejo.

No Spark é implementado o algoritmo FP-Growth que é a implementação paralela do algoritmo *a priori*. 
 
### FP-Growth

FP-Growth is a type of "a priori" algorithm to mine frequent itemsets. The Spark implementation use the parallel FP-growth algorithm described in *Li et al.*, **PFP: Parallel FP-Growth for Query Recommendation** [LI2008](http://dx.doi.org/10.1145/1454008.1454027). PFP distributes computation in such a way that each worker executes an independent group of mining tasks.

In [1]:
# Load libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split
from pyspark.ml.fpm import FPGrowth

import time
start_time = time.time()

## Parameters configuration

In [2]:
# Path to dataset file
data_path='./data/'

## Creating Spark environment

In [3]:
# Create Spark Session
sc = SparkSession.builder \
     .master("local[*]") \
     .appName("AssociationRule") \
     .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/16 08:05:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Reading Data

In [4]:
# Read the list of products by customer (csv) and transform to a list of vectors
data = (sc.read
       .text(data_path+"groceries.csv.gz")
       .select(split("value", ",").alias("items")))

data.show(truncate=False)

+----------------------------------------------------------------------------------------------+
|items                                                                                         |
+----------------------------------------------------------------------------------------------+
|[citrus fruit, semi-finished bread, margarine, ready soups]                                   |
|[tropical fruit, yogurt, coffee]                                                              |
|[whole milk]                                                                                  |
|[pip fruit, yogurt, cream cheese , meat spreads]                                              |
|[other vegetables, whole milk, condensed milk, long life bakery product]                      |
|[whole milk, butter, yogurt, rice, abrasive cleaner]                                          |
|[rolls/buns]                                                                                  |
|[other vegetables, UHT-milk, 

## Associative Rule: Frequent Pattern Mining

Mining frequent items, itemsets, subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset

### Set FPGrowth algorithm:

**itemsCol** = Name of items collumn. Not needed if it is only one.

**minConfidence** = Minimal confidence for generating Association Rule. [0.0, 1.0]. minConfidence will not affect the mining for frequent itemsets, but will affect the association rules generation.
               
**minSupport** =  Support says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. Minimal support level of the frequent pattern. [0.0, 1.0]. Any pattern that appears more than (minSupport * size-of-the-dataset) times will be output in the frequent itemsets.

**numPartitions** = Number of partitions (at least 1) used by parallel FP-growth. By default the param is not set, and partition number of the input dataset is used.

In [5]:
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.05, minConfidence=0.1)
fi = fpGrowth.fit(data)

# Display frequent itemsets.
fi.freqItemsets.sort('freq', ascending=False).show(truncate=False)

+------------------------------+----+
|items                         |freq|
+------------------------------+----+
|[whole milk]                  |2513|
|[other vegetables]            |1903|
|[rolls/buns]                  |1809|
|[soda]                        |1715|
|[yogurt]                      |1372|
|[bottled water]               |1087|
|[root vegetables]             |1072|
|[tropical fruit]              |1032|
|[shopping bags]               |969 |
|[sausage]                     |924 |
|[pastry]                      |875 |
|[citrus fruit]                |814 |
|[bottled beer]                |792 |
|[newspapers]                  |785 |
|[canned beer]                 |764 |
|[pip fruit]                   |744 |
|[other vegetables, whole milk]|736 |
|[fruit/vegetable juice]       |711 |
|[whipped/sour cream]          |705 |
|[brown bread]                 |638 |
+------------------------------+----+
only showing top 20 rows



### Display generated association rules

**Antecedent:** Antecedent itens.

**Consequent:** Consequent itens.

**Confidence:** This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

**Lift:** This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. Lift avoid the item popularity which affects confidence. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought.

In [6]:
fi.associationRules.sort('confidence', ascending=False).show(truncate=False)

+------------------+------------------+-------------------+------------------+-------------------+
|antecedent        |consequent        |confidence         |lift              |support            |
+------------------+------------------+-------------------+------------------+-------------------+
|[yogurt]          |[whole milk]      |0.40160349854227406|1.5717351405345266|0.05602440264361973|
|[other vegetables]|[whole milk]      |0.38675775091960063|1.5136340948246207|0.07483477376715811|
|[rolls/buns]      |[whole milk]      |0.30790491984521834|1.2050317893663836|0.05663446873411286|
|[whole milk]      |[other vegetables]|0.29287703939514526|1.513634094824621 |0.07483477376715811|
|[whole milk]      |[rolls/buns]      |0.2216474333465977 |1.2050317893663838|0.05663446873411286|
|[whole milk]      |[yogurt]          |0.2192598487863112 |1.5717351405345266|0.05602440264361973|
+------------------+------------------+-------------------+------------------+-------------------+



### Verify the rules against dataset

Transform examines the input items against all the association rules and summarize the consequents as prediction.

In [7]:
fi.transform(data).show(truncate=False)

+----------------------------------------------------------------------------------------------+--------------------------------------+
|items                                                                                         |prediction                            |
+----------------------------------------------------------------------------------------------+--------------------------------------+
|[citrus fruit, semi-finished bread, margarine, ready soups]                                   |[]                                    |
|[tropical fruit, yogurt, coffee]                                                              |[whole milk]                          |
|[whole milk]                                                                                  |[rolls/buns, yogurt, other vegetables]|
|[pip fruit, yogurt, cream cheese , meat spreads]                                              |[whole milk]                          |
|[other vegetables, whole milk, condensed milk, 

In [8]:
sc.stop()
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 10.1537024974823 seconds ---
