# Background Information

**Affinity Analysis** discovers co-occurrence relationships among activities performed by specific individuals or groups. In retail, affinity analysis is often used for market basket analysis. **Market Basket Analysis** is a technique used to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. 

## Definitions

Let $I = {i_1, i_2, ..., i_n}$ be a set of n binary attributes called *items*.
Let $T = {t_1, t_2, ..., t_m}$ be a database of *transactions*.

Each transaction *t* is a binary vector with $t[k] = 1$ if *t* contains the item $i_k$ and 0 otherwise. 

An **Association Rule** suggests a relationship between the items, and it's generally written like so: $ \{Milk\} \Rightarrow \{Bread\} $. Note that X and Y in the equation below can contain multiple items and of course $X, Y \subseteq I$. X is called the antecedent or left hand side (LHS), and Y is called the consequent, or right hand side (RHS).

$$ X \Rightarrow Y $$

**Support** is the relative frequency that the relationship occurs. The support of X with respect to T is the proportion of transactions *t* in the dataset which contains the item set X. You may want to look for high support in order to make sure it is a useful relationship.

$$ supp(X) = \frac{|\{t\in T; X \subseteq t\}|}{|T|} $$

$$ supp(X \cup Y) = \frac{|\{t\in T; X \cup Y \subseteq t\}|}{|T|} $$

$$ supp \in [0, 1] $$

$supp(X)$ is also known as the antecedent support, whereas $supp(Y)$ is the consequent support. 

**Confidence** is the measure of reliability of the rule. A confidence of 0.50 would mean that in 50% of the cases where milk was purchased, so was bread. 

$$ conf(X \Rightarrow Y) = \frac{frq(X \cup Y)}{frq(X)} $$

$$ conf(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X)} $$

$$ conf \in [0, 1] $$

However, confidence can be misleading, especially when the $supp(Y) > conf(X \Rightarrow Y)$

**Lift** is given as the ratio of the observed support to that expected if X and Y were independent. It measures how many times more often X and Y occur together than expected if they were statistically independent. A lift less than 1 means X and Y are negatively correlated. A lift value of 1 means the rules are completely independent. Lift values > 1 are generally more "interesting" and could be indicative of a useful rule pattern. With a high lift, the chance that the rule is just a coincidence is lower. 

$$ lift(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X) \times supp(Y)} $$

$$ lift \in [0, \infty] $$

**Leverage** computes the difference between the observed frequency of X and Y occuring together and the frequency we would expect if X and Y were independent. A leverage value of 0 indicates independence

$$ leverage(X \Rightarrow Y) = supp(X \Rightarrow Y) - (supp(X) \times supp(Y)) $$

$$ leverage \in [-1, 1] $$

**Conviction** is the ratio of the expected frequency that X occurs without Y if X and Y were independent divided by the observed frequency of incorrect predictions. A conviction of 1.2 would mean that the rule would be incorrect 20% more often if the association between X and Y was purely random chance. A high conviction means that the consequent is highly dependent on the antecedent. Rules that always hold have a value of infinity. If the items are independent, the conviction is 1.

$$ conv(X \Rightarrow Y) = \frac{1-supp(Y)}{1-conf(X \Rightarrow Y)} $$

$$ conv \in [0, \infty] $$

**Rule Power Factor (RPF)** is an indication of how intense a rule's items are associated with each other in terms of positive relationship. 

$$ rpf(X \Rightarrow Y) = \frac{supp^2(X \cup Y)}{supp(X)} $$

There are also other interestingness measures, such as all-confidence and collective strength

## Process

Association rules generally need to satisify some specified minimum support and confidence (or other interestingness) threshold. The best rules generally have **high support** (the rule applies to a large amount of cases), **confidence** (the rule is reliable), and **lift** (the rule is not a coincidence)

1. A minimum support threshold is applied to find all frequent itemsets in a database. This step is important as support-pruning also tends to eliminate most spurious correlations
2. Minimum interesting-ness constraints are applied to these frequent itemsets in order to form rules

However, the first step, finding all frequent itemsets in a database is difficult since it involves searching all item combinations. The set of possible item sets is the power set over I and has size $ 2^n - 1$. Efficient search is possible using the downward-closure property of support, which guarantees that for a frequent itemset, all its subsets are also frequent and thus no infrequent itemset can be a subset of a frequent itemset. 

### Algorithms

The **Apriori** algorithm is used to find the frequent itemsets. It first identifies the frequent individual items in the database and then extends them into larger and larger item sets as long as those item sets appear sufficiently often in the database.

This is the algorithm that's used in the Python library `mlxtend`. There are a number of other algorithms that can be reviewed in the [Wiki Page](https://en.wikipedia.org/wiki/Association_rule_learning) if you wish to do a custom implementation

## Differences between Affinity Analysis & Collaborative Filtering

Both affinity analysis and collaborative filtering can be used to recommend items to users. In general, they seek to answer different questions. With collaborative filtering, it answers "What items do users with similar interests to yours like?" Market basket analysis aims to answer "What items are frequently associated with each other?"

In market basket analysis, we consider each basket for a user separately, whereas collaborative filtering considers baskets aggregated per user. 

In market basket analysis, you also consider different measurements, such as support and lift, and there is directionality in the relationships. For example, $X \Rightarrow Y$ is not the same as $Y \Rightarrow X$. Collaborative filtering uses symmetric measures, like cosine similarity. 

It is also possible for collaborative filtering to come up with more indirect similarities. Such as, if you buy item 1, it could find that item 2 is bought along with it, and also item 3 and 4 are similar to 2. Then it can recommend items 3 and 4, even if they're not often associated with item 1. 

Affinity analysis is also generally used as an exploratory tool.

## Sources

Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. p. 207.


# A Simple Example

We'll start with a simple example from the `mlxtend` documentation

[Source](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)

[Documentation](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)

In [1]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

# Sample Dataset of Transactions

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

# Now, let's convert this dataset into a dataframe of binary vectors

te = TransactionEncoder()
transactions = te.fit(dataset).transform(dataset)
transactions_df = pd.DataFrame(transactions, columns = te.columns_)
transactions_df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [2]:
# Let's get the most frequent item sets with a specified minimum support
# by default, apriori uses a min_support of 0.5

frequent_itemsets = apriori(transactions_df, min_support=0.6, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Onion, Eggs)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


In [3]:
# Now let's generate our association rules
# association_rules() lets you specify your metric of interest and the threshold

from mlxtend.frequent_patterns import association_rules

# Rules is sorted by the metric in descending order
metric = 'confidence'
rules = association_rules(frequent_itemsets, metric=metric, min_threshold=0.70).sort_values(by=[metric], ascending=False).reset_index()
rules

Unnamed: 0,index,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,1,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf
1,2,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
2,4,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
3,5,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
4,6,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
5,7,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
6,9,"(Onion, Eggs)",(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
7,10,(Onion),"(Kidney Beans, Eggs)",0.6,0.8,0.6,1.0,1.25,0.12,inf
8,0,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0
9,3,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6


In [4]:
# Now let's consider rules that have both a high confidence and a high lift

rules[rules['lift'] > 1]

Unnamed: 0,index,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,2,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
5,7,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
7,10,(Onion),"(Kidney Beans, Eggs)",0.6,0.8,0.6,1.0,1.25,0.12,inf
9,3,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
10,8,"(Kidney Beans, Eggs)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
11,11,(Eggs),"(Kidney Beans, Onion)",0.8,0.6,0.6,0.75,1.25,0.12,1.6


# Example with Real Data

This data set comes from the UCI Machine Learning Repository and represents transactional data from a UK retailer from 2010-2011. It actually represents sales to wholesalers so it's a little different from our assumed use case

In [5]:
dataset = pd.read_excel('Data/OnlineRetail.xlsx')
dataset.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
dataset.shape

(541909, 8)

In [7]:
dataset['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

In [12]:
dataset[dataset['Country'].isin(['France', 'Germany'])].shape

(18052, 8)

In [13]:
# Let's consider a subset of the data to keep things smaller
dataset_sm = dataset[dataset['Country'].isin(['France', 'Germany'])].copy()

In [14]:
dataset_sm.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [15]:
dataset_sm['InvoiceNo'] = dataset_sm['InvoiceNo'].astype('str')

In [16]:
# Data Cleaning
dataset_sm.dropna(axis=0, subset=['InvoiceNo', 'StockCode', 'Description'], inplace=True)
# No rows had invalid invoice numbers or stock codes
dataset_sm.shape

(18052, 8)

In [17]:
# Remove refunded transactions
dataset_sm = dataset_sm[~dataset_sm['InvoiceNo'].str.contains('C')]
dataset_sm.shape

(17450, 8)

In [18]:
# Let's transform the transactions into binary vectors
transactions_df = dataset.pivot_table(index='InvoiceNo', columns='Description', values='Quantity').fillna(0)
transactions_df.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# The Postage item isn't particularly interesting, so we'll drop this column

transactions_df = transactions_df.drop(['POSTAGE'], axis=1)
transactions_df.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# We have to convert it to boolean values for the mlxtend functions
# This can take awhile for large dataframes
transactions_df[transactions_df.columns] = transactions_df[transactions_df.columns].astype(bool)

In [21]:
transactions_df.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536366,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536367,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536368,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536369,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [24]:
# Now let's get the most frequent item sets

# These are itemsets that have appeared in at least 3% of transactions
frequent_itemsets = apriori(transactions_df, min_support=0.03, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.039311,(6 RIBBONS RUSTIC CHARM)
1,0.034198,(60 TEATIME FAIRY CAKE CASES)
2,0.040947,(ALARM CLOCK BAKELIKE GREEN)
3,0.032684,(ALARM CLOCK BAKELIKE PINK)
4,0.044220,(ALARM CLOCK BAKELIKE RED )
5,0.060010,(ASSORTED COLOUR BIRD ORNAMENT)
6,0.039311,(BAKING SET 9 PIECE RETROSPOT )
7,0.031089,(CHARLOTTE BAG PINK POLKADOT)
8,0.036570,(CHARLOTTE BAG SUKI DESIGN)
9,0.035425,(CHOCOLATE HOT WATER BOTTLE)


In [25]:
from mlxtend.frequent_patterns import association_rules

# Rules is sorted by the metric in descending order
metric = 'lift'
rules = association_rules(frequent_itemsets, metric=metric, min_threshold=2).sort_values(by=[metric], ascending=False).reset_index()
rules

Unnamed: 0,index,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,0,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.045815,0.043238,0.032071,0.7,16.189404,0.03009,3.189206
1,1,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.043238,0.045815,0.032071,0.741722,16.189404,0.03009,3.694408
2,3,(JUMBO BAG PINK POLKADOT),(JUMBO BAG RED RETROSPOT),0.050356,0.087335,0.034075,0.676686,7.74813,0.029677,2.82284
3,2,(JUMBO BAG RED RETROSPOT),(JUMBO BAG PINK POLKADOT),0.087335,0.050356,0.034075,0.390164,7.74813,0.029677,1.557212


In [26]:
# Let's do further filtering to find rules with high lift and confidence
rules[(rules['lift'] > 5) & (rules['confidence'] > 0.5)]

Unnamed: 0,index,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,0,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.045815,0.043238,0.032071,0.7,16.189404,0.03009,3.189206
1,1,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.043238,0.045815,0.032071,0.741722,16.189404,0.03009,3.694408
2,3,(JUMBO BAG PINK POLKADOT),(JUMBO BAG RED RETROSPOT),0.050356,0.087335,0.034075,0.676686,7.74813,0.029677,2.82284
