### Case Study: Costmetic Store

Description: In this case study, we will implement affinity analysis using Apriori and Eclat algorithm

### Scenario

Nyja, a cosmetic company, sells cosmetic products such as foundations, concealers, lipsticks, highlighters, among others. They have multiple branches spread out globally in more than 150 cities and sell their products both online and offline. The company senses a surge in demand for cosmetics by women consumers online and decides to run a promotion to market its product portfolio using several popular online channels. Thanks to the proliferation of smartphone usage, Internet usage has significantly increased, pushing the millennials and Gen Zers to migrate to online shopping channels.  
The store manager, at this point, would like to know how likely it is for customers to purchase two products together. To put this plan to action, the company decides to mine data from the pre-existing database. 


### Objectives

To figure out how likely is it for a customer to purchase any two products at the same time.
Added to the primary objective, Nyja wants to use the data to generate new business insights for comprehending new business patterns and growth opportunities

### About dataset

Contain 2 columns and 957 rows:
Column 1 – Transaction: Number of transactions made by the customers
Column 2- Items: Number of items purchased together

### Deliverable

Find out the items purchased together using Apriori and ECLAT algorithms

Also called market basket analysis, affinity analysis. 

It is mainly used in customer transactional databases, to explore the dependencies between purchasing pattern of customers.

### What is POS transactional data?

Point of Sale data has Large number of transaction records.
* Data collected using bar-code scanners
* Each record lists all items purchased by a customer on a single purchase transaction

***Are certain groups of items consistently purchased together? How would you use this Knowledage?*** 


Suppose we talk about two products A and B. If item A and B are baught togather frequently, certain steps could be taken to maximize profits.

* A and B a place near to each other, so that if some buys A or B they don't have to go too far to search the other
* Customers who bought A, a personlized add could be send out to them for promotion of B
* Discount offers could be given, when people buy them togather.

Identifying an associations between products is called **Association rule mining**

### Import Required Libraries and read the file

In [1]:
import pandas as pd
import csv
import re

# COSMATICS_DATASET = 'cosmetics.csv'
# df = pd.read_csv(COSMATICS_DATASET)
# df = pd.melt(df, id_vars=['Trans'], value_vars=list(df.columns)[1:])
# df = df[df['value']==1]
# df = df.groupby('Trans')['variable'].apply(list).rename('items').reset_index()
# df['items'] = df['items'].apply(lambda x: ','.join(x))
# df.to_csv('cosmetics_items.csv',index=False)

COSMATICS_DATASET = 'cosmetics.csv'
def read_data(file_path):
    data = []
    f = open(file_path,encoding='utf-8')
    for i,line in enumerate(f.readlines()[1:]):
        line = re.sub(r'\"','',line)
        items = [k.strip() for k in line.split(',')]
        transaction_no, items = items[0], items[1:]
        data.append((transaction_no,set(items)))
    return data
data = read_data(COSMATICS_DATASET)
df = pd.DataFrame(data,columns=['transaction#','items'])
df

Unnamed: 0,transaction#,items
0,1,"{0, 1}"
1,2,"{0, 1}"
2,3,"{0, 1}"
3,4,"{0, 1}"
4,5,"{0, 1}"
5,6,"{0, 1}"
6,7,"{0, 1}"
7,8,"{0, 1}"
8,9,"{0, 1}"
9,10,"{0, 1}"


In [2]:
# import pandas as pd
# FACEPLATES_DATASET = 'mobile_faceplates_data.csv'
# def read_data(file_path):
#     data = []
#     f = open(file_path,encoding='utf-8')
#     for line in f.readlines():
#         items = [k.strip() for k in line.split(',')]
#         transaction_no, items = items[0], items[1:]
#         data.append((transaction_no,set(items)))
#     return data
# data = read_data(FACEPLATES_DATASET)
# df = pd.DataFrame(data,columns=['transaction#','items'])


First we need to convert this data into a binary martix format.
* Convert color labels into binary columns.
* **MultiLabelBinarizer** can be used for the same

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
temp_df = pd.DataFrame(mlb.fit_transform(df['items'].tolist()),columns=mlb.classes_)
df = pd.concat([df,temp_df], axis=1)
df

Unnamed: 0,transaction#,items,0,1
0,1,"{0, 1}",1,1
1,2,"{0, 1}",1,1
2,3,"{0, 1}",1,1
3,4,"{0, 1}",1,1
4,5,"{0, 1}",1,1
5,6,"{0, 1}",1,1
6,7,"{0, 1}",1,1
7,8,"{0, 1}",1,1
8,9,"{0, 1}",1,1
9,10,"{0, 1}",1,1


* Association Rules are probabilistic "if-then" statements
* Basic idea is to examin all the possible rules between items 
* Select only rules most likely to indicate true dependence
* Example of some if-else rule
    * If {A, B} Then {C}
    * If {A, D} Then {E}
* Many rules are possible; how to select good from the bad ones?
* In the rule "If red and white, then green"; red and white are called Antecedent and Green is called Consequent.
    

*Performance Measure 1:* **Support**
* Consider only combinations that occur with higher frequency in the database
* % (or number) of transactions in which antecedent (IF) and consequent (THEN) appear in the data Performance
* Problem with considering frequency is that generating all possible rules is exponential in the number of distinct items.
* Apriori algorithm helps with this problem.

### Apriori algorithm

For k products…
* Set minimum support criterion
* Generate list of one-item sets that meet the support criterion
* Use list of one-item sets to generate list of two-item sets that meet support criterion
* Use list of two-item sets to generate list of three-item sets that meet support criterion 
* Continue up through k-item sets

In [4]:
import sys
sys.path.append('/Users/saurabhagrawal/anaconda3/envs/tapchief/lib/python3.6/site-packages')

In [5]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

In [6]:
freq_items = apriori(df.iloc[:,2:], min_support=0.1, use_colnames=True, verbose=1)
freq_items[10:]

Processing 2 combinations | Sampling itemset size 2


Unnamed: 0,support,itemsets


*Performance Measure 2:* **Confidence** : P(Consequent | Antecedent)
* Confidence: % of antecedent (IF) transactions that also have the consequent (THEN) item set, 
* same as P(Consequent | Antecedent) = P(C & A)/P(A)
* number of transactions with both antecedent & consequent item sets / number of transactions with antecedent item set
* In the above dataset: What is the confidence for “if white then blue”? : 4/8
* Problem with confidence: If antecedent and/or consequent have high support; the confidence would be biased.

In [7]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.1) 
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(0),(1),1.0,0.957,0.957,0.957,1.0,0.0,1.0
1,(1),(0),0.957,1.0,0.957,1.0,1.0,0.0,inf


*Performance Measure 2:* **Lift Ratio** : confidence/(benchmark confidence)
* Benchmark assumes independence between antecedent and consequent
    * P(antecedent & consequent) = P(antecedent) x P(consequent)
    * P(C|A) = P(C&A) / P(A) = P(C) x P(A) / P(A) = P(C)
    * number of transactions with consequent item sets / number of transactions in database
* Lift : confidence/ P(C)
* Lift > 1 indicates a rule that is useful in finding consequent items sets (i.e., more useful than selecting transactions randomly)

### Interpretations
* Lift ratio shows how effective the rule is in finding consequents vs. random (useful if finding particular consequent is important)
* Confidence shows the rate at which consequents will be found (useful in learning costs of promotion)
* Support measures overall impact (%transactions affected)

### Process of selecting rules
* Generate all rules that meet specified support & confidence
* Find frequent item sets (those with sufficient support)
* From these item sets, generate rules with sufficient confidence

In [8]:
# with min support of 2 and confidence of 0.7 
freq_items = apriori(df.iloc[:,2:], min_support=0.1, use_colnames=True, verbose=1)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.7) 
rules

Processing 2 combinations | Sampling itemset size 2


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(0),(1),1.0,0.957,0.957,0.957,1.0,0.0,1.0
1,(1),(0),0.957,1.0,0.957,1.0,1.0,0.0,inf


## ECLAT algorithm

* Stands for : Equivalence Class Clustering and bottom-up Lattice Traversal.
* More scalable then Apriori algorithm in order to find the frequent items
* It work on the vertical representation of the transcation dataset
* Transation Ids are recorded against the item set instead the other way which was used in Apriori algorithm
* Depth First Search is performed on this dataset 
    * The items sets are arraged in a tree format to avoid the calcualtion of items which never occur togather fo have support lesser then the mininum support
    * the intersection of transaction ids is the computed support.

In [9]:
from collections import defaultdict
data = read_data(COSMATICS_DATASET)
item2trans_dict = defaultdict(set)
for trans_id,items in data:
    for item in items:
        item2trans_dict[item].add(trans_id)
df = pd.DataFrame(item2trans_dict.items(), columns =['item','transaction_ids'])
df

Unnamed: 0,item,transaction_ids
0,0,"{218, 988, 823, 35, 274, 149, 290, 360, 377, 4..."
1,1,"{218, 988, 823, 35, 274, 149, 290, 360, 377, 4..."


In [10]:
freq_items = {}
dict_id = 0
min_support = 0.10
max_depth = 16
n_transations = len(data)
def eclat(tree_prefix, items, dict_id):
    for item,transaction_ids in items:
        curr_support = len(transaction_ids)
        if curr_support/n_transations >= min_support:
            freq_items[frozenset(tree_prefix + [item])] = curr_support/n_transations
            tree_suffix = []
            for suffix_item, suffix_transaction_ids in items:
                curr_transaction_ids = transaction_ids & suffix_transaction_ids
                if len(curr_transaction_ids) >= min_support:
                    tree_suffix.append((suffix_item, curr_transaction_ids))
            dict_id += 1
            if dict_id<=max_depth:
                eclat(tree_prefix + [item], sorted(tree_suffix, key=lambda x: len(x[1]), reverse=True), dict_id)
item2trans = [(item,trans) for item,trans in item2trans_dict.items()]
item2trans = sorted(item2trans,key = lambda x: len(x[1]), reverse=True)
eclat([],item2trans,dict_id)
freq_items =  [(support,set(item)) for item,support in freq_items.items()]
freq_items = sorted(freq_items,key=lambda x: x[0], reverse=True)
freq_items = pd.DataFrame(freq_items,columns=['support','itemsets'])
freq_items

Unnamed: 0,support,itemsets
0,1.0,{0}
1,0.957,"{0, 1}"
2,0.957,{1}


In [11]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.1) 
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(0),(1),1.0,0.957,0.957,0.957,1.0,0.0,1.0
1,(1),(0),0.957,1.0,0.957,1.0,1.0,0.0,inf


### Summary
* Association rules (affinity analysis, market basket analysis) produce rules on associations between items from a database of transactions/events
* Most popular method to enumerate rules is Apriori algorithm
* To reduce computation, we consider only “frequent” item sets (= Support)
* Performance is measured by confidence and lift ratio
* Can produce a profusion of rules; review is required to identify useful rules and to reduce redundancy