# <span style="font-size:40px;"><center>CUSTOMER INVOICES ANALYSIS</center> </span>

The Apriori algorithm is an algorithm for searching for association rules. 

Association rules are "if-then" statements, that help to show the probability of relationships between data items, within large data sets in various types of databases.

It is especially useful for transactions analysis - when you have a tons of invoices. Analysing the invoices can help you answer some important business questions - what items should be included in the bundle? Or what items should be placed nearby? What to offer the buyer?

In [1]:
import numpy as np 
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd 
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 512 kB/s            
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9


In [2]:
df = pd.read_excel("../input/customer-segmentation-dataset/Online Retail.xlsx", usecols = "A, C:D")

In [3]:
# Stripping extra spaces in the description
df['Description'] = df['Description'].str.strip()
 
# Dropping the rows without any invoice number
df.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
 
# Dropping all transactions which were done on credit
df = df[~df['InvoiceNo'].str.contains('C')]

In [4]:
df["Description"].value_counts()

WHITE HANGING HEART T-LIGHT HOLDER    2327
JUMBO BAG RED RETROSPOT               2115
REGENCY CAKESTAND 3 TIER              2019
PARTY BUNTING                         1707
LUNCH BAG RED RETROSPOT               1594
                                      ... 
FLOWER FAIRY 5 SUMMER DRAW LINERS        1
website fixed                            1
Found in w/hse                           1
OOPS ! adjustment                        1
PAPER CRAFT , LITTLE BIRDIE              1
Name: Description, Length: 4194, dtype: int64

Let's drop some infrequent and too frequent items from dataset for ease and more interesting relations.

In [5]:
df = df[df.groupby('Description').Description.transform('count')>100]
df = df[df.groupby('Description').Description.transform('count')<1000]

In [6]:
# Changing format of dataframe
basket = df.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

In [7]:
# There's some unwanted phrases written in lowercase
basket.drop(list(basket.filter(regex="[a-z]")), axis=1, inplace=True)

In [8]:
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YOU'RE CONFUSING ME METAL SIGN,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Apriori algorithm rely on total count of items, so actually it's important to make that there's no duplicate or unnecessary positions!

It's also an option to take entities from item names to reduce columns size - if it's more important to highlight brand of item, or type of item.

In [9]:
def hot_encode(x):
    if(x<= 0):
        return 0
    if(x>= 1):
        return 1

For apriori algorithm from mlxtend library data should be one-hot encoded.

In [10]:
# Encoding the dataset
basket_encoded = basket.applymap(hot_encode)

In [11]:
# Building the model
frq_items = apriori(basket_encoded, min_support = 0.02, use_colnames = True)
 
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'])

The rules consist of antecedents and consequents, showing us what will be next, if something already happened.

In [12]:
rules = pd.DataFrame(rules)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(JUMBO BAG APPLES),(JUMBO BAG ALPHABET),0.050324,0.046619,0.020119,0.399796,8.575746,0.017773,1.588426
16,(JUMBO BAG APPLES),(JUMBO BAG PEARS),0.050324,0.030616,0.020891,0.415133,13.559148,0.01935,1.657443
15,(JUMBO BAG ALPHABET),(JUMBO BAG APPLES),0.046619,0.050324,0.020119,0.431567,8.575746,0.017773,1.670692
8,(CHOCOLATE HOT WATER BOTTLE),(HOT WATER BOTTLE I AM SO POORLY),0.044149,0.033807,0.020171,0.456876,13.514364,0.018678,1.778957
0,(CHARLOTTE BAG SUKI DESIGN),(CHARLOTTE BAG PINK POLKADOT),0.045384,0.038232,0.021097,0.464853,12.158742,0.019362,1.797202
4,(CHARLOTTE BAG SUKI DESIGN),(STRAWBERRY CHARLOTTE BAG),0.045384,0.037151,0.021354,0.470522,12.66498,0.019668,1.818485
2,(WOODLAND CHARLOTTE BAG),(CHARLOTTE BAG PINK POLKADOT),0.042966,0.038232,0.020325,0.473054,12.373256,0.018683,1.825173
18,(WOODLAND CHARLOTTE BAG),(STRAWBERRY CHARLOTTE BAG),0.042966,0.037151,0.020891,0.486228,13.087737,0.019295,1.874076
6,(CHARLOTTE BAG SUKI DESIGN),(WOODLAND CHARLOTTE BAG),0.045384,0.042966,0.023464,0.517007,12.032946,0.021514,1.981465
3,(CHARLOTTE BAG PINK POLKADOT),(WOODLAND CHARLOTTE BAG),0.038232,0.042966,0.020325,0.531629,12.373256,0.018683,2.043323


Maybe you want to make "SPACEBOY LUNCH BOX" and "DOLLY GIRL LUNCH BOX" a bundle?
Also you can notice that if people buy an item of one color (a bag, for example), they also buy similar item with another color or desing.

In [13]:
frq_items['length'] = frq_items['itemsets'].apply(lambda x: len(x))
items = pd.DataFrame(frq_items[frq_items['length']==1]["itemsets"])
items.head().style.set_properties(subset=['itemsets'], **{'width': '300px'})

Unnamed: 0,itemsets
0,frozenset({'3 STRIPEY MICE FELTCRAFT'})
1,frozenset({'4 TRADITIONAL SPINNING TOPS'})
2,frozenset({'6 RIBBONS RUSTIC CHARM'})
3,frozenset({'60 CAKE CASES DOLLY GIRL DESIGN'})
4,frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'})


Itemsets are frozensets also consists of 1, 2, 3 or more frequent items, depending of lenght given.

Apriory algorithm is interensting for analysing any data with a sequence, or some data that can be grouped. It can be, for example, user actions stored in logs.

It can reveal some hidden rules and dependencies.