# 1.1 Frequent Itemset Generation

In [1]:
import pandas as pd
pd.set_option("max_colwidth", 150)
f = "https://github.com/cs6220/cs6220.spring2019/raw/master/data/Online%20Retail.xlsx"
df = pd.read_excel(f)
basket = (df[df["Country"] == "United Kingdom"]
.groupby(["InvoiceNo", "Description"])["Quantity"]
.sum().unstack().reset_index().fillna(0)
.set_index("InvoiceNo")) # transform transactions into baskets of items
basket_sets = basket.applymap(lambda x: 1 if x >=1 else 0) # convert counts to booleans

In [2]:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(basket_sets, min_support=0.027, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

In [3]:
basket_sets.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
frequent_one_itemset = frequent_itemsets[frequent_itemsets['length'] ==1].sort_values('support' ,ascending = False).head(5)

In [5]:
frequent_two_itemset = frequent_itemsets[frequent_itemsets['length'] == 2].sort_values('support' ,ascending = False).head(5)

In [6]:
frequent_one_itemset

Unnamed: 0,support,itemsets,length
104,0.098276,(WHITE HANGING HEART T-LIGHT HOLDER),1
44,0.087931,(JUMBO BAG RED RETROSPOT),1
84,0.076452,(REGENCY CAKESTAND 3 TIER),1
74,0.072323,(PARTY BUNTING),1
61,0.063158,(LUNCH BAG RED RETROSPOT),1


In [7]:
frequent_two_itemset

Unnamed: 0,support,itemsets,length
110,0.035617,"(JUMBO BAG PINK POLKADOT, JUMBO BAG RED RETROSPOT)",2
109,0.031806,"(GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER )",2
112,0.03167,"(JUMBO STORAGE BAG SUKI, JUMBO BAG RED RETROSPOT)",2
111,0.029809,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG RED RETROSPOT)",2
113,0.027541,"(LUNCH BAG BLACK SKULL., LUNCH BAG RED RETROSPOT)",2


See above for top 5 1-itemsets and 2-itemsets with the highest support. Highest support value for the 1-itemsets is 0.098276, and 0.035617 for 2-itemsets.  

# 1.2 Association Rule Generation

In [8]:
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3).sort_values(by='lift', ascending=False).head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.042377,0.043421,0.031806,0.750535,17.285056,0.029966,3.834527
1,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.043421,0.042377,0.031806,0.732497,17.285056,0.029966,3.579862
8,(LUNCH BAG BLACK SKULL.),(LUNCH BAG RED RETROSPOT),0.055172,0.063158,0.027541,0.499178,7.903646,0.024056,1.870608
9,(LUNCH BAG RED RETROSPOT),(LUNCH BAG BLACK SKULL.),0.063158,0.055172,0.027541,0.436063,7.903646,0.024056,1.675414
2,(JUMBO BAG PINK POLKADOT),(JUMBO BAG RED RETROSPOT),0.052586,0.087931,0.035617,0.677308,7.702719,0.030993,2.826438


See above for top 5 association rules

What items make up one of the top association rules? Search online for the
items (or at least items with the same name). Do you think they are likely to be bought
together?
I searched all the itmes online, and found out that each antecedent and consequent are identical items. For example, "GREEN REGENCY TEACUP AND SAUCER" and ROSES REGENCY TEACUP AND SAUCER are identical items with different color, so they tend to be sold together (strong co-occurrence). I believe they're likely to be bought together.

# 2.1 Association Rule Mining

In [9]:
import numpy as np
import pandas as pd
path = "https://raw.githubusercontent.com/cs6220/cs6220.spring2019/master/data/adult/"
names = pd.read_csv(path + "adult.names", sep="\n", header=None)
parse_cols = lambda x: x.str.split(":", expand=True).iloc[:, 0]
columns = np.roll(parse_cols(names.iloc[92:108, 0]), shift=-1)
df_adult = pd.read_csv(path + "adult.data", sep=",", header=None, index_col=False)
df_adult.columns = columns

In [10]:
df_adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,">50K, <=50K."
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# 2.1.1

In [11]:
df_categorical = df_adult.select_dtypes(include='object')

In [12]:
df_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,">50K, <=50K."
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [13]:
import patsy

In [14]:
workclass_data = patsy.dmatrix('workclass -1', df_categorical, return_type='dataframe')

In [15]:
education_data = patsy.dmatrix('education -1', df_categorical, return_type='dataframe')

In [16]:
marital_status_data = patsy.dmatrix("Q('marital-status') -1", df_categorical, return_type='dataframe')

In [17]:
occupation_data = patsy.dmatrix("occupation -1", df_categorical, return_type='dataframe')

In [18]:
relationship_data = patsy.dmatrix("relationship -1", df_categorical, return_type='dataframe')

In [19]:
race_data = patsy.dmatrix("race -1", df_categorical, return_type='dataframe')

In [20]:
sex_data = patsy.dmatrix("sex -1", df_categorical, return_type='dataframe')

In [21]:
native_country_data = patsy.dmatrix("Q('native-country') -1", df_categorical, return_type='dataframe')

In [22]:
income_data = patsy.dmatrix("Q('>50K, <=50K.') -1", df_categorical, return_type='dataframe')

In [23]:
transform_data = pd.concat([workclass_data, education_data, marital_status_data, occupation_data, relationship_data,
          race_data, sex_data, native_country_data, income_data], axis = 1)

In [24]:
transform_data.head()

Unnamed: 0,workclass[ ?],workclass[ Federal-gov],workclass[ Local-gov],workclass[ Never-worked],workclass[ Private],workclass[ Self-emp-inc],workclass[ Self-emp-not-inc],workclass[ State-gov],workclass[ Without-pay],education[ 10th],...,Q('native-country')[ Scotland],Q('native-country')[ South],Q('native-country')[ Taiwan],Q('native-country')[ Thailand],Q('native-country')[ Trinadad&Tobago],Q('native-country')[ United-States],Q('native-country')[ Vietnam],Q('native-country')[ Yugoslavia],"Q('>50K, <=50K.')[ <=50K]","Q('>50K, <=50K.')[ >50K]"
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


# 2.2.2

In [25]:
frequent_itemsets_adult = apriori(transform_data, min_support=0.4, use_colnames=True)

In [26]:
frequent_itemsets_adult

Unnamed: 0,support,itemsets
0,0.69703,(workclass[ Private])
1,0.459937,(Q('marital-status')[ Married-civ-spouse])
2,0.405178,(relationship[ Husband])
3,0.854274,(race[ White])
4,0.669205,(sex[ Male])
5,0.895857,(Q('native-country')[ United-States])
6,0.75919,"(Q('>50K, <=50K.')[ <=50K])"
7,0.595928,"(workclass[ Private], race[ White])"
8,0.458954,"(workclass[ Private], sex[ Male])"
9,0.618378,"(workclass[ Private], Q('native-country')[ United-States])"


In [27]:
rules_confidence = association_rules(frequent_itemsets_adult, metric="confidence", min_threshold=0.5)

In [28]:
rules_confidence['length'] = rules_confidence['antecedents'].apply(lambda x: len(x))+ rules_confidence['consequents'].apply(lambda x: len(x))

In [29]:
rules_confidence[(rules_confidence["confidence"] > 0.7) & (rules_confidence["length"] >= 4)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
84,"(workclass[ Private], Q('>50K, <=50K.')[ <=50K], Q('native-country')[ United-States])",(race[ White]),0.478916,0.854274,0.413132,0.862639,1.009793,0.004007,1.060905,4
85,"(race[ White], workclass[ Private], Q('native-country')[ United-States])","(Q('>50K, <=50K.')[ <=50K])",0.544455,0.75919,0.413132,0.7588,0.999485,-0.000213,0.99838,4
86,"(workclass[ Private], Q('>50K, <=50K.')[ <=50K], race[ White])",(Q('native-country')[ United-States]),0.456743,0.895857,0.413132,0.904519,1.009668,0.003956,1.090715,4
87,"(race[ White], Q('>50K, <=50K.')[ <=50K], Q('native-country')[ United-States])",(workclass[ Private]),0.580971,0.69703,0.413132,0.711106,1.020195,0.008178,1.048725,4
89,"(workclass[ Private], Q('>50K, <=50K.')[ <=50K])","(race[ White], Q('native-country')[ United-States])",0.544609,0.786862,0.413132,0.758586,0.964065,-0.015399,0.882874,4


1. { race[White], native-country[United-States], <=50K } -> { workclass[Private] }
2. { workclass[Private], race[ White], <=50K } -> { native-country[United-States] }
3. { race[White], native-country[United-States], workclass[Private] } -> { <=50K }
4. { workclass[Private], native-country[United-States], <=50K } -> { race[White] }
5. { workclass[Private], <=50K } -> { race[White], native-country[United-States] }

Rules are generated based on confidence metric with min_threshold 0.5 and min_support 0.4.
Five rules that I picked above have confidence mesaure more than 0.7 which suggests there's probability of 70% or more to see the consequent in the result set given that it also contains the antecedent.

About 70% probability(confidence) white people from U.S. with income less than 50k has strong co-occurrence with working in private-sector.

About 90% probability(confidence) white people who work in private-sector with less than 50k income has strong co-occurrence with the U.S.

About 75% probability(confidence) white people from U.S. who work in private-sector has strong co-occurrence with earning less than 50k.

About 85% probability(confidence) people work in private-sector with income less than 50k from the U.S. has strong co-occurrence with race white. 

About 75% probability(confidence) whoever work in private-sector with income less than 50k has strong co-occurrence with white people from U.S.

Based on these rules, we can see that there's about more than 70% confident that Caucasian, U.S., working at private-sector, earning less than 50k has high association (strong co-occurrence).

# 2.2.3

In [30]:
rules_lift = association_rules(frequent_itemsets_adult, metric="lift", min_threshold=0.5)

In [31]:
rules_lift['length'] = rules_lift['antecedents'].apply(lambda x: len(x))+ rules_lift['consequents'].apply(lambda x: len(x))

In [32]:
rules_lift[(rules_lift['lift'] > 1.4) & (rules_lift['length'] == 3)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
60,"(Q('marital-status')[ Married-civ-spouse], relationship[ Husband])",(sex[ Male]),0.404902,0.669205,0.404871,0.999924,1.494196,0.133909,4361.194804,3
61,"(Q('marital-status')[ Married-civ-spouse], sex[ Male])",(relationship[ Husband]),0.409048,0.405178,0.404871,0.989789,2.44285,0.239134,58.253195,3
62,"(relationship[ Husband], sex[ Male])",(Q('marital-status')[ Married-civ-spouse]),0.405147,0.459937,0.404871,0.999318,2.172729,0.218529,791.612734,3
63,(Q('marital-status')[ Married-civ-spouse]),"(relationship[ Husband], sex[ Male])",0.459937,0.405147,0.404871,0.880275,2.172729,0.218529,4.968497,3
64,(relationship[ Husband]),"(Q('marital-status')[ Married-civ-spouse], sex[ Male])",0.405178,0.409048,0.404871,0.999242,2.44285,0.239134,779.643457,3
65,(sex[ Male]),"(Q('marital-status')[ Married-civ-spouse], relationship[ Husband])",0.669205,0.404902,0.404871,0.605002,1.494196,0.133909,1.506587,3


1. { marital-status[Married-civ-spouse], relationship[Husband] } -> { sex[Male] }
2. { marital-status[Married-civ-spouse], sex[Male] } ->	{ relationship[Husband] }
3. { relationship[Husband], sex[Male] } -> { marital-status [Married-civ-spouse] }
4. { marital-status[Married-civ-spouse] } -> { relationship[Husband], sex[Male] }
5. { relationship[Husband] } -> { marital-status[Married-civ-spouse], sex[ Male] }

Rules are generated based on lift metric with min_threshold 0.5 and min_support 0.4.
The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule occur together than we would expect; high lift value greater than 1 indicates the antecedent and consequent of a rule occurs together more often, hence they're positive correlated.

Five rules that I picked above have lift mesaure more than 1.5 which suggests the antecedent and consequent of a rule occurs together very often, i.e. they're positive correlated.

Around 40%(support) of data shows that husband reports martial-status as complete family is Male, and almost 99%(confidence) of the data containing husbands who reports martial-status as complete family is also male (lift value 1.5)

Around 40%(support) of data shows male people report martial-status as complete family is a husband, and almost 98%(confidence) of the data contaning male people report martial-status as complete family is also a husband (lift value 2.4)

Around 40%(support) of data shows a person who is male and husband reports his martial-status as complete family , and almost 99%(confidence) of data containing a person who is male and husband also reports his martial-status as complete family (lift value 2.2)

Around 40%(support) of data shows whoever report his martial-status as complete family is a male husband, and almost 88%(confidence) of data containing whoever report his martial-status as complete family is also a male husband (lift value 2.2)

Around 40%(support) of data shows a husband is someone who is male and reports his martials-status as complete family, and almost 99%(confidence) of data containing a husband is also male and reports his martials-status as complete family (lift value 2.4)

Based on these rules, we can see that husband, complete family, male tend to occur together very often than expected.
Althgouth it may be intuitive for some of these rules. For example, Husband is almost likely to be male; Male person is almost likely to be considered a husband. However, high lift value tells us these antecedent and consequent are positive correlated, and often occur together, hence confirms the intuition.

# 2.2.4

In [33]:
rules_confidence[(rules_confidence["confidence"] > 0.7) & (rules_confidence["length"] >= 4)][1:2] #use confidence metric

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
85,"(race[ White], workclass[ Private], Q('native-country')[ United-States])","(Q('>50K, <=50K.')[ <=50K])",0.544455,0.75919,0.413132,0.7588,0.999485,-0.000213,0.99838,4


{ race[White], native-country[United-States], workclass[Private] } -> { <=50K }

About 75% probability white people from U.S. who work in private-sector earn less than 50k.

This is one of the top rules based on confidence metric. We see that this particular rule has fairly high confidence around 0.75, but lift value is very close to, but below 1 which indicates antecedents and consequents occur almost as often together as expected (Neither positve correlated nor negative correlated). In the other hand, the support is around 0.41 which is computed based on antecedent support and consequent support, and 0.41 is larger than min_support threshold 0.4(value used in generating frequent itemsets), so this particular rule (itemset?) is considered frequent.

In [34]:
rules_lift[(rules_lift['lift'] > 1.4) & (rules_lift['length'] == 3)][3:4] #use lift metric

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
63,(Q('marital-status')[ Married-civ-spouse]),"(relationship[ Husband], sex[ Male])",0.459937,0.405147,0.404871,0.880275,2.172729,0.218529,4.968497,3


{ marital-status[Married-civ-spouse] } -> { relationship[Husband], sex[Male] }

Around 40.5%(support) of data shows whoever report his martial-status as complete family is a male husband, and almost 88%(confidence) of data containing whoever report his martial-status as complete family is also a male husband (lift value 2.2)

This is one of the top rules based on lift metric. We see that this particular rule has fairly high confidence around 0.88, and the high lift value 2.1 indicates that antecedents and consequents occur more often together than expected (positive correlated). In the other hand, the support is around 0.405 which is computed based on antecedent support and consequent support, and 0.405 is larger than min_support threshold 0.4(value used in generating frequent itemsets), so this particular rule (itemset?) is considered frequent.