# Problem 1:
Prepare rules for the all the data sets 
1) Try different values of support and confidence. Observe the change in number of rules for different support,confidence values
2) Change the minimum length in apriori algorithm
3) Visulize the obtained rules using different plots 

In [1]:
import pandas as pd

from mlxtend.frequent_patterns import apriori, association_rules

import matplotlib.pyplot as plt

In [2]:
# Matplotlib configurations

# Display interactive plots. Used this since convenient for displaying plots in github.
# %matplotlib notebook
%matplotlib notebook
# Font and figure size:
# Ref: https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot
SMALL_SIZE = 8
MEDIUM_SIZE = 9
BIGGER_SIZE = 12

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

In [3]:
book_df = pd.read_csv('book.csv')

In [4]:
book_df.head()

Unnamed: 0,ChildBks,YouthBks,CookBks,DoItYBks,RefBks,ArtBks,GeogBks,ItalCook,ItalAtlas,ItalArt,Florence
0,0,1,0,1,0,0,1,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,1,0,1,0,0,0,0
4,0,0,1,0,0,0,1,0,0,0,0


In [5]:
book_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   ChildBks   2000 non-null   int64
 1   YouthBks   2000 non-null   int64
 2   CookBks    2000 non-null   int64
 3   DoItYBks   2000 non-null   int64
 4   RefBks     2000 non-null   int64
 5   ArtBks     2000 non-null   int64
 6   GeogBks    2000 non-null   int64
 7   ItalCook   2000 non-null   int64
 8   ItalAtlas  2000 non-null   int64
 9   ItalArt    2000 non-null   int64
 10  Florence   2000 non-null   int64
dtypes: int64(11)
memory usage: 172.0 KB


## Observations:
- 2000 records and 11 categories of book genres (i.e 11 product categories)
- The features are in binary i.e already encoded.
- There are no null values.

In [6]:
# Mean along the columns, to help estimate min support score.
support_est_prod = book_df.mean(axis=0)
support_est_prod

ChildBks     0.4230
YouthBks     0.2475
CookBks      0.4310
DoItYBks     0.2820
RefBks       0.2145
ArtBks       0.2410
GeogBks      0.2760
ItalCook     0.1135
ItalAtlas    0.0370
ItalArt      0.0485
Florence     0.1085
dtype: float64

In [7]:
# values between which, support for a book genre falls.
min(support_est_prod), max(support_est_prod)

(0.037, 0.431)

## Observations:
- The support scores for each of the products is within the range (0.037, 0.431) i.e (74 - 862 transactions).
- We can expect the support scores for two products to be less than 0.43, for three even lesser etc.
- For this problem, we can try the min_support thresholds [0.025, 0.05, 0.15, 0.25] (arbitarily chosen for now)

In [8]:
support = [0.025, 0.05, 0.15, 0.25]
confidence = [0.4, 0.6] # Arbitarily chosen for analysis.

In [9]:
# Creating a dictionary to contain association rules for each support threshold,
# and for each confidence threshold per support threshold.
freq_set_per_suport = dict.fromkeys(support) # Dictionary to contain frequent items.
rules_set = dict.fromkeys(support) # Dictionary to contain association rules.
for s_val in support:
    frequent_itemsets = apriori(book_df, min_support=s_val, use_colnames=True)
    frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) # col to contain length of itemsets.
    freq_set_per_suport[s_val] = frequent_itemsets # Store freuent items for each support in dictionary.

    rules_per_sup_val = []
    for c_val in confidence:
        rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=c_val)
        rules_per_sup_val.append(rules)
    rules_set[s_val] = rules_per_sup_val


#### Frequent item sets and rules for support = 0.025, confidence = 0.4


In [10]:
freq_set_per_suport[0.025].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [11]:
rules_set[0.025][0].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(CookBks),(ChildBks),0.431,0.423,0.256,0.593968,1.404179,0.073687,1.421069
2,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
3,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
4,(ChildBks),(DoItYBks),0.423,0.282,0.184,0.434988,1.542511,0.064714,1.27077


#### Frequent item sets and rules for support = 0.025, confidence = 0.6

In [12]:
freq_set_per_suport[0.025].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [13]:
rules_set[0.025][1].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
2,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
3,(RefBks),(ChildBks),0.2145,0.423,0.1515,0.706294,1.669725,0.060767,1.964548
4,(ArtBks),(ChildBks),0.241,0.423,0.1625,0.674274,1.594028,0.060557,1.771427


#### Frequent item sets and rules for support = 0.05, confidence = 0.4

In [14]:
freq_set_per_suport[0.05].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [15]:
rules_set[0.05][0].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(CookBks),(ChildBks),0.431,0.423,0.256,0.593968,1.404179,0.073687,1.421069
2,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
3,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
4,(ChildBks),(DoItYBks),0.423,0.282,0.184,0.434988,1.542511,0.064714,1.27077


#### Frequent itemsets and rules for support = 0.05, confidence = 0.6

In [16]:
freq_set_per_suport[0.05].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [17]:
rules_set[0.05][1].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
2,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
3,(RefBks),(ChildBks),0.2145,0.423,0.1515,0.706294,1.669725,0.060767,1.964548
4,(ArtBks),(ChildBks),0.241,0.423,0.1625,0.674274,1.594028,0.060557,1.771427


#### Frequent itemsets and rules for support = 0.15, confidence = 0.4

In [18]:
freq_set_per_suport[0.15].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [19]:
rules_set[0.15][0].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(CookBks),(ChildBks),0.431,0.423,0.256,0.593968,1.404179,0.073687,1.421069
2,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
3,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
4,(ChildBks),(DoItYBks),0.423,0.282,0.184,0.434988,1.542511,0.064714,1.27077


#### Frequent itemsets and rules for support = 0.15, confidence = 0.6

In [20]:
freq_set_per_suport[0.15].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.2475,(YouthBks),1
2,0.431,(CookBks),1
3,0.282,(DoItYBks),1
4,0.2145,(RefBks),1


In [21]:
rules_set[0.15][1].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
2,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
3,(RefBks),(ChildBks),0.2145,0.423,0.1515,0.706294,1.669725,0.060767,1.964548
4,(ArtBks),(ChildBks),0.241,0.423,0.1625,0.674274,1.594028,0.060557,1.771427


#### Frequent itemsets and rules for support = 0.25, confidence = 0.4

In [22]:
freq_set_per_suport[0.25].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.431,(CookBks),1
2,0.282,(DoItYBks),1
3,0.276,(GeogBks),1
4,0.256,"(CookBks, ChildBks)",2


In [23]:
rules_set[0.25][0].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(CookBks),(ChildBks),0.431,0.423,0.256,0.593968,1.404179,0.073687,1.421069
1,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124


#### Frequent item sets and rules for support = 0.25, confidence = 0.6

In [24]:
freq_set_per_suport[0.25].head()

Unnamed: 0,support,itemsets,length
0,0.423,(ChildBks),1
1,0.431,(CookBks),1
2,0.282,(DoItYBks),1
3,0.276,(GeogBks),1
4,0.256,"(CookBks, ChildBks)",2


In [25]:
rules_set[0.25][1].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124


In [26]:
## Visualizing the results:
fig, ax = plt.subplots(nrows=4, ncols=2, figsize=(9,21))

for i in range(len(support)): # i is the key index
    key = support[i]
    n_vals = len(confidence)
    for j in range(n_vals):
        rules_df = rules_set[key][j] # j is the value index for the respective "key"
        ax[i,j].scatter(rules_df['support'], rules_df['confidence'], alpha = 0.5)
        ax[i,j].set_xlabel('support')
        ax[i,j].set_ylabel('confidence')
        ax[i,j].set_title('Support = {} Confidence = {}'.format(support[i],confidence[j]))
plt.show()

<IPython.core.display.Javascript object>

## Filtering rows and modifying the Max length parameter:

### Filtering rows by length

In [27]:

sup_key = 0.025 # consider a support val of 0.025 for illustration.
freq_itemsets = freq_set_per_suport[sup_key]

In [28]:
freq_itemsets['length'].unique() # find out how many diiferent types of sets are there.

array([1, 2, 3, 4, 5, 6], dtype=int64)

In [29]:
freq_itemsets[(freq_itemsets['length'] == 3) # Filter rows that contain 3 items in an itemset (for illustration)
            & (freq_itemsets['support'] >=0.025)] 

Unnamed: 0,support,itemsets,length
55,0.1290,"(CookBks, YouthBks, ChildBks)",3
56,0.0950,"(DoItYBks, YouthBks, ChildBks)",3
57,0.0830,"(RefBks, YouthBks, ChildBks)",3
58,0.0805,"(ArtBks, YouthBks, ChildBks)",3
59,0.0990,"(YouthBks, ChildBks, GeogBks)",3
...,...,...,...
124,0.0290,"(RefBks, ItalCook, GeogBks)",3
125,0.0360,"(ArtBks, ItalCook, GeogBks)",3
126,0.0295,"(ArtBks, ItalArt, GeogBks)",3
127,0.0300,"(Florence, ArtBks, GeogBks)",3


### Changing the max_lenth parameter and obtaining corresponding association rules.

In [30]:
# Creating a dictionary to contain association rules for each support threshold,
# and for each confidence threshold per support threshold.
freq_set_per_suport1 = dict.fromkeys(support)
rules_set1 = dict.fromkeys(support) 
for s_val in support:
    frequent_itemsets = apriori(book_df, min_support=s_val, use_colnames=True, max_len=3)
    frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) # col to contain length of itemsets.
    freq_set_per_suport1[s_val] = frequent_itemsets # Store freuent items for each support in dictionary.

    rules_per_sup_val = []
    for c_val in confidence:
        rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=c_val)
        rules_per_sup_val.append(rules)
    rules_set1[s_val] = rules_per_sup_val


#### support threshold = 0.025, confidence = 0.4, max_length = 3

In [31]:
rules_set1[0.025][0].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(YouthBks),(ChildBks),0.2475,0.423,0.165,0.666667,1.576044,0.060308,1.731
1,(CookBks),(ChildBks),0.431,0.423,0.256,0.593968,1.404179,0.073687,1.421069
2,(ChildBks),(CookBks),0.423,0.431,0.256,0.605201,1.404179,0.073687,1.44124
3,(DoItYBks),(ChildBks),0.282,0.423,0.184,0.652482,1.542511,0.064714,1.660347
4,(ChildBks),(DoItYBks),0.423,0.282,0.184,0.434988,1.542511,0.064714,1.27077


In [32]:
rules_set1[0.025][0].shape # Not printing the rules here since there are lots of rows.

(272, 9)

**Note:** the number of row is lessser after applying max_length filter to the apriori algorithm compared to before.This can also be seen in the reduction in number of points in the scatterplots. As before we can extract the association rules from the dictionary created above for the required support and confidence values. 

In [33]:
## Visualizing the results:
fig, ax = plt.subplots(nrows=4, ncols=2, figsize=(9,21))

for i in range(len(support)): # i is the key index
    key = support[i]
    n_vals = len(confidence)
    for j in range(n_vals):
        rules_df = rules_set1[key][j] # j is the value index for the respective "key"
        ax[i,j].scatter(rules_df['support'], rules_df['confidence'], alpha = 0.5)
        ax[i,j].set_xlabel('support')
        ax[i,j].set_ylabel('confidence')
        ax[i,j].set_title('Support = {} Confidence = {}'.format(support[i],confidence[j]))
plt.show()

<IPython.core.display.Javascript object>

## Conclusion:
A set of association rules and frequent item sets for different combinations of support and confidence thresholds were prepared using the apriori algorithm and visualised using scatter plots. Based on the above analysis, and further use of metrics like lift etc. a list of associated items can be extracted; for example, best possible combinations of three book genres given a customer buys a book from a particular genre.