## Kyle Calabro
## Dr. Tweneboah
## CMPS 620 - Homework Two: Part B
## 5 March 2021
-----

In [4]:
import pandas as pd

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

## Part One: Create Nested List of the Words in Each Line of the File

In [3]:
words_list = []

with open("bible.txt", "r") as bible_file:
    for line in bible_file:
        words_list.append(line.split())

## Part Two: Generating Frequent Itemsets:
### a) Transform the data via the TransactionEncoder 

In [7]:
te = TransactionEncoder()
te_array = te.fit(words_list).transform(words_list)

te_df = pd.DataFrame(te_array, columns = te.columns_)
te_df

Unnamed: 0,aaron,aaron's,aaronites,abaddon,abagtha,abana,abarim,abase,abased,abasing,...,zorathites,zoreah,zorites,zorobabel,zuar,zuph,zur,zuriel,zurishaddai,zuzims
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31096,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
31097,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
31098,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
31099,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### b) Build the frequent itemsets via apriori and return the itemsets with at least .01 support

In [16]:
frequent_itemsets = apriori(te_df, min_support = 0.01, use_colnames = True)

In [17]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.019099,(about)
1,0.022990,(according)
2,0.035176,(after)
3,0.020932,(again)
4,0.044693,(against)
...,...,...
351,0.013537,"(lord, thy, thee)"
352,0.015305,"(lord, thy, thou)"
353,0.012701,"(shalt, thou, thee)"
354,0.015144,"(shalt, thou, thy)"


### c) Select those frequent itemsets with at least size 2 that have support of at least 5 percent

In [22]:
filtered_itemsets = frequent_itemsets[(frequent_itemsets["support"] >= .05) & (frequent_itemsets["itemsets"].str.len() >= 2)]
filtered_itemsets

Unnamed: 0,support,itemsets
256,0.051188,"(lord, god)"


From the above output we can see that only one itemset exhibits a support of at least .05 with more than one item. The itemset is (lord, god). Given that support is an indication of how frequently the itemset appears in the dataset this result makes a lot of sense being that we are examining the King James Bible. Essentially, this result is telling us that the itemset (lord, god) appears in five percent of all transactions.

## Part Three: Association Rules
### a) Create the rules with their corresponding support, confidence, and lift. Use "lift" and min_threshold = 1.

In [23]:
rules = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(lord),(against),0.214366,0.044693,0.013119,0.061197,1.369271,0.003538,1.01758
1,(against),(lord),0.044693,0.214366,0.013119,0.293525,1.369271,0.003538,1.112048
2,(am),(lord),0.026076,0.214366,0.010578,0.405672,1.892426,0.004989,1.321886
3,(lord),(am),0.214366,0.026076,0.010578,0.049348,1.892426,0.004989,1.024479
4,(said),(answered),0.115816,0.015755,0.010675,0.092171,5.850226,0.00885,1.084174


### b) Filter the dataframe to include only those with large lift (5) and high confidence (.6).

In [27]:
filtered_rules = rules[(rules["lift"] >= 5) & (rules["confidence"] >= 0.6)]
filtered_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
5,(answered),(said),0.015755,0.115816,0.010675,0.677551,5.850226,0.00885,2.742089
6,(art),(thou),0.014533,0.124787,0.01434,0.986726,7.90728,0.012527,65.932714
79,(hast),(thou),0.027298,0.124787,0.026848,0.98351,7.881511,0.023442,53.075418
87,(she),(her),0.023408,0.038359,0.014083,0.601648,15.684715,0.013185,2.414051
181,(thus),(saith),0.022732,0.038488,0.01463,0.643564,16.721383,0.013755,2.697577
184,(shalt),(thou),0.039034,0.124787,0.039002,0.999176,8.007055,0.034131,1062.508601
249,"(thy, hast)",(thou),0.011318,0.124787,0.011029,0.974432,7.808762,0.009616,34.230554
261,"(lord, thus)",(saith),0.01627,0.038488,0.01389,0.853755,22.18265,0.013264,6.574666
265,(thus),"(lord, saith)",0.022732,0.028198,0.01389,0.611033,21.669011,0.013249,2.498413
266,"(lord, shalt)",(thou),0.012283,0.124787,0.01225,0.997382,7.992678,0.010718,334.331372


From the above output, we observe that there are two rules that exhibit a perfect confidence score of 1.0 and in turn, infinite conviction. These rules are (shalt, thee) implies (thou) and (shalt, thy) implies (thou). This indicates that for these two rules, the consequent and antecedent will always occur together. The rules (shalt) implies (thou) and (lord, shalt) implies (thou) also exhibit very large confidence values above .99 indicating that the consequents and antecedents in these rules almost always occur together. It is important to note however that, while all of these rules exhibit high confidence and lift, with widely ranging conviction values, these rules also all exhibit relatively low values of support with the highest being .039, so these results should perhaps be taken with some caution.

### c) Conviction and the Problems with Confidence and Lift

Confidence suffers from the issue that the confidence for an association rule having a very frequent consequent will always be high. Lift suffers from issues in that it is susceptible to noise in small databases, rare itemsets with low counts which by chance occur a fdew times can produce very large lift values. The conviction measure avoids the problems associated with the confidence and lift measures as the conviction measure compares the probability that A appears without B if they were dependent with the actual frequency of the appearance A without B. Conviction is, therefore, a directed measure that is monotone in confidence and lift.  