Extracting the association rules of an arbitrary document (relatively big). Each sentence is
a transaction. Use the modules to extract the information.

[article link](https://aeon.co/essays/being-underslept-and-out-of-sync-is-a-political-injustice)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Read txt files and perform preprocessing

In [11]:
textArr = np.loadtxt('./Q4/article.txt',
                     dtype=str,
                     delimiter='\n',
                     unpack=True)
textArr

array(['For Uber drivers trying to make ends meet, it can be tempting to sleep in the car. It saves on a few journeys and helps make the most of peak-hour business. It keeps a driver readily available for work – and the apps favour those who can clock up the hours. There are carparks where the sleeping bags come out after dark, if only for five or six hours.',
       'Sleeping in a vehicle is clearly not great. There are the obvious obstacles to adequate rest – how to get comfortable, how to deal with the light, temperature and lack of facilities. The sleep is typically short and poor. Then there are questions of privacy – exposure to onlookers, from passersby to police. Sleeping in a car means breaking a norm, often attracting suspicion. To sleep where you work has its own degradations – a sense of permanent connection, perhaps of exploitation. And it almost certainly means sleeping alone.',
       'The carpark sleeper is one of the more dramatic expressions of poor sleep in the conte

In [12]:
# split the text into sentences
flattenedSentencesArr = np.hstack(np.char.split(textArr, sep='.'))
data = pd.DataFrame(flattenedSentencesArr)
data.columns = ['sentences']
data.head()

Unnamed: 0,sentences
0,"For Uber drivers trying to make ends meet, it ..."
1,It saves on a few journeys and helps make the...
2,It keeps a driver readily available for work ...
3,There are carparks where the sleeping bags co...
4,


In [13]:
# remove line breaks
data['sentences'] = data['sentences'].apply(lambda x: x.replace('\n', ''))
# drop empty sentences
data['sentences'] = data['sentences'].replace('', np.nan)
data = data.dropna()
data.sample(5)

Unnamed: 0,sentences
89,Short sleep is often coupled with irregular sleep
107,Such trends are pronounced in developing coun...
65,They have their own significance
9,"Sleeping in a car means breaking a norm, ofte..."
149,They cannot be evaluated looking only at thei...


In [14]:
# remove symbols, transform to lowercase, split into array of word
import re
data['sentences'] = data['sentences'].apply(lambda x: x.lower())
data['sentences'] = data['sentences'].apply(lambda x: re.sub(r'[^\w]', ',', x))
data['sentences'] = data['sentences'].apply(lambda x: x.split(','))
data.sample(5)

Unnamed: 0,sentences
139,"[, those, afflicted, by, illness, may, have, g..."
263,"[, renovating, them, for, an, age, of, desynch..."
102,"[, the, individual, must, adapt, or, risk, the..."
46,"[, the, sleep, is, typically, short, and, poor]"
180,"[, significant, minorities, are, excluded, fro..."


In [15]:
# delete empty list
data['sentences'] = data['sentences'].apply(lambda x: [str for str in x if str])
# remove common words
common_words = ['the', 'a', 'an', 'and','of','in','is','are','was','were','that','this']
data['sentences'] = data['sentences'].apply(lambda x: [word for word in x if word not in common_words])
data.sample(5)

Unnamed: 0,sentences
139,"[those, afflicted, by, illness, may, have, gre..."
207,"[evening, however, they, succumb, more, easily..."
124,"[there, no, dignity]"
17,"[divides, around, sleep, have, rarely, been, s..."
236,"[some, argue, sleep, should, be, protected, le..."


In [16]:
data.shape

(240, 1)

## Perform association rule mining using apriori

In [132]:
#  install mlxtend
%pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.19.0
Note: you may need to restart the kernel to use updated packages.


In [18]:
from mlxtend.preprocessing import TransactionEncoder

In [19]:
# encode the data using transaction encoder
te = TransactionEncoder()
te_ary = te.fit(data['sentences']).transform(data['sentences'])
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,10,10pm,11am,1867,19,1900s,1941,1949,19th,2016,...,world,would,wrong,yawn,year,years,yet,yields,you,yourself
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
236,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
237,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
238,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [20]:
from mlxtend.frequent_patterns import apriori

In [23]:
df_apr = apriori(df, min_support=0.03, use_colnames=True)
df_apr

Unnamed: 0,support,itemsets
0,0.033333,(also)
1,0.125000,(as)
2,0.070833,(at)
3,0.183333,(be)
4,0.066667,(but)
...,...,...
142,0.037500,"(to, sleep, by)"
143,0.033333,"(to, for, sleep)"
144,0.041667,"(it, to, sleep)"
145,0.037500,"(to, one, sleep)"


In [24]:
df_apr.sort_values('support', ascending=False)

Unnamed: 0,support,itemsets
62,0.445833,(to)
48,0.375000,(sleep)
20,0.195833,(it)
3,0.183333,(be)
127,0.170833,"(to, sleep)"
...,...,...
101,0.033333,"(not, it)"
104,0.033333,"(less, sleep)"
107,0.033333,"(not, to)"
109,0.033333,"(on, sleep)"


In [25]:
df_apr['length'] = df_apr['itemsets'].apply(lambda x: len(x))
df_apr.sample(3)

Unnamed: 0,support,itemsets,length
142,0.0375,"(to, sleep, by)",3
42,0.033333,(public),1
40,0.05,(poor),1


In [27]:
df_apr[(df_apr['length'] >= 3) & (df_apr['support'] > 0.02)]

Unnamed: 0,support,itemsets,length
138,0.041667,"(to, sleep, as)",3
139,0.033333,"(be, can, sleep)",3
140,0.033333,"(be, to, for)",3
141,0.0625,"(be, to, sleep)",3
142,0.0375,"(to, sleep, by)",3
143,0.033333,"(to, for, sleep)",3
144,0.041667,"(it, to, sleep)",3
145,0.0375,"(to, one, sleep)",3
146,0.033333,"(their, to, people)",3


In [28]:
from mlxtend.frequent_patterns import association_rules

In [29]:
rules = association_rules(df_apr, metric='lift', min_threshold=1)

In [30]:
rules.sample(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
73,(to),(other),0.445833,0.066667,0.0375,0.084112,1.261682,0.007778,1.019048
124,"(to, sleep)",(be),0.170833,0.183333,0.0625,0.365854,1.995565,0.031181,1.287821
116,"(be, to)",(for),0.116667,0.1625,0.033333,0.285714,1.758242,0.014375,1.1725
84,(short),(sleep),0.041667,0.375,0.0375,0.9,2.4,0.021875,6.25
139,(sleep),"(to, for)",0.375,0.079167,0.033333,0.088889,1.122807,0.003646,1.010671


In [31]:
rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
157,(people),"(their, to)",0.079167,0.083333,0.033333,0.421053,5.052632,0.026736,1.583333
152,"(their, to)",(people),0.083333,0.079167,0.033333,0.400000,5.052632,0.026736,1.534722
155,(their),"(to, people)",0.154167,0.050000,0.033333,0.216216,4.324324,0.025625,1.212069
154,"(to, people)",(their),0.050000,0.154167,0.033333,0.666667,4.324324,0.025625,2.537500
22,(those),(by),0.091667,0.125000,0.041667,0.454545,3.636364,0.030208,1.604167
...,...,...,...,...,...,...,...,...,...
87,(sleep),(to),0.375000,0.445833,0.170833,0.455556,1.021807,0.003646,1.017857
68,(to),(or),0.445833,0.083333,0.037500,0.084112,1.009346,0.000347,1.000850
47,(to),(have),0.445833,0.083333,0.037500,0.084112,1.009346,0.000347,1.000850
69,(or),(to),0.083333,0.445833,0.037500,0.450000,1.009346,0.000347,1.007576


In [37]:
rules[(rules['lift'] >= 3) & (rules['confidence'] >= 0.2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
22,(those),(by),0.091667,0.125,0.041667,0.454545,3.636364,0.030208,1.604167
23,(by),(those),0.125,0.091667,0.041667,0.333333,3.636364,0.030208,1.3625
74,(their),(people),0.154167,0.079167,0.041667,0.27027,3.41394,0.029462,1.261883
75,(people),(their),0.079167,0.154167,0.041667,0.526316,3.41394,0.029462,1.785648
111,"(be, sleep)",(can),0.091667,0.1,0.033333,0.363636,3.636364,0.024167,1.414286
112,"(can, sleep)",(be),0.054167,0.183333,0.033333,0.615385,3.356643,0.023403,2.123333
114,(can),"(be, sleep)",0.1,0.091667,0.033333,0.333333,3.636364,0.024167,1.3625
152,"(their, to)",(people),0.083333,0.079167,0.033333,0.4,5.052632,0.026736,1.534722
154,"(to, people)",(their),0.05,0.154167,0.033333,0.666667,4.324324,0.025625,2.5375
155,(their),"(to, people)",0.154167,0.05,0.033333,0.216216,4.324324,0.025625,1.212069
