# Main

Student name: PHAN MANH TUNG 

Class: MoSIG M1

Student number: 42202349 

## Credit German

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [2]:
data = pd.read_csv("/content/credit-german.txt", delimiter='\t')

In [3]:
data.head(3)

Unnamed: 0,checking_status,disc_duration,credit_history,purpose,disc_amount,savings_status,employment,personal_status,other_parties,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,lo_1_year,critical/other existing,radio/tv,1000_2000,no known savings,>=7,male single,none,real estate,>=55,none,own,two,skilled,one,yes,yes,good
1,0<=X<200,up_2_years,existing paid,radio/tv,up_2000,<100,1<=X<4,female div/dep/mar,none,real estate,<30,none,own,one,skilled,one,none,yes,bad
2,no checking,lo_1_year,critical/other existing,education,up_2000,<100,4<=X<7,male single,none,real estate,30<=X<55,none,own,one,unskilled resident,two,none,yes,good


In [4]:
# First we try to see the statistics of the dataset 
data.describe()

'''
Because the apriori algorithm is based on counting frequency, 
we could first identify some fields that could be algorithmically meaningless and noisy to exclude 
+ foreign_worker==yes (freq 963/1000) -> Almost every record is a foreign worker
+ other_payment_plans==none(freq 814/1000) -> Most people do not have other payment plans
+ other_parties==none (freq 907/1000) -> Most people do not have other parties
+ num_dependent==one (845/1000) or two -> Most people has one person depending on them, 15% has two
'''

'\nBecause the apriori algorithm is based on counting frequency, \nwe could first identify some fields that could be algorithmically meaningless and noisy to exclude \n+ foreign_worker==yes (freq 963/1000) -> Almost every record is a foreign worker\n+ other_payment_plans==none(freq 814/1000) -> Most people do not have other payment plans\n+ other_parties==none (freq 907/1000) -> Most people do not have other parties\n+ num_dependent==one (845/1000) or two -> Most people has one person depending on them, 15% has two\n'

In [5]:
# Make a copy of the dataframe, with the exclusion of the aforementioned fields
df = data.loc[:, ~data.columns.isin(["foreign_worker", "other_payment_plans", "other_parties", "num_dependents"])].copy()

'''
Hint 2: The transactions are not very meaningful without side information e.g., “yes, yes, good” is meaningless
without knowing to which field corresponds each of the values. Propose a smart way of transforming each of
the lines (from the second and until the last one) of the two files in order to have meaningful transactions.
'''

# The dataframe is not about lists of transactions, so that there are some MEANINGLESS values such as "yes", "no", "true"..
# So it is better to add the name of the column to each value with the form: 'column_name==value'
df[df.columns] = df[df.columns].apply(lambda row : str(row.name) + "==" + row)

# turn the df into a list to fit into the model
dataset = df.values.tolist()

In [6]:
# Association Rules

'''
You probably noticed that a same fixed parameter value may yield a number of frequent itemsets/association
rules that greatly varies from one dataset to another.
For the different datasets, tune the support/confidence/lift values, and filter the antecedents/consequents of
the rules such that you obtain a relatively small number of rules that you find interpretable and useful. The
definitions of the different support, confidence and lift measures are given in the slides ”Rule Mining from
Data”.
In the TP report that you will put on caseine, for each dataset, you will provide the set of extracted rules
that you consider as interpretable and useful, and you will summarize the choices that you have made and the
experiments that you have conducted for setting up the parameters leading to them. The code, outputs and
justifications will be given in a single python file.
'''

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
r = pd.DataFrame(te_ary, columns=te.columns_)

# I choose a low min_support to preserve the class==bad (since there is only 300/1000 records for that class)
frequent_itemsets = apriori(r, min_support=0.15, use_colnames=True) # support > 0.15

# Lift / Conviction > 1 is a good measure to take (if that==1 proposes independency)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.001) 
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(checking_status==no checking),(age==30<=X<55),0.394,0.550,0.246,0.624365,1.135210,0.029300,1.197973
1,(age==30<=X<55),(checking_status==no checking),0.550,0.394,0.246,0.447273,1.135210,0.029300,1.096382
2,(age==30<=X<55),(class==good),0.550,0.700,0.409,0.743636,1.062338,0.024000,1.170213
3,(class==good),(age==30<=X<55),0.700,0.550,0.409,0.584286,1.062338,0.024000,1.082474
4,(age==30<=X<55),(credit_history==critical/other existing ),0.550,0.293,0.189,0.343636,1.172820,0.027850,1.077147
...,...,...,...,...,...,...,...,...,...
2095,(housing==own),"(job==skilled, existing_credits==one, credit_h...",0.713,0.207,0.152,0.213184,1.029873,0.004409,1.007859
2096,(job==skilled),"(class==good, existing_credits==one, credit_hi...",0.630,0.234,0.152,0.241270,1.031068,0.004580,1.009582
2097,(class==good),"(job==skilled, existing_credits==one, credit_h...",0.700,0.211,0.152,0.217143,1.029113,0.004300,1.007847
2098,(existing_credits==one),"(job==skilled, credit_history==existing paid, ...",0.633,0.173,0.152,0.240126,1.388014,0.042491,1.088339


In [7]:
# Decision rules

print('''
Reasons to have a good credit: ( lift, conviction > 1 ; confidence > 0.5 ; support threshold is raised to 0.3 )

There are some meaningful reasons for having a "good" class such as: 
+ own a house
+ have a skilled job
+ own a telephone
+ be a single male
+ the age ranges from 30 to 55
''')

result1= rules[(rules['consequents'] == {'class==good'} ) & (rules['confidence'] > 0.5)  & (rules['conviction'] > 0.5)
 & (rules['support'] > 0.3)]
 
result1


Reasons to have a good credit: ( lift, conviction > 1 ; confidence > 0.5 ; support threshold is raised to 0.3 )

There are some meaningful reasons for having a "good" class such as: 
+ own a house
+ have a skilled job
+ own a telephone
+ be a single male
+ the age ranges from 30 to 55



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(age==30<=X<55),(class==good),0.55,0.7,0.409,0.743636,1.062338,0.024,1.170213
50,(checking_status==no checking),(class==good),0.394,0.7,0.348,0.883249,1.261784,0.0722,2.569565
88,(housing==own),(class==good),0.713,0.7,0.527,0.73913,1.055901,0.0279,1.15
90,(job==skilled),(class==good),0.63,0.7,0.444,0.704762,1.006803,0.003,1.016129
94,(personal_status==male single),(class==good),0.548,0.7,0.402,0.733577,1.047967,0.0184,1.126027
304,"(age==30<=X<55, housing==own)",(class==good),0.417,0.7,0.324,0.776978,1.109969,0.0321,1.345161
788,"(existing_credits==one, housing==own)",(class==good),0.438,0.7,0.311,0.710046,1.014351,0.0044,1.034646
808,"(job==skilled, housing==own)",(class==good),0.452,0.7,0.342,0.756637,1.08091,0.0256,1.232727
814,"(own_telephone==none, housing==own)",(class==good),0.433,0.7,0.319,0.736721,1.052458,0.0159,1.139474
824,"(personal_status==male single, housing==own)",(class==good),0.408,0.7,0.314,0.769608,1.09944,0.0284,1.302128


In [8]:
print('''
Reasons to have a bad credit: ( lift, conviction > 1 ; support > 0.15 )

There are some meaningful reasons for having a "good" class such as: 
+ existing credits : one
+ do not own a telephone
+ savings status  < 100

It is important to point out the reason for the very low support/confidence values 
because class=='good' has 700/1000 record, so data for class='bad' is thus not so much
Therefore, we do not set a threshold for class=="bad"
''')

result1= rules[(rules['consequents'] == {'class==bad'} )   & (rules['support'] > 0.175)]
result1


Reasons to have a bad credit: ( lift, conviction > 1 ; support > 0.15 )

There are some meaningful reasons for having a "good" class such as: 
+ existing credits : one
+ do not own a telephone
+ savings status  < 100

It is important to point out the reason for the very low support/confidence values 
because class=='good' has 700/1000 record, so data for class='bad' is thus not so much
Therefore, we do not set a threshold for class=="bad"



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
68,(disc_amount==up_2000),(class==bad),0.568,0.3,0.179,0.315141,1.050469,0.0086,1.022108
70,(existing_credits==one),(class==bad),0.633,0.3,0.2,0.315956,1.053186,0.0101,1.023326
73,(own_telephone==none),(class==bad),0.596,0.3,0.187,0.313758,1.045861,0.0082,1.020049
75,(savings_status==<100),(class==bad),0.603,0.3,0.217,0.359867,1.199558,0.0361,1.093523


In [9]:
# Abstract rules : A set of rules for relations among some attributes

result2= rules[(rules['consequents'] != {'class==good'} ) & (rules['consequents'] != {'class==bad'} ) 
& (rules['confidence'] > 0.7)  & (rules['support'] > 0.25)]
result2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
12,(age==30<=X<55),(housing==own),0.55,0.713,0.417,0.758182,1.063369,0.02485,1.186842
27,(age==<30),(job==skilled),0.371,0.63,0.264,0.71159,1.129508,0.03027,1.282897
28,(age==<30),(own_telephone==none),0.371,0.596,0.268,0.722372,1.212034,0.046884,1.455184
58,(checking_status==no checking),(housing==own),0.394,0.713,0.304,0.771574,1.082151,0.023078,1.256422
89,(class==good),(housing==own),0.7,0.713,0.527,0.752857,1.055901,0.0279,1.161272
120,(existing_credits==one),(credit_history==existing paid),0.633,0.53,0.478,0.755134,1.424782,0.14251,1.919419
121,(credit_history==existing paid),(existing_credits==one),0.53,0.633,0.478,0.901887,1.424782,0.14251,3.740577
164,(disc_duration==lo_1_year),(housing==own),0.359,0.713,0.266,0.740947,1.039196,0.010033,1.107882
166,(disc_duration==lo_1_year),(own_telephone==none),0.359,0.596,0.255,0.710306,1.191789,0.041036,1.394577
178,(employment==1<=X<4),(housing==own),0.339,0.713,0.252,0.743363,1.042585,0.010293,1.11831


## Habitudes de vie

In [10]:
data = pd.read_csv("/content/habitudes_de_vie.csv", delimiter='\t')

In [11]:
data.head()

Unnamed: 0,TYPELAIT,SELALIMENT,SELCONSO,ACTIVITESPORT,FUMER,HAB_BOISSON
0,2%MILK,MODERATE,MODERATE,DAILY,REGULAR,OCCASIONAL
1,SKIM,MODERATE,LOW,DAILY,NEVER,NEVER
2,NOMILK,NONE,???,DAILY,FORMER,NEVER
3,NOMILK,NONE,LOW,DAILY,OCCASIONAL,REGULAR
4,2%MILK,NONE,LOW,NEVER,REGULAR,OCCASIONAL


In [12]:
# First we try to see the statistics of the dataset 
data.describe()

# The dataset seems very balanced with close numbers of unique values (4,5,6)
# and there is no dominative number of each column (this biggest is TYPELAIT==2%MILK with 231/360, which is acceptable)

Unnamed: 0,TYPELAIT,SELALIMENT,SELCONSO,ACTIVITESPORT,FUMER,HAB_BOISSON
count,360,360,360,360,360,360
unique,5,5,6,6,5,4
top,2%MILK,VERYLITTLE,LOW,NEVER,FORMER,REGULAR
freq,231,132,123,172,140,213


In [13]:
# Copy data
df = data.copy()
df[df.columns] = df[df.columns].apply(lambda row : str(row.name) + "==" + row)
dataset = df.values.tolist()

# Encoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
r = pd.DataFrame(te_ary, columns=te.columns_)

In [14]:
# aprori
frequent_itemsets = apriori(r, min_support=0.15, use_colnames=True) 
# unique values (4,5,6) each column -> the best threshold for support is ~1/6

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.001)

print('''
For this "life habits" dataset, there is no DECISION RULES, since it has no target/decision column
We could only draw out the ABSTRACT RULES, in another word, relations between the activities 

Aprior algorithm with support > 0.15 and lift > 1 yields very good outcome (confidence > 0.3, conviction > 1)

We could see there are very strong and meaningful connections among these habits:
Two-directional connections:
+ (FUMER==REGULAR) <-> (ACTIVITESPORT==NEVER)
+ (ACTIVITESPORT==DAILY)	<-> (HAB_BOISSON==REGULAR)

+ (SELCONSO==LOW|MODERATE) <-> (ACTIVITESPORT==NEVER)

+ (ACTIVITESPORT==DAILY, HAB_BOISSON==REGULAR) <-> (TYPELAIT==2%MILK)

# SELALIMENT and SELCONSO have the same levels
+ (SELALIMENT==NONE) <->	(SELCONSO==VERYLOW)	
+ (SELALIMENT==MODERATE) <->	(SELCONSO==MODERATE)	

One-directional connections:
+ (FUMER==FORMER)	-> (HAB_BOISSON==REGULAR)

..so on..

''')

rules


For this "life habits" dataset, there is no DECISION RULES, since it has no target/decision column
We could only draw out the ABSTRACT RULES, in another word, relations between the activities 

Aprior algorithm with support > 0.15 and lift > 1 yields very good outcome (confidence > 0.3, conviction > 1)

We could see there are very strong and meaningful connections among these habits:
Two-directional connections:
+ (FUMER==REGULAR) <-> (ACTIVITESPORT==NEVER)
+ (ACTIVITESPORT==DAILY)	<-> (HAB_BOISSON==REGULAR)

+ (SELCONSO==LOW|MODERATE) <-> (ACTIVITESPORT==NEVER)

+ (ACTIVITESPORT==DAILY, HAB_BOISSON==REGULAR) <-> (TYPELAIT==2%MILK)

# SELALIMENT and SELCONSO have the same levels
+ (SELALIMENT==NONE) <->	(SELCONSO==VERYLOW)	
+ (SELALIMENT==MODERATE) <->	(SELCONSO==MODERATE)	

One-directional connections:
+ (FUMER==FORMER)	-> (HAB_BOISSON==REGULAR)

..so on..




Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(HAB_BOISSON==REGULAR),(ACTIVITESPORT==DAILY),0.591667,0.35,0.233333,0.394366,1.126761,0.02625,1.073256
1,(ACTIVITESPORT==DAILY),(HAB_BOISSON==REGULAR),0.35,0.591667,0.233333,0.666667,1.126761,0.02625,1.225
2,(TYPELAIT==2%MILK),(ACTIVITESPORT==DAILY),0.641667,0.35,0.230556,0.359307,1.026592,0.005972,1.014527
3,(ACTIVITESPORT==DAILY),(TYPELAIT==2%MILK),0.35,0.641667,0.230556,0.65873,1.026592,0.005972,1.05
4,(FUMER==REGULAR),(ACTIVITESPORT==NEVER),0.316667,0.477778,0.166667,0.526316,1.101591,0.01537,1.102469
5,(ACTIVITESPORT==NEVER),(FUMER==REGULAR),0.477778,0.316667,0.166667,0.348837,1.101591,0.01537,1.049405
6,(ACTIVITESPORT==NEVER),(HAB_BOISSON==NEVER),0.477778,0.258333,0.155556,0.325581,1.260315,0.03213,1.099713
7,(HAB_BOISSON==NEVER),(ACTIVITESPORT==NEVER),0.258333,0.477778,0.155556,0.602151,1.260315,0.03213,1.312613
8,(SELALIMENT==MODERATE),(ACTIVITESPORT==NEVER),0.269444,0.477778,0.155556,0.57732,1.208343,0.026821,1.235501
9,(ACTIVITESPORT==NEVER),(SELALIMENT==MODERATE),0.477778,0.269444,0.155556,0.325581,1.208343,0.026821,1.083238
