In [6]:
## create new data matrix for decision tree analysis
import pandas as pd

all_data = pd.read_csv( 'data/wvs.csv')

data = pd.DataFrame()

Variables: 

* V10 - "Feeling of happiness" (1-2 happy, 3-4 not happy)
* Important in life: V4 - family, V5 - friends, V6 - leisure time, V7 - politics, V8 - work, V9 - religion (1-2 important, 3-4 not important)


In [7]:
variables = [ 'V10', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9']

for variable in variables:
    
    data[ variable ] = all_data[ variable ] >= 3 ## 3-4 not happy / not important TRUE
    
data.head()

Unnamed: 0,V10,V4,V5,V6,V7,V8,V9
0,False,False,False,False,False,False,False
1,False,False,False,True,True,False,False
2,False,False,True,False,True,False,False
3,False,False,False,True,True,True,False
4,False,False,False,False,False,False,False


In [8]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

rules = apriori( data, min_support=0.0000001, use_colnames = True )
rules['length'] = rules['itemsets'].apply(lambda x: len(x))

rules = association_rules( rules )

In [10]:
for rule in rules.to_dict('records'):    
    print( rule['antecedents'], '=>', rule['consequents'], 
            'confidence', '{0:.5f}'.format(rule['confidence']), 
            'lift', '{0:.5f}'.format(rule['lift']), 
             'support', '{0:.5f}'.format(rule['support']) )

frozenset({'V5', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.80369 lift 1.47483 support 0.00974
frozenset({'V4', 'V10', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.83908 lift 1.53978 support 0.00082
frozenset({'V5', 'V9', 'V10', 'V8'}) => frozenset({'V7'}) confidence 0.81641 lift 1.49817 support 0.00233
frozenset({'V8', 'V4', 'V6', 'V7'}) => frozenset({'V5'}) confidence 0.80153 lift 6.61343 support 0.00117
frozenset({'V5', 'V4', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.83333 lift 1.52924 support 0.00117
frozenset({'V9', 'V4', 'V6', 'V8'}) => frozenset({'V5'}) confidence 0.81356 lift 6.71271 support 0.00107
frozenset({'V9', 'V4', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.80508 lift 1.47740 support 0.00106
frozenset({'V5', 'V9', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.81463 lift 1.49492 support 0.00373
frozenset({'V6', 'V8', 'V5', 'V4', 'V10'}) => frozenset({'V7'}) confidence 0.87692 lift 1.60923 support 0.00064
frozenset({'V6', 'V9', 'V5', 'V4', 'V10'}) => frozen

## Tasks

* Analyse also lift and support. Do you find any rules which might be interesting for further investigation?

In [18]:
rule_number = 1

print("Best lift:\n")

for rule in rules.to_dict('records'):
    if rule['lift'] >= 3:
        print(rule_number, rule['antecedents'], '=>', rule['consequents'], 
                'confidence', '{0:.5f}'.format(rule['confidence']), 
                'lift', '{0:.5f}'.format(rule['lift']), 
                 'support', '{0:.5f}'.format(rule['support']) )
        rule_number += 1

rule_number = 1

print("\nBest support:\n")

for rule in rules.to_dict('records'):
    if rule['support'] >= .002:
        print(rule_number, rule['antecedents'], '=>', rule['consequents'], 
                'confidence', '{0:.5f}'.format(rule['confidence']), 
                'lift', '{0:.5f}'.format(rule['lift']), 
                 'support', '{0:.5f}'.format(rule['support']) )
        rule_number += 1

Best lift:

1 frozenset({'V8', 'V4', 'V6', 'V7'}) => frozenset({'V5'}) confidence 0.80153 lift 6.61343 support 0.00117
2 frozenset({'V9', 'V4', 'V6', 'V8'}) => frozenset({'V5'}) confidence 0.81356 lift 6.71271 support 0.00107
3 frozenset({'V6', 'V9', 'V7', 'V8', 'V4'}) => frozenset({'V5'}) confidence 0.84211 lift 6.94824 support 0.00089
4 frozenset({'V9', 'V7', 'V8', 'V5', 'V4'}) => frozenset({'V6'}) confidence 0.82474 lift 3.93753 support 0.00089
5 frozenset({'V9', 'V7', 'V8', 'V5', 'V4', 'V10'}) => frozenset({'V6'}) confidence 0.82353 lift 3.93174 support 0.00047

Best support:

1 frozenset({'V5', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.80369 lift 1.47483 support 0.00974
2 frozenset({'V5', 'V9', 'V10', 'V8'}) => frozenset({'V7'}) confidence 0.81641 lift 1.49817 support 0.00233
3 frozenset({'V5', 'V9', 'V6', 'V8'}) => frozenset({'V7'}) confidence 0.81463 lift 1.49492 support 0.00373


There are five rules that have a high lift. Which means, for example in the case of rule 2, that it is very likely that when the responder does NOT consider family (V4) AND religion (V9) AND leisure time (V6) AND work (V8) to be important for them, it is very likely that they do NOT consider friends (V5) important. The other rules with high lift (>6) were also about friends. They seem to describe some kind of "general apathy" towards everything. Hence, there are individuals who do not think that anything is important for them. However, what is missing from these rules is interesting. Not being happy (V10) is present only in one of these rules. Which goes against "folk sociological/psychological" idea that this kind of general apathy toward everything should also be associated with not being happy. 

However, when we look at the rules that have the highest support, none of the rules with high lift belong to them. In the best cases of the rules with the best lift is only 0.001 which means that only 0.1% of the cases had this rule in them. Hence, I would not consider that the rules we found provide us interesting knowledge about how the variables are associated to each other in general. All the rules with the best support had quite small lift.


* Try adding more variables. Note that these can only be True/False variables.

The variables that are added are those that answer the questions of "Maritial status" (V57) and "How many children do you have?" (V58). The first reason for choosing these variables is that they are true/false variables. In addition, an interesting question is are these variables associated with what people consider as important or being happy. 

In [37]:
data_with_extra_variables = pd.DataFrame()

variables = [ 'V10', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V57', 'V58' ]

for variable in variables:
    if variable == 'V58':
        data_with_extra_variables[ variable ] = all_data[ variable ] < 1 ## no children TRUE
    else:
        data_with_extra_variables[ variable ] = all_data[ variable ] >= 3 ## 3-4 not happy / not important  TRUE
                                                                          ## 3-6 not married TRUE

print(data_with_extra_variables.head())

     V10     V4     V5     V6     V7     V8     V9    V57    V58
0  False  False  False  False  False  False  False   True   True
1  False  False  False   True   True  False  False   True   True
2  False  False   True  False   True  False  False   True   True
3  False  False  False   True   True   True  False   True   True
4  False  False  False  False  False  False  False  False  False


In [44]:
rules = apriori( data_with_extra_variables, min_support=0.001, use_colnames = True )
rules['length'] = rules['itemsets'].apply(lambda x: len(x))

rules = association_rules( rules )

In [59]:
rule_count = 1

print('Lift >= 2.3: \n' )
for rule in rules.to_dict('records'):
    if rule['lift'] >= 2.3:
        print(rule_count, rule['antecedents'], '=>', rule['consequents'], 
                'confidence', '{0:.5f}'.format(rule['confidence']), 
                'lift', '{0:.5f}'.format(rule['lift']), 
                 'support', '{0:.5f}'.format(rule['support']) )
        rule_count += 1

rule_count = 1

print('\nSupport >= 0.01: \n' )
for rule in rules.to_dict('records'):
    if rule['support'] >= 0.01:
        print(rule_count, rule['antecedents'], '=>', rule['consequents'], 
                'confidence', '{0:.5f}'.format(rule['confidence']), 
                'lift', '{0:.5f}'.format(rule['lift']), 
                 'support', '{0:.5f}'.format(rule['support']) )
        rule_count += 1

        

Lift >= 2.3: 

1 frozenset({'V9', 'V4', 'V58'}) => frozenset({'V57'}) confidence 0.86364 lift 2.32371 support 0.00297
2 frozenset({'V4', 'V10', 'V7', 'V58'}) => frozenset({'V57'}) confidence 0.85714 lift 2.30624 support 0.00121
3 frozenset({'V8', 'V4', 'V6', 'V7'}) => frozenset({'V5'}) confidence 0.80153 lift 6.61343 support 0.00117
4 frozenset({'V9', 'V4', 'V6', 'V8'}) => frozenset({'V5'}) confidence 0.81356 lift 6.71271 support 0.00107
5 frozenset({'V58', 'V4', 'V6', 'V7'}) => frozenset({'V57'}) confidence 0.86842 lift 2.33658 support 0.00111
6 frozenset({'V9', 'V4', 'V58', 'V8'}) => frozenset({'V57'}) confidence 0.86667 lift 2.33186 support 0.00116

Support >= 0.01: 

1 frozenset({'V10', 'V7', 'V58'}) => frozenset({'V57'}) confidence 0.81648 lift 2.19684 support 0.02245
2 frozenset({'V9', 'V10', 'V58'}) => frozenset({'V57'}) confidence 0.80609 lift 2.16887 support 0.01123


With added variables, the some of the rules that had a high lift in the previous section (e.g. V8, V4, V6, V7 -> V5) had the best lift. However, some rules with higher support (>0.01) emerged. One of these rules suggest that when a person is not happy AND does not consider politics important AND is not married they probably do not have any children. Other rule is similar in other but instead of NOT considering politics important they do NOT consider religion important. It may not be surprising that not being married and not having children are associated. However, I was surprised how there were no significant associations between what person considers to be important and are they not happy. However, there seems to be at least some kind of association between not being married and unhappiness.

## Some reflections

For some reason I did not find association rules method as an interesting machine learning method. Perhaps it is the lack of my imagination but I was not able to think of that many (realistic) reserach questions that would require the using of association rules. This of course can be caused by two facors. The class reading which was about leaders and government systems in different countries did not provide any kind of revelation of how powerful and useful this method could be when "used the right way". Also, the frequent "shopping basket" example of which is used to explain what association rules methods are about (obviously I had to google about association rules a lot) did frame association rules in a way that I kept me wondering "what kind of social scientific questions are similar to this task". 

I guess that I can think of at least one case where association rules analysis would be provide an interesting possibility (although it is quite unrealistic case). Rogers Brubaker and his research crew did a study about how "ethnicity works" (the study is reported in their book _Nationalistic Politics and Everyday Ethnicity in an Transsylvanian Town_). One interesting research stream was the question of when and how some interaction situation and those who are participating it becomes framed as "ethnic". They collected data with ethnography and interviews. Analyzing this data produced some kind of general understanding of what kind of cues activated ethnic schemas -- for example, a name that connected a stranger to the Hungarian minority was this kind of cue. It could be claimed that they produced some very general association rules between cues and the activation of ethnic schemas. In theory, this data (or similar large data) could be processed (or collected) in a way that there would be TRUE/FALSE statements of what was present in the interaction situation where ethnicity did nor did not emerge. If this data was large enough, it could be analysed with association rules method and it could provide interesting and more detailed rules about how environment, different cues and the activation of different ethnic schemas are connected to each other. This could help to illustrate some aspects of the interaction between cognition, action and the environment. 

What I learned from this exercise was that, at least with the data that was used (and the variables I chose), it was really difficult to find rules that could be considered to be interesting (in substance and also be statistically interesting). The reason could be that this kind of data just does not provide a good foundation for doing association rules analysis. However, although I am a bit sceptical towards this method, it could be that with the right data and research question this method could be used to find interesting things about the social world.