# Pattern mining in *National Health Interview Survey*

My goal is to find if there are interesting patterns and rules about occurrence of different medical conditions.

I used a data set available [here](https://www.cdc.gov/nchs/nhis/2020nhis.htm) in the *Sample Adult Interview* tab. The data was collected in 2020 from *National Health Interview Survey*.

## Preprocessing of the data

### Let's get to know more about the data.

In [None]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
data = pd.read_csv('adult20.csv')

print('Number of attributes = %d' % (data.shape[1]))

data.head()

Number of attributes = 617


Unnamed: 0,URBRRL,RATCAT_A,INCGRP_A,INCTCFLG_A,FAMINCTC_A,IMPINCFLG_A,RJWKCLSOFT_A,RJWCLSNOSD_A,RJWRKCLSSD_A,RECJOBSD_A,...,PHSTAT_A,PROXYREL_A,PROXY_A,AVAIL_A,HHSTAT_A,INTV_MON,RECTYPE,WTFA_A,HHX,POVRATTC_A
0,3,14,5,0,100000,0,,,,,...,2,,,1,1,11,10,4526.109,H066706,6.47
1,3,11,4,0,75000,0,,,,,...,2,,,1,1,8,10,12809.039,H034928,3.64
2,3,14,4,0,90000,0,,,,,...,3,,,1,1,8,10,10322.534,H018289,6.76
3,3,11,3,0,65000,0,,,,,...,1,,,1,1,3,10,7743.375,H006876,3.79
4,3,8,1,0,25762,2,,,,,...,3,,,1,1,6,10,4144.724,H028842,2.1


### Dimensionality reduction

There is a lot of columns. In this example I want to know if there are some interesting rules about occurrence of different conditions, that's why I only picked columns that have answears to the question *Ever been told that you have [name of the condition]?*

In [None]:
data = data[['HYPEV_A',
             'CHLEV_A',
             'ASEV_A',
             'CANEV_A',
             'DIBEV_A',
             'MIEV_A',
             'ANXEV_A',
             'DEPEV_A',
             'ARTHEV_A',
             'DEMENEV_A',
             'STREV_A',
             'CHDEV_A',
             'KIDWEAKEV_A',
             'HEPEV_A',
             'LIVEREV_A']]
print('Number of attributes = %d' % (data.shape[1]))
data.head()

Number of attributes = 15


Unnamed: 0,HYPEV_A,CHLEV_A,ASEV_A,CANEV_A,DIBEV_A,MIEV_A,ANXEV_A,DEPEV_A,ARTHEV_A,DEMENEV_A,STREV_A,CHDEV_A,KIDWEAKEV_A,HEPEV_A,LIVEREV_A
0,2,2,2,1,2,2,2,2,2,2,2,2,2.0,2.0,2.0
1,1,2,2,2,2,2,2,2,2,2,2,2,2.0,2.0,2.0
2,2,1,2,2,2,2,2,2,2,2,2,2,2.0,2.0,2.0
3,2,2,2,2,2,2,2,2,2,2,2,2,,,
4,1,2,2,2,2,2,2,1,1,2,2,2,,,


After that I have 15 columns fitted to our problem.

### Replacing non binary values with NaN

As I know that there can be only two values in the whole data set 1 - corresponding to answer *Yes* and 2 - corresponding to answer *No*, I removed all other values. From the appendix to the dataset I know that values 7, 8, 9 are corresponding to *Refused*, *Not Ascertained* and *Don't Know*. 

In [None]:
data[data > 2] = np.NaN

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[data > 2] = np.NaN


### Renaming columns




In [None]:
# the meanings of the data values can be found in the codebook from the NHIS
# https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2020/adult-codebook.pdf
data = data.rename(columns={'HYPEV_A':"hypertension",
                            'CHLEV_A':'cholesterol',
                            'ASEV_A':'asthma',
                            'CANEV_A':'cancer',
                            'DIBEV_A':'diabetes', 
                            'MIEV_A':'heart_attack',
                            'ANXEV_A':'anxiety', 
                            'DEPEV_A':'depression',
                            'ARTHEV_A': 'arthritis',
                            'DEMENEV_A': 'dementia',
                            'STREV_A': 'stroke',
                            'CHDEV_A': 'heart disease',
                            'KIDWEAKEV_A': 'failing kidneys',
                            'HEPEV_A': 'hepatitis',
                            'LIVEREV_A': 'liver condition'})
data.head()

Unnamed: 0,hypertension,cholesterol,asthma,cancer,diabetes,heart_attack,anxiety,depression,arthritis,dementia,stroke,heart disease,failing kidneys,hepatitis,liver condition
0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
1,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
2,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
3,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,
4,1.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,,,


### Removing missing data

In [None]:
print('Number of missing values:')
for col in data.columns:
    print('\t%s: %d' % (col, data[col].isna().sum()))

Number of missing values:
	hypertension: 51
	cholesterol: 108
	asthma: 29
	cancer: 34
	diabetes: 32
	heart_attack: 38
	anxiety: 51
	depression: 56
	arthritis: 43
	dementia: 19
	stroke: 30
	heart disease: 81
	failing kidneys: 13878
	hepatitis: 13874
	liver condition: 13874


I decided to replace NaN values with mode of each column.

In [None]:
data['hypertension'] = data['hypertension'].fillna(data['hypertension'].mode()[0])
data['cholesterol'] = data['cholesterol'].fillna(data['cholesterol'].mode()[0])
data['asthma'] = data['asthma'].fillna(data['asthma'].mode()[0])
data['cancer'] = data['cancer'].fillna(data['cancer'].mode()[0])
data['diabetes'] = data['diabetes'].fillna(data['diabetes'].mode()[0])
data['heart_attack'] = data['heart_attack'].fillna(data['heart_attack'].mode()[0])
data['anxiety'] = data['anxiety'].fillna(data['anxiety'].mode()[0])
data['depression'] = data['depression'].fillna(data['depression'].mode()[0])
data['arthritis'] = data['arthritis'].fillna(data['arthritis'].mode()[0])
data['dementia'] = data['dementia'].fillna(data['dementia'].mode()[0])
data['stroke'] = data['stroke'].fillna(data['stroke'].mode()[0])
data['heart disease'] = data['heart disease'].fillna(data['heart disease'].mode()[0])
data['failing kidneys'] = data['failing kidneys'].fillna(data['failing kidneys'].mode()[0])
data['hepatitis'] = data['hepatitis'].fillna(data['hepatitis'].mode()[0])
data['liver condition'] = data['liver condition'].fillna(data['liver condition'].mode()[0])

In [None]:
print('Number of missing values after:')
for col in data.columns:
    print('\t%s: %d' % (col, data[col].isna().sum()))

Number of missing values after:
	hypertension: 0
	cholesterol: 0
	asthma: 0
	cancer: 0
	diabetes: 0
	heart_attack: 0
	anxiety: 0
	depression: 0
	arthritis: 0
	dementia: 0
	stroke: 0
	heart disease: 0
	failing kidneys: 0
	hepatitis: 0
	liver condition: 0


### Removing duplicates

In [None]:
dups = data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

print('Number of rows before discarding duplicates = %d' % (data.shape[0]))
data = data.drop_duplicates()
print('Number of rows after discarding duplicates = %d' % (data.shape[0]))

Number of duplicate rows = 29980
Number of rows before discarding duplicates = 31568
Number of rows after discarding duplicates = 1588


### Replacing *1*, *2* values this *True*, *False*

This step is required to then apply *Apriori* algorithm.

In [None]:
data = data.replace(1, True)
data = data.replace(2, False)

data.head()

Unnamed: 0,hypertension,cholesterol,asthma,cancer,diabetes,heart_attack,anxiety,depression,arthritis,dementia,stroke,heart disease,failing kidneys,hepatitis,liver condition
0,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False


## Frequent patterns mining

### Finding frequent itemsets with Apriori algorithm with *min_sup=0.1*

In [None]:
frequent_itemsets = apriori(data, min_support=0.1, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

print(frequent_itemsets.head())
print(frequent_itemsets[frequent_itemsets['length'] > 1].head())
print(frequent_itemsets[frequent_itemsets['length'] > 2].head())

    support        itemsets  length
0  0.640428  (hypertension)       1
1  0.559194   (cholesterol)       1
2  0.324937        (asthma)       1
3  0.379723        (cancer)       1
4  0.366499      (diabetes)       1
     support                      itemsets  length
14  0.387280   (cholesterol, hypertension)       2
15  0.212846        (asthma, hypertension)       2
16  0.249370        (cancer, hypertension)       2
17  0.263224      (diabetes, hypertension)       2
18  0.196474  (hypertension, heart_attack)       2
     support                                   itemsets  length
65  0.132242        (cholesterol, asthma, hypertension)       3
66  0.152393        (cholesterol, cancer, hypertension)       3
67  0.178212      (diabetes, cholesterol, hypertension)       3
68  0.133501  (cholesterol, hypertension, heart_attack)       3
69  0.154912       (cholesterol, hypertension, anxiety)       3


I can see that hypertension is occuring in every itemset (that has more than one element) printed above. Probably beacause hypertension is a common condition.

### Finding and evaluating association rules

Using frequent itemsets I generated association rules. Firstly let see values of *support*, *confidence* and *lift*.

In [None]:
# min_conf = 0.7
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

# dropping leverage and conviction metrics
rules = rules.drop(['leverage',	'conviction'], axis=1)

rules.sort_values("lift", ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
28,"(diabetes, hypertension, arthritis)",(cholesterol),0.154912,0.559194,0.110202,0.711382,1.272156
25,"(diabetes, depression)",(cholesterol),0.156801,0.559194,0.111461,0.710843,1.271193
27,"(diabetes, cholesterol, arthritis)",(hypertension),0.139169,0.640428,0.110202,0.791855,1.236446
12,"(cholesterol, stroke)",(hypertension),0.157431,0.640428,0.122166,0.776,1.211689
19,"(diabetes, heart disease)",(hypertension),0.13728,0.640428,0.106423,0.775229,1.210486


Analyzing *lift* value I can conclude that these rules' itemsets are positively correlated.

As in population not many people have a lot of conditions, I have to take into consideration influence of null-transactions - the transactions that do not contain any of the itemsets being examined. I picked two null-invariant measures that were recommended during lectures - *Kulczynski* and *Imbalance Ratio*.

In [None]:
# adding Kulczynski and IR metrics
rules.loc[:, 'Kulczynski'] = 0.5*(rules['support']/rules['antecedent support']+rules['support']/rules['consequent support'])
rules.loc[:, 'Imbalance ratio'] = (rules['antecedent support']-rules['consequent support']).abs()/(rules['antecedent support']+rules['consequent support']-rules['support'])

rules.sort_values("lift", ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,Kulczynski,Imbalance ratio
28,"(diabetes, hypertension, arthritis)",(cholesterol),0.154912,0.559194,0.110202,0.711382,1.272156,0.454227,0.669447
25,"(diabetes, depression)",(cholesterol),0.156801,0.559194,0.111461,0.710843,1.271193,0.455084,0.665625
27,"(diabetes, cholesterol, arthritis)",(hypertension),0.139169,0.640428,0.110202,0.791855,1.236446,0.481965,0.748824
12,"(cholesterol, stroke)",(hypertension),0.157431,0.640428,0.122166,0.776,1.211689,0.483379,0.714818
19,"(diabetes, heart disease)",(hypertension),0.13728,0.640428,0.106423,0.775229,1.210486,0.470702,0.749531


After analyzing Kulczynski and IR I can conclude that these rules' itemsets are neutral (*Kulc* values close to 0.5) and imbalanced (relatively high *IR*). 

Overall these rules aren't very interesting. Hypertension and cholesterol are frequent in the dataset and even though they are more likely to occur if person also has other conditions listed above, it is not surprising that if someone has several conditions then he/she is also more likely to have conditions that occur frequently in population.  