# **Eksplorasi FP-Growth Dataset Covid-19 (mlxtend on Colab Notebook)**
Eksplorasi sangat sederhana untuk mencoba mencari korelasi dari dataset Covid-19 menggunakan algoritma FP-Growth (Frequent Pattern Growth) pada Python library mlxtend. <br><br> Eksplorasi ini bertujuan untuk mengetahui korelasi antara faktor status dirawat di rumah sakit; dirawat di ICU; memiliki penyakit komorbid; faktor gender; dan status kematian. Eksperimen dilakukan 2 kali: Round #1 tanpa memasukkan komponen gender; Round #2 dengan memasukkan komponen gender. <br><br> 

Wawasan yang ingin digali antara lain adalah:
*    Apakah riwayat penyakit komorbid berhubungan dengan status dirawat di ICU?
*    Apakah gender berpengaruh pada tingkat kematian? <br><br>

Sumber dataset: "Covid-19 Case Surveillance" (https://www.kaggle.com/arashnic/covid19-case-surveillance-public-use-dataset). Dataset ini berisi kumpulan data pasien individual terdeidentifikasi yang mencakup karakteristik demografis, riwayat paparan, indikator dan hasil keparahan penyakit, data klinis, hasil uji diagnostik laboratorium, dan komorbiditas. Overview isi dataset dapat dilihat di: https://github.com/is004/eksplorasi-dataset-covid-19/blob/main/Eksplorasi_Dataset_Covid-19_IS004_01A.ipynb <br><br> Tahapan langkah: <br> 

1.   Hubungkan Google Colab dengan Google Drive, unzip dataset
2.   Install dan import library
3.   Load dataset ke dataframe <br><br>
4.   Preprocessing dan transformasi dataset (Round #1: tanpa "sex")
5.   Proses fpgrowth frequent itemsets
6.   Proses association rules
7.   Interpretasi hasil dan kesimpulan <br><br>
8.   Preprocessing dan transformasi dataset (Round #2: dengan "sex")
5.   Proses fpgrowth frequent itemsets
6.   Proses association rules
7.   Interpretasi hasil dan kesimpulan



# **PREPARASI**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Unzip file dataset:

In [2]:
!unzip '/content/drive/MyDrive/DATASET1/Covid-19/Kaggle_Covid-19.zip'

Archive:  /content/drive/MyDrive/DATASET1/Covid-19/Kaggle_Covid-19.zip
  inflating: COVID-19_Case_Surveillance_Public_Use_Data.csv  


Install packages: <br> (required: mlxtend-0.21.0)

In [3]:
!pip install mlxtend --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mlxtend
  Downloading mlxtend-0.21.0-py2.py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mlxtend
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
Successfully installed mlxtend-0.21.0


In [4]:
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd

Load dataframe dari dataset Covid-19:

In [5]:
data = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [6]:
data.head()

Unnamed: 0,cdc_report_dt,pos_spec_dt,onset_dt,current_status,sex,age_group,Race and ethnicity (combined),hosp_yn,icu_yn,death_yn,medcond_yn
0,2020/11/10,2020/11/10,,Laboratory-confirmed case,Male,10 - 19 Years,"Black, Non-Hispanic",No,Unknown,No,No
1,2020/11/14,2020/11/10,2020/11/10,Laboratory-confirmed case,Male,10 - 19 Years,"Black, Non-Hispanic",No,No,No,No
2,2020/11/19,2020/11/10,2020/11/09,Laboratory-confirmed case,Male,10 - 19 Years,"Black, Non-Hispanic",No,No,No,No
3,2020/11/14,2020/11/10,,Laboratory-confirmed case,Male,10 - 19 Years,"Black, Non-Hispanic",Missing,Missing,No,Missing
4,2020/11/13,2020/11/10,2020/11/10,Laboratory-confirmed case,Male,10 - 19 Years,"Black, Non-Hispanic",No,No,No,Yes


In [7]:
data.shape

(8405079, 11)

*****
# **ROUND #1**

*****
**PREPROCESSING**

Buat dataframe baru (data1) berisi hanya kolom yang ingin dieksplorasi: <br>
(Pada eksperimen round #1 ini hanya dipilih kolom "hosp_yn", "icu_yn", "death_yn", "medcond_yn" -- tujuannya untuk mengetahui hubungan antara 4 status tersebut) <br><br> Arti status: hosp_yn adalah status dirawat di rumah sakit; icu_yn adalah status dirawat di ICU; death_yn adalah status meninggal; medcond_yn adalah status memiliki riwayat penyakit komorbid.

In [8]:
data1 = data[["hosp_yn", "icu_yn", "death_yn", "medcond_yn"]]
data1.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn
0,No,Unknown,No,No
1,No,No,No,No
2,No,No,No,No
3,Missing,Missing,No,Missing
4,No,No,No,Yes


Cek isi unik per kolom:

In [9]:
data1["hosp_yn"].unique()

array(['No', 'Missing', 'Unknown', 'Yes'], dtype=object)

In [10]:
data1["icu_yn"].unique()

array(['Unknown', 'No', 'Missing', 'Yes'], dtype=object)

In [11]:
data1["death_yn"].unique()

array(['No', 'Missing', 'Unknown', 'Yes'], dtype=object)

In [12]:
data1["medcond_yn"].unique()

array(['No', 'Missing', 'Yes', 'Unknown'], dtype=object)

Hapus baris berisi null, Missing, Unknown; simpan dalam data1_clean:

In [13]:
data1clean = data1.dropna()
data1clean = data1.loc[(data1['hosp_yn'] != 'Missing') & (data1['hosp_yn'] != 'Unknown') &
(data1['icu_yn'] != 'Missing') & (data1['icu_yn'] != 'Unknown') &
(data1['death_yn'] != 'Missing') & (data1['death_yn'] != 'Unknown') &
(data1['medcond_yn'] != 'Missing') & (data1['medcond_yn'] != 'Unknown')]

In [14]:
data1clean.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn
1,No,No,No,No
2,No,No,No,No
4,No,No,No,Yes
11,No,No,No,No
14,No,No,No,Yes


In [15]:
data1clean.count()

hosp_yn       550507
icu_yn        550507
death_yn      550507
medcond_yn    550507
dtype: int64

Transformasi data agar sesuai dengan format data input fpgrowth mlextend: <br> (Ubah Yes-No menjadi boolean True-False)

In [16]:
data1clean = data1clean.replace({'Yes': True, 'No': False})

In [17]:
data1clean.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn
1,False,False,False,False
2,False,False,False,False
4,False,False,False,True
11,False,False,False,False
14,False,False,False,True


In [18]:
data1clean.dtypes

hosp_yn       bool
icu_yn        bool
death_yn      bool
medcond_yn    bool
dtype: object

*****
**FP-GROWTH PROCESSING #1**

Proses FP-Growth frequent itemsets, simpan hasilnya sebagai dataframe data_freq: <br> (parameter min_support=0.03)

In [19]:
freqItemsets = fpgrowth(data1clean, min_support=0.03, use_colnames = True)

In [20]:
data_freq = pd.DataFrame(freqItemsets)

In [21]:
data_freq.sort_values(by=['support'], ascending=False)

Unnamed: 0,support,itemsets
0,0.52286,(medcond_yn)
1,0.170764,(hosp_yn)
4,0.143593,"(medcond_yn, hosp_yn)"
2,0.059189,(death_yn)
3,0.058108,(icu_yn)
8,0.057229,"(icu_yn, hosp_yn)"
5,0.055258,"(death_yn, medcond_yn)"
9,0.05134,"(icu_yn, medcond_yn)"
6,0.050859,"(death_yn, hosp_yn)"
11,0.050762,"(icu_yn, medcond_yn, hosp_yn)"


Proses FP-Growth association rule, simpan hasilnya sebagai dataframe data_assoc: <br> (parameter metric="lift"; min_threshold=1)

In [22]:
assocRules = association_rules(freqItemsets, metric="lift", min_threshold=1)

In [23]:
data_assoc = pd.DataFrame(assocRules)

In [24]:
data_assoc.sort_values(by=['lift'], ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
25,"(death_yn, hosp_yn)",(icu_yn),0.050859,0.058108,0.032027,0.629724,10.837076,0.029072,2.543753
28,(icu_yn),"(death_yn, hosp_yn)",0.058108,0.050859,0.032027,0.551158,10.837076,0.029072,2.114646
27,(death_yn),"(icu_yn, hosp_yn)",0.059189,0.057229,0.032027,0.541094,9.454878,0.02864,2.054387
26,"(icu_yn, hosp_yn)",(death_yn),0.057229,0.059189,0.032027,0.559625,9.454878,0.02864,2.136388
17,(icu_yn),(death_yn),0.058108,0.059189,0.032123,0.552815,9.339815,0.028684,2.103852
16,(death_yn),(icu_yn),0.059189,0.058108,0.032123,0.54272,9.339815,0.028684,2.059772
21,(icu_yn),"(medcond_yn, hosp_yn)",0.058108,0.143593,0.050762,0.873582,6.08373,0.042418,6.774382
20,"(medcond_yn, hosp_yn)",(icu_yn),0.143593,0.058108,0.050762,0.353515,6.08373,0.042418,1.456943
29,(hosp_yn),"(death_yn, icu_yn)",0.170764,0.032123,0.032027,0.18755,5.838471,0.026541,1.191306
24,"(death_yn, icu_yn)",(hosp_yn),0.032123,0.170764,0.032027,0.997003,5.838471,0.026541,276.683062


*****
**INTERPRETASI SEDERHANA & KESIMPULAN EKSPERIMEN ROUND #1** <br><br>


*   Beberapa itemsets (antecedents --> consequents) yang menghasilkan nilai confidence tertinggi (> 0.98) adalah: (death_yn, icu_yn) --> (hosp_yn); (icu_yn, medcond_yn) -->	(hosp_yn); (icu_yn) -->	(hosp_yn). Tingginya confidence pada itemsets ini menggambarkan bahwa orang-orang yang dirawat di rumah sakit kemungkinan besar adalah orang-orang yang dirawat di ICU dan memiliki riwayat penyakit komorbid.
*   Pada itemsets 13 (hosp_yn) -->	(icu_yn), nilai confidence cukup rendah (0.335). Asosiasi ini menggambarkan orang-orang yang dirawat di rumah sakit belum tentu dirawat di ICU.
*   Hasil menarik terdapat pada 6 itemsets teratas, dengan nilai lift > 9.33. Tiga dari 6 itemsets ini memiliki item consequents (death_yn) dengan antecedents mengandung (icu_yn). Asosiasi ini menggambarkan tingginya tingkat kematian orang-orang yang dirawat di ICU.
*   Asosiasi tersebut didukung dengan itemsets 20 (medcond_yn, hosp_yn) -->	(icu_yn) dengan nilai lift cukup tinggi 6.08, yang menggambarkan bahwa orang-orang yang dirawat di ICU kemungkinan besar memiliki riwayat penyakit komorbid.



*****
# **ROUND #2**

*****
**PREPROCESSING**

Buat dataframe baru (data2) berisi 5 kolom: <br>
("hosp_yn", "icu_yn", "death_yn", "medcond_yn", "sex" )

In [25]:
data2 = data[["hosp_yn", "icu_yn", "death_yn", "medcond_yn", "sex"]]
data2.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn,sex
0,No,Unknown,No,No,Male
1,No,No,No,No,Male
2,No,No,No,No,Male
3,Missing,Missing,No,Missing,Male
4,No,No,No,Yes,Male


Cek isi unik kolom "sex": <br>
(isi unik kolom lainnya sudah dicek di eksperimen round 1)

In [26]:
data2["sex"].unique()

array(['Male', 'Unknown', 'Missing', 'Female', 'Other', nan], dtype=object)

Hapus baris berisi null; ambil baris berisi Yes, No, Male, Female; simpan dalam data2_clean: <br>
(data pada kolom "sex" hanya dipakai yang berisi Male dan Female saja; data pada kolom lainnya hanya dipakai yang berisi Yes dan No saja)

In [27]:
data2clean = data2.dropna()
data2clean = data2.loc[((data2['hosp_yn'] == 'Yes') | (data2['hosp_yn'] == 'No')) &
((data2['icu_yn'] == 'Yes') | (data2['icu_yn'] == 'No')) &
((data2['death_yn'] == 'Yes') | (data2['death_yn'] == 'No')) &
((data2['medcond_yn'] == 'Yes') | (data2['medcond_yn'] == 'No')) &
((data2['sex'] == 'Male') | (data2['sex'] == 'Female'))]

Verifikasi cek isi unik kolom "sex" pada data2clean:

In [28]:
data2clean['sex'].unique()

array(['Male', 'Female'], dtype=object)

In [29]:
data2clean.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn,sex
1,No,No,No,No,Male
2,No,No,No,No,Male
4,No,No,No,Yes,Male
11,No,No,No,No,Male
14,No,No,No,Yes,Male


In [30]:
data2clean.count()

hosp_yn       548892
icu_yn        548892
death_yn      548892
medcond_yn    548892
sex           548892
dtype: int64

Transformasi data agar sesuai dengan format data input fpgrowth mlextend, simpan di data2transform:
* Split kolom "sex" menjadi 2 kolom "male_yn" dan "female_yn": 1) Buat kolom baru "female_yn" berisi copy dari kolom "sex"; 2) Rename kolom "sex" menjadi "male_yn"
* Ubah isi data pada kolom "male_yn": Male menjadi True, Female menjadi False; Ubah isi data pada kolom "female_yn": Male menjadi False; Female menjadi True
* Ubah isi data Yes-No pada kolom lainnya menjadi True-False

In [31]:
data2transform = data2clean.assign(female_yn=data2clean.sex)
data2transform = data2transform.rename(columns = {'sex':'male_yn'})

In [32]:
data2transform['male_yn'] = data2transform['male_yn'].replace(['Male', 'Female'], [True, False])
data2transform['female_yn'] = data2transform['female_yn'].replace(['Male', 'Female'], [False, True])
data2transform = data2transform.replace(['Yes', 'No'], [True, False])
data2transform = data2transform.replace({'True': True, 'False': False})
data2transform.head()

Unnamed: 0,hosp_yn,icu_yn,death_yn,medcond_yn,male_yn,female_yn
1,False,False,False,False,True,False
2,False,False,False,False,True,False
4,False,False,False,True,True,False
11,False,False,False,False,True,False
14,False,False,False,True,True,False


In [33]:
data2transform.dtypes

hosp_yn       bool
icu_yn        bool
death_yn      bool
medcond_yn    bool
male_yn       bool
female_yn     bool
dtype: object

*****
**FP-GROWTH PROCESSING #2**

Proses FP-Growth frequent itemsets, simpan hasilnya sebagai dataframe data_freq2: <br> (parameter min_support=0.03)

In [34]:
freqItemsets2 = fpgrowth(data2transform, min_support=0.03, use_colnames = True)

In [35]:
data_freq2 = pd.DataFrame(freqItemsets2)

In [36]:
data_freq2.sort_values(by=['support'], ascending=False)

Unnamed: 0,support,itemsets
2,0.531126,(female_yn)
1,0.52315,(medcond_yn)
0,0.468874,(male_yn)
7,0.27749,"(female_yn, medcond_yn)"
6,0.24566,"(male_yn, medcond_yn)"
3,0.170959,(hosp_yn)
10,0.143755,"(medcond_yn, hosp_yn)"
9,0.091233,"(male_yn, hosp_yn)"
8,0.079726,"(female_yn, hosp_yn)"
12,0.077101,"(male_yn, medcond_yn, hosp_yn)"


Proses FP-Growth association rule, simpan hasilnya sebagai dataframe data_assoc2: <br> (parameter metric="lift"; min_threshold=1)

In [37]:
assocRules2 = association_rules(freqItemsets2, metric="lift", min_threshold=1)

In [38]:
data_assoc2 = pd.DataFrame(assocRules2)

In [39]:
data_assoc2.sort_values(by=['lift'], ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
84,(icu_yn),"(death_yn, hosp_yn)",0.058190,0.050934,0.032072,0.551158,10.821134,0.029108,2.114480
81,"(death_yn, hosp_yn)",(icu_yn),0.050934,0.058190,0.032072,0.629681,10.821134,0.029108,2.543242
89,(death_yn),"(icu_yn, medcond_yn)",0.059270,0.051411,0.030028,0.506624,9.854420,0.026981,1.922650
88,"(icu_yn, medcond_yn)",(death_yn),0.051411,0.059270,0.030028,0.584075,9.854420,0.026981,2.261775
83,(death_yn),"(icu_yn, hosp_yn)",0.059270,0.057308,0.032072,0.541112,9.442145,0.028675,2.054297
...,...,...,...,...,...,...,...,...,...
12,"(medcond_yn, hosp_yn)",(male_yn),0.143755,0.468874,0.077101,0.536334,1.143878,0.009698,1.145494
3,(hosp_yn),(male_yn),0.170959,0.468874,0.091233,0.533654,1.138161,0.011075,1.138910
2,(male_yn),(hosp_yn),0.468874,0.170959,0.091233,0.194579,1.138161,0.011075,1.029326
1,(medcond_yn),(male_yn),0.523150,0.468874,0.245660,0.469579,1.001504,0.000369,1.001330


In [40]:
data_assoc2.sort_values(by=['lift'], ascending=False).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
84,(icu_yn),"(death_yn, hosp_yn)",0.05819,0.050934,0.032072,0.551158,10.821134,0.029108,2.11448
81,"(death_yn, hosp_yn)",(icu_yn),0.050934,0.05819,0.032072,0.629681,10.821134,0.029108,2.543242
89,(death_yn),"(icu_yn, medcond_yn)",0.05927,0.051411,0.030028,0.506624,9.85442,0.026981,1.92265
88,"(icu_yn, medcond_yn)",(death_yn),0.051411,0.05927,0.030028,0.584075,9.85442,0.026981,2.261775
83,(death_yn),"(icu_yn, hosp_yn)",0.05927,0.057308,0.032072,0.541112,9.442145,0.028675,2.054297
82,"(icu_yn, hosp_yn)",(death_yn),0.057308,0.05927,0.032072,0.559639,9.442145,0.028675,2.136269
46,(death_yn),(icu_yn),0.05927,0.05819,0.032168,0.542741,9.32706,0.02872,2.059687
47,(icu_yn),(death_yn),0.05819,0.05927,0.032168,0.552818,9.32706,0.02872,2.103683
90,(icu_yn),"(death_yn, medcond_yn)",0.05819,0.055333,0.030028,0.51603,9.325852,0.026808,1.951912
87,"(death_yn, medcond_yn)",(icu_yn),0.055333,0.05819,0.030028,0.542671,9.325852,0.026808,2.05937


*****
**INTERPRETASI SEDERHANA & KESIMPULAN EKSPERIMEN ROUND #2** <br><br>


*   Hasil eksperimen #2 secara umum tidak jauh berbeda dengan eksperimen #1.
*   Ternyata elemen "sex" tidak terlalu berpengaruh terhadap elemen-elemen lainnya. Hal ini dapat dilihat dari kemunculan elemen (male_yn) pada itemset 62, dengan confidence cukup rendah 0.399; dan sepuluh rules teratas dengan lift tertinggi tidak ada yang mengandung elemen (male_yn) maupun (female_yn).
*   Dapat disimpulkan juga bahwa "sex" bukan faktor yang paling berhubungan dengan kematian. Hal ini terlihat dari tidak adanya asosiasi (male_yn) --> (death_yn) atau (female_yn) --> (death_yn) pada 20 rules dengan lift tertinggi. 
*   Wawasan asosiasi yang lebih komprehensif mungkin akan didapatkan dengan memasukkan features lainnya dari dataset tersebut, selain 5 features yang sudah dipilih.		



Simpan dataframe hasil pemrosesan ke file csv:

In [41]:
data_freq.to_csv('/content/drive/MyDrive/DATASET1/Covid-19/Output_FPG/data_freq.csv')
data_assoc.to_csv('/content/drive/MyDrive/DATASET1/Covid-19/Output_FPG/data_assoc.csv')
data_freq2.to_csv('/content/drive/MyDrive/DATASET1/Covid-19/Output_FPG/data_freq2.csv')
data_assoc2.to_csv('/content/drive/MyDrive/DATASET1/Covid-19/Output_FPG/data_assoc2.csv')