# 1. 데이터셋 생성 방법

국민건강영양조사 데이터는 질병관리청에서 이용 동의를 하고 이용해야 하기 때문에 원본 데이터를 제공할 수 없습니다.
따라서, 데이터 처리 코드를 제공하니 아래와 같은 순서로 실행하세요.

1. 데이터 다운로드(2010~2021년 12년 치)
2. 데이터셋 생성

## [생성방법1] 데이터 다운로드

국민건강영양조사 사이트( https://knhanes.kdca.go.kr/knhanes/sub03/sub03_02_05.do )에서
2010년 ~ 2021년에 대한 '**기본DB**'의 '**SAS**'파일 12개를 다운로드 받습니다.

>hn10_all.sas7bdat<br>
>hn11_all.sas7bdat<br>
>hn12_all.sas7bdat<br>
>hn13_all.sas7bdat<br>
>hn14_all.sas7bdat<br>
>hn15_all.sas7bdat<br>
>hn16_all.sas7bdat<br>
>hn17_all.sas7bdat<br>
>hn18_all.sas7bdat<br>
>hn19_all.sas7bdat<br>
>hn20_all.sas7bdat<br>
>hn21_all.sas7bdat

![데이터다운로드](fig1_download.png)

## [생성방법2] 프로세싱 코드와 데이터를 한 폴더에 위치

다운로드 받은 후에<br>
nationalhealth_preprocessing.ipynb(주피터 코드 파일),<br>
meta_data20.xlsx(메타데이터 엑셀 파일),<br>
DB 파일 12개를 동일한 작업 폴더에 위치시킵니다.

![실행전파일](fig2_files_before.png)

## [생성방법3] 프로세싱 코드 실행

DB 파일과 프로세싱 코드 파일이 있는 작업 폴더에서
데이터셋 구축의 주피터 코드 파일(nationalhealth_preprocessing.ipynb)을 전체 실행합니다.

![실행](fig3_execution.png)

## [생성방법4] 데이터셋 생성 실행
최종적으로 nationalhealth_2010to2021.csv 파일이 생성된 것을 확인합니다.

![실행후파일](fig4_files_after.png)

# 2. 데이터셋 프로세싱 코드

구체적인 데이터셋 프로세싱 코드는 다음과 같습니다.

(나중에 데이터셋 프로세싱이 어떻게 되었는지 참고하고자 할 때, 아래 코드를 참고하면 됩니다)

* 2.1. 데이터셋 통합
* 2.2. 질병 변수 추가
* 2.3. 분석 변수 선택
* 2.4. 기타값(해당없음, 모름) 처리
* 2.5. 결측치값(nan) 데이터 제거
* 2.6. 데이터셋 파일 저장

## [프로세싱코드1] 데이터셋 통합

DB파일 12개를 읽어와서<br>
데이터셋 df로 통합합니다.

In [1]:
from IPython.display import Image
import pandas as pd
import glob
files=[
    'hn10_all.sas7bdat',
    'hn11_all.sas7bdat',
    'hn12_all.sas7bdat',
    'hn13_all.sas7bdat',
    'hn14_all.sas7bdat',
    'hn15_all.sas7bdat',
    'hn16_all.sas7bdat',
    'hn17_all.sas7bdat',
    'hn18_all.sas7bdat',
    'hn19_all.sas7bdat',
    'hn20_all.sas7bdat',
    'hn21_all.sas7bdat']
df_merged = None
for file in files:
    year = int(file.split('/')[-1].split('_')[0][-2:])
    if not (10 <= year <= 21):
        continue
    df = pd.read_sas(file)
    if year == 21:
        dic_upper2original = dict([(colID.upper(),colID) for colID in df.columns])
    df.columns = list(map(lambda x:x.upper(), df.columns))
    if df_merged is None:
        df_merged = df
        continue
    df_merged = pd.concat([df_merged,df], axis=0, join='inner')
df_merged.columns = df_merged.columns.map(dic_upper2original)
df_merged = df_merged.reset_index(drop=True, inplace=False)
df_merged

  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
  rslt[name] = self._string_chunk[js, :]


Unnamed: 0,mod_d,ID,ID_fam,year,region,town_t,apt_t,psu,sex,age,...,N_PHOS,N_FE,N_NA,N_K,N_CAROT,N_RETIN,N_B1,N_B2,N_NIAC,N_VITC
0,b'2022.03.08',b'A308059801',b'A3080598',2010.0,1.0,1.0,2.0,b'A308',1.0,61.0,...,1158.741753,16.168011,7700.147136,4889.722884,8410.947935,40.636889,1.551362,1.160234,14.902740,302.078958
1,b'2022.03.08',b'A308059802',b'A3080598',2010.0,1.0,1.0,2.0,b'A308',2.0,54.0,...,,,,,,,,,,
2,b'2022.03.08',b'A308120201',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',1.0,33.0,...,862.497376,11.643875,5290.916555,2745.907566,1203.118204,109.254473,1.277699,1.184275,14.131862,50.702033
3,b'2022.03.08',b'A308120202',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',2.0,33.0,...,607.005191,7.011724,2816.639475,1463.945700,948.610489,82.484460,0.523133,0.766115,8.217986,39.511489
4,b'2022.03.08',b'A308120203',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',2.0,4.0,...,676.563698,5.877152,1183.633015,1410.599423,84.784175,110.651952,0.739844,0.869431,7.752621,82.259096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95305,b'2023.01.13.',b'R904351302',b'R9043513',2021.0,8.0,2.0,2.0,b'R904',1.0,25.0,...,,,,,,,,,,
95306,b'2023.01.13.',b'R904353001',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',1.0,45.0,...,1127.791215,14.174227,2722.535342,2193.573455,1472.979490,159.422485,0.881561,1.532354,14.065870,21.532513
95307,b'2023.01.13.',b'R904353002',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',2.0,43.0,...,1004.018227,7.434020,2373.257201,2361.510798,1413.207563,158.166347,0.986966,1.147640,10.985381,39.225403
95308,b'2023.01.13.',b'R904353003',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',1.0,8.0,...,1056.178696,7.072549,1678.226284,2458.049854,1360.526438,263.368377,0.913597,1.321368,10.198622,52.271571


## [프로세싱코드2] 질병 변수 추가

데이터셋 df에 질병변수 13개를 추가합니다.


>**질병 변수**<br>
    1.비만<br>
    2.고혈압<br>
    3.당뇨병<br>
    4.고콜레스테롤혈증<br>
    5.고중성지방혈증<br>
    6.B형간염<br>
    7.빈혈<br>
    8.뇌졸중<br>
    9.협심증또는심근경색증<br>
    10.천식<br>
    11.아토피피부염<br>
    12.골관절염<br>
    13.우울증
    
질병 여부의 기준은 질병관리청(https://knhanes.kdca.go.kr/knhanes/sub04/sub04_04_05.do) 의 아래의 통계자료를 참고하였습니다.

![질병통계](fig5_disease_statistics.png)

In [2]:
data=df_merged
import numpy as np

# 1.비만
index1 = data[data["age"] >= 19].index
index2 = data.dropna(subset=["HE_BMI"]).index
index3 = data[data["HE_dprg"].isnull()].index
intersection_index = list(set(index1) & set(index2) & set(index3))
data.loc[:,"비만"] = np.NaN
data.loc[intersection_index, "비만"] = (data.loc[intersection_index]["HE_BMI"] >= 25).astype(int)

# 2.고혈압
index1 = data[data["age"] >= 19].index
index2 = data.dropna(subset=["HE_sbp1", "HE_sbp2", "HE_sbp3", "HE_dbp1", "HE_dbp2", "HE_dbp3"]).index
index3 = data[data["DI1_2"].isin([1, 2, 3, 4, 5, 8])].index
intersection_index = list(set(index1) & set(index2) & set(index3))
data.loc[:, "고혈압"] = np.NaN
data.loc[intersection_index, "고혈압"] = data.loc[intersection_index]["HE_HP"].map({1: 0, 2: 0, 3: 1})

# 3.당뇨병
index1 = data[data["HE_dprg"].isnull()].index
index2 = data.dropna(subset=["HE_glu"]).index
index3 = data[data["HE_fst"] >= 8].index
index4 = data[(data["DE1_dg"].isin([0, 1, 8])) & (data["DE1_31"].isin([0, 1, 8])) & (data["DE1_32"].isin([0, 1, 8]))].index
index5 = data.dropna(subset=["HE_HbA1c"]).index
intersection_index = list(set(index1) & set(index2) & set(index3) & set(index4) & set(index5))

index6 = data[(data["HE_glu"] >= 126)].index
index7 = data[(data["DE1_31"] == 1)].index
index8 = data[(data["DE1_32"] == 1)].index
index9 = data[(data["DE1_dg"] == 1)].index
index10 = data[(data["HE_HbA1c"] >= 6.5)].index
union_index = list(set(index6) | set(index7) | set(index8) | set(index9) | set(index10))

diabetes_index = list(set(intersection_index) & set(union_index))
complement_index = list(set(intersection_index) - set(diabetes_index))

data.loc[:, "당뇨병"] = np.NaN
data.loc[diabetes_index, "당뇨병"] = 1
data.loc[complement_index, "당뇨병"] = 0

# 4.고콜레스테롤혈증
index1 = data[(data["age"] >= 19) & (data["HE_fst"] >= 8)].index
index2 = data.dropna(subset=["HE_chol", "DI2_2"]).index
index3 = data[data["DI2_2"].isin([1, 2, 3, 4, 5, 8])].index
intersection_index = list(set(index1) & set(index2) & set(index3))
data.loc[:, "고콜레스테롤혈증"] = np.NaN
data.loc[intersection_index, "고콜레스테롤혈증"] = ((data.loc[intersection_index]["HE_chol"] >= 240) | (data.loc[intersection_index]["DI2_2"] == 1)).astype(int)

# 5.고중성지방혈증
index1 = data[(data["age"] >= 19) & (data["HE_fst"] >= 12)].index
index2 = data.dropna(subset=["HE_TG"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "고중성지방혈증"] = np.NaN
data.loc[intersection_index, "고중성지방혈증"] = (data.loc[intersection_index]["HE_TG"] >= 200).astype(int)

# 6.B형간염
index1 = data[data["age"] >= 10].index
index2 = data.dropna(subset=["HE_hepaB"]).index
index3 = data[data["HE_hepaB"].isin([0, 1])].index
intersection_index = list(set(index1) & set(index2) & set(index3))
data.loc[:, "B형간염"] = np.NaN
data.loc[intersection_index, "B형간염"] = data.loc[intersection_index]["HE_hepaB"]

# 7.빈혈
index1 = data[data["age"] >= 10].index
index2 = data.dropna(subset=["HE_anem"]).index
index3 = data[data["HE_anem"].isin([0, 1])].index
intersection_index = list(set(index1) & set(index2) & set(index3))
data.loc[:, "빈혈"] = np.NaN
data.loc[intersection_index, "빈혈"] = data.loc[intersection_index]["HE_anem"]

# 8.뇌졸중
index1 = data[data["age"] >= 30].index
index2 = data.dropna(subset=["DI3_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "뇌졸중"] = np.NaN
data.loc[intersection_index, "뇌졸중"] = data.loc[intersection_index]["DI3_dg"]

# 8.협심증또는심근경색증
index1 = data[data["age"] >= 30].index
index2 = data.dropna(subset=["DI4_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "협심증또는심근경색증"] = np.NaN
data.loc[intersection_index, "협심증또는심근경색증"] = data.loc[intersection_index]["DI4_dg"]

# 9.천식
index1 = data[data["age"] >= 19].index
index2 = data.dropna(subset=["DJ4_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "천식"] = np.NaN
data.loc[intersection_index, "천식"] = data.loc[intersection_index]["DJ4_dg"]

# 10.아토피피부염
index1 = data[data["age"] >= 19].index
index2 = data.dropna(subset=["DL1_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "아토피피부염"] = np.NaN
data.loc[intersection_index, "아토피피부염"] = data.loc[intersection_index]["DL1_dg"]

# 8.골관절염
index1 = data[data["age"] >= 30].index
index2 = data.dropna(subset=["DM2_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "골관절염"] = np.NaN
data.loc[intersection_index, "골관절염"] = data.loc[intersection_index]["DM2_dg"]

# 8.우울증
index1 = data[data["age"] >= 19].index
index2 = data.dropna(subset=["DF2_dg"]).index
intersection_index = list(set(index1) & set(index2))
data.loc[:, "우울증"] = np.NaN
data.loc[intersection_index, "우울증"] = data.loc[intersection_index]["DF2_dg"]
data

Unnamed: 0,mod_d,ID,ID_fam,year,region,town_t,apt_t,psu,sex,age,...,고콜레스테롤혈증,고중성지방혈증,B형간염,빈혈,뇌졸중,협심증또는심근경색증,천식,아토피피부염,골관절염,우울증
0,b'2022.03.08',b'A308059801',b'A3080598',2010.0,1.0,1.0,2.0,b'A308',1.0,61.0,...,1.0,0.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
1,b'2022.03.08',b'A308059802',b'A3080598',2010.0,1.0,1.0,2.0,b'A308',2.0,54.0,...,0.0,0.0,1.0,0.0,8.0,8.0,8.0,0.0,8.0,8.0
2,b'2022.03.08',b'A308120201',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',1.0,33.0,...,0.0,1.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
3,b'2022.03.08',b'A308120202',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',2.0,33.0,...,0.0,0.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
4,b'2022.03.08',b'A308120203',b'A3081202',2010.0,1.0,1.0,2.0,b'A308',2.0,4.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95305,b'2023.01.13.',b'R904351302',b'R9043513',2021.0,8.0,2.0,2.0,b'R904',1.0,25.0,...,0.0,0.0,0.0,0.0,,,0.0,0.0,,0.0
95306,b'2023.01.13.',b'R904353001',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',1.0,45.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95307,b'2023.01.13.',b'R904353002',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',2.0,43.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95308,b'2023.01.13.',b'R904353003',b'R9043530',2021.0,8.0,2.0,2.0,b'R904',1.0,8.0,...,,,,,,,,,,


## [프로세싱코드3] 분석 변수 선택

데이터셋 df에서 분석할 변수 122개를 선택합니다.

선택하는 분석 변수는 1) 신상 정보, 2) 채혈 검사 정보, 3) 영양 검사 정보, 4) 질병 정보 등입니다.

변수의 ID 및 자세한 정보의 내용은 질병관리청(https://knhanes.kdca.go.kr/knhanes/sub04/sub04_04_05.do)의 '원시자료 이용지침서' 파일과 이 파일을 정리한 meta_data20.xlsx를 참고하면 됩니다.

![질병통계](fig6_metadata.png)

In [3]:
variables="""ID
ID_fam
year
region
town_t
sex
age
incm
ho_incm
incm5
ho_incm5
edu
occp
cfam
genertn
allownc
marri_1
marri_2
fam_rela
tins
D_1_1
educ
EC1_1
EC_wht_23
EC_wht_5
EC_pedu_1
EC_pedu_2
BD1_11
BD2_1
BD2_31
dr_month
BP6_10
BP7
mh_stress
BS3_1
BE3_31
BE5_1
LW_mt
LW_mt_a1
LW_br
HE_fst
HE_HPdr
HE_DMdr
HE_mens
HE_prg
HE_HPfh1
HE_HPfh2
HE_HPfh3
HE_HLfh1
HE_HLfh2
HE_HLfh3
HE_IHDfh1
HE_IHDfh2
HE_IHDfh3
HE_STRfh1
HE_STRfh2
HE_STRfh3
HE_DMfh1
HE_DMfh2
HE_DMfh3
HE_rPLS
HE_sbp
HE_dbp
HE_ht
HE_wt
HE_wc
HE_BMI
HE_glu
HE_HbA1c
HE_chol
HE_HDL_st2
HE_TG
HE_ast
HE_alt
HE_hepaB
HE_HB
HE_HCT
HE_BUN
HE_crea
HE_WBC
HE_RBC
HE_Bplt
HE_Uph
HE_Unitr
HE_Usg
HE_Upro
HE_Uglu
HE_Uket
HE_Ubil
HE_Ubld
HE_Uro
HE_Ucrea
N_INTK
N_EN
N_WATER
N_PROT
N_FAT
N_CHO
N_CA
N_PHOS
N_FE
N_NA
N_K
N_CAROT
N_RETIN
N_B1
N_B2
N_NIAC
N_VITC
비만
고혈압
당뇨병
고콜레스테롤혈증
고중성지방혈증
B형간염
빈혈
뇌졸중
협심증또는심근경색증
천식
아토피피부염
골관절염
우울증
""".strip().split()
df_merged_selected=data[variables]
df_merged_selected

Unnamed: 0,ID,ID_fam,year,region,town_t,sex,age,incm,ho_incm,incm5,...,고콜레스테롤혈증,고중성지방혈증,B형간염,빈혈,뇌졸중,협심증또는심근경색증,천식,아토피피부염,골관절염,우울증
0,b'A308059801',b'A3080598',2010.0,1.0,1.0,1.0,61.0,2.0,2.0,3.0,...,1.0,0.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
1,b'A308059802',b'A3080598',2010.0,1.0,1.0,2.0,54.0,2.0,2.0,2.0,...,0.0,0.0,1.0,0.0,8.0,8.0,8.0,0.0,8.0,8.0
2,b'A308120201',b'A3081202',2010.0,1.0,1.0,1.0,33.0,3.0,3.0,3.0,...,0.0,1.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
3,b'A308120202',b'A3081202',2010.0,1.0,1.0,2.0,33.0,2.0,3.0,3.0,...,0.0,0.0,0.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0
4,b'A308120203',b'A3081202',2010.0,1.0,1.0,2.0,4.0,3.0,3.0,3.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95305,b'R904351302',b'R9043513',2021.0,8.0,2.0,1.0,25.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,,,0.0,0.0,,0.0
95306,b'R904353001',b'R9043530',2021.0,8.0,2.0,1.0,45.0,3.0,3.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95307,b'R904353002',b'R9043530',2021.0,8.0,2.0,2.0,43.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95308,b'R904353003',b'R9043530',2021.0,8.0,2.0,1.0,8.0,3.0,3.0,4.0,...,,,,,,,,,,


## [프로세싱코드4] 기타값(해당없음, 모름) 처리

원본데이터에서 '해당없음' 값은 8, 88, 888 등으로 되어 있으며, '모름' 값은 9, 99, 999 등으로 변수마다 다르게 되어 있습니다.

이것을 해당없음은 -1값으로 모름 값은 -2값으로 통일합니다.

In [4]:
df = df_merged_selected
df.loc[df['sex']==1,'LW_mt'] = 8   #남성(sex=1)일 경우, 출산경험(LW_mt) 값을 해당없음(8)로 변경
df.loc[df['sex']==1,'LW_mt_a1'] = 8   #남성(sex=1)일 경우, 첫출산연령(LW_mt_a1) 값을 해당없음(8)로 변경
df.loc[df['sex']==1,'LW_br'] = 8   #남성(sex=1)일 경우, 모유수유경험(LW_br) 값을 해당없음(8)로 변경

lst_8 = ["HE_HPdr", "HE_DMdr", "HE_mens", "HE_prg", '협심증또는심근경색증']
lst_9 = ["cfam", "genertn", "marri_1", "D_1_1", "HE_HPfh1", "HE_HPfh2", "HE_HLfh1", "HE_HLfh2", 
         "HE_IHDfh1", "HE_IHDfh2", "HE_STRfh1", "HE_STRfh2", "HE_DMfh1", "HE_DMfh2"]
lst_99 = ["allownc", "fam_rela", "tins"]
lst_8_9 = ["EC1_1", "BD1_11", "BD2_1", "BD2_31", "BP6_10", "BP7", "BS3_1", "BE5_1", "LW_mt", 
           "LW_br", "HE_HPfh3", "HE_HLfh3", "HE_IHDfh3", "HE_STRfh3", "HE_DMfh3", 
           '뇌졸중', '천식', '아토피피부염', '골관절염', '우울증']
lst_88_99 = ["educ", "EC_wht_5", "EC_pedu_1", "EC_pedu_2", "BE3_31"]
lst_88_9_99 = ["marri_2"]
lst_888_999 = ["EC_wht_23", "LW_mt_a1"]

for col in df.columns:
    if col in lst_8:
        df.loc[df[col] == 8, col] = -1
    elif col in lst_9:
        df.loc[df[col] == 9, col] = -2
    elif col in lst_99:
        df.loc[df[col] == 99, col] = -2
    elif col in lst_8_9:
        df.loc[df[col] == 8, col] = -1
        df.loc[df[col] == 9, col] = -2
    elif col in lst_88_99:
        df.loc[df[col] == 88, col] = -1
        df.loc[df[col] == 99, col] = -2
    elif col in lst_88_9_99:
        df.loc[df[col] == 88, col] = -1
        df.loc[df[col] == 9, col] = -2
        df.loc[df[col] == 99, col] = -2
    elif col in lst_888_999:
        df.loc[df[col] == 888, col] = -1
        df.loc[df[col] == 999, col] = -2
    else:
        pass
    
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,ID,ID_fam,year,region,town_t,sex,age,incm,ho_incm,incm5,...,고콜레스테롤혈증,고중성지방혈증,B형간염,빈혈,뇌졸중,협심증또는심근경색증,천식,아토피피부염,골관절염,우울증
0,b'A308059801',b'A3080598',2010.0,1.0,1.0,1.0,61.0,2.0,2.0,3.0,...,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,b'A308059802',b'A3080598',2010.0,1.0,1.0,2.0,54.0,2.0,2.0,2.0,...,0.0,0.0,1.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0
2,b'A308120201',b'A3081202',2010.0,1.0,1.0,1.0,33.0,3.0,3.0,3.0,...,0.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,b'A308120202',b'A3081202',2010.0,1.0,1.0,2.0,33.0,2.0,3.0,3.0,...,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,b'A308120203',b'A3081202',2010.0,1.0,1.0,2.0,4.0,3.0,3.0,3.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95305,b'R904351302',b'R9043513',2021.0,8.0,2.0,1.0,25.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,,,0.0,0.0,,0.0
95306,b'R904353001',b'R9043530',2021.0,8.0,2.0,1.0,45.0,3.0,3.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95307,b'R904353002',b'R9043530',2021.0,8.0,2.0,2.0,43.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95308,b'R904353003',b'R9043530',2021.0,8.0,2.0,1.0,8.0,3.0,3.0,4.0,...,,,,,,,,,,


## [프로세싱코드5] 결측값(nan)값 제거

나이 30이상으로 선택하고
결측값을 제거합니다.

In [5]:
df2 = df.loc[df["age"] >= 30]   #나이(age) >= 30이상 선택
df2 = df2.dropna()
df2 = df2.reset_index(drop=True, inplace=False)
df2

Unnamed: 0,ID,ID_fam,year,region,town_t,sex,age,incm,ho_incm,incm5,...,고콜레스테롤혈증,고중성지방혈증,B형간염,빈혈,뇌졸중,협심증또는심근경색증,천식,아토피피부염,골관절염,우울증
0,b'A308780901',b'A3087809',2010.0,1.0,1.0,1.0,74.0,3.0,2.0,4.0,...,0.0,1.0,0.0,0.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0
1,b'A309099802',b'A3090998',2010.0,1.0,1.0,2.0,71.0,2.0,1.0,2.0,...,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,b'A309460901',b'A3094609',2010.0,1.0,1.0,2.0,61.0,3.0,2.0,4.0,...,0.0,0.0,0.0,0.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0
3,b'A309460902',b'A3094609',2010.0,1.0,1.0,1.0,32.0,2.0,2.0,2.0,...,0.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0
4,b'A310439801',b'A3104398',2010.0,1.0,1.0,2.0,63.0,4.0,4.0,5.0,...,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35624,b'R904322404',b'R9043224',2021.0,8.0,2.0,1.0,53.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35625,b'R904332601',b'R9043326',2021.0,8.0,2.0,1.0,50.0,2.0,2.0,2.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35626,b'R904346201',b'R9043462',2021.0,8.0,2.0,2.0,54.0,4.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35627,b'R904353001',b'R9043530',2021.0,8.0,2.0,1.0,45.0,3.0,3.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## [프로세싱코드6] 결과 파일 저장

프로세싱된 파일을 저장합니다

In [6]:
df2.to_csv('nationalhealth_2010to2021.csv',index=False)