# Introduction
We are using a menu dataset from [AI Hub](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=242).
This dataset, originally intended for training vision models, includes a large number of labeled images and an organized Excel file (.xlsx) containing nutrient and category information.
However, we only need the menu labels and their corresponding nutrients from this organized file.

The dataset includes around 50,000 different kinds of detailed food data.
Among these, there are duplicates and some unusable entries, such as ingredients or overly detailed food items.
The following script has been created to refine and clean this food data.

> If you need the original dataset, refer this site. [AI hub](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=242)


## Refinement
### Fields we use
- Id
- Name
- Categories
- Carbohydrates
- Protein
- Fat
- Sugar
- Sodium (NaCl)
- Total energy

### Data folder structure
- root
    - datas
        - raw
            - raw_menu_nutrient.xlsx
        - refined
            - menus
                - menus.xlsx
            - tags
                - tags.xlsx

# imports

In [76]:
import pandas as pd
from pandas import DataFrame

# Load dataset

In [77]:
datas: DataFrame = pd.read_excel('./datas/raw/raw_menu_nutrient.xlsx', engine='openpyxl', sheet_name=1)

# Extract DataFrame

In [78]:
df: DataFrame = datas.copy()

In [79]:
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84,Unnamed: 85,Unnamed: 86,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110,Unnamed: 111,Unnamed: 112,Unnamed: 113,Unnamed: 114,Unnamed: 115,Unnamed: 116,Unnamed: 117,Unnamed: 118,Unnamed: 119,Unnamed: 120,Unnamed: 121,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Unnamed: 126,Unnamed: 127,Unnamed: 128,Unnamed: 129,Unnamed: 130,Unnamed: 131,Unnamed: 132,Unnamed: 133,Unnamed: 134,Unnamed: 135,Unnamed: 136,Unnamed: 137,Unnamed: 138,Unnamed: 139,Unnamed: 140,Unnamed: 141,Unnamed: 142,Unnamed: 143,Unnamed: 144,Unnamed: 145,Unnamed: 146,Unnamed: 147,Unnamed: 148,Unnamed: 149,Unnamed: 150,Unnamed: 151,Unnamed: 152,Unnamed: 153,Unnamed: 154,Unnamed: 155,Unnamed: 156,Unnamed: 157,Unnamed: 158,Unnamed: 159,Unnamed: 160,Unnamed: 161,Unnamed: 162,Unnamed: 163,Unnamed: 164,Unnamed: 165,Unnamed: 166,Unnamed: 167,Unnamed: 168,Unnamed: 169,Unnamed: 170,Unnamed: 171,Unnamed: 172,Unnamed: 173,Unnamed: 174,Unnamed: 175,Unnamed: 176,Unnamed: 177,Unnamed: 178,Unnamed: 179,Unnamed: 180,Unnamed: 181,Unnamed: 182,Unnamed: 183,Unnamed: 184,Unnamed: 185,Unnamed: 186,Unnamed: 187,Unnamed: 188,Unnamed: 189,Unnamed: 190,Unnamed: 191,Unnamed: 192,Unnamed: 193,Unnamed: 194,Unnamed: 195,Unnamed: 196,Unnamed: 197,Unnamed: 198,Unnamed: 199,Unnamed: 200,Unnamed: 201,Unnamed: 202,Unnamed: 203,Unnamed: 204,Unnamed: 205,Unnamed: 206,Unnamed: 207,Unnamed: 208,Unnamed: 209,Unnamed: 210,Unnamed: 211,Unnamed: 212,Unnamed: 213,Unnamed: 214,Unnamed: 215,Unnamed: 216,Unnamed: 217,Unnamed: 218,Unnamed: 219,Unnamed: 220,Unnamed: 221,Unnamed: 222,Unnamed: 223,Unnamed: 224,Unnamed: 225,Unnamed: 226,Unnamed: 227,Unnamed: 228,Unnamed: 229,Unnamed: 230,Unnamed: 231,Unnamed: 232,Unnamed: 233,Unnamed: 234
0,최종 DB 업데이트 일 : 2021-03-10,,,"● 이 파일은 통계용으로 사용하기 위한 것으로, 기존 ""-""값이 ""0""으로 변환된 ...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,NO,SAMPLE_ID,식품코드,DB군,상용제품,식품명,연도,지역 / 제조사,채취시기,식품대분류,식품상세분류,1회제공량,내용량_단위,총내용량(g),총내용량(mL),에너지(㎉),에너지(kj),수분(g),수분(%),단백질(g),지방(g),탄수화물(g),총당류(g),자당(g),포도당(g),과당(g),유당(g),맥아당(g),갈락토오스(g),당알콜(g),에리스리톨(g),총 식이섬유(g),총 식이섬유(mg),총 식이섬유(%),수용성 식이섬유(g),불용성 식이섬유(g),셀룰로오스(%),리그닌(%),칼슘(㎎),철(㎎),철(㎍),마그네슘(㎎),인(㎎),칼륨(g),칼륨(㎎),나트륨(㎎),아연(㎎),구리(㎎),구리(㎍),망간(㎎),망간(㎍),셀레늄(㎍),몰리브덴(㎍),요오드(㎍),염소(㎎),비타민 A(㎍),비타민 A(㎍ RE),레티놀(㎍),베타카로틴(㎍),비타민 D(D2+D3)(㎍),비타민 D2(㎍),비타민 D3(㎍),비타민 D1(㎍),비타민 E(㎎),비타민 E(㎎ α-TE),알파 토코페롤(㎎),베타 토코페롤(㎎),감마 토코페롤(㎎),델타 토코페롤(㎎),알파 토코트리에놀(㎎),베타 토코트리에놀(㎎),감마 토코트리에놀(㎎),델타 토코트리에놀(㎎),토코페롤(㎎),토코트리에놀(㎎),비타민 K(㎎),비타민 K(㎍),비타민 K1(㎍),비타민 K2(㎍),비타민 B1(㎎),비타민 B1(㎍),비타민 B2(㎎),비타민 B2(㎍),나이아신(㎎),나이아신(㎎ NE),나이아신(NE)(㎎),나이아신(NE)(㎎ NE),니코틴산 (㎎),니코틴아마이드(㎎),판토텐산(㎎),판토텐산(㎍),비타민 B6(㎎),비타민 B6(㎍),피리독신(㎎),비오틴(㎍),엽산(DFE)(㎍),엽산 - 천연 엽산(㎍),엽산 - 합성 엽산(㎍),비타민 B12(㎎),비타민 B12(㎍),비타민 C(g),비타민 C(㎎),콜린(g),콜린(㎎),총 아미노산(g),총 아미노산(㎎),필수 아미노산(㎎),이소류신(㎎),류신(㎎),라이신(㎎),메티오닌(㎎),페닐알라닌(㎎),트레오닌(㎎),트립토판(㎎),발린(㎎),히스티딘(㎎),아르기닌(㎎),비필수 아미노산(㎎),티로신(㎎),시스테인(㎎),알라닌(㎎),아스파르트산(㎎),글루탐산(㎎),글리신(㎎),프롤린(㎎),세린(㎎),타우린(㎎),글리신 베타인 (㎎),호마린(㎎),트리고넬린(㎎),리보핵산(㎎),데옥시리보핵산(㎎),콜레스테롤(g),콜레스테롤(㎎),총 지방산(g),총 필수 지방산(g),총 포화 지방산(g),총 포화 지방산(%),부티르산(4:0)(g),부티르산(4:0)(㎎),카프로산(6:0)(g),카프로산(6:0)(㎎),카프릴산(8:0)(g),카프릴산(8:0)(㎎),카프르산(10:0)(g),카프르산(10:0)(㎎),라우르산(12:0)(g),라우르산(12:0)(㎎),라우르산(12:0)(%),트라이데칸산(13:0)(㎎),미리스트산(14:0)(g),미리스트산(14:0)(㎎),미리스트산(14:0)(%),펜타데칸산(15:0)(㎎),팔미트산(16:0)(g),팔미트산(16:0)(㎎),헵타데칸산(17:0)(㎎),스테아르산(18:0)(g),스테아르산(18:0)(㎎),스테아르산(18:0)(%),아라키드산(20:0)(g),아라키드산(20:0)(㎎),아라키드산(20:0)(%),헨에이코산산(21:0)(㎎),베헨산(22:0)(㎎),트리코산산(23:0)(㎎),리그노세르산(24:0)(㎎),총 단일 불포화지방산(g),총 단일 불포화지방산(%),미리스톨레산(14:1)(g),미리스톨레산(14:1)(㎎),미리스톨레산(14:1)(%),팔미톨레산(16:1)(g),팔미톨레산(16:1)(㎎),팔미톨레산(16:1)(%),헵타데센산(17:1)(㎎),올레산(18:1(n-9))(g),올레산(18:1(n-9))(㎎),올레산(18:1(n-9))(%),박센산(18:1(n-7))(g),박센산(18:1(n-7))(㎎),가돌레산(20:1)(g),가돌레산(20:1)(㎎),에루크산(22:1)(㎎),에루크산(22:1)(%),네르본산(24:1)(㎎),총 다중 불포화지방산(g),총 다중 불포화지방산(%),리놀레산(18:2(n-6)c)(g),리놀레산(18:2(n-6)c)(㎎),리놀레산(18:2(n-6)c)(%),알파 리놀렌산(18:3(n-3))(g),알파 리놀렌산(18:3(n-3))(㎎),감마 리놀렌산(18:3(n-6))(g),감마 리놀렌산(18:3(n-6))(㎎),스테아리돈산(18:4)(%),에이코사디에노산(20:2(n-6))(g),에이코사디에노산(20:2(n-6))(㎎),에이코사트리에노산(20:3(n-3))(㎎),에이코사트리에노산(20:3(n-6))(g),에이코사트리에노산(20:3(n-6))(㎎),아라키돈산(20:4(n-6))(g),아라키돈산(20:4(n-6))(㎎),아라키돈산(20:4(n-6))(%),에이코사테트라에노산(20:4(n-3))(㎎),에이코사펜타에노산(20:5(n-3))(g),에이코사펜타에노산(20:5(n-3))(㎎),에이코사펜타에노산(20:5(n-3))(%),도코사디에노산(22:2)(㎎),도코사펜타에노산(22:5(n-3))(g),도코사펜타에노산(22:5(n-3))(㎎),도코사헥사에노산(22:6(n-3))(g),도코사헥사에노산(22:6(n-3))(㎎),도코사헥사에노산(22:6(n-3))(%),EPA와 DHA의 합(㎎),오메가 3 지방산(g),오메가 6 지방산(g),트랜스 지방산(g),트랜스 올레산(18:1(n-9)t)(g),트랜스 올레산(18:1(n-9)t)(㎎),트랜스 리놀레산 (18:2t)(g),트랜스 리놀레산 (18:2t)(㎎),트랜스 리놀레산(18:3t)(g),트랜스 리놀레산(18:3t)(㎎),트랜스 리놀레산(18:3t)(%),냉산가용성물질(㎎),총 불포화지방산(g),식염상당량(g),회분(g),폐기율(%),가식부(%),산가용성물질(%),카페인(㎎),성분표출처,발행기관
3,1,D000006-94-AVG,D000006,음식,품목대표,꿩불고기,2019,충주,평균,구이류,육류구이,500,g,0,0,368.8,0,412.6,0,33.5,8.5,39.7,16.9,7.2,2.8,2.8,0.7,3.5,0,0,0,9.8,0,0,0,0,0,0,105.61,0,4,85.39,458.05,0,1243.12,1264.31,3.99,0.32,0,0.68,0,47.55,0,0,0,0,0,0,1424.58,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8.73,0.06,0,0,0,0,0,0,0.33,0,3.61,0,0,0,0,0,0,0,0,0,0,0,63.13,0,0,0,0,0,2.99,0,0,0,0,0,1284.38,2328.6,2484.75,645.19,1274.93,1410.84,0,1367.69,848.539,2205.59,0,919.806,276.499,1642.38,2781.77,4784.61,1419.92,1092.3,1264.96,0,0,0,0,0,0,0,106.18,0,0,1.9,0,0,0,0,0,0,0,0,0,0.007,0,0,0,0.032,0,0,0,1.284,0,0,0.527,0,0,0.029,0,0,0,0,0,0,0,0,0,0,0,0.081,0,0,0,2.308,0,0,0.113,0,0.028,0,0,0,0,0,0,3.236,0,0,0.657,0,0.019,0,0,0.007,0,0,0,0,0.143,0,0,0,0,0,0,0,0,0,0.023,0,0,0,0,0,0.1,0.018,0,0.053,0,0,0,0,0,0,0,5.8,0,0,0,0,식약처('16) 제4권,식품의약품안전처
4,2,D000007-ZZ-AVG,D000007,음식,품목대표,닭갈비,2019,전국(대표),평균,구이류,육류구이,400,g,0,0,595.61,0,276.4,0,45.9,25.8,44.9,21.2,3.6,5.9,4.8,0,6.9,0,0,0,11.6,0,0,0,0,0,0,98.64,0,3.38,104.42,505.25,0,1200.24,1535.83,3.55,0.34,0,0.97,0,57.56,0,0,0,0,0,38.61,2133.37,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.82,0.04,0,0,0,0,0.24,0,0.37,0,1.23,0,0,0,0,0,0,0,0,0,0,0,108.13,0,0,0,1.12,0,5.54,0,0,0,0,0,1796.24,3245.41,3593.31,878.04,1830.14,2019.94,0,1966.32,1198.58,2851.72,0,1310.51,307.73,2465.2,4361.76,7778.84,2230.08,1865.83,1833.39,0,0,0,0,0,0,0,193.4,0,0,6,0,0,0,0,0,0,0,0.002,0,0.032,0,0,0,0.179,0,0,0,4.485,0,0,1.271,0,0,0.032,0,0,0,0,0,0,0,0,0.051,0,0,1.29,0,0,0,8.116,0,0,0.465,0,0.082,0,0,0,0,0,0,4.851,0,0,0.443,0,0.021,0,0,0.027,0,0,0.04,0,0.128,0,0,0,0.006,0,0,0,0.016,0,0.01,0,0,0,0,0,0.2,0.075,0,0.065,0,0.013,0,0,0,0,0,7,0,0,0,0,식약처('16) 제4권,식품의약품안전처


In [80]:
df.columns = df.iloc[2, :]
df: DataFrame = df.iloc[3:, :]
df.set_index('NO', inplace=True)

# Extract features

In [83]:
df_new: DataFrame = df.iloc[:, [1, 2, 3, 4, 6, 8, 9, 14, 18, 19, 20, 21, 44]]

In [84]:
df_new.head()

2,식품코드,DB군,상용제품,식품명,지역 / 제조사,식품대분류,식품상세분류,에너지(㎉),단백질(g),지방(g),탄수화물(g),총당류(g),나트륨(㎎)
NO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,D000006,음식,품목대표,꿩불고기,충주,구이류,육류구이,368.8,33.5,8.5,39.7,16.9,1264.31
2,D000007,음식,품목대표,닭갈비,전국(대표),구이류,육류구이,595.61,45.9,25.8,44.9,21.2,1535.83
3,D000008,음식,품목대표,닭갈비,춘천,구이류,육류구이,558.47,45.5,31.6,23.1,8.5,1016.94
4,D000009,음식,품목대표,닭꼬치,전국(대표),구이류,육류구이,176.723,11.562,8.565,13.348,3.152,286.911
5,D000010,음식,품목대표,더덕구이,전국(대표),구이류,채소류구이,184.0,3.1,5.2,31.1,11.6,743.37


In [None]:
df_new.columns = ['code', 'type', 'prod', 'name', 'manuf', 'cate_big', 'cate_specific', 'energy', 'prot', 'fat', 'carb', 'sugar', 'nat']