# Understanding diabetes dataset

UCI ML Repository에서 다운로드한 [당뇨병 시계열 데이터](https://archive.ics.uci.edu/ml/datasets/Diabetes)에 대해 알아보자.

아무런 베이스 지식이 없으니, 찬찬히 훑어보며 이해하는 것을 목표로 한다.

In [1]:
import os
from datetime import datetime as dt

import pandas as pd
import numpy as np

In [2]:
DATA_DIR = '../../data/diabetes/'
store_dir = '../../data/diabetes-project/0.merge/'

## 디렉토리 읽어보기

In [3]:
len(os.listdir(DATA_DIR)), os.listdir(DATA_DIR)

(73,
 ['README-DIABETES',
  'data-31',
  'data-36',
  'data-09',
  'data-07',
  'data-38',
  'data-53',
  'data-54',
  'data-62',
  'data-65',
  'data-06',
  'data-39',
  'data-01',
  'data-37',
  'data-08',
  'data-30',
  'data-64',
  'data-63',
  'data-55',
  'data-52',
  'data-70',
  'data-48',
  'data-41',
  'data-46',
  'data-12',
  'data-15',
  'data-23',
  'data-24',
  'data-47',
  'data-40',
  'data-49',
  'data-25',
  'data-22',
  'data-14',
  'data-13',
  'data-68',
  'data-57',
  'Data-Codes',
  'data-50',
  'data-59',
  'data-66',
  'data-61',
  'data-35',
  'data-32',
  'data-04',
  'data-03',
  'data-60',
  'data-58',
  'data-67',
  'data-51',
  'data-69',
  'data-56',
  'data-02',
  'data-05',
  'data-33',
  'data-34',
  'data-29',
  'data-16',
  'data-11',
  'data-18',
  'data-27',
  'data-20',
  'data-45',
  'Domain-Description',
  'data-42',
  'data-21',
  'data-19',
  'data-26',
  'data-10',
  'data-28',
  'data-17',
  'data-43',
  'data-44'])

총 파일 개수는 73개이며, `README-DIABETES`, `Data-Codes`, `Domain-Description` 세 파일을 제외하고는 `data-[n]` 형식으로 이루어짐. 세 종류의 메타 데이터를 읽어보고, 나머지 데이터를 확인할 것

### `README-DIABETES`

가장 먼저 README를 읽어 데이터 정보를 확인해보자

In [4]:
with open(os.path.join(DATA_DIR, 'README-DIABETES'), 'r') as f:
    readme = f.readlines()
    
print(''.join(readme))

The DIABETES data sets in this directory are provided for use in 1994 
AI in Medicine symposium submissions.  Permission is granted to use the
data sets for other research purposes as long as appropriate credit is
given as to the source (AIM-94 data set provided by Michael Kahn, MD, PhD, 
Washington University, St. Louis, MO).


Index:
------

* Data-Codes: a listing of the codes used in the data sets.

* Domain-Description: This file describes the basic physiology and patho-
physiology of diabetes mellitus and its treatment.

* data-[01-70]: data sets covering several weeks' to months' worth of
outpatient care on 70 patients.  An additional 10 sets will be made
available two weeks prior to the symposium for interested parties.  Please
contact the organizers if you would like to obtain these data sets.


Methods:
--------

You do not need to use all the data in order to participate.  Use any 
subset of the available data from either the ICU data set or the diabetes 
data set.  Furtherm

얻을 수 있는 데이터 정보

 * Data-Codes: 코드의 목록
 * Domain-Description: 당뇨병의 기본적인 병리, 생리와 치료에 대해 설명하는 파일
 * data-[0-70]: 70명의 환자에 대한 외래 진료를 담은 몇 주에서 몇 달 동안의 데이터 세트.

### `Data-Codes`

In [5]:
with open(os.path.join(DATA_DIR, 'Data-Codes'), 'r') as f:
    data_codes = f.readlines()
    
print(''.join(data_codes))

Diabetes patient records were obtained from two sources:  an automatic
electronic recording device and paper records.  The automatic device
had an internal clock to timestamp events, whereas the paper records
only provided "logical time" slots (breakfast, lunch, dinner,
bedtime).  For paper records, fixed times were assigned to breakfast
(08:00), lunch (12:00), dinner (18:00), and bedtime (22:00).  Thus
paper records have fictitious uniform recording times whereas
electronic records have more realistic time stamps.

Diabetes files consist of four fields per record.  Each field is
separated by a tab and each record is separated by a newline.

File Names and format:
(1) Date in MM-DD-YYYY format
(2) Time in XX:YY format
(3) Code
(4) Value

The Code field is deciphered as follows:

33 = Regular insulin dose
34 = NPH insulin dose
35 = UltraLente insulin dose
48 = Unspecified blood glucose measurement
57 = Unspecified blood glucose measurement
58 = Pre-breakfast blood glucose measurement
59

데이터는 두 가지 방식을 통해 얻어짐

 * automatic electronic recording device: 측정 시간이 비교적 정확하게 기록됨.
 * paper record: 아침, 점심, 저녁, 자기전의 논리적 시간으로 기록되어, 이 데이터들은 08:00, 12:00, 18:00, 22:00 으로 변환함. 좀 더 부정확하다고 할 수 있음.
 
각 데이터는 4개의 field에 대해 레코드가 기록됨. 각 레코드 정보는 아래와 같음.

File Names and format:
 * (1) Date in MM-DD-YYYY format
 * (2) Time in XX:YY format
 * (3) Code
 * (4) Value

코드 정보는 아래 dictionary에 저장함

In [6]:
code_dict = dict()

for line in data_codes[20:-5]:
    code, descrb = line.rstrip().split(' = ')
    code_dict[int(code)] = descrb

code_dict[33]

'Regular insulin dose'

### `Domain-Description`

In [7]:
with open(os.path.join(DATA_DIR, 'Domain-Description'), 'r') as f:
    domain = f.readlines()
    
print(''.join(domain))

A Non-technical Description of Key Concepts in Outpatient Monitoring
and Management of Insulin Dependent Diabetes Mellitus (IDDM) for the
AAAI Spring Symposium on Intepreting Clinical Data.


The following text is provided to orient you to the the diabetes data
set. It is meant as a quick introduction to the pertinent issues in
this domain for potential participants of the AAAI Spring Symposium on
Interpreting Clinical Data.  However, it is not meant to be a rigorous
or comprehensive review of the subject.

Isaac  Kohane, AIM-94 Co-Chair
8/27/1993
aim-94@camis.stanford.edu

------------------------------------------------------------------------

Patients with IDDM are insulin deficient. This can either be due to a)
low or absent production of insulin by the beta islet cells of the
pancreas subsequent to an auto-immune attack or b) insulin-resistance,
typically associated with older age and obesity, which leads to a
relative insulin-deficiency even though the insulin levels might be
no

인술린 의존성 당뇨병(제 1형 당뇨병) 환자는 인슐린이 결핍됨. a) 자가면역에 의해 인슐린 생산이 줄거나 없는 이유  또는 b) 나이가 들거나 비만에 의해서 인슐린 저항성이 증가하는 이유로 인해 발생함. 원인과는 상관없이, IDDM은 대사 문제를 일으킴...

등의 당뇨병에 대한 간단한 설명

### `data-[01-70]`

In [8]:
df_01 = pd.read_csv(os.path.join(DATA_DIR, 'data-01'), sep='\t', header=None)
df_01.columns = ['date', 'time', 'code', 'value']
print(df_01.shape)
df_01.head()

(943, 4)


Unnamed: 0,date,time,code,value
0,04-21-1991,9:09,58,100
1,04-21-1991,9:09,33,9
2,04-21-1991,9:09,34,13
3,04-21-1991,17:08,62,119
4,04-21-1991,17:08,33,7


In [9]:
df_70 = pd.read_csv(os.path.join(DATA_DIR, 'data-70'), sep='\t', header=None)
df_70.columns = ['date', 'time', 'code', 'value']
print(df_70.shape)
df_70.head()

(341, 4)


Unnamed: 0,date,time,code,value
0,03-13-1989,08:00,58,354.0
1,03-13-1989,08:00,33,2.0
2,03-13-1989,08:00,34,8.0
3,03-13-1989,18:00,62,275.0
4,03-13-1989,18:00,33,1.0


첫번째 환자는 943개의 레코드를, 마지막 환자는 341개의 레코드를 가짐. 우선 모든 파일을 읽어 하나의 데이터 프레임으로 만들자

In [10]:
df_01.loc[:, 'pid'] = '01'
df_01['pid']

0      01
1      01
2      01
3      01
4      01
       ..
938    01
939    01
940    01
941    01
942    01
Name: pid, Length: 943, dtype: object

In [11]:
df_70.loc[:, 'pid'] = '70'
df_70['pid']

0      70
1      70
2      70
3      70
4      70
       ..
336    70
337    70
338    70
339    70
340    70
Name: pid, Length: 341, dtype: object

In [12]:
df_data = pd.concat([df_01, df_70], axis=0)
print(df_data.shape)
df_data.head()

(1284, 5)


Unnamed: 0,date,time,code,value,pid
0,04-21-1991,9:09,58,100.0,1
1,04-21-1991,9:09,33,9.0,1
2,04-21-1991,9:09,34,13.0,1
3,04-21-1991,17:08,62,119.0,1
4,04-21-1991,17:08,33,7.0,1


모든 데이터에 대해 위 같은 방식으로 병합하여 새로운 데이터 프레임으로 저장한다.

In [13]:
df_list = []

for pid in range(1, 71):
    
    if pid <= 9:
        pid = '0' + str(pid)
    else:
        pid = str(pid) 
        
    df_tmp = pd.read_csv(os.path.join(DATA_DIR, f'data-{pid}'), sep='\t', header=None)
    df_tmp.columns = ['date', 'time', 'code', 'value']
    df_tmp.loc[:, 'pid'] = pid
    
    df_list.append(df_tmp)

len(df_list)

70

In [14]:
df_data = pd.concat(df_list, axis=0)
print(df_data.shape)
df_data.head()

(29330, 5)


Unnamed: 0,date,time,code,value,pid
0,04-21-1991,9:09,58,100,1
1,04-21-1991,9:09,33,9,1
2,04-21-1991,9:09,34,13,1
3,04-21-1991,17:08,62,119,1
4,04-21-1991,17:08,33,7,1


## 데이터 저장

In [15]:
date = dt.strftime(dt.now(), '%y-%m-%d')
df_data.to_pickle(os.path.join(store_dir, f'df_data_{date}.pkl'))

In [16]:
%store code_dict

Stored 'code_dict' (dict)
