# MIMIC-iv-CXR 데이터 분석

> 원본 데이터 및 정보는 [여기](https://physionet.org/content/mimic-cxr/2.1.0/)를 참고하세요.

## 데이터 요약

MIMIC 흉부 X선(MIMIC-CXR) 데이터베이스 v2.1.0은 DICOM 형식의 흉부 방사선 사진과 자유 텍스트 방사선 보고서를 포함한 대규모 공개 데이터 세트입니다. 이 데이터 세트에는 매사추세츠주 보스턴에 있는 Beth Israel Deaconess 의료 센터에서 수행되었으며, 1996년 미국 건강보험 양도 및 책임법(HIPAA)의 세이프 하버(Safe Harbor) 요건을 충족하기 위해 익명화되었습니다.

## 데이터 종류

MIMIC-CXR은 세 가지 데이터 형식이 혼합된 형태입니다.
- 전자 건강 기록 데이터(EHR)
- 이미지(흉부 방사선 사진)
- 자연어(자유 텍스트 보고서)
이 세 가지 양식은 거의 독립적으로 처리되었으며, 데이터베이스를 구축하기 위해 통합되었습니다.

## 레이블링 현황 확인

In [None]:
import pandas as pd

chexpert = pd.read_csv('mimic-cxr-2.0.0-chexpert.csv.gz', compression='gzip')
chexpert

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227822,19999442,58708861,,,,,,,,,1.0,,,,,1.0
227823,19999733,57132437,,,,,,,,,1.0,,,,,
227824,19999987,55368167,1.0,-1.0,,,,,0.0,,,0.0,,,0.0,
227825,19999987,58621812,1.0,,,,,,,,,,,,,1.0


In [15]:
print('총 건수: {:,}'.format(len(chexpert['study_id'].unique())))

총 건수: 227,827


| 질병명                      | 값이 1인 건수 | 값이 1이 아닌 건수 |
|---------------------------|--------------:|--------------------:|
| 무기폐 (Atelectasis)         |        45,808 |             182,019 |
| 심비대 (Cardiomegaly)        |        44,845 |             182,982 |
| 경화 (Consolidation)         |        10,778 |             217,049 |
| 부종 (Edema)                 |        27,018 |             200,809 |
| 심종격 확장 (Enlarged Cardiomediastinum) |     7,179 |             220,648 |
| 골절 (Fracture)              |         4,390 |             223,437 |
| 폐 병변 (Lung Lesion)        |         6,284 |             221,543 |
| 폐 혼탁 (Lung Opacity)       |        51,525 |             176,302 |
| 이상 없음 (No Finding)       |        75,455 |             152,372 |
| 흉막 삼출 (Pleural Effusion) |        54,300 |             173,527 |
| 기타 흉막 질환 (Pleural Other) |       2,011 |             225,816 |
| 폐렴 (Pneumonia)             |        16,556 |             211,271 |
| 기흉 (Pneumothorax)          |        10,358 |             217,469 |
| 인공장치 (Support Devices)   |        66,558 |             161,269 |

In [17]:
# 컬럼별로 1과 1이 아닌 값의 개수 세기
result = []
for col in chexpert.columns[2:]:
    count_1 = (chexpert[col] == 1).sum()
    count_not_1 = (chexpert[col] != 1).sum()
    result.append({'Column': col, 'Count_1': count_1, 'Count_not_1': count_not_1})

# 결과를 DataFrame으로 출력
summary_df = pd.DataFrame(result)
print(summary_df)

                        Column  Count_1  Count_not_1
0                  Atelectasis    45808       182019
1                 Cardiomegaly    44845       182982
2                Consolidation    10778       217049
3                        Edema    27018       200809
4   Enlarged Cardiomediastinum     7179       220648
5                     Fracture     4390       223437
6                  Lung Lesion     6284       221543
7                 Lung Opacity    51525       176302
8                   No Finding    75455       152372
9             Pleural Effusion    54300       173527
10               Pleural Other     2011       225816
11                   Pneumonia    16556       211271
12                Pneumothorax    10358       217469
13             Support Devices    66558       161269


- 정상 건수: 75,455
- 폐렴(Pneumonia) 건수: 16,556
- 폐렴(Pneumonia) 아닌 비정상 건수: 135816  (= 227827 - 75455 - 16556)

## 폐렴(Pneumonia) 데이터 다운로드

In [18]:
# Pneumonia = 1 일 때 다른 컬럼이 1과 1이 아닌 값의 개수 세기
pneumonia_cases = chexpert[chexpert['Pneumonia']==1.0]

# 컬럼별로 1과 1이 아닌 값의 개수 세기
result = []
for col in pneumonia_cases.columns[2:]:
    count_1 = (pneumonia_cases[col] == 1).sum()
    count_not_1 = (pneumonia_cases[col] != 1).sum()
    result.append({'Column': col, 'Count_1': count_1, 'Count_not_1': count_not_1})

# 결과를 DataFrame으로 출력
summary_df = pd.DataFrame(result)
print(summary_df)

                        Column  Count_1  Count_not_1
0                  Atelectasis     3672        12884
1                 Cardiomegaly     3507        13049
2                Consolidation     2268        14288
3                        Edema     3052        13504
4   Enlarged Cardiomediastinum      546        16010
5                     Fracture      180        16376
6                  Lung Lesion      703        15853
7                 Lung Opacity     8397         8159
8                   No Finding        0        16556
9             Pleural Effusion     5007        11549
10               Pleural Other      206        16350
11                   Pneumonia    16556            0
12                Pneumothorax      280        16276
13             Support Devices     4543        12013


In [None]:
meta = pd.read_csv('mimic-cxr-2.0.0-metadata.csv.gz', compression='gzip')
meta

Unnamed: 0,dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,10000032,50414267,CHEST (PA AND LAT),PA,3056,2544,21800506,213014.531,CHEST (PA AND LAT),postero-anterior,Erect
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,10000032,50414267,CHEST (PA AND LAT),LATERAL,3056,2544,21800506,213014.531,CHEST (PA AND LAT),lateral,Erect
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,10000032,53189527,CHEST (PA AND LAT),PA,3056,2544,21800626,165500.312,CHEST (PA AND LAT),postero-anterior,Erect
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,10000032,53189527,CHEST (PA AND LAT),LATERAL,3056,2544,21800626,165500.312,CHEST (PA AND LAT),lateral,Erect
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,10000032,53911762,CHEST (PORTABLE AP),AP,2705,2539,21800723,80556.875,CHEST (PORTABLE AP),antero-posterior,
...,...,...,...,...,...,...,...,...,...,...,...,...
377105,428e2c18-5721d8f3-35a05001-36f3d080-9053b83c,19999733,57132437,CHEST (PA AND LAT),PA,3056,2544,21520708,224550.171,CHEST (PA AND LAT),postero-anterior,Erect
377106,58c403aa-35ff8bd9-73e39f54-8dc9cc5d-e0ec3fa9,19999733,57132437,CHEST (PA AND LAT),LATERAL,3056,2544,21520708,224550.171,CHEST (PA AND LAT),lateral,Erect
377107,58766883-376a15ce-3b323a28-6af950a0-16b793bd,19999987,55368167,CHEST (PORTABLE AP),AP,2544,3056,21451104,51448.218,CHEST (PORTABLE AP),antero-posterior,Erect
377108,7ba273af-3d290f8d-e28d0ab4-484b7a86-7fc12b08,19999987,58621812,CHEST (PORTABLE AP),AP,3056,2544,21451102,202809.234,CHEST (PORTABLE AP),antero-posterior,Erect


In [24]:
print('{:,}'.format(len(meta['subject_id'].unique())))
print('{:,}'.format(len(meta['study_id'].unique())))
print('{:,}'.format(meta['dicom_id'].count()))

65,379
227,835
377,110


In [25]:
negbio = pd.read_csv('mimic-cxr-2.0.0-negbio.csv.gz', compression='gzip')
negbio

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227822,19999442,58708861,,,,,,,,,1.0,,,,,1.0
227823,19999733,57132437,,,,,,,,,1.0,,,,,
227824,19999987,55368167,1.0,-1.0,,,,,0.0,,,0.0,,,0.0,
227825,19999987,58621812,1.0,,,,,,,,,,,,,1.0


In [26]:
split = pd.read_csv('mimic-cxr-2.0.0-split.csv.gz', compression='gzip')
split

Unnamed: 0,dicom_id,study_id,subject_id,split
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,50414267,10000032,train
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,50414267,10000032,train
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,53189527,10000032,train
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,53189527,10000032,train
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,53911762,10000032,train
...,...,...,...,...
377105,428e2c18-5721d8f3-35a05001-36f3d080-9053b83c,57132437,19999733,train
377106,58c403aa-35ff8bd9-73e39f54-8dc9cc5d-e0ec3fa9,57132437,19999733,train
377107,58766883-376a15ce-3b323a28-6af950a0-16b793bd,55368167,19999987,train
377108,7ba273af-3d290f8d-e28d0ab4-484b7a86-7fc12b08,58621812,19999987,train


# COVID-CXR 데이터 분석

In [20]:
import pandas as pd

# 공백 구분 텍스트 파일 경로
txt_file_path = './data/covid-cxr/train.txt'
csv_file_path = './data/covid-cxr/train.csv'

# 공백을 구분자로 읽기 (공백이 여러 개일 경우 \s+ 정규식 사용)
df = pd.read_csv(txt_file_path, delim_whitespace=True)

# CSV로 저장
df.to_csv(csv_file_path, index=False)

print(f"'{csv_file_path}'로 변환 완료")

'./data/covid-cxr/train.csv'로 변환 완료


  df = pd.read_csv(txt_file_path, delim_whitespace=True)


In [21]:
covid_train = pd.read_csv(csv_file_path)
covid_train.head()

Unnamed: 0,379,1e64990d1b40c1758a2aaa9c7f7a85_jumbo.jpeg,negative,cohen
0,379,7223b8ad031187d9a142d7f7ca02c9_jumbo.jpeg,negative,cohen
1,380,3392dc7d262e28423caca517f98c2e_jumbo.jpeg,negative,cohen
2,380,ec3a480c0926ded74429df416cfb05_jumbo.jpeg,negative,cohen
3,382,a72aeb349a63c79ed24e473c434efe_jumbo.jpg,negative,cohen
4,382,ba45a47c3ef5060ec39891046be7ca_jumbo.jpg,negative,cohen


In [23]:
print((covid_train['negative'] == 'negative').sum())
print((covid_train['negative'] == 'positive').sum())


10663
57199


In [25]:
import tarfile

# 압축 풀기
with tarfile.open('./data/covid-cxr/covid19_posi_metadata.tar.gz', 'r:gz') as tar:
    tar.extractall('./data/covid-cxr/unpacked')

  tar.extractall('./data/covid-cxr/unpacked')
