https://archive.ics.uci.edu/dataset/277/thoracic+surgery+data

Thoracic Surgery Data

    The data is dedicated to classification problem related to the post-operative life expectancy in the lung cancer patients:
    class 1 - death within one year after surgery
    class 2 - survival.

- Dataset Characteristics
    - Multivariate
- Associated Tasks
    - Classification
- Feature Type
    - Integer, Real
- Features
    - 16
- Instances = Individual patients
    - 470

- Additional Information
    - The data was collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the years 2007 to 2011. The Centre is associated with the Department of Thoracic Surgery of the Medical University of Wroclaw and Lower-Silesian Centre for Pulmonary Diseases, Poland, while the research database constitutes a part of the National Lung Cancer Registry, administered by the Institute of Tuberculosis and Pulmonary Diseases in Warsaw, Poland.
- Has Missing Values?
    - No

- Additional Variable Information
1. DGN: Diagnosis - specific combination of ICD-10 codes for primary and secondary as well multiple tumours if any (DGN3,DGN2,DGN4,DGN6,DGN5,DGN8,DGN1)
2. PRE4: Forced vital capacity - FVC (numeric)
3. PRE5: Volume that has been exhaled at the end of the first second of forced expiration - FEV1 (numeric)
4. PRE6: Performance status - Zubrod scale (PRZ2,PRZ1,PRZ0)
5. PRE7: Pain before surgery (T,F)
6. PRE8: Haemoptysis before surgery (T,F)
7. PRE9: Dyspnoea before surgery (T,F)
8. PRE10: Cough before surgery (T,F)
9. PRE11: Weakness before surgery (T,F)
10. PRE14: T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest) (OC11,OC14,OC12,OC13)
11. PRE17: Type 2 DM - diabetes mellitus (T,F)
12. PRE19: MI up to 6 months (T,F)
13. PRE25: PAD - peripheral arterial diseases (T,F)
14. PRE30: Smoking (T,F)
15. PRE32: Asthma (T,F)
16. AGE: Age at surgery (numeric)
17. Risk1Y: 1 year survival period - (T)rue value if died (T,F)

1. DGN: 진단 - 1차 및 2차 종양과 다발성 종양에 대한 ICD-10 코드의 특정 조합(DGN3,DGN2,DGN4,DGN6,DGN5,DGN8,DGN1)
2. PRE4: 강제 폐활량 - FVC(숫자)
3. PRE5: 강제 호기의 첫 1초가 끝날 때 호기된 양 - FEV1(숫자)
4. PRE6: 성능 상태 - Zubrod 스케일(PRZ2,PRZ1,PRZ0)
5. PRE7: 수술 전 통증(T,F)
6. PRE8: 수술 전 지혈(T,F)
7. PRE9: 수술 전 호흡 곤란 (T,F)
8. PRE10: 수술 전 기침 (T,F)
9. PRE11: 수술 전 허약함(T,F)
10. PRE14: 임상 TNM의 T - 원래 종양의 크기, OC11(최소)부터 OC14(최대)까지(OC11,OC14,OC12,OC13)
11. PRE17: 제2형 DM - 당뇨병(T,F)
12. PRE19: MI 최대 6개월(T,F)
13. PRE25: PAD - 말초 동맥 질환(T,F)
14. PRE30: 흡연 (T,F)
15. PRE32: 천식(T,F)
16. 나이: 수술 당시 나이(숫자)
17. Risk1Y: 생존 기간 1년 - (T)사망 시 rue 값 (T,F)

Class Distribution: the class value (Risk1Y) is binary valued.
   Risk1Y Value:   Number of Instances:
	T                  70
	N                  400

Summary Statistics:

	Binary Attributes Distribution:
	   PRE7 Value:   Number of Instances:
		T              	31
		N             	439
	   PRE8 Value:   Number of Instances:
		T              	68
		N             	402
	   PRE9 Value:   Number of Instances:
		T              	31
		N             	439
	   PRE10 Value:   Number of Instances:
		T              	323
		N             	147
	   PRE11 Value:   Number of Instances:
		T              	78
		N             	392		
	   PRE17 Value:   Number of Instances:
		T              	35
		N             	435	
	   PRE19 Value:   Number of Instances:
		T              	2
		N             	468	
	   PRE25 Value:   Number of Instances:
		T              	8
		N             	462
	   PRE30 Value:   Number of Instances:
		T              	386
		N             	84			
	   PRE32 Value:   Number of Instances:
		T              	368
		N             	2		
		
	Nominal Attributes Distribution:
	   DGN Value:   Number of Instances:
		DGN3           349
		DGN2           52
		DGN4           47
		DGN6           4
		DGN5           15
		DGN8           2		
		DGN1           1	
	   PRE6 Value:   Number of Instances:
		PRZ2           27
		PRZ1           313
		PRZ0           130
	   PRE14 Value:   Number of Instances:
		OC11           177
		OC14           17
		OC12           257
		OC13           19

	Numeric Attributes Statistics:	
	     Min   Max   Mean    SD      
    PRE4:    1.4   6.3   3.3     0.9   
    PRE5:    0.96  86.3  4.6     11.8   
    AGE:     21    87    52.5    8.7

In [4]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sklearn

In [19]:
columns = [
    'DGN', 'PRE4', 'PRE5', 'PRE6', 'PRE7', 'PRE8', 'PRE9', 'PRE10', 
    'PRE11', 'PRE14', 'PRE17', 'PRE19', 'PRE25', 'PRE30', 'PRE32', 
    'AGE', 'Risk1Y'
]
TS = pd.read_csv("./ThoraricSurgery.csv", header=None, names=columns)
TS.head(20)

Unnamed: 0,DGN,PRE4,PRE5,PRE6,PRE7,PRE8,PRE9,PRE10,PRE11,PRE14,PRE17,PRE19,PRE25,PRE30,PRE32,AGE,Risk1Y
293,1,3.8,2.8,0,0,0,0,0,0,12,0,0,0,1,0,62,0
1,2,2.88,2.16,1,0,0,0,1,1,14,0,0,0,1,0,60,0
8,2,3.19,2.5,1,0,0,0,1,0,11,0,0,1,1,0,66,1
14,2,3.98,3.06,2,0,0,0,1,1,14,0,0,0,1,0,80,1
17,2,2.21,1.88,0,0,1,0,0,0,12,0,0,0,1,0,56,0
18,2,2.96,1.67,0,0,0,0,0,0,12,0,0,0,1,0,61,0
35,2,2.76,2.2,1,0,0,0,1,0,11,0,0,0,0,0,76,0
42,2,3.24,2.52,1,0,0,0,1,0,12,0,0,0,1,0,63,1
65,2,3.15,2.76,1,0,1,0,1,0,12,0,0,0,1,0,59,0
111,2,4.48,4.2,0,0,0,0,0,0,12,0,0,0,1,0,55,0


In [20]:
print(TS.shape)
print(TS.info)

(470, 17)
<bound method DataFrame.info of      DGN  PRE4  PRE5  PRE6  PRE7  PRE8  PRE9  PRE10  PRE11  PRE14  PRE17  \
293    1  3.80  2.80     0     0     0     0      0      0     12      0   
1      2  2.88  2.16     1     0     0     0      1      1     14      0   
8      2  3.19  2.50     1     0     0     0      1      0     11      0   
14     2  3.98  3.06     2     0     0     0      1      1     14      0   
17     2  2.21  1.88     0     0     1     0      0      0     12      0   
..   ...   ...   ...   ...   ...   ...   ...    ...    ...    ...    ...   
98     6  3.04  2.40     2     0     0     0      1      0     11      0   
369    6  3.88  2.72     1     0     0     0      1      0     12      0   
406    6  5.36  3.96     1     0     0     0      1      0     12      0   
25     8  4.32  3.20     0     0     0     0      0      0     11      0   
447    8  5.20  4.10     0     0     0     0      0      0     12      0   

     PRE19  PRE25  PRE30  PRE32  AGE  Risk1Y 