# Thoracic Surgery Data Set



## Description
Data related to the post-operative life expectancy in lung cancer patients.

The column attributes are as follows:

| Id | Attribute |Domain |
| -   | ----------- |----------- |
| 1   | DGN - Diagnosis | DGN3, DGN2, DGN4, DGN6, DGN5, DGN8, DGN1 |
| 2   | PRE4 - Forced vital capacity | numeric |
| 3   | PRE5 - Volume that has been exhaled at the end of the first second of forced expiration | numeric |
| 4   | PRE6 - Performance status - Zubrod scale | PRZ2,PRZ1,PRZ0 |
| 5   | PRE7 - Pain before surgery | T,F |
| 6  | PRE8 -  Haemoptysis before surgery | T,F |
| 7  | PRE9 - Dyspnoea before surgery | T,F |
| 8  | PRE10 - Cough before surgery | T,F |
| 9  | PRE11 - Weakness before surgery | T,F |
| 10  | PRE14 - T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest) | OC11,OC14,OC12,OC13 |
| 11  | PRE17 - Type 2 diabetes mellitus | T,F |
| 12  | PRE19 - MI up to 6 months | T,F |
| 13  | PRE25 - peripheral arterial diseases | T,F |
| 14  | PRE30 - Smoking | T,F |
| 15  | PRE32 - Asthma | T,F |
| 16  | Age - Age at surgery | numeric |
| 17  | Risk1Y - 1 year survival period | T,F |

source: https://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data


## Importin and processing dataset

In [37]:
import os                        # for os.path.exists
import json                      # for loading metadata
import urllib                    # for downloading remote files 
import numpy as np
import pandas as pd
from scipy.io import arff

In [38]:
def download(remoteurl: str, localfile: str):
    """
    Download remoteurl to localfile, unless localfile already exists.
    Returns the localfile string.
    """
    localfile = "../../datasets/classification/"+localfile
    if not os.path.exists(localfile):
        print("Downloading %s..." % localfile)
        filename, headers = urllib.request.urlretrieve(remoteurl, localfile)
    return localfile

In [39]:
data_file = download("https://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff", "ThoracicSurgery.arff")

arff_data = arff.loadarff(data_file)

column_names = [
    'DGN',
    'PRE4',
    'PRE5',
    'PRE6',
    'PRE7',
    'PRE8',
    'PRE9',
    'PRE10',
    'PRE11',
    'PRE14',
    'PRE17',
    'PRE19',
    'PRE25',
    'PRE30',
    'PRE32',
    'AGE',
    'Risk1Yr']

cat_columns = ['DGN',
    'PRE6',
    'PRE7',
    'PRE8',
    'PRE9',
    'PRE10',
    'PRE11',
    'PRE14',
    'PRE17',
    'PRE19',
    'PRE25',
    'PRE30',
    'PRE32',
    'Risk1Yr']

data = pd.DataFrame(arff_data[0].tolist(), columns=column_names)

for cat_col in cat_columns: 
    data[cat_col] = data[cat_col].str.decode('utf-8')
    data[cat_col] = pd.Categorical(data[cat_col])
    data[cat_col] = data[cat_col].cat.codes   

data = data.replace("?", np.nan) 
data = data.dropna() 

X = (data.iloc[:,:data.shape[1]-1])
y = (data.iloc[:,data.shape[1]-1:])

X = X.to_numpy()
y = y.to_numpy().flatten()

## Importing libraries

In [40]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
import sklearn.tree
import sklearn.neighbors
import sklearn.ensemble
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)