# Heart Disease Classification
Simo Hyttinen<br>
Student #1503565<br>
Helsinki Metropolia University of Applied Sciences<br>
Last edited: <i>31.01.2018</i>


## Objectives
The objective of this assignment is to create a program which can preprocess a dataset and then use that dataset to train a dense neural network to predict whether a person has heart disease or not, based on 13 variables.

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
import keras
from timeit import default_timer as timer

Using TensorFlow backend.


## Importing a preprocessed dataset
The following imports a preprocessed dataset and sets the column names.

In [33]:
colnames = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                  'restecg', 'thalach', 'exang', 'oldpeak',
                  'slope', 'ca', 'thal', 'num']

In [34]:
def format_prep_csv(filename):
    df_ret = pd.read_csv(filename, na_values='?')
    df_ret.columns = colnames
    return  df_ret


## Importing and preprocessing a raw dataset
This imports a set of raw data and parses it into intelligible entries.<br><br>
Each entry in the raw dataset is divided to 10 rows with 7-8 columns each (1st row of each entry has no 8th column for some reason). I checked the correct positions of each of the 14 values from the dataset documentation and picked the right values to pull from which row and which column. These are stored in a temporary dataframe. The entries are separated from each other by calculating the modulo 10 of the row index. Depending on the modulus, different actions are taken. From 0-8 values are stored in the temp dataframe. When the modulus is 9, it means the entry is at its end and the temp dataframe is appended to the processed dataframe.

In [35]:
def parse_raw_csv(filename):
    testcols = ["col1", "col2", "col3", "col4", "col5", "col6", "col7", 'col8']
    df_raw = pd.read_csv(filename, sep=' ', names=testcols)
    df_ret = pd.DataFrame(index=[0], columns=colnames)
    df_temp = pd.DataFrame(index=[0], columns=colnames)
    
    for i, row in df_raw.iterrows():
        m = i % 10
        if m == 0:
            df_temp.iloc[0]['age'] = float(row['col3'])
            df_temp.iloc[0]['sex'] = int(row['col4'])
        elif m == 1:
            df_temp.iloc[0]['cp'] = int(row['col2'])
            df_temp.iloc[0]['trestbps'] = float(row['col3'])
            df_temp.iloc[0]['chol'] = float(row['col5'])
        elif m == 2:
            df_temp.iloc[0]['fbs'] = float(row['col1'])
            df_temp.iloc[0]['restecg'] = float(row['col4'])
        elif m == 4:
            df_temp.iloc[0]['thalach'] = float(row['col1'])
            df_temp.iloc[0]['exang'] = float(row['col7'])
        elif m == 5:
            df_temp.iloc[0]['oldpeak'] = float(row['col1'])
            df_temp.iloc[0]['slope'] = float(row['col2'])
            df_temp.iloc[0]['ca'] = float(row['col5'])
        elif m == 6:
             df_temp.iloc[0]['thal'] = float(row['col4'])
        elif m == 7:
            df_temp.iloc[0]['num'] = int(row['col3'])
        elif m == 9:
            for ix in range(len(df_temp.columns)):
                if df_temp.iloc[0][ix] == -9:
                    df_temp.iloc[0][ix] = nan
            df_ret = df_ret.append(df_temp, ignore_index=True)

    df_ret.drop(df_ret.index[0], inplace=True)
    return df_ret

Here the different datasets are combined into one big dataset. The already preprocessed version of the Cleveland data is used because the raw data file is corrupted. <i>New.data</i> is not used because it seems to have a slightly different formatting compared to the other files.

In [45]:
df_hu = parse_raw_csv("data/hungarian.data")
df_sw = parse_raw_csv("data/switzerland.data")
df_cl = format_prep_csv("data/processed.cleveland.data")
df_lbc = parse_raw_csv("data/long-beach-va.data")
df = pd.concat([df_hu, df_sw], ignore_index=True).append(df_cl, ignore_index=True).append(df_lbc, ignore_index=True)
print("Number of entries in the dataset: " + str(len(df.index)))

Number of entries in the dataset: 919


## Description of the data
The contains 14 variables that describe the person in the entry:
- Sex
- Age
- Chest pain type
    - 1: Typical angina
    - 2: Atypical angina
    - 3: Non-anginal pain
    - 4: Asymptomatic
- Resting blood pressure (mm/Hg) on admission
- Serum cholesterol (mg/dl)
- Fasting blood sugar
    - 1: Over 120mg/dl
    - 0: Under 120mg/dl
- Resting electrocardiographic results
    - 0: Normal
    - 1: ST-T wave abnormality
    - 2: Probable or definite left ventricular hypertrophy