# Understanding the Data: From the file `heart-disease.names`

data_path = 'data/heart+disease/processed.cleveland.data'

From the file `heart-disease.names`, the data has 14 attributes and 303 instances. After close examination, of the 14 attributes, 13 are used for prediction and the 14th attribute is the predicted attribute. Of all the attributes, 9 are categorical and 5 are continuous.

### Categorical Attributes
- sex: 1 = male; 0 = female
- cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- exang: exercise induced angina (1 = yes; 0 = no)
- slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
- num: diagnosis of heart disease (angiographic disease status)


## Load the Dataset

For this task, I'll be using the `processed.cleveland.data` dataset for this analysis. 

In [50]:
import pandas as pd

# load the data
data_path = 'data/heart+disease/processed.cleveland.data'
data = pd.read_csv(data_path, header=None)

columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

data.columns = columns
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Check unique values in Categorical Columns

The reason for doing this is to check if there are any missing values in the categorical columns. That will help in mapping the categorical columns to the correct values and if missing values are found, we can handle them appropriately.

In [51]:
# Get Unique Values in Each Column
def print_unique_values(df, categorical_columns):
    for column in categorical_columns:
        if column in df.columns:
            unique_values = df[column].unique()
            print(f"\n {len(unique_values)} Unique values in '{column}':")
            print(unique_values)
            print("-" * 50)
        else:
            print(f"\nWarning: Column '{column}' not found in the DataFrame")

categorical_columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'num']
print_unique_values(data, categorical_columns)


 2 Unique values in 'sex':
[1. 0.]
--------------------------------------------------

 4 Unique values in 'cp':
[1. 4. 3. 2.]
--------------------------------------------------

 2 Unique values in 'fbs':
[1. 0.]
--------------------------------------------------

 3 Unique values in 'restecg':
[2. 0. 1.]
--------------------------------------------------

 2 Unique values in 'exang':
[0. 1.]
--------------------------------------------------

 3 Unique values in 'slope':
[3. 2. 1.]
--------------------------------------------------

 5 Unique values in 'ca':
['0.0' '3.0' '2.0' '1.0' '?']
--------------------------------------------------

 4 Unique values in 'thal':
['6.0' '3.0' '7.0' '?']
--------------------------------------------------

 5 Unique values in 'num':
[0 2 1 3 4]
--------------------------------------------------


### Explanation

From the output, we can see that there are missing values in the `ca` and `thal` columns. I will handle these missing values by mapping them to a new category, say `unknown`.

In [52]:
# label categorical columns
sex_map = {1.0: 'male', 0.0: 'female'}
cp_map = {1.0: 'typical angina', 2.0: 'atypical angina', 3.0: 'non-anginal pain', 4.0: 'asymptomatic'}
fbs_map = {1.0: 'true', 0.0: 'false'}
restecg_map = {0.0: 'normal', 1.0: 'abnormal', 2.0: 'left ventricular hypertrophy'}
exang_map = {1.0: 'yes', 0.0: 'no'}
slope_map = {1.0: 'upsloping', 2.0: 'flat', 3.0: 'downsloping'}
ca_map = {'0.0': '0', '1.0': '1', '2.0': '2', '3.0': '3', '?': 'unknown'}
thal_map = {'3.0': 'normal', '6.0': 'fixed defect', '7.0': 'reversible defect', '?': 'unknown'}
num_map = {
    0: 0,
    1: 1,
    2: 2,
    3: 3,
    4: 4
}

data['sex'] = data['sex'].map(sex_map)
data['cp'] = data['cp'].map(cp_map)
data['fbs'] = data['fbs'].map(fbs_map)
data['restecg'] = data['restecg'].map(restecg_map)
data['exang'] = data['exang'].map(exang_map)
data['slope'] = data['slope'].map(slope_map)
data['ca'] = data['ca'].map(ca_map)
data['thal'] = data['thal'].map(thal_map)
data['num'] = data['num'].map(num_map)

data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,male,typical angina,145.0,233.0,True,left ventricular hypertrophy,150.0,no,2.3,downsloping,0,fixed defect,0
1,67.0,male,asymptomatic,160.0,286.0,False,left ventricular hypertrophy,108.0,yes,1.5,flat,3,normal,2
2,67.0,male,asymptomatic,120.0,229.0,False,left ventricular hypertrophy,129.0,yes,2.6,flat,2,reversible defect,1
3,37.0,male,non-anginal pain,130.0,250.0,False,normal,187.0,no,3.5,downsloping,0,normal,0
4,41.0,female,atypical angina,130.0,204.0,False,left ventricular hypertrophy,172.0,no,1.4,upsloping,0,normal,0
