# Feature Engineering

* **Project:** M2: Mini Project
* **AUthor** Jacob Buysse

In this project we are going to analyze the impact of feature engineering on model performance.  I have chosen a dataset containing 13 features and one target class for determining if heart disease is present in the patient.<br>
https://archive.ics.uci.edu/dataset/45/heart+disease

In this notebook we will be using...

In [21]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

Let is load the cleveland data file and view the head/info/describe results.

In [2]:
df = pd.read_csv(
    './processed.cleveland.data',
    names=[
        'age', 'sex', 'cp', 'trestbps', 'chol',
        'fbs', 'restecg', 'thalach', 'exang', 'oldpeak',
        'slope', 'ca', 'thal', 'num'
    ],
    dtype={'ca': np.float64, 'thal': np.float64},
    na_values={'ca': ['?'], 'thal': ['?']}
)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  num       303 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 33.3 KB


In [4]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0,301.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.672241,4.734219,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.937438,1.939706,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,4.0


We have 303 rows.  There are 4 NA (?) values for `ca` and 2 NA (?) values for `thal`.  All values were loaded as `float64` though quite a few are listed as categorical on the source website.

Let us take a closer look at the target value.

In [5]:
df.num.value_counts()

num
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64

It is described as a categorical value with 0 being the absence of heart disease and values 1 through 4 being different degrees of heart disease being present.  However, we can just try to compute a binary classifier for absense (0) or presence (1-4).  Let us add the new target column.

In [6]:
df['disease'] = df.num.apply(lambda value: value != 0)
df.disease.describe()

count       303
unique        2
top       False
freq        164
Name: disease, dtype: object

So we have 164 samples with no heart disease and 139 samples with heart disease.  Let us do an 75/25 train/test split stratified over the binary classifier and try a logistic regression to get a baseline performance before doing any feature engineering.  We will just drop the NA rows for now.

In [7]:
features = [
    'age', 'sex', 'cp', 'trestbps', 'chol',
    'fbs', 'restecg', 'thalach', 'exang', 'oldpeak',
    'slope', 'ca', 'thal'
]
filtered_df = df.dropna()
X = filtered_df[features].values
y = filtered_df.disease.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=777, stratify=filtered_df.disease)

def score_model(y_pred):
    print("Confusion Matrix")
    print(f"{confusion_matrix(y_test, y_pred)}")
    print(f"{classification_report(y_test, y_pred)}")

print(f"X_train {len(X_train)}, X_test {len(X_test)}, y_train {len(y_train)}, y_test {len(y_test)}")

X_train 222, X_test 75, y_train 222, y_test 75


We have 222 training samples and 75 testing samples with 13 numeric features and 1 binary classifier output.

In [8]:
model = LogisticRegression(random_state=777, max_iter=2000).fit(X_train, y_train)
score_model(model.predict(X_test))

Confusion Matrix
[[35  5]
 [ 8 27]]
              precision    recall  f1-score   support

       False       0.81      0.88      0.84        40
        True       0.84      0.77      0.81        35

    accuracy                           0.83        75
   macro avg       0.83      0.82      0.82        75
weighted avg       0.83      0.83      0.83        75



We will use the overall F1-score to rank the models.  Our initial attempt got an 83%.  There is definitely a relationship, but let us see how much we can improve this model performance through feature engineering.

## Imputing NA Values

There were two columns that were missing values.  Let us see what values make sense.

First, let us look at `ca` (a numeric feature).

In [9]:
df.ca.describe()

count    299.000000
mean       0.672241
std        0.937438
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        3.000000
Name: ca, dtype: float64

In [10]:
df.ca.value_counts()

ca
0.0    176
1.0     65
2.0     38
3.0     20
Name: count, dtype: int64

It seems pretty safe to use 0 for these values.

In [11]:
df.ca = df.ca.fillna(0)

Now let us look at `thal` (a categorical feature).

In [12]:
df.thal.describe()

count    301.000000
mean       4.734219
std        1.939706
min        3.000000
25%        3.000000
50%        3.000000
75%        7.000000
max        7.000000
Name: thal, dtype: float64

In [13]:
df.thal.value_counts()

thal
3.0    166
7.0    117
6.0     18
Name: count, dtype: int64

It seems pretty safe to use 3 for these values.

In [14]:
df.thal = df.thal.fillna(3)

Now let us repeat our test but include all of the rows.

In [15]:
X = df[features].values
y = df.disease.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=777, stratify=df.disease)
print(f"X_train {len(X_train)}, X_test {len(X_test)}, y_train {len(y_train)}, y_test {len(y_test)}")

X_train 227, X_test 76, y_train 227, y_test 76


We are now using 227 rows (5 more) for training and 76 rows (1 more) for testing.

In [16]:
model = LogisticRegression(random_state=777, max_iter=2000).fit(X_train, y_train)
score_model(model.predict(X_test))

Confusion Matrix
[[38  3]
 [ 8 27]]
              precision    recall  f1-score   support

       False       0.83      0.93      0.87        41
        True       0.90      0.77      0.83        35

    accuracy                           0.86        76
   macro avg       0.86      0.85      0.85        76
weighted avg       0.86      0.86      0.85        76



Even just include those 4 additional rows with imputed values improved the model performance to 86% (+3%).

## One-hot Encoding

Many of the source columns were categorical values.  Using these as numeric features can skew the magnitude for regression, so let us one-hot encode these features.

In [17]:
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
train_df, test_df = train_test_split(df, test_size=0.25, random_state=77, stratify=df.disease)
hot_enc = OneHotEncoder()
hot_enc.fit(train_df[cat_features])
train_hot = hot_enc.transform(train_df[cat_features])
test_hot = hot_enc.transform(test_df[cat_features])
print(f"Training 1he {train_hot.shape}, Testing 1he {test_hot.shape}")

Training 1he (227, 19), Testing 1he (76, 19)


So we have converted our 7 categorical features into 19 one hot encoded features.

In [18]:
num_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca']
train_num = train_df[num_features].values
test_num = test_df[num_features].values
X_train = sp.hstack((train_hot, train_num))
y_train = train_df.disease.values
X_test = sp.hstack((test_hot, test_num))
y_test = test_df.disease.values
print(f"X_train {X_train.shape}, X_test {X_test.shape}, y_train {y_train.shape}, y_test {y_test.shape}")

X_train (227, 25), X_test (76, 25), y_train (227,), y_test (76,)


In [19]:
model = LogisticRegression(random_state=777, max_iter=2000).fit(X_train, y_train)
score_model(model.predict(X_test))

Confusion Matrix
[[37  4]
 [ 8 27]]
              precision    recall  f1-score   support

       False       0.82      0.90      0.86        41
        True       0.87      0.77      0.82        35

    accuracy                           0.84        76
   macro avg       0.85      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76



This dropped our score to 84% (+1% from base and -2% from NA test).  However, this was most likely due to scaling issues with the other numeric columns.

## Numeric Scaling

Let us use the standard scalar on the remaining numeric features to ensure they have mean zero with unit variance.

In [20]:
scaler = StandardScaler()
scaler.fit(train_num)
train_scale = scaler.transform(train_num)
test_scale = scaler.transform(test_num)
X_train = sp.hstack((train_hot, train_num))
X_test = sp.hstack((test_hot, test_num))
model = LogisticRegression(random_state=777, max_iter=2000).fit(X_train, y_train)
score_model(model.predict(X_test))

Confusion Matrix
[[37  4]
 [ 8 27]]
              precision    recall  f1-score   support

       False       0.82      0.90      0.86        41
        True       0.87      0.77      0.82        35

    accuracy                           0.84        76
   macro avg       0.85      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76



Suprisingly this did not change the score from 84%.  Perhaps we are overfitting due to the extra columns we added during one hot encoding.

## Dimensionality Reduction

We are possibly overfitting our training data due to the number of columns we have.  Let us use PCA to reduce dimensionality and see how that impacts the model performance.

In [32]:
pca = PCA(n_components=6, svd_solver='arpack')
pca.fit(X_train)
print(pca.explained_variance_ratio_.sum())

0.9992013948547115


So we can keep just 6 components and still explain 99.9% of the variance.

In [34]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print(f"Train {X_train_pca.shape}, Test {X_test_pca.shape}")

Train (227, 6), Test (76, 6)


In [36]:
model = LogisticRegression(random_state=777, max_iter=2000).fit(X_train_pca, y_train)
score_model(model.predict(X_test_pca))

Confusion Matrix
[[37  4]
 [ 4 31]]
              precision    recall  f1-score   support

       False       0.90      0.90      0.90        41
        True       0.89      0.89      0.89        35

    accuracy                           0.89        76
   macro avg       0.89      0.89      0.89        76
weighted avg       0.89      0.89      0.89        76



This improved the model performance to 89% (+6% from base, +3% from previous best, and +5% from previous iteration).