<a href="https://colab.research.google.com/github/kiptuidenis/SKIES/blob/main/Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**FEATURE ENGINEERING**

Main goal of feature engineering  is to create the best representation of data so that models can learn patterns effectively by selecting the most relevant variables

**Feature Types**

1. Numerical Features:
Continous variables that variables that take a range of values
2.Categorical Features:
Discrete variables that belong to a specific category
3. Ordinal Features: Categorical features that have inherent order
4. Text features:
Unstructured data that must be converted to numerical form
5. Image features: Consists of images

**Feature Selection Techniques**

1. *Filter Methods:*
Use statistical tests to determine the importance of each feature e.g ```ANOVA test```, ```Ch-Square Test```, ```Correlation```

2. *Wrapper methods:*
Use machine learning models to evaluate feature subsets e.g ```Recursive Feature Elimination(RFE)```

3. *Embeded Methods:* Include regularization methods such as ```L1```, ```L2```

4. *Domain knowledge:* Expert knowledge helps in selecting the most meaningful features

**Feature Extraction Techniques**

Aims to reduce dimensionality while retaining essential information

1. *Principal Component Analysis(PCA)*: Reduces data into principal components that capture most variance.

2. *t-SNE(t-Distributed Stochastic Neighbor Embedding)*: Helps in understanding the cluster structure of the data

3. *Autoencoders*: A type of neural networks that learns compressed representation of data. Useful when extracting complex features in images and text

**Handling Missing Values**

Missing values can lead to biased results

**Technques to handle missing values**

1. *Remove missing values*: If missing values is minimal, drop the affected rows and columns

2. *Imputation*: Fill in missing values using statistical techniques e.g Mean, Median, Mode or ```KNN imputation``` for more accurate estimations

3. *Prediicting missing values*: Train a model to predict missing values

**Feature Scaling**

Prevents larger magnitude features from dominating smaller ones

**Techniques for feature scaling**

1. *Min-Max Scaling:* Scales Values between 0 and 1
Used when data does not follow a normal distribution

2. *Standardization:* Transforms data to have zero mean and unit variance
Standardization is more robust to outliers, and in many cases is preferable over Max-Min Normalization

```How to decide which features to scale? or do if you have decided to feature scale you just scale all the features? ```

```Note: ``` *If we do scaling before the split, the training data will also have information about test data which will make it perform good in test data but model might not perform well with unseen data. Feature scaling should therefore be done after splitting*

```Data Leakage: ```More like giving the model mwakenya. It invalidates the perfomance

**Algorithms that require feature scaling**
1. K-means since it uses euclidean distance
2. KNN - also measures the distance between sample pairs
3. PCA -Gets the features with maximum variiance
4. ANN - Uses gradient descent to learn'
5. Gradient Descent

Note: Algorithms that are not distance based suuch as Naive Bayes, logisric regression may not need feature scaling

**Encoding Categorical Features**

Machine Learning models require numerical inputs that is why encoding categorical features is important

**Encoding Methods**
1. *One-Hot Encoding* : Converts into binary columns. Can increase the number of columns significantly which might lead to dimensionality issues
2. *Label Encooding* : Maintains order. Used for features that have natural order
3. *Target Encoding* : Replaces categories with their mean target values

**Auto Feature Selection**

In [None]:
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [None]:
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
                "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
df = pd.read_csv(url, names=column_names, na_values="?")
df.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,,3.0,0


Ask what these code does

In [None]:
# Convert categorical variables to numeric
df["thal"] = pd.to_numeric(df["thal"], errors="coerce")
df["ca"] = pd.to_numeric(df["ca"], errors="raise")



In [None]:
df.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,,3.0,0


In [None]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


In [None]:
# Fill missing values with median
df.fillna(df.median(), inplace=True)
df.tail()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,0.0,3.0,0


🔹 Step 2: Feature Selection
1. Remove Low-Variance Features
•	VarianceThreshold removes features with little variation.


In [None]:
X = df.drop(columns=['target'])
y = df['target']

# Apply variance threshold
var_thresh = VarianceThreshold(threshold=0.01)
X_var_selected = var_thresh.fit_transform(X)

# Get selected feature names
selected_features_var = X.columns[var_thresh.get_support()]
print("Selected Features after Variance Threshold:", selected_features_var)


Selected Features after Variance Threshold: Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')


. Remove Highly Correlated Features
•	Drops one feature from any pair with correlation > 0.8.


In [None]:
# Compute correlation matrix
corr_matrix = df.corr().abs()
corr_matrix

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,1.0,0.097542,0.104139,0.284946,0.20895,0.11853,0.148868,0.393806,0.091661,0.203805,0.16177,0.365323,0.128303,0.222853
sex,0.097542,1.0,0.010084,0.064456,0.199915,0.047862,0.021647,0.048663,0.146201,0.102173,0.037533,0.086048,0.380581,0.224469
cp,0.104139,0.010084,1.0,0.036077,0.072319,0.039975,0.067505,0.334422,0.38406,0.202277,0.15205,0.233117,0.262089,0.407075
trestbps,0.284946,0.064456,0.036077,1.0,0.13012,0.17534,0.14656,0.045351,0.064762,0.189171,0.117382,0.097528,0.134424,0.157754
chol,0.20895,0.199915,0.072319,0.13012,1.0,0.009841,0.171043,0.003432,0.06131,0.046564,0.004062,0.123726,0.018351,0.070909
fbs,0.11853,0.047862,0.039975,0.17534,0.009841,1.0,0.069564,0.007854,0.025665,0.005747,0.059894,0.140764,0.064625,0.059186
restecg,0.148868,0.021647,0.067505,0.14656,0.171043,0.069564,1.0,0.083389,0.084867,0.114133,0.133946,0.131749,0.024325,0.183696
thalach,0.393806,0.048663,0.334422,0.045351,0.003432,0.007854,0.083389,1.0,0.378103,0.343085,0.385601,0.265699,0.274142,0.41504
exang,0.091661,0.146201,0.38406,0.064762,0.06131,0.025665,0.084867,0.378103,1.0,0.288223,0.257748,0.145788,0.32524,0.397057
oldpeak,0.203805,0.102173,0.202277,0.189171,0.046564,0.005747,0.114133,0.343085,0.288223,1.0,0.577537,0.301067,0.342405,0.504092


In [None]:
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
upper

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,,0.097542,0.104139,0.284946,0.20895,0.11853,0.148868,0.393806,0.091661,0.203805,0.16177,0.365323,0.128303,0.222853
sex,,,0.010084,0.064456,0.199915,0.047862,0.021647,0.048663,0.146201,0.102173,0.037533,0.086048,0.380581,0.224469
cp,,,,0.036077,0.072319,0.039975,0.067505,0.334422,0.38406,0.202277,0.15205,0.233117,0.262089,0.407075
trestbps,,,,,0.13012,0.17534,0.14656,0.045351,0.064762,0.189171,0.117382,0.097528,0.134424,0.157754
chol,,,,,,0.009841,0.171043,0.003432,0.06131,0.046564,0.004062,0.123726,0.018351,0.070909
fbs,,,,,,,0.069564,0.007854,0.025665,0.005747,0.059894,0.140764,0.064625,0.059186
restecg,,,,,,,,0.083389,0.084867,0.114133,0.133946,0.131749,0.024325,0.183696
thalach,,,,,,,,,0.378103,0.343085,0.385601,0.265699,0.274142,0.41504
exang,,,,,,,,,,0.288223,0.257748,0.145788,0.32524,0.397057
oldpeak,,,,,,,,,,,0.577537,0.301067,0.342405,0.504092


In [None]:
# Find Features with correlation > 0.8
to_drop = [column for column in upper.columns if any(upper[column] > 0.7)]

print("Highly Correlated features to remove: ", to_drop)

Highly Correlated features to remove:  []


In [None]:
# Drop these features
df.drop(columns=to_drop, inplace=True)

Select Best Features (ANOVA F-Test)

In [None]:
X = df.drop(columns=['target'])
y = df['target']

# Select the best 8 features
select_k = SelectKBest(score_func=f_classif, k=8)
X_best = select_k.fit_transform(X, y)

# Get selected feature names
selected_features_kbest = X.columns[select_k.get_support()]
print("Selected Features after SelectKBest:", selected_features_kbest)
X_best

Selected Features after SelectKBest: Index(['sex', 'cp', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'], dtype='object')


array([[  1.,   1., 150., ...,   3.,   0.,   6.],
       [  1.,   4., 108., ...,   2.,   3.,   3.],
       [  1.,   4., 129., ...,   2.,   2.,   7.],
       ...,
       [  1.,   4., 115., ...,   2.,   1.,   7.],
       [  0.,   2., 174., ...,   2.,   1.,   3.],
       [  1.,   3., 173., ...,   1.,   0.,   3.]])

Feature Scaling (StandardScaler)
StandardScaler transforms data so that it has zero mean and unit variance

In [20]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_best)

Train and Test the Model

Random Forest Classifier for prediction

In [21]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [22]:
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy: {:.2f}".format(accuracy))

Model Accuracy: 0.51


Testing Model Accuracy with Model hyperparameter tuning, for RandomForestClassifier