# Exercise: Feature Selection and Engineering


`#scikit-learn` `#data-preprocessing` `#logistic-regression` `#feature-engineering`


> Objectives:
>
> - Load and explore broken data.
> - Understand why the broken data cannot be used for training directly.
> - Apply minimal transformations to allow initial training.
> - Perform additional data cleaning, feature engineering, and normalization.
> - Compare the performance of models trained on minimally processed and fully processed data.


## Standard Deep Atlas Exercise Set Up


- [x] Ensure you are using the coursework Pipenv environment and kernel ([instructions](../SETUP.md))
- [x] Apply the standard Deep Atlas environment setup process by running this cell:


In [2]:
import sys, os
sys.path.insert(0, os.path.join('..', 'includes'))

import deep_atlas
from deep_atlas import FILL_THIS_IN
deep_atlas.initialize_environment()
if deep_atlas.environment == 'COLAB':
    %pip install -q python-dotenv==1.0.0

🎉 Running in a Virtual environment


## 🚦 Checkpoint: Start


- [ ] Run this cell to record your start time:


In [3]:
deep_atlas.log_start_time()

🚀 Success! Get started...


## Imports


- [x] To ensure everything runs smoothly, load all the necessary libraries in this section. This includes data manipulation, machine learning, and logging libraries.


In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import logging

## Load Broken Data


We will begin by loading a dataset that contains issues such as missing values, incorrect data types, and potentially problematic records.

- [x] Load the broken dataset and display the first few rows to examine its contents:


In [5]:
logging.basicConfig(level=logging.INFO)


def load_data(file_path):
    try:
        data = pd.read_csv(file_path)
        logging.info(f"Data loaded successfully from {file_path}")
        return data
    except FileNotFoundError:
        logging.error(f"File not found: {file_path}")
        raise
    except pd.errors.EmptyDataError:
        logging.error(f"Empty CSV file: {file_path}")
        raise
    except pd.errors.ParserError:
        logging.error(f"Error parsing CSV file: {file_path}")
        raise


data = load_data("assets/fitness_data.csv")
data.head()

INFO:root:Data loaded successfully from assets/fitness_data.csv


Unnamed: 0,Age,Gender,Weight,Height,Steps per Day,Calories Burned,Exercise Minutes,Resting Heart Rate,Workout Intensity,Hours of Sleep,Sleep Quality,Stress Level,Fitness Goal Achieved
0,56.0,M,65.1,179.8,5906.0,660.5,52.0,65.0,lw,9.7,Good,5.0,No
1,,M,70.7,182.4,10272.0,,39.0,71.0,hgh,7.0,Poor,5.0,Yes
2,32.0,F,47.9,174.3,8070.0,861.1,46.0,76.0,med,6.4,Poor,9.0,Yes
3,25.0,F,68.7,166.6,6597.0,,58.0,60.0,med,8.5,Good,5.0,No
4,38.0,M,,154.1,4771.0,670.2,,73.0,lw,6.7,Poor,5.0,No


## Explore Broken Data


Next, we will explore the dataset to identify any problems, such as missing values, incorrect data types, and duplicate records.

- [x] Explore the dataset by running the following code:


In [6]:
def explore_data(data):
    logging.info("Exploring data:")
    print(data.head())
    print("\nDataset info:")
    data.info()
    print("\nMissing values:")
    print(data.isnull().sum())
    print("\nData types:")
    print(data.dtypes)
    print("\nDataset description:")
    print(data.describe())

    duplicate_rows = data.duplicated(
        subset=[
            "Age",
            "Gender",
            "Height",
            "Exercise Minutes",
            "Resting Heart Rate",
        ]
    )
    print(f"\nNumber of duplicate rows: {duplicate_rows.sum()}")


explore_data(data)

INFO:root:Exploring data:


    Age Gender  Weight  Height  Steps per Day  Calories Burned  \
0  56.0      M    65.1   179.8         5906.0            660.5   
1   NaN      M    70.7   182.4        10272.0              NaN   
2  32.0      F    47.9   174.3         8070.0            861.1   
3  25.0      F    68.7   166.6         6597.0              NaN   
4  38.0      M     NaN   154.1         4771.0            670.2   

   Exercise Minutes  Resting Heart Rate Workout Intensity  Hours of Sleep  \
0              52.0                65.0                lw             9.7   
1              39.0                71.0               hgh             7.0   
2              46.0                76.0               med             6.4   
3              58.0                60.0               med             8.5   
4               NaN                73.0                lw             6.7   

  Sleep Quality  Stress Level Fitness Goal Achieved  
0          Good           5.0                    No  
1          Poor           5.0   

## Why Broken Data Cannot Be Trained On


The dataset, as it stands, contains several issues:

- Missing values in critical fields
- Inconsistent data types
- Duplicate rows

These issues must be resolved before attempting to train a model, as machine learning algorithms require clean, well-structured input data.


## Minimal Data Transformations


We will now perform the minimal transformations necessary to allow initial model training. This includes imputing missing values, fixing data types, and converting categorical variables.

- [x] Run this code to apply the minimal preprocessing steps:


In [8]:
def create_preprocessor(numeric_columns, categorical_columns):
    return ColumnTransformer(
        transformers=[
            ("num", SimpleImputer(strategy="median"), numeric_columns),
            (
                "cat",
                Pipeline(
                    [
                        (
                            "imputer",
                            SimpleImputer(strategy="most_frequent"),
                        ),
                        (
                            "encoder",
                            OneHotEncoder(drop="first", sparse_output=False),
                        ),
                    ]
                ),
                categorical_columns,
            ),
        ]
    )


numeric_columns = [
    "Age",
    "Weight",
    "Height",
    "Steps per Day",
    "Calories Burned",
    "Exercise Minutes",
    "Resting Heart Rate",
    "Hours of Sleep",
    "Stress Level",
]
categorical_columns = ["Gender", "Workout Intensity", "Sleep Quality"]

preprocessor = create_preprocessor(numeric_columns, categorical_columns)

X = data.drop("Fitness Goal Achieved", axis=1)
y = data["Fitness Goal Achieved"].fillna("No")

X_clean_min = preprocessor.fit_transform(X)

## Train and Assess Initial Model


With the minimally processed data, we can now train a logistic regression model.

- [x] Train and assess the initial model:


In [9]:
def train_model(X, y, model_type="initial"):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{model_type.capitalize()} model performance:")
    print(classification_report(y_test, y_pred))
    return model


initial_model = train_model(X_clean_min, y, "initial")


Initial model performance:
              precision    recall  f1-score   support

          No       0.91      0.96      0.94       106
         Yes       0.96      0.90      0.93       100

    accuracy                           0.93       206
   macro avg       0.93      0.93      0.93       206
weighted avg       0.93      0.93      0.93       206



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Further Refinements


We will now further clean and enhance the dataset by removing duplicates, performing feature engineering, and normalizing numeric features.

- [x] Perform the refinements:


In [11]:
def preprocess_data(data):
    data_deduped = data.drop_duplicates(
        subset=[
            "Age",
            "Gender",
            "Height",
            "Exercise Minutes",
            "Resting Heart Rate",
        ]
    ).copy()

    epsilon = 1e-10
    data_deduped["Activity Intensity"] = data_deduped["Steps per Day"] / (
        data_deduped["Exercise Minutes"] + epsilon
    )
    data_deduped["Sleep Efficiency"] = np.where(
        data_deduped["Sleep Quality"] == "Good",
        data_deduped["Hours of Sleep"] * 1.5,
        data_deduped["Hours of Sleep"],
    )
    data_deduped["Calories per Step"] = data_deduped["Calories Burned"] / (
        data_deduped["Steps per Day"] + epsilon
    )

    data_deduped = data_deduped.replace([np.inf, -np.inf], np.nan)

    numeric_columns = data_deduped.select_dtypes(
        include=[np.number]
    ).columns
    categorical_columns = data_deduped.select_dtypes(
        exclude=[np.number]
    ).columns

    data_deduped[numeric_columns] = data_deduped[numeric_columns].fillna(
        data_deduped[numeric_columns].median()
    )
    for col in categorical_columns:
        data_deduped[col] = data_deduped[col].fillna(
            data_deduped[col].mode()[0]
        )

    return data_deduped


data_preprocessed = preprocess_data(data)

X_final = data_preprocessed.drop(
    [
        "Fitness Goal Achieved",
        "Gender",
        "Workout Intensity",
        "Sleep Quality",
    ],
    axis=1,
)
y_final = data_preprocessed["Fitness Goal Achieved"].fillna("No")

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_final)

## Train and Assess Final Model


Finally, we will train a second logistic regression model using the fully processed dataset and compare its performance to the initial model.

- [ ] Train and assess the final model:


In [12]:
final_model = train_model(X_scaled, y_final, "final")


Final model performance:
              precision    recall  f1-score   support

          No       0.95      0.97      0.96       104
         Yes       0.97      0.95      0.96        96

    accuracy                           0.96       200
   macro avg       0.96      0.96      0.96       200
weighted avg       0.96      0.96      0.96       200



### 🚦 Checkpoint: Stop


- [x] Complete the feedback form and run the following cell to log your stop time:


In [14]:
# deep_atlas.log_feedback(
#     {
#         # How long were you actively focused on this section? (HH:MM)
#         "active_time": 00:20,
#         # Did you feel finished with this section (Yes/No):
#         "finished": yes,
#         # How much did you enjoy this section? (1–5)
#         "enjoyment": 4,
#         # How useful was this section? (1–5)
#         "usefulness": 4,
#         # Did you skip any steps?
#         "skipped_steps": no,
#         # Any obvious opportunities for improvement?
#         "suggestions": [],
#     }
# )
# deep_atlas.log_stop_time()

## You Did It!


In this exercise, you successfully:

- Loaded a broken dataset.
- Identified and addressed issues like missing values and incorrect data types.
- Trained an initial model with minimal data transformations.
- Performed further refinements like feature engineering and normalization.
- Compared the performance of models trained on minimally processed and fully processed data.

This exercise demonstrates the importance of thorough data preprocessing and its impact on model performance. Well done!
