# 6. Feature Engineering

**Objective:** Transform the cleaned data into a format suitable for machine learning models. This includes creating composite scores, encoding categorical variables, and normalizing numerical features.

**Input:** `data/processed/cleaned_survey_data.csv`
**Output:** `data/processed/features_engineered.csv`

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import os

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [2]:
# Define paths
CLEANED_DATA_PATH = os.path.join('..', 'data', 'processed', 'cleaned_survey_data.csv')
PROCESSED_DATA_DIR = os.path.join('..', 'data', 'processed')
ENGINEERED_DATA_PATH = os.path.join(PROCESSED_DATA_DIR, 'features_engineered.csv')

# Load the cleaned dataset
try:
    df_cleaned = pd.read_csv(CLEANED_DATA_PATH)
    print(f"Cleaned dataset loaded successfully. Shape: {df_cleaned.shape}")
except FileNotFoundError:
    print(f"Error: Cleaned data file not found at {CLEANED_DATA_PATH}")
    df_cleaned = pd.DataFrame() # Avoid errors later

# Make a copy to work on
df_eng = df_cleaned.copy()

Cleaned dataset loaded successfully. Shape: (736, 31)


## 6.1 Create Composite Mental Health Score

Create a single score representing overall mental health distress by averaging the scores of Anxiety, Depression, Insomnia, and OCD.

In [3]:
mh_cols = ['Anxiety', 'Depression', 'Insomnia', 'OCD']
df_eng['MH_Composite'] = df_eng[mh_cols].mean(axis=1)

print("Created 'MH_Composite' score.")
display(df_eng[['Anxiety', 'Depression', 'Insomnia', 'OCD', 'MH_Composite']].head())

Created 'MH_Composite' score.


Unnamed: 0,Anxiety,Depression,Insomnia,OCD,MH_Composite
0,3.0,0.0,1.0,0.0,1.0
1,7.0,2.0,2.0,1.0,3.0
2,7.0,7.0,10.0,2.0,6.5
3,9.0,7.0,3.0,3.0,5.5
4,7.0,2.0,5.0,9.0,5.75


## 6.2 Encode Categorical Variables

We need to convert categorical features into numerical representations.

*   **Binary Encoding:** Convert 'Yes'/'No' columns to 1/0.
*   **One-Hot Encoding:** Convert nominal categorical columns (`Primary streaming service`, `Fav genre`) into multiple binary columns.
*   **Ordinal Encoding:** Frequency columns were already mapped to 0-3 in Notebook 02. We will keep them as they are.
*   **Target Variable:** The `Music effects` column will be handled later during model preparation (Notebook 07) as it's the target, not a feature for this step.

In [4]:
# Identify binary columns
binary_cols = ['While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Foreign languages']
yes_no_map = {'Yes': 1, 'No': 0}

for col in binary_cols:
    if col in df_eng.columns:
        # Ensure column is string type before mapping if necessary
        if not pd.api.types.is_string_dtype(df_eng[col]):
             df_eng[col] = df_eng[col].astype(str)
        df_eng[col] = df_eng[col].map(yes_no_map)
        # Convert to nullable integer type
        df_eng[col] = df_eng[col].astype('Int64')
    else:
        print(f"Warning: Binary column '{col}' not found.")


print(f"Applied binary encoding (1/0) to: {binary_cols}")
display(df_eng[binary_cols].head())
print("\nData types after binary encoding:")
print(df_eng[binary_cols].dtypes)

Applied binary encoding (1/0) to: ['While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Foreign languages']


Unnamed: 0,While working,Instrumentalist,Composer,Exploratory,Foreign languages
0,1,1,1,1,1
1,1,0,0,1,0
2,0,0,0,0,1
3,1,0,1,1,1
4,1,0,0,1,0



Data types after binary encoding:
While working        Int64
Instrumentalist      Int64
Composer             Int64
Exploratory          Int64
Foreign languages    Int64
dtype: object


In [5]:
# Identify nominal columns for one-hot encoding
# Note: 'Music effects' is the target, excluded here.
one_hot_cols = ['Primary streaming service', 'Fav genre']

# Check if columns exist
one_hot_cols = [col for col in one_hot_cols if col in df_eng.columns]
print(f"Columns identified for One-Hot Encoding: {one_hot_cols}")

# Initialize OneHotEncoder
# handle_unknown='ignore' helps if test data has categories not seen in train
# sparse_output=False returns a dense numpy array
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first') # drop='first' to avoid multicollinearity

# Fit and transform the data
try:
    encoded_data = encoder.fit_transform(df_eng[one_hot_cols])
    # Get feature names for the new columns
    encoded_feature_names = encoder.get_feature_names_out(one_hot_cols)
    # Create a DataFrame with the encoded columns
    df_encoded = pd.DataFrame(encoded_data, columns=encoded_feature_names, index=df_eng.index)

    print(f"\nApplied One-Hot Encoding. New columns shape: {df_encoded.shape}")
    display(df_encoded.head())

    # Drop original columns and concatenate encoded ones
    df_eng = df_eng.drop(columns=one_hot_cols)
    df_eng = pd.concat([df_eng, df_encoded], axis=1)
    print(f"\nDataFrame shape after adding OHE columns: {df_eng.shape}")

except Exception as e:
    print(f"\nError during One-Hot Encoding: {e}")
    # Handle error, maybe skip OHE or debug

Columns identified for One-Hot Encoding: ['Primary streaming service', 'Fav genre']

Applied One-Hot Encoding. New columns shape: (736, 20)


Unnamed: 0,Primary streaming service_I do not use a streaming service.,Primary streaming service_Other streaming service,Primary streaming service_Pandora,Primary streaming service_Spotify,Primary streaming service_YouTube Music,Fav genre_Country,Fav genre_EDM,Fav genre_Folk,Fav genre_Gospel,Fav genre_Hip hop,Fav genre_Jazz,Fav genre_K pop,Fav genre_Latin,Fav genre_Lofi,Fav genre_Metal,Fav genre_Pop,Fav genre_R&B,Fav genre_Rap,Fav genre_Rock,Fav genre_Video game music
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0



DataFrame shape after adding OHE columns: (736, 50)


## 6.3 Normalize Numerical Features

Scale numerical features to have zero mean and unit variance using `StandardScaler`. This helps algorithms that are sensitive to feature scales (like SVM, Logistic Regression).

Numerical features to scale:
*   `Age`
*   `Hours per day`
*   `BPM`
*   `MH_Composite` (the new composite score)
*   Frequency columns (already ordinal 0-3, scaling is optional but can be beneficial for some models)

In [6]:
# Identify numerical columns for scaling
freq_cols = [col for col in df_eng.columns if col.startswith('Frequency [')]
numerical_cols_to_scale = ['Age', 'Hours per day', 'BPM', 'MH_Composite'] + freq_cols

# Check if columns exist and are numeric
valid_numerical_cols = []
for col in numerical_cols_to_scale:
    if col in df_eng.columns:
        if pd.api.types.is_numeric_dtype(df_eng[col]):
            valid_numerical_cols.append(col)
        else:
            print(f"Warning: Column '{col}' is not numeric, attempting conversion.")
            try:
                df_eng[col] = pd.to_numeric(df_eng[col], errors='coerce')
                # Check if conversion resulted in NaNs that need handling
                if df_eng[col].isnull().any():
                     print(f"Warning: Coercion introduced NaNs in '{col}'. Imputing with median.")
                     median_val = df_eng[col].median()
                     df_eng[col].fillna(median_val, inplace=True)
                valid_numerical_cols.append(col)
            except Exception as e:
                print(f"Error converting column '{col}' to numeric: {e}. Skipping scaling for this column.")
    else:
        print(f"Warning: Numerical column '{col}' not found for scaling.")

print(f"\nColumns identified for scaling: {valid_numerical_cols}")

# Initialize StandardScaler
scaler = StandardScaler()

# Apply scaling
if valid_numerical_cols:
    try:
        df_eng[valid_numerical_cols] = scaler.fit_transform(df_eng[valid_numerical_cols])
        print("\nApplied StandardScaler to numerical columns.")
        display(df_eng[valid_numerical_cols].head())
        print("\nStatistics after scaling (should be close to 0 mean, 1 std dev):")
        display(df_eng[valid_numerical_cols].describe())
    except Exception as e:
        print(f"\nError during scaling: {e}")
else:
    print("\nNo valid numerical columns found to scale.")



Columns identified for scaling: ['Age', 'Hours per day', 'BPM', 'MH_Composite', 'Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]']

Applied StandardScaler to numerical columns.


Unnamed: 0,Age,Hours per day,BPM,MH_Composite,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
0,-0.598118,-0.18927,-0.036885,-1.573781,-0.339753,-0.887173,-0.022036,-1.003479,-0.544607,0.597042,-1.09516,2.259831,2.769036,-0.066135,-1.076003,1.037022,0.700073,1.580606,-2.003294,0.700372
1,3.139555,-0.684951,-0.036886,-0.606008,0.672629,-0.887173,-0.976084,-0.012123,2.308283,-0.372987,2.103403,0.264347,1.611733,-0.066135,-1.076003,-0.033452,0.700073,-0.318702,0.899116,-0.233457
2,-0.598118,0.141184,-0.036885,1.087593,-1.352135,-0.887173,1.886059,-1.003479,-0.544607,-0.372987,-0.028972,2.259831,-0.702873,0.907372,0.687779,-1.103927,-1.190766,-0.318702,-1.035824,1.634202
3,2.973436,-0.354497,-0.036887,0.603707,0.672629,-0.887173,-0.976084,-0.012123,2.308283,-1.343015,2.103403,1.262089,2.769036,0.907372,-1.076003,-0.033452,0.700073,-1.268355,-2.003294,-1.167287
4,-0.598118,0.141184,-0.036886,0.724679,-1.352135,-0.887173,-0.022036,-1.003479,0.881838,1.567071,-1.09516,2.259831,1.611733,0.907372,-1.076003,-0.033452,1.645493,1.580606,-2.003294,-0.233457



Statistics after scaling (should be close to 0 mean, 1 std dev):


Unnamed: 0,Age,Hours per day,BPM,MH_Composite,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
count,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0
mean,6.757879e-17,-6.757879e-17,-2.413528e-18,-1.544658e-16,9.654113e-18,-6.757879e-17,-3.37894e-17,-2.6548810000000003e-17,-6.757879e-17,1.013682e-16,-3.8616450000000004e-17,5.3097620000000006e-17,2.8962340000000004e-17,6.275174e-17,4.827057e-17,3.37894e-17,9.895466000000001e-17,3.740969e-17,3.8616450000000004e-17,9.654113e-18
std,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068,1.00068
min,-1.262593,-1.180632,-0.03688889,-2.057667,-1.352135,-0.8871725,-0.9760841,-1.003479,-0.5446074,-1.343015,-1.09516,-0.7333943,-0.7028728,-1.039643,-1.076003,-2.174401,-1.190766,-1.268355,-2.003294,-1.167287
25%,-0.5981179,-0.519724,-0.03688604,-0.7269799,-0.339753,-0.8871725,-0.9760841,-1.003479,-0.5446074,-0.3729866,-1.09516,-0.7333943,-0.7028728,-1.039643,-1.076003,-1.103927,-1.190766,-1.268355,-1.035824,-1.167287
50%,-0.3489398,-0.1892702,-0.03688564,-0.001150545,-0.339753,0.1974769,-0.02203643,-0.01212257,-0.5446074,-0.3729866,-0.02897249,-0.7333943,-0.7028728,-0.06613502,-0.1941118,-0.03345232,-0.2453467,-0.3187017,-0.06835387,-0.2334574
75%,0.232476,0.4716376,-0.03688509,0.7246788,0.6726285,0.1974769,0.9320112,0.979234,0.8818376,0.5970421,1.037215,0.2643473,0.45443,0.9073725,0.6877789,1.037022,0.700073,0.630952,0.8991163,0.7003723
max,5.299099,6.750261,27.11088,2.660224,1.68501,2.366776,1.886059,1.970591,3.734728,1.567071,2.103403,2.259831,2.769036,1.88088,1.56967,1.037022,1.645493,1.580606,0.8991163,1.634202


## 6.4 Final Review and Save Engineered Features

Review the final structure and data types of the engineered DataFrame. Save it for the next step (model preparation).

In [7]:
print("\nFinal DataFrame Info:")
df_eng.info()

print("\nFirst 5 rows of the engineered DataFrame:")
display(df_eng.head())

print(f"\nFinal shape of engineered data: {df_eng.shape}")

# Ensure the target column 'Music effects' is still present if needed for next step
if 'Music effects' not in df_eng.columns:
    print("Warning: Target column 'Music effects' is missing. Re-adding from cleaned data.")
    if 'Music effects' in df_cleaned.columns:
         df_eng['Music effects'] = df_cleaned['Music effects']
    else:
         print("Error: Cannot re-add 'Music effects', missing in cleaned data too.")


# Save the engineered DataFrame
try:
    # Create directory if it doesn't exist
    os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)
    df_eng.to_csv(ENGINEERED_DATA_PATH, index=False)
    print(f"\nEngineered features successfully saved to: {ENGINEERED_DATA_PATH}")
except Exception as e:
    print(f"\nError saving engineered data: {e}")


Final DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 50 columns):
 #   Column                                                       Non-Null Count  Dtype  
---  ------                                                       --------------  -----  
 0   Age                                                          736 non-null    float64
 1   Hours per day                                                736 non-null    float64
 2   While working                                                736 non-null    Int64  
 3   Instrumentalist                                              736 non-null    Int64  
 4   Composer                                                     736 non-null    Int64  
 5   Exploratory                                                  736 non-null    Int64  
 6   Foreign languages                                            736 non-null    Int64  
 7   BPM                                                      

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Exploratory,Foreign languages,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,MH_Composite,Primary streaming service_I do not use a streaming service.,Primary streaming service_Other streaming service,Primary streaming service_Pandora,Primary streaming service_Spotify,Primary streaming service_YouTube Music,Fav genre_Country,Fav genre_EDM,Fav genre_Folk,Fav genre_Gospel,Fav genre_Hip hop,Fav genre_Jazz,Fav genre_K pop,Fav genre_Latin,Fav genre_Lofi,Fav genre_Metal,Fav genre_Pop,Fav genre_R&B,Fav genre_Rap,Fav genre_Rock,Fav genre_Video game music
0,-0.598118,-0.18927,1,1,1,1,1,-0.036885,-0.339753,-0.887173,-0.022036,-1.003479,-0.544607,0.597042,-1.09516,2.259831,2.769036,-0.066135,-1.076003,1.037022,0.700073,1.580606,-2.003294,0.700372,3.0,0.0,1.0,0.0,No effect,-1.573781,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.139555,-0.684951,1,0,0,1,0,-0.036886,0.672629,-0.887173,-0.976084,-0.012123,2.308283,-0.372987,2.103403,0.264347,1.611733,-0.066135,-1.076003,-0.033452,0.700073,-0.318702,0.899116,-0.233457,7.0,2.0,2.0,1.0,No effect,-0.606008,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,-0.598118,0.141184,0,0,0,0,1,-0.036885,-1.352135,-0.887173,1.886059,-1.003479,-0.544607,-0.372987,-0.028972,2.259831,-0.702873,0.907372,0.687779,-1.103927,-1.190766,-0.318702,-1.035824,1.634202,7.0,7.0,10.0,2.0,No effect,1.087593,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,2.973436,-0.354497,1,0,1,1,1,-0.036887,0.672629,-0.887173,-0.976084,-0.012123,2.308283,-1.343015,2.103403,1.262089,2.769036,0.907372,-1.076003,-0.033452,0.700073,-1.268355,-2.003294,-1.167287,9.0,7.0,3.0,3.0,Improve,0.603707,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.598118,0.141184,1,0,0,1,0,-0.036886,-1.352135,-0.887173,-0.022036,-1.003479,0.881838,1.567071,-1.09516,2.259831,1.611733,0.907372,-1.076003,-0.033452,1.645493,1.580606,-2.003294,-0.233457,7.0,2.0,5.0,9.0,Improve,0.724679,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0



Final shape of engineered data: (736, 50)

Engineered features successfully saved to: ../data/processed/features_engineered.csv


## Feature Engineering Decisions Summary

*   **Composite Score:** Averaged `Anxiety`, `Depression`, `Insomnia`, `OCD` into `MH_Composite` to provide a single mental health indicator.
*   **Binary Encoding:** Mapped 'Yes'/'No' in `While working`, `Instrumentalist`, `Composer`, `Exploratory`, `Foreign languages` to 1/0.
*   **One-Hot Encoding:** Applied to `Primary streaming service` and `Fav genre` using `drop='first'` to avoid multicollinearity. This converts these nominal categories into numerical format suitable for modeling.
*   **Ordinal Features:** Kept the pre-mapped frequency columns (0-3) as ordinal features. Scaled them along with other numerical features.
*   **Normalization:** Used `StandardScaler` on `Age`, `Hours per day`, `BPM`, `MH_Composite`, and the frequency columns to standardize their scales, which benefits distance-based algorithms and those using regularization.
*   **Target Variable:** `Music effects` was retained but not transformed in this step; it will be handled during model-specific data preparation.
*   **Excluded Columns:** Original mental health scores (`Anxiety`, etc.) were kept alongside the composite score for potential alternative modeling approaches but were not scaled individually in this primary pipeline. Timestamp and Permissions were dropped earlier. Original OHE columns were dropped after encoding.