# Feature Engineering for Hit Song Prediction

## Objectives
1. Create interaction features (e.g., energy × danceability)
2. Add polynomial features for non-linear relationships
3. Engineer temporal features (month, day of week)
4. Create domain-specific features
5. Evaluate impact on model performance


In this notebook, we enrich the original Spotify dataset by adding interaction terms, polynomial features and simple temporal features to capture relationships the baseline model might miss. After creating features like energy x danceability, valence x energy, squared terms and a “party factor,” we rebuild the dataset, apply the same train–test split and scaling, and train a logistic regression model with class weighting. We then compare this engineered feature model directly to the original feature set to see whether the new features improve predictive power, and identify which engineered features matter most. The final engineered dataset is saved for use in later modeling.

---

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.metrics import f1_score, recall_score, precision_score, roc_auc_score

warnings.filterwarnings('ignore')
RANDOM_SEED = 42

project_root = Path.cwd().parent
processed_data_dir = project_root / 'data' / 'processed'
figures_dir = project_root / 'figures'

print("Setup complete")

Setup complete


## 1. Load Original Data

In [4]:
# Load the processed dataset
data_file = processed_data_dir / 'hits_dataset.csv'

if not data_file.exists():
    print("ERROR: hits_dataset.csv not found!")
    print(f"   Expected location: {data_file}")
    print("\nPlease run the following notebooks first:")
    print("   1. 01_Week1_Data_Setup_EDA.ipynb")
    print("   2. 02_Week2_Baseline_Modeling.ipynb")
    print("\nThese notebooks will create the hits_dataset.csv file needed for feature engineering.")
    raise FileNotFoundError(f"Required file not found: {data_file}")

df = pd.read_csv(data_file)
print(f"Dataset loaded: {df.shape}")
print(f"   Features: {df.shape[1]}")
print(f"   Samples: {df.shape[0]}")
df.head()

Dataset loaded: (113999, 13)
   Features: 13
   Samples: 113999


Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,is_hit,year,track_name,artists
0,0.676,0.461,-6.746,0.143,0.0322,1e-06,0.358,0.715,87.917,0,2015,Comedy,Gen Hoshino
1,0.42,0.166,-17.235,0.0763,0.924,6e-06,0.101,0.267,77.489,0,2015,Ghost - Acoustic,Ben Woodward
2,0.438,0.359,-9.734,0.0557,0.21,0.0,0.117,0.12,76.332,0,2015,To Begin Again,Ingrid Michaelson;ZAYN
3,0.266,0.0596,-18.515,0.0363,0.905,7.1e-05,0.132,0.143,181.74,0,2015,Can't Help Falling In Love,Kina Grannis
4,0.618,0.443,-9.681,0.0526,0.469,0.0,0.0829,0.167,119.949,1,2015,Hold On,Chord Overstreet


## 2. Create Interaction Features

Combine features that might work together to predict hits.

In [5]:
# Get audio feature columns
exclude_cols = ['is_hit', 'year'] + df.select_dtypes(include=['object']).columns.tolist()
audio_features = [col for col in df.columns if col not in exclude_cols]

# Create interaction features
df_engineered = df.copy()

# Domain knowledge interactions
if 'energy' in audio_features and 'danceability' in audio_features:
    df_engineered['energy_x_danceability'] = df['energy'] * df['danceability']
    print("Created: energy × danceability")

if 'valence' in audio_features and 'energy' in audio_features:
    df_engineered['valence_x_energy'] = df['valence'] * df['energy']
    print("Created: valence × energy (happy & energetic)")
if 'loudness' in audio_features and 'energy' in audio_features:
    df_engineered['loudness_x_energy'] = df['loudness'] * df['energy']
    print("Created: loudness × energy")

if 'acousticness' in audio_features and 'energy' in audio_features:
    df_engineered['acoustic_vs_energy'] = df['acousticness'] - df['energy']
    print("Created: acousticness - energy (acoustic contrast)")
# Danceability composite
if all(f in audio_features for f in ['danceability', 'valence', 'energy']):
    df_engineered['party_factor'] = (df['danceability'] + df['valence'] + df['energy']) / 3
    print("Created: party_factor (avg of dance, valence, energy)")
print(f"\nDataset shape after interactions: {df_engineered.shape}")

Created: energy × danceability
Created: valence × energy (happy & energetic)
Created: loudness × energy
Created: acousticness - energy (acoustic contrast)
Created: party_factor (avg of dance, valence, energy)

Dataset shape after interactions: (113999, 18)


## 3. Polynomial Features

Capture non-linear relationships.

In [6]:
# Add squared terms for key features
key_features = ['danceability', 'energy', 'valence'] if all(f in audio_features for f in ['danceability', 'energy', 'valence']) else audio_features[:3]

for feature in key_features:
    if feature in df.columns:
        df_engineered[f'{feature}_squared'] = df[feature] ** 2
        print(f"Created: {feature}²")

print(f"\nDataset shape after polynomial features: {df_engineered.shape}")

Created: danceability²
Created: energy²
Created: valence²

Dataset shape after polynomial features: (113999, 21)


## 4. Temporal Features

Extract month, season, day of week if date information is available.

In [7]:
# Year-based features
if 'year' in df.columns:
    year_min = df['year'].min()
    year_max = df['year'].max()
    
    # Normalize year (0-1 scale)
    if year_max > year_min:
        df_engineered['year_normalized'] = (df['year'] - year_min) / (year_max - year_min)
    else:
        # All years are the same, set to 0.5
        df_engineered['year_normalized'] = 0.5
        print("Warning: All songs from the same year, year_normalized set to 0.5")
    
    # Year bins (early, mid, late period)
    try:
        if year_max - year_min >= 2:  # Need at least 3 distinct values for 3 bins
            df_engineered['year_period'] = pd.cut(df['year'], bins=3, labels=[0, 1, 2]).astype(int)
        else:
            # Not enough year range for binning
            df_engineered['year_period'] = 1  # Set all to middle period
            print(f"Warning: Year range too small ({year_min}-{year_max}), year_period set to 1")
    except Exception as e:
        print(f"Warning: Could not create year_period bins: {e}")
        df_engineered['year_period'] = 1
    
    print("Created: year_normalized, year_period")
else:
    print("No 'year' column found, skipping temporal features")

print(f"\nFinal engineered dataset shape: {df_engineered.shape}")
print(f"Added {df_engineered.shape[1] - df.shape[1]} new features")

Created: year_normalized, year_period

Final engineered dataset shape: (113999, 23)
Added 10 new features


## 5. Feature Importance Analysis

In [8]:
# Prepare data
exclude_cols = ['is_hit', 'year'] + df_engineered.select_dtypes(include=['object']).columns.tolist()
feature_cols = [col for col in df_engineered.columns if col not in exclude_cols]

X = df_engineered[feature_cols].values
y = df_engineered['is_hit'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape}")
print(f"Features: {len(feature_cols)}")

Training set: (91199, 19)
Features: 19


## 6. Compare Original vs Engineered Features

In [9]:
# Get original features for comparison
original_features = [col for col in df.columns if col not in ['is_hit', 'year'] and col not in df.select_dtypes(include=['object']).columns]

# Use the SAME train/test split for fair comparison
# Create stratified split indices to ensure both models use identical train/test sets
X_orig = df[original_features].values
indices = np.arange(len(y))
train_idx, test_idx = train_test_split(indices, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

# Apply same split to original features
X_orig_train = X_orig[train_idx]
X_orig_test = X_orig[test_idx]
y_orig_train = y[train_idx]
y_orig_test = y[test_idx]

# Scale original features
scaler_orig = StandardScaler()
X_orig_train_scaled = scaler_orig.fit_transform(X_orig_train)
X_orig_test_scaled = scaler_orig.transform(X_orig_test)

# Train original features model
model_orig = LogisticRegression(class_weight='balanced', random_state=RANDOM_SEED, max_iter=1000)
model_orig.fit(X_orig_train_scaled, y_orig_train)
y_pred_orig = model_orig.predict(X_orig_test_scaled)

# Train engineered features model (already done in previous cells, but ensuring same split)
model_eng = LogisticRegression(class_weight='balanced', random_state=RANDOM_SEED, max_iter=1000)
model_eng.fit(X_train_scaled, y_train)
y_pred_eng = model_eng.predict(X_test_scaled)

# Verify same test set (should be True)
assert np.array_equal(y_orig_test, y_test), "Test sets don't match! Check random seed."

# Compare metrics on the SAME test set
print("\n" + "="*60)
print("ORIGINAL vs ENGINEERED FEATURES COMPARISON")
print("="*60)

metrics = {
    'Features Count': [len(original_features), len(feature_cols)],
    'Precision': [precision_score(y_orig_test, y_pred_orig), precision_score(y_test, y_pred_eng)],
    'Recall': [recall_score(y_orig_test, y_pred_orig), recall_score(y_test, y_pred_eng)],
    'F1 Score': [f1_score(y_orig_test, y_pred_orig), f1_score(y_test, y_pred_eng)]
}

comparison = pd.DataFrame(metrics, index=['Original', 'Engineered'])
print(comparison)

improvement = ((comparison.loc['Engineered', 'F1 Score'] - comparison.loc['Original', 'F1 Score']) /
               comparison.loc['Original', 'F1 Score'] * 100)
print(f"\nF1 Score Improvement: {improvement:+.2f}%")


ORIGINAL vs ENGINEERED FEATURES COMPARISON
            Features Count  Precision    Recall  F1 Score
Original                 9   0.039352  0.822430  0.075109
Engineered              19   0.040371  0.824766  0.076973

F1 Score Improvement: +2.48%


## 7. Top Engineered Features

In [10]:
# Feature importance from coefficients
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': model_eng.coef_[0],
    'Abs_Coefficient': np.abs(model_eng.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print("\nTop 10 Most Important Features:")
print("="*60)
for idx, row in feature_importance.head(10).iterrows():
    feature_type = "ENGINEERED" if row['Feature'] not in original_features else "ORIGINAL"
    print(f"{row['Feature']:30s} {row['Coefficient']:+.4f}  [{feature_type}]")

# Count engineered features in top 10
top10_engineered = sum(1 for f in feature_importance.head(10)['Feature'] if f not in original_features)
print(f"\n{top10_engineered}/10 top features are engineered")


Top 10 Most Important Features:
valence_x_energy               +2.0903  [ENGINEERED]
energy_squared                 -1.6221  [ENGINEERED]
danceability                   +1.4115  [ORIGINAL]
energy_x_danceability          -1.3599  [ENGINEERED]
valence_squared                -1.2063  [ENGINEERED]
loudness                       +1.0922  [ORIGINAL]
instrumentalness               -1.0099  [ORIGINAL]
valence                        -0.9958  [ORIGINAL]
energy                         +0.9538  [ORIGINAL]
acoustic_vs_energy             -0.5369  [ENGINEERED]

5/10 top features are engineered


## 8. Save Engineered Dataset

In [11]:
# Ensure output directory exists
processed_data_dir.mkdir(parents=True, exist_ok=True)

# Save engineered dataset
output_file = processed_data_dir / 'hits_dataset_engineered.csv'
df_engineered.to_csv(output_file, index=False)

print(f"Saved engineered dataset to: {output_file}")
print(f"   Original features: {len(original_features)}")
print(f"   Total features: {len(feature_cols)}")
print(f"   New features: {len(feature_cols) - len(original_features)}")
print(f"\nDataset info:")
print(f"   Rows: {df_engineered.shape[0]:,}")
print(f"   Columns: {df_engineered.shape[1]}")
print(f"   File size: {output_file.stat().st_size / 1024:.1f} KB")

Saved engineered dataset to: c:\Users\aruni\FeatureBeats\data\processed\hits_dataset_engineered.csv
   Original features: 9
   Total features: 19
   New features: 10

Dataset info:
   Rows: 113,999
   Columns: 23
   File size: 24436.6 KB


---


### New Features Created:
1. **Interaction Terms**: energy×danceability, valence×energy, etc.
2. **Polynomial Features**: Squared terms for key features
3. **Domain Features**: party_factor, acoustic_contrast
4. **Temporal Features**: year_normalized, year_period

### Impact:
- Original features: Basic Spotify audio features
- Engineered features: Enhanced with domain knowledge
- Can improve model performance by capturing complex patterns

### Usage:
Use `hits_dataset_engineered.csv` in subsequent modeling for potentially better results!


Feature engineering adds 10 new predictors and leads to a modest but meaningful performance boost: the F1 score improves by about 2.5%, and several engineered features rank among the most important in the model, confirming they add useful signal. Key contributors include valence x energy, energy^2 and energy x danceability. Overall, we learn that simple domain-informed transformations help the model capture more complex patterns, and the engineered dataset provides a stronger foundation for future models like XGBoost.

---