# X1: Feature Engineering - The Art of Creating Better FeaturesFeature engineering is often the difference between mediocre and winning models.

## Introduction**Andrew Ng's famous quote:**> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering."Good features make simple models work. Bad features make complex models fail.

## Table of Contents1. Why Feature Engineering Matters2. Handling Missing Data3. Encoding Categorical Variables4. Numerical Transformations5. Feature Scaling & Normalization6. Creating Interaction Features7. Polynomial Features8. Time-based Features9. Text Features10. Automated Feature Engineering

In [None]:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, PolynomialFeaturesfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom category_encoders import TargetEncoderimport warningswarnings.filterwarnings('ignore')print('✅ Libraries loaded')

## 1. Handling Missing Data**Three Strategies:****A. Deletion:**- Drop rows: When <5% missing- Drop columns: When >70% missing**B. Imputation:**- Mean/Median: Numerical features- Mode: Categorical features- Forward/Backward fill: Time series- Model-based: Predict missing values**C. Flag Creation:**- Create 'is_missing' binary feature- Often helps models learn patterns

In [None]:
# Create sample data with missing valuesdf = pd.DataFrame({    'age': [25, np.nan, 35, 45, np.nan],    'income': [50000, 60000, np.nan, 80000, 75000],    'city': ['NYC', 'LA', np.nan, 'Chicago', 'NYC']})print('Original data:')print(df)print(f'\nMissing values:\n{df.isnull().sum()}')# Strategy 1: Simple imputationfrom sklearn.impute import SimpleImputernum_imputer = SimpleImputer(strategy='mean')df['age_imputed'] = num_imputer.fit_transform(df[['age']])df['income_imputed'] = SimpleImputer(strategy='median').fit_transform(df[['income']])# Strategy 2: Create missing flagdf['age_was_missing'] = df['age'].isnull().astype(int)print('\nAfter imputation:')print(df[['age', 'age_imputed', 'age_was_missing']])

## 2. Encoding Categorical Variables**Methods:****A. Label Encoding:** For ordinal data (low, medium, high)**B. One-Hot Encoding:** For nominal data with few categories (<10)- Creates binary column for each category- Avoids imposing false order**C. Target Encoding:** For high-cardinality features (>10 categories)- Replace category with mean target value- Risk of overfitting - use cross-validation**D. Frequency Encoding:** Replace with category frequency

In [None]:
# One-Hot Encodingdf_cat = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})df_onehot = pd.get_dummies(df_cat, columns=['color'], prefix='color')print('One-Hot Encoding:')print(df_onehot)# Target Encoding (mean of target for each category)df_target = pd.DataFrame({    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],    'price': [500, 300, 520, 400, 310]})target_means = df_target.groupby('city')['price'].mean()df_target['city_encoded'] = df_target['city'].map(target_means)print('\nTarget Encoding:')print(df_target)

## 3. Numerical Transformations**Common Transformations:****A. Log Transform:** For skewed data- $x' = \log(x + 1)$- Makes distribution more Gaussian**B. Square Root:** For count data- $x' = \sqrt{x}$**C. Box-Cox:** Automatic optimal transformation**D. Binning:** Convert continuous to categorical- Equal-width bins- Equal-frequency bins (quantiles)- Custom thresholds

In [None]:
# Log transformation for skewed dataskewed_data = np.exp(np.random.normal(0, 1, 1000))  # Highly skewedfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))ax1.hist(skewed_data, bins=50)ax1.set_title('Original (Skewed)', fontweight='bold')ax1.set_xlabel('Value')log_transformed = np.log1p(skewed_data)ax2.hist(log_transformed, bins=50)ax2.set_title('Log Transformed (More Gaussian)', fontweight='bold')ax2.set_xlabel('Log(Value + 1)')plt.tight_layout()plt.show()print(f'Original skewness: {pd.Series(skewed_data).skew():.2f}')print(f'Log-transformed skewness: {pd.Series(log_transformed).skew():.2f}')

## 4. Feature Scaling**Why Scale?** Many algorithms (KNN, SVM, Neural Networks) are sensitive to feature magnitude.**Methods:****A. Standardization (Z-score):**- $x' = \frac{x - \mu}{\sigma}$- Mean=0, Std=1- Use when data is Gaussian**B. Min-Max Normalization:**- $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$- Range: [0, 1]- Use when need bounded range**C. Robust Scaling:**- Uses median and IQR (robust to outliers)

In [None]:
# Comparison of scaling methodsdata = np.array([[1, 2000], [2, 3000], [3, 5000], [4, 10000], [5, 2500]])df_scale = pd.DataFrame(data, columns=['feature1', 'feature2'])print('Original data:')print(df_scale)# StandardScalerscaler_std = StandardScaler()df_std = pd.DataFrame(scaler_std.fit_transform(df_scale), columns=['f1_std', 'f2_std'])# MinMaxScaler  scaler_mm = MinMaxScaler()df_mm = pd.DataFrame(scaler_mm.fit_transform(df_scale), columns=['f1_mm', 'f2_mm'])print('\nStandardized:')print(df_std.head())print('\nMin-Max Normalized:')print(df_mm.head())

## 5. Interaction Features**Idea:** Combine features to capture relationships**Examples:**- Price × Quantity = Total Revenue- Height × Weight = BMI indicator- Day × Hour = Time of day patterns**Polynomial Features:**- Degree 2: $[x_1, x_2] → [x_1, x_2, x_1^2, x_1x_2, x_2^2]$- Captures non-linear relationships

In [None]:
# Polynomial featuresX = np.array([[2, 3], [4, 5]])poly = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly.fit_transform(X)print('Original features: [x1, x2]')print(X)print('\nPolynomial features (degree=2): [x1, x2, x1², x1*x2, x2²]')print(X_poly)print(f'\nFeature names: {poly.get_feature_names_out()}')

## 6. Time-Based Features**From timestamps, extract:**- Hour, day of week, month, quarter, year- Is_weekend, is_holiday- Days since epoch- Cyclical encoding (sin/cos for hour/month)**Cyclical Encoding:**- Hour 23 and Hour 0 are close, but numerically 23 apart- Solution: $hour\_sin = \sin(2\pi \times hour / 24)$- $hour\_cos = \cos(2\pi \times hour / 24)$

In [None]:
# Time-based feature extractiondates = pd.date_range('2024-01-01', periods=5, freq='12H')df_time = pd.DataFrame({'timestamp': dates})# Extract featuresdf_time['hour'] = df_time['timestamp'].dt.hourdf_time['day_of_week'] = df_time['timestamp'].dt.dayofweekdf_time['is_weekend'] = (df_time['day_of_week'] >= 5).astype(int)# Cyclical encoding for hourdf_time['hour_sin'] = np.sin(2 * np.pi * df_time['hour'] / 24)df_time['hour_cos'] = np.cos(2 * np.pi * df_time['hour'] / 24)print(df_time)

## 7. Automated Feature Engineering**Tools:****Featuretools:** Automates feature creation using Deep Feature Synthesis- Aggregation features- Transformation features- Handles multi-table data**Best Practices:**1. Start with domain knowledge2. Create features based on business logic3. Use automated tools for exploration4. Always validate with cross-validation5. Remove highly correlated features (>0.95)6. Use feature importance to prune

## Conclusion**Feature Engineering Checklist:****Data Cleaning:**- ☑ Handle missing values- ☑ Remove duplicates- ☑ Fix data types**Encoding:**- ☑ One-hot encode nominal categories (<10 levels)- ☑ Target encode high-cardinality (>10 levels)- ☑ Label encode ordinal categories**Transformations:**- ☑ Log transform skewed features- ☑ Scale features for distance-based models- ☑ Create polynomial/interaction features**Domain-Specific:**- ☑ Extract time-based features- ☑ Create ratios and aggregations- ☑ Add domain knowledge features**Validation:**- ☑ Check for data leakage- ☑ Validate with cross-validation- ☑ Analyze feature importance**Remember:** Good features > Complex models!