# Feature Engineering & Advanced ML — Notes, Explanations and Code

_Contains: what it does, when to use, and runnable code templates (commented where needed)._

## 1. Handling Missing Values

**What it does:** Fills or removes missing values (NaN) so models can train properly.

**When to use:** Use whenever your dataset has missing entries. Choose strategy based on data type and missingness pattern.

**Common techniques:** Mean/median/mode imputation, forward/backward fill, KNN/Iterative imputer, drop rows/columns.

In [None]:
# EXAMPLES: Handling Missing Values
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

# sample DataFrame
df = pd.DataFrame({'age':[25, np.nan, 37, 29],
                   'salary':[50000, 60000, np.nan, 52000],
                   'city':['A','B','A', None]})

# 1) SimpleImputer (mean / median / most_frequent)
mean_imp = SimpleImputer(strategy='mean')
# df['age'] = mean_imp.fit_transform(df[['age']])

# 2) KNNImputer (uses nearby rows)
knn = KNNImputer(n_neighbors=2)
# df[['age','salary']] = knn.fit_transform(df[['age','salary']])

# 3) Forward/Backward fill
# df.fillna(method='ffill', inplace=True)  # or 'bfill'

# 4) Drop rows with too many missing values
# df = df.dropna(subset=['age','salary'])

# Note: uncomment lines above to run on your actual data.


## 2. Encoding Categorical Variables

**What it does:** Converts string/ categorical values to numeric so ML models can use them.

**When to use:** When dataset contains categorical/text columns. Choose method based on cardinality and whether order matters.

**Techniques:** Label Encoding, One-Hot, Target Encoding, Frequency/Binary/Hash encoding.

In [None]:
# EXAMPLES: Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.DataFrame({'city':['NY','LA','NY','SF'], 'quality':['low','high','medium','low'], 'target':[0,1,0,1]})

# 1) Label Encoding (for ordinal)
le = LabelEncoder()
# df['quality_enc'] = le.fit_transform(df['quality'])

# 2) One-Hot Encoding (for nominal) using pandas
# df_ohe = pd.get_dummies(df, columns=['city'], drop_first=True)

# 3) Target Encoding (mean target per category)
# df['city_te'] = df.groupby('city')['target'].transform('mean')

# 4) Frequency Encoding
# freq = df['city'].value_counts().to_dict()
# df['city_freq'] = df['city'].map(freq)


## 3. Feature Scaling

**What it does:** Brings numeric features to a similar scale which helps many algorithms converge faster and perform better.

**When to use:** For distance-based models (KNN, SVM), gradient-based models (NN), or when features have very different units.

**Techniques:** StandardScaler, MinMaxScaler, RobustScaler.

In [None]:
# EXAMPLES: Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd

df = pd.DataFrame({'age':[20,40,60], 'income':[20000, 50000, 120000]})

# Standardization (Z-score)
scaler = StandardScaler()
# df[['age','income']] = scaler.fit_transform(df[['age','income']])

# Min-Max scaling (0-1)
mms = MinMaxScaler()
# df[['age','income']] = mms.fit_transform(df[['age','income']])

# Robust scaling (median & IQR) - good with outliers
rs = RobustScaler()
# df[['age','income']] = rs.fit_transform(df[['age','income']])


## 4. Handling Outliers

**What it does:** Detects and manages extreme values that can skew model learning.

**When to use:** When features show extreme values or heavy tails; check with boxplots or summary stats.

**Techniques:** IQR, Z-score, Winsorization, transform (log), or model-based detection.

In [None]:
# EXAMPLES: Outlier detection/handling
import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame({'salary':[30_000, 35_000, 40_000, 1_500_000, 45_000]})

# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
# filter valid rows
# df_iqr = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]

# Z-score method
# z_scores = np.abs(stats.zscore(df['salary']))
# df_z = df[z_scores < 3]

# Winsorization (cap values)
# df['salary_cap'] = df['salary'].clip(lower=df['salary'].quantile(0.01), upper=df['salary'].quantile(0.99))

# Log transform to reduce impact (if values > 0)
# df['salary_log'] = np.log1p(df['salary'])


## 5. Feature Transformation

**What it does:** Applies mathematical transforms to make skewed distributions more normal or relationships more linear.

**When to use:** When features are skewed or relationships to target are nonlinear.

**Techniques:** Log, Box-Cox, Yeo-Johnson, Power transforms, binning.

In [None]:
# EXAMPLES: Transformation
import numpy as np
import pandas as pd
from scipy.stats import boxcox

df = pd.DataFrame({'income':[20000, 30000, 40000, 1000000]})

# Log transform (handles skewness)
# df['income_log'] = np.log1p(df['income'])

# Box-Cox (requires positive data)
# df['income_bc'], _ = boxcox(df['income'] + 1)  # add 1 if zeros present

# Binning (continuous -> categorical)
# df['income_bin'] = pd.cut(df['income'], bins=[0,30000,70000,1e7], labels=['low','mid','high'])


## 6. Feature Creation

**What it does:** Builds new features from existing ones to expose helpful signals.

**When to use:** When domain knowledge suggests useful combinations or aggregations; often improves model power.

**Examples:** Ratios, date-time parts, text lengths, aggregated statistics.

In [None]:
# EXAMPLES: Feature creation
import pandas as pd

df = pd.DataFrame({'loan_amount':[1000,2000,3000], 'income':[10000,20000,15000], 'date':['2020-01-01','2020-06-15','2021-03-10'], 'review':['good','bad product','excellent value']})
df['date'] = pd.to_datetime(df['date'])

# Debt-to-income ratio
# df['dti'] = df['loan_amount'] / df['income']

# Extract date parts
# df['month'] = df['date'].dt.month
# df['weekday'] = df['date'].dt.day_name()

# Text features
# df['review_len'] = df['review'].str.len()
# df['review_word_count'] = df['review'].str.split().apply(len)


## 7. Feature Selection

**What it does:** Chooses a subset of features that are most useful to the model.

**When to use:** When dataset has many features, to reduce overfitting and improve speed.

**Techniques:** Filter methods (correlation, chi2), wrapper (RFE), embedded (Lasso, tree importance), and permutation importance.

In [None]:
# EXAMPLES: Feature selection
import pandas as pd
from sklearn.feature_selection import RFE, SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Assume X, y are prepared feature matrix and target
# X, y = ...

# RFE (recursive feature elimination)
# model = LogisticRegression(max_iter=1000)
# rfe = RFE(model, n_features_to_select=5)
# rfe.fit(X, y)
# selected = rfe.support_

# SelectKBest (filter)
# skb = SelectKBest(score_func=chi2, k=10)
# skb.fit(X, y)

# Tree-based importance
# rf = RandomForestClassifier()
# rf.fit(X, y)
# importances = rf.feature_importances_


## 8. Handling Imbalanced Data

**What it does:** Balances class distribution so models don't ignore minority class.

**When to use:** For classification problems with skewed class ratios.

**Techniques:** Oversampling (SMOTE), undersampling, class weights, ensemble methods tailored to imbalance.

In [None]:
# EXAMPLES: Imbalanced data handling
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Assume X, y prepared
# print('Before:', Counter(y))

# SMOTE (oversample minority)
# sm = SMOTE(random_state=42)
# X_res, y_res = sm.fit_resample(X, y)
# print('After SMOTE:', Counter(y_res))

# Random undersampling
# rus = RandomUnderSampler(random_state=42)
# X_ru, y_ru = rus.fit_resample(X, y)


## 9. Dimensionality Reduction

**What it does:** Reduces number of features while preserving variance/structure.

**When to use:** When many features lead to high-dimensional data, for visualization or to reduce noise.

**Techniques:** PCA, LDA, t-SNE, UMAP.

In [None]:
# EXAMPLES: Dimensionality reduction
from sklearn.decomposition import PCA

# Assume X, y prepared
# PCA example
# pca = PCA(n_components=2)
# X_pca = pca.fit_transform(X)



## 10. Hyperparameter Tuning

**What it does:** Searches best hyperparameters for models for optimal performance.

**When to use:** Before finalizing models — improves generalization and performance.

**Techniques:** GridSearchCV, RandomizedSearchCV.

In [None]:
# EXAMPLES: Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# grid search example
param_grid = {'n_estimators':[50,100], 'max_depth':[5,10,None]}
# grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
# grid.fit(X, y)
# print(grid.best_params_, grid.best_score_)

# randomized search
# param_dist = {'n_estimators':[50,100,200], 'max_depth':[3,5,10,None]}
# rand = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=5, cv=3, random_state=42)
# rand.fit(X, y)
# print(rand.best_params_)


## 11. Cross-Validation

**What it does:** Evaluates model stability by training/testing on multiple splits.

**When to use:** Always recommended rather than single train/test split, especially with limited data.

**Techniques:** K-Fold, Stratified K-Fold, Time Series split, Leave-One-Out.

In [None]:
# EXAMPLES: Cross-validation
from sklearn.model_selection import cross_val_score, StratifiedKFold, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression

# k-fold CV
# scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5)
# print('CV scores:', scores, 'mean:', scores.mean())

# StratifiedKFold for classification
# skf = StratifiedKFold(n_splits=5)
# scores_skf = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=skf)

# TimeSeriesSplit for time-ordered data
# tss = TimeSeriesSplit(n_splits=5)
# scores_ts = cross_val_score(RandomForestClassifier(), X, y, cv=tss)


## 12. Regularization

**What it does:** Adds penalty to model complexity to reduce overfitting.

**When to use:** Use when model overfits; common in linear models and neural networks.

**Types:** L1 (Lasso), L2 (Ridge), ElasticNet.

In [None]:
# EXAMPLES: Regularization
from sklearn.linear_model import Lasso, Ridge, ElasticNet

# Lasso (L1) - can do feature selection by zeroing coeffs
# model_l1 = Lasso(alpha=0.1)
# model_l1.fit(X, y)

# Ridge (L2) - shrink coefficients
# model_l2 = Ridge(alpha=1.0)
# model_l2.fit(X, y)

# ElasticNet (mix of L1 & L2)
# model_en = ElasticNet(alpha=0.1, l1_ratio=0.5)
# model_en.fit(X, y)


## 13. Ensemble Learning

**What it does:** Combines multiple models to get better predictions than single models.

**When to use:** When single models are unstable or to boost performance.

**Types:** Bagging (RandomForest), Boosting (AdaBoost, GradientBoosting, XGBoost, LightGBM, CatBoost), Stacking/Blending.

In [None]:
# EXAMPLES: Ensemble methods
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Random Forest (bagging)
# rf = RandomForestClassifier(n_estimators=100)
# rf.fit(X, y)

# AdaBoost (boosting)
# ada = AdaBoostClassifier(n_estimators=50)
# ada.fit(X, y)

# Gradient Boosting
# gb = GradientBoostingClassifier()
# gb.fit(X, y)

# Stacking example
# estimators = [('svc', SVC(probability=True)), ('rf', RandomForestClassifier())]
# stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
# stack.fit(X, y)


## 14. Model Evaluation Metrics

**What it does:** Quantifies how well a model performs using appropriate metrics.

**When to use:** Always after training; pick metrics matching problem (classification vs regression, imbalanced data, business cost).

In [None]:
# EXAMPLES: Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.metrics import mean_squared_error, r2_score

# For classification
# print('Accuracy:', accuracy_score(y_test, y_pred))
# print('Precision:', precision_score(y_test, y_pred))
# print('Recall:', recall_score(y_test, y_pred))
# print('F1:', f1_score(y_test, y_pred))
# print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))
# print('ROC AUC:', roc_auc_score(y_test, y_score_prob[:,1]))

# For regression
# print('MSE:', mean_squared_error(y_test, y_pred))
# print('R2:', r2_score(y_test, y_pred))


## 15. Train–Test Splitting Strategies

**What it does:** Splits data into training and testing sets in ways suitable for the problem.

**When to use:** Always — ensures evaluation on unseen data. Use stratify for classification, time-based splits for time-series.

In [1]:
# EXAMPLES: Train-test split strategies
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# Simple split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Stratified split (maintains class ratios)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# StratifiedShuffleSplit for repeated stratified splits
# sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
# for train_idx, test_idx in sss.split(X, y):
#     X_tr, X_te = X[train_idx], X[test_idx]
