# Data Preprocessing

## 1. Dimensionality Reduction

Dimensionality reduction seeks a lower dimensional representation of numeriical input data that preserves the most important relationships in the data. This is often performed to avoid model overfitting and achieve a simpler model while maintaining decent model performance.

## 2. Sampling

## 1. Dimension Reduction

- **Linear Transformation**
    - Principal Component Analysis (PCA)
    - Singular Value Decomposition (SVD)
    - Linear Discriminant Analysis (LDA)
- **Non-linear Transformation**
    - Isomap Embedding
    - LLE
    - Quardratic Discriminant Analysis (QDA)

In [2]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.datasets import make_classification

from sklearn.pipeline import Pipeline

In [3]:
X, y = make_classification(n_samples=10000, n_features=30, n_informative=20, n_redundant=10, random_state=0)

In [4]:
model = LogisticRegression(class_weight='balanced')

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
scores = cross_val_score(model, X, y, scoring='accuracy', cv = cv, n_jobs=-1)

In [5]:
print(f"Accuracy: {np.mean(scores):.3f} (std: {np.std(scores):.3f})")

Accuracy: 0.817 (std: 0.009)


In [6]:
# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

steps = [('PCA', PCA(n_components=5)),
         ('m', LogisticRegression(class_weight='balanced'))] 

model = Pipeline(steps = steps)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

print(f"Accuracy: {np.mean(scores):.3f} (std: {np.std(scores):.3f})")


Accuracy: 0.595 (std: 0.009)


CPU times: total: 46.9 ms
Wall time: 617 ms


## 2. Sampling

- Downsampling
- Upsampling