# Machine Learning Project : LDA (Linear Discrimination Analysis)

# Introduction 

1.1 Project Goal

The objective of this project is to:

Select a supervised machine learning classification algorithm

Study it in detail (theoretically and empirically)

Identify a data characteristic that affects its performance

Propose a variant of the algorithm that addresses this weakness

Evaluate the modified version on the OpenML-CC18 benchmark datasets

In this project, we selected Linear Discriminant Analysis (LDA) and focused on the characteristic noise/outliers, which LDA is known to be sensitive to.

# Background

2.1 Linear Discriminant Analysis (LDA)

Brief explanation: LDA is a generative classifier that models each class as a Gaussian with a shared covariance matrix.

Decision rule:

Estimate class means

Estimate shared covariance

Compute linear decision boundaries

Assumptions:

Classes are linearly separable

Covariances are equal

Features are normally distributed

2.2 Why LDA is sensitive to outliers

Means shift dramatically with extreme values

Covariance matrix becomes inflated

Decision boundary rotates and becomes unstable

Outliers → distort model parameters → significant accuracy drop.

# Datasets Description

3. Dataset Description
3.1 Benchmark Source

We use datasets from the OpenML CC-18 Curated Classification Benchmark Suite, which contains diverse classification datasets standardized for ML evaluation.

3.2 Dataset Loading & Preprocessing

Datasets obtained in ARFF format

Loaded locally using scipy.io.arff.loadarff

Features split into:

Numerical

Categorical

Preprocessing steps:

Numerical: Standardization

Categorical: One-Hot Encoding

3.3 Dataset Summary Table

# Baseline Method: Standard LDA
4.1 Implementation Overview

Preprocess each dataset

Use stratified 5-fold cross-validation

Evaluate:

Accuracy

Macro F1-score

Evaluate both:

Clean data

Noisy data (outliers artificially injected into training folds)

4.2 Noise Injection Strategy

Explain briefly:

A percentage of training samples randomly selected

Feature values perturbed with extreme values (controlled outlier strength)

Only training folds are modified, test folds remain clean

4.3 Baseline Performance Results

(Insert your baseline results table here)

4.4 Observations

Summarize what happened when noise was added

Identify datasets most affected by outliers

# Proposed Modification

5. Proposed Method: Outlier-Filtered LDA
5.1 Motivation

LDA relies on estimating means and covariance from all data points. Outliers strongly distort these estimates.
Idea: remove samples with extreme z-scores before fitting LDA → more stable covariance → more robust classifier.

5.2 Algorithm Description

Steps:

Compute z-scores per feature

Mark samples exceeding z_thresh in ANY feature as outliers

Remove them from the training data

Fit standard LDA on filtered data

Test normally (no filtering applied to test set)

5.3 Implementation Code

(Include your FilteredLDA class and explanation)

5.4 Expected Benefits

More stable class means

Better covariance estimate

Less boundary distortion

Improved accuracy under noisy conditions

# Empirical Evaluation
6.1 Experimental Setup

Same 5-fold CV as baseline

Same noise setup

Same preprocessing pipeline

Metrics: Accuracy, Macro F1

6.2 Results Table

(Insert merged table: baseline vs filtered)

| Dataset | acc_noisy | acc_noisy_filtered | Δacc | f1_noisy | f1_noisy_filtered | Δf1 |

6.3 Visual Comparison

Insert the scatter plots:

Scatter Plot: Accuracy (Noisy Data)

(Your plot goes here)

Scatter Plot: Macro F1 (Noisy Data)

(Your plot goes here)

Optional: Bar chart for improvement per dataset

(Your bar plot if you include one)

# Results and Analysis

7. Results Analysis
7.1 Overall Improvements

Most datasets lie above the diagonal → filtered LDA performs better

Outlier-filtering improves robustness without requiring complex models

7.2 Dataset-Level Behavior

Summaries:

Strong improvements: dataset4.arff

Moderate: dataset3.arff, dataset2.arff, dataset8.arff

Neutral: dataset7.arff, dataset9.arff

Small negative: dataset10.arff (possible over-filtering)

7.3 Interpretation

Z-score filtering protects LDA from extreme values

Means + covariance become more stable

Model generalizes better under noisy training data

# Conclusions 
8. Conclusions
8.1 Summary

Baseline LDA is highly sensitive to outliers

We implemented a simple, effective outlier-filtered LDA

The variant consistently improved performance on noisy data

8.2 Strengths

Simple

Computationally cheap

Easy to interpret

Significantly improves robustness on many datasets

8.3 Limitations

May remove useful data if threshold is too strict

Effectiveness depends on dataset size and distribution

8.4 Future Work

Explore robust covariance estimation (e.g., shrinkage, MinCovDet)

Use Mahalanobis distance for more intelligent outlier detection

Extend to multiclass or high-dimensional datasets

Combine filtering with dimensionality reduction

## Appendix
A.1 Full Code Listing

(Place all helper functions, classes, etc.)

A.2 Environment & Libraries

Python version

sklearn version

pandas version