# Matrix Factorization: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis, or LDA, is a multi-class classification algorithm that can be used for dimensionality reduction. The number of dimensions for the projection is limited to 1 and C-1, where C is the number of classes. In this case, our dataset is a binary classification problem (two classes), limiting the number of dimensions to 1. The scikit-learn library provides the LinearDiscriminantAnalysis class implementation of Linear Discriminant Analysis that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform.

## Import libraries

In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

## Load data

We will use the make_classification() function to create a test binary classification dataset. The dataset will have 1,000 examples with 20 input features, 10 of which are informative and 10 of which are redundant. This provides an opportunity for each technique to identify and remove redundant input features. The fixed random seed for the pseudorandom number generator ensures we generate the same synthetic dataset each time the code runs.

It is a binary classification task and we will evaluate a LogisticRegression model after each dimensionality reduction transform. The model will be evaluated using the gold standard of repeated stratified 10-fold cross-validation. The mean and standard deviation classification accuracy across all folds and repeats will be reported.

In [2]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)

In [3]:
print(X.shape)
print(y.shape)

(1000, 20)
(1000,)


## Baseline model

In [4]:
# define the model
model = LogisticRegression()

In [5]:
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [6]:
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.824 (0.034)


##  Model with LDA

We will use a Pipeline to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure.

In [7]:
# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

In [8]:
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [9]:
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.825 (0.034)


In this case, we can see a slight lift in performance as compared to the baseline fit on the raw data.