# Linear Discriminant Analysis

We use LDA because it is more stable than Logistic regression when classes are well-separated and if the sample is small with an approximately normal distribution. It is also used when **we have more than two response classes**

LDA approach models the distribution of X for each class separately, and then uses **Bayes' theorem** to perform estimates for $Pr(Y|X)$.

The Bayes classifier used for prediction basically returns the class for which the condition probability of the observation belonging to the class given its features is the highest.

To quantify the goodness of the model, we use the **training error rate** obtained by taking the total number of errors on the training set divided by the number of observations.

Since the aim of Bayes classifier is to minimize test error rates, we can rewrite the test error rate as the **Bayes error rate**:

$$1 - E_X(max_j P(Y = j|X))$$

This value won't reach 0 because of the irreducible error present in every distribution.

We can obtain a **discriminant score** (which is derived by a manipulation of the Gaussian distribution plugged into the Bayes theorem for LDA). LDA is called **linear** because if we assume a Gaussian distribution for the variable we will obtain a Dscore which is a linear function of x.

From the discriminant score function, we can obtain the Bayes Decision Boundary which minimized misclassification. For p = 1, we have that the decision boundary is equal to $x = \frac{\mu_1 + \mu_2}{2}$

When p > 1, the discriminant function is still linear if all the observations are drawn from a multivariate Gaussian distribution.

From the discriminant score, we can obtain an estimated probability for each class given X. We will use this probability in order to choose the class which maximizes it.

We can vary the classification threshold by lowering it for Bayesian classifiers in order to prevent false negatives.

* Sensitivity is the complement of FN rate, also known as **true positive rate**.

* Specificity is the complement of FP rate, also known as **true negative rate**.

For different $\Sigma_k$ in each class, we get the **quadratic discriminant analysis** which is no longer linear.

If we assume total independence between all the variables in the distribution, we get **naive Bayes** approach.

We should choose the best DA for our data distribution through contextual knowledge.

For two classes, LDA has the same form of the logistic regression model (logit of P(class 1) on P(class2)).

**Use LDA when there is a strong hypothesis of Gaussian distribution and there are more than two classes**

Logistic regression use **discriminative learning**, which uses conditional likelyhood based on P(Y|X), while LDA uses **generative learning**, since it uses the full likelyhood based on Pr(Y, X).

Naive Bayes is popular when p is very large.

# Laboratory: LDA

In [1]:
library(dplyr)  # optional
library(languageR)
library(tibble)  # optional
library(MASS)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

    select



In [2]:
# Word type : represents the number of distinct words in the text
# Word token : the sequence of frequences for each word types
head(spanishMeta)

Author,YearOfBirth,TextName,PubDate,Nwords,FullName
C,1916,X14458gll,1983,2972,Cela
C,1916,X14459gll,1951,3040,Cela
C,1916,X14460gll,1956,3066,Cela
C,1916,X14461gll,1948,3044,Cela
C,1916,X14462gll,1942,3053,Cela
M,1943,X14463gll,1986,3013,Mendoza


In [3]:
# Extracting column names
names(spanishMeta) #or nrow(spanishMeta)

In [12]:
# Calculating number of text per author and the mean publication date
spanishMeta %>% group_by(spanishMeta$Author) %>% summarize()

spanishMeta$Author
C
M
V


In [13]:
?summarise