# Classification

Classification models aim to provide an automated decision such as is this e-mail spam?
Most classification models return a probability or propensity of belonging to a class of interest.
A cut-off can then be used to convert the propensity score to a decision. It follows the general approach:

1. Establish a cutoff probability for the class of interest above which we consider a record as belonging to that class.
2. Estimate (with any model) the probability that a record belongs to the class of interest.
3. If that probability is above the cutoff probability, assign the new record to the class of interest.

The higher the cutoff, the fewer records predicted as that class.

## Naive Bayes

The naive Bayes algorithm uses the probability of observing predictor values,
given an outcome, to estimate the probability of observing outcome $Y = i$, given
a set of predictor values.

- **Conditional probability**: The probability of observing some event (say $X = i$) given some other event (say $Y = i$), written as $P(X_i | Y_i)$.
- **Posterior probability**: The probability of an outcome after the predictor information has been incorporated (in contrast to the prior probability of outcomes, not taking predictor information into account).

To understand Bayesian classification, we can start out by imagining "nonnaive"
Bayesian classification. For each record to be classified:
1. Find all the other records with the same predictor profile (i.e., where the predictor values are the same).
2. Determine what classes those records belong to and which class is most prevalent (i.e., probable).
3. Assign that class to the new record.

The preceding approach amounts to finding all the records in the sample that are
exactly like the new record to be classified in the sense that all the predictor
values are identical.

### Why Exact Bayesian Classification Is Impractical

When the number of predictor variables exceeds a handful, many of the records
to be classified will be without exact matches e.g. male Hispanic with high income from the US Midwest who voted in the last
election, did not vote in the prior election, has three daughters and one son, and
is divorced.

_Despite its name, naive Bayes is not considered a method of Bayesian statistics. Naive Bayes
is a data–driven, empirical method requiring relatively little statistical expertise. The name
comes from the Bayes rule–like calculation in forming the predictions—specifically the initial
calculation of predictor value probabilities given an outcome, and then the final calculation of
outcome probabilities._

### The Naive Solution

In the naive Bayes solution, we no longer restrict the probability calculation to
those records that match the record to be classified. Instead, we use the entire
data set. The naive Bayes modification is as follows:

1. For a binary response $Y = i$ ($i = 0$ or $1$), estimate the individual conditional probabilities for each predictor $P(X_j|Y=i)$; these are the probabilities that the predictor value is in the record when we observe $Y = i$. This probability is estimated by the proportion of $X_j$ values among the $Y = i$ records in the training set
2. Multiply these probabilities by each other, and then by the proportion of records belonging to $Y = i$
3. Repeat steps 1 and 2 for all the classes
4. Estimate a probability for outcome $i$ by taking the value calculated in step 2 for class $i$ and dividing it by the sum of such values for all classes.
5. Assign the record to the class with the highest probability for this set of predictor values.

This naive Bayes algorithm can also be stated as an equation for the probability
of observing outcome $Y = i$, given a set of predictor values $X_1, ...,X_p$:

\begin{equation}
P(X_1, X_2, ..., X_p)
\end{equation}

The value of $P(X_1, X_2, ..., X_p)$ is a scaling factor to ensure the probability
is between 0 and 1 and does not depend on Y:

\begin{equation}
P(X_1, X_2, ..., X_p) = P(Y=0)(P(X_1|Y=0)P(X_2|Y=0)...P(X_p|Y=0)) + P(Y=1)(P(X_1|Y=1)P(X_2|Y=1)...P(X_p|Y=1))
\end{equation}

**Why is this formula called “naive”? We have made a simplifying assumption
that the exact conditional probability of a vector of predictor values, given
observing an outcome, is sufficiently well estimated by the product of the
individual conditional probabilities $P(X_j|Y=i)$. In other words, in
estimating $P(X_j|Y=i)$ instead of $P(X_1, X_2, ..., X_p | Y=i)$, we are assuming $X_j$
is independent of all the other predictor variables $X_{k for} K\neq j$**

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

#### Note
_When a predictor category is absent in the training data, the algorithm assigns zero probability
to the outcome variable in new data, rather than simply ignoring this variable and using the
information from other variables, as other methods might. This is something to pay attention to
when binning continuous variables._

### Numeric Predictor Variables

From the definition, we see that the Bayesian classifier works only with
categorical predictors (e.g., with spam classification, where presence or absence
of words, phrases, characters, and so on, lies at the heart of the predictive task).
To apply naive Bayes to numerical predictors, one of two approaches must be
taken:
- Bin and convert the numerical predictors to categorical predictors and apply the algorithm of the previous section.
- Use a probability model—for example, the normal distribution to estimate the conditional probability $P(X_j|Y=i)$

### Key Takeaways

- Naive Bayes works with categorical (factor) predictors and outcomes.
- It asks, "Within each outcome category, which predictor categories are most probable?"
- That information is then inverted to estimate probabilities of outcome categories, given predictor values.

## Discriminant Analysis

Discriminant analysis is the earliest statistical classifier; it was introduced by R.
A. Fisher in 1936

- **Covariance**: A measure of the extent to which one variable varies in concert with another (i.e., similar magnitude and direction).
- **Discriminant function**: The function that, when applied to the predictor variables, maximizes the separation of the classes.
- **Discriminant weights**: The scores that result from the application of the discriminant function, and are used to estimate probabilities of belonging to one class or another.

While discriminant analysis encompasses several techniques, the most commonly used is linear discriminant analysis, or LDA. LDA is now less widely used with the advent of more
sophisticated techniques, such as tree models and logistic regression.

However, you may still encounter LDA in some applications and it has links to
other more widely used methods (such as principal components analysis). In addition, discriminant analysis can
provide a measure of predictor importance, and it is used as a computationally
efficient method of feature selection.

#### Note
Linear discriminant analysis should not be confused with Latent Dirichlet Allocation, also
referred to as LDA. Latent Dirichlet Allocation is used in text and natural language processing
and is unrelated to linear discriminant analysis.

### Covariance Matrix

covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (that is, the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables

To understand discriminant analysis, it is first necessary to introduce the concept
of covariance between two or more variables. The covariance measures the
relationship between two variables $x$ and $z$. Denote the mean for each variable
by $\overline{x}$ and $\overline{z}$. The covariance $S_{x,z}$ between $x$ and $z$ is given by:
\begin{equation}
S_{x,z} = \frac{\sum_{i=1}^{n}(x-\overline{x})(z-\overline{z})}{n-1}
\end{equation}
where n is the number of records.

As with the correlation coefficient, positive values indicate a
positive relationship and negative values indicate a negative relationship.
Correlation, however, is constrained to be between –1 and 1, whereas covariance
is on the same scale as the variables $x$ and $z$. The covariance matrix $\sum$ for $x$
and $z$ consists of the individual variable variances, $S^2_x$ and $S^2_y$, on the diagonal
(where row and column are the same variable) and the covariances between
variable pairs on the off-diagonals.

\begin{equation}
\hat{\sum} = \begin{bmatrix}
S^2_x & S_{x,z} \\
S_{x,z} & S^2_z \\
\end{bmatrix}
\end{equation}

### Fisher’s Linear Discriminant

For simplicity, we focus on a classification problem in which we want to predict
a binary outcome y using just two continuous numeric variables $(x,z)$.
Technically, discriminant analysis assumes the predictor variables are normally
distributed continuous variables, but, in practice, the method works well even for
nonextreme departures from normality, and for binary predictors. Fisher’s linear
discriminant distinguishes variation between groups, on the one hand, from
variation within groups on the other. Specifically, seeking to divide the records
into two groups, LDA focuses on maximizing the "between" sum of squares $SS_{between}$
(measuring the variation between the two groups) relative to the
"within" sum of squares $SS_{within}$ (measuring the within-group variation). In this
case, the two groups correspond to the records $(x_0, z_0)$ for which $y = 0$ and the
records $(x_1, z_1)$ for which $y = 1$. The method finds the linear combination $w_{x}x+w_{z}z$
that maximizes that sum of squares ratio.

\begin{equation}
\frac{SS_{between}}{SS_{within}}
\end{equation}

The between sum of squares is the squared distance between the two group
means, and the within sum of squares is the spread around the means within each
group, weighted by the covariance matrix. Intuitively, by maximizing the
between sum of squares and minimizing the within sum of squares, this method
yields the greatest separation between the two groups.

### Key Takeaways

- Discriminant analysis works with continuous or categorical predictors, as well as categorical outcomes.
- Using the covariance matrix, it calculates a linear discriminant function, which is used to distinguish records belonging to one class from those belonging to another.
- This function is applied to the records to derive weights, or scores, for each record (one weight for each possible class) that determines its estimated class.