# Classification

There are two main types of classificationL

* **Supervised classification**, when all the classes are known. In this case, the data are represented by a matrix having the response categorical variable $y$ in the first column.

* **Cluster analysis**, where the classes are not known in advance.

We the denote the set of possible classes using $G$.

If the response variable takes only two values, it is called **binary variable**. If it takes more than two classes, it is called **multinomial variable**.

Most common methods of classification:

Logistic regression, KNN, Linear Discriminant Analysis (LDA), SVM, Decision Trees, Random Forests, GAMs $\implies$ **supervised**

Cluster analysis $\implies$ **unsupervised**

The confusion matrix shows all the missclassified examples in your sample, from which you can compute the classification error. We can compute the confusion matrix on both the train and the test sets.

The global classification error is given by the classification error of each class $i$ times the probability that an individual belongs to class $i$. This is the **empyrical** or **training** classification error, and it typically underestimates the real classification error.

# Logistic Regression

We can model a binary classification problem as a linear problem in which the target is to predict the value of $y$ (which has to be transformed in factor beforehand) by minimizing the error of a linear model on the prediction. We can thus build a rule to represent that

$y_i = 1$ if $\hat{y} \geq 0.5$, $y_i = 0$ if $\hat{y} < 0.5$

If $Y ~ Bern(p)$ we have that $E(Y) = 1p + 0(1-p) = p$. Thus, we can say that $E(Y) = P(1)$.

We are building a model which will return $Pr(Y = 1| X)$, which is the probability of the data point belongs to the default class given its features. We will refer to it as $p(X)$, and it returns a value between 0 and 1. Linear regression doesn't estimate this probability well because some estimated probabilities are below 0 or above 1, so we use the **Logistic Regression**.

If the response variable is multinomial, we can use **Multiclass Logistic Regression**.

The logistic function goes as follows

$$p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$$

We can apply a monotonic transformation, called **log odds** or **logit function**, to obtain the linear equation representing our logistic model (defined of the $[-\infty, \infty]$ domain).

$$log(\frac{p(X)}{1 - p(X)} = \beta_0 + \beta_1X$$

It is called log odds because it is simply the logarithm of the odds, represented by

$$\frac{p(X)}{1 - p(X} = e^{\beta_0 + \beta_1X}$$

which are simply the probability of X happening divided by the probability of it not happening.

* a one-unit increase of X changes the log odds by $\beta_1$

* a one-unit increase of X multiplies the odds by $e^{\beta_1}$

* a one-unit increase of X changes p(X) according to the sign of $\beta_1$ and depending on the value of X (the relationship is not linear, is an “S”)



# Generalized Linear Models (GLM)

A family of of models which contains also the Logistic regression.

The models in this family we consider the correct distribution for the response variable, and then we build a linear model in beta parameter by applying a transformation (called **link function**) which transforms the expected value of Y.

For logistic regression the link function is the **logit**, which takes the logarithm of the odds.

To estimate the parameters $\beta_0, \beta_1$ we use the **Maximum Likelihood Criterion**, which gives the probability of the data being described by the model having those as parameters.

In R, we use the `glm` function to perform this estimation.

To use the model for prediction, we simply plug the estimated parameters and then try it for the chosen values of X in order to obtain the probability for the observation to belong in the default class.

The common rule is to define p > 0.5 as the threshold for classifying a result as default. This threshold could be changed in case of false positives or false negatives.


# Multiple Logistic Regression

It is simply an extension of the normal logistic regression

$$p(X) = \frac{e^{\beta_0 + \beta_1X + \dots + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X + \dots + \beta_pX_p}}$$

One of the tasks of multiple regressions is to distinguish the effect of multiple variables on the outcome. This avoids a **confounding** effect brought by non-included variables. When we have multiple variables, we should include them at the beginning of our model analysis to manifest their effect.


# Multiclass Logistic Regression

It is simply an extension of the binary logistic regression

$$p(X) = \frac{e^{\beta_{0k} + \beta_{1k}X_1 + \dots + \beta_{pk}X_p}}{\Sigma_{l = 1}^K e^{\beta_{0l} + \beta_{1l}X + \dots + \beta_{pl}X_p}}$$

# Laboratory: Logistic Regression

In [28]:
library(dplyr)
library(gdata)
library(bestglm) # like leaps, used to select the best subset of variables
library(Fahrmeir)

In [29]:
data  <- credit

In [30]:
head(data)

Y,Cuenta,Mes,Ppag,Uso,DM,Sexo,Estc
buen,no,18,pre buen pagador,privado,1049,mujer,vive solo
buen,no,9,pre buen pagador,profesional,2799,hombre,no vive solo
buen,bad running,12,pre buen pagador,profesional,841,mujer,vive solo
buen,no,12,pre buen pagador,profesional,2122,hombre,no vive solo
buen,no,12,pre buen pagador,profesional,2171,hombre,no vive solo
buen,no,10,pre buen pagador,profesional,2241,hombre,no vive solo


In [31]:
summary(data)

    Y                Cuenta         Mes                     Ppag    
 buen:700   no          :274   Min.   : 4.0   pre buen pagador:911  
 mal :300   good running:394   1st Qu.:12.0   pre mal pagador : 89  
            bad running :332   Median :18.0                         
                               Mean   :20.9                         
                               3rd Qu.:24.0                         
                               Max.   :72.0                         
          Uso            DM            Sexo               Estc    
 privado    :657   Min.   :  250   mujer :402   no vive solo:640  
 profesional:343   1st Qu.: 1366   hombre:598   vive solo   :360  
                   Median : 2320                                  
                   Mean   : 3271                                  
                   3rd Qu.: 3972                                  
                   Max.   :18424                                  

In [32]:
# Set variable names to lowercase
names(data) <- names(data) %>% tolower()
# Set levels for factor variables
data %>% select_if(is.factor) %>% sapply(levels)

In [33]:
data %>% summarise_all(funs(list(levels(.))))

y,cuenta,mes,ppag,uso,dm,sexo,estc
"buen, mal","no , good running, bad running",,"pre buen pagador, pre mal pagador","privado , profesional",,"mujer , hombre","no vive solo, vive solo"


In [34]:
data %>% select_if(is.factor) %>% sapply(mapLevels)

$y
buen  mal 
   1    2 

$cuenta
          no good running  bad running 
           1            2            3 

$ppag
pre buen pagador  pre mal pagador 
               1                2 

$uso
    privado profesional 
          1           2 

$sexo
 mujer hombre 
     1      2 

$estc
no vive solo    vive solo 
           1            2 


In [37]:
# Translate to english and replace levels by factors
colnames(data) <- c("solvency", "account","loan_duration", "prev_pay_behavior", "loan_use", "credit", "sex", "civ_state")
levels(data$solvency) <- c("good", "bad")
levels(data$account) <- c("no", "good", "bad")
levels(data$prev_pay_behavior) <- c("good", "bad")
levels(data$loan_use) <- c("private", "professional")
levels(data$sex) <- c("woman", "man")
levels(data$civ_state) <- c("alone", "not alone")

In [38]:
head(data)

solvency,account,loan_duration,prev_pay_behavior,loan_use,credit,sex,civ_state
good,no,18,good,private,1049,woman,not alone
good,no,9,good,professional,2799,man,alone
good,bad,12,good,professional,841,woman,not alone
good,no,12,good,professional,2122,man,alone
good,no,12,good,professional,2171,man,alone
good,no,10,good,professional,2241,man,alone
