<div >
<img src = "../banner.jpg" />
</div>

# Classification (Cont.)

To work through the steps of probability-based classification, we’ll use a real dataset on loans and credit from a set of local lenders in Germany (taken from the UC Irvine Machine Learning Repository and cleaned for our purposes). 

Credit scoring is a classic problem of classification, and it remains one of the big application domains for ML: use previous loan results (default versus payment) to train a model that can predict the performance of potential new loans.

\begin{align}
Default=f(x) + u
\end{align}

where $Default=I(Default=1)$

Taken from the UC Irvine Machine Learning Repository.


In [None]:
#Cargar librerías 
require("pacman")
p_load(tidyverse)
set.seed(1011)

In [None]:
#Leer los datos 
credit <- readRDS(url("https://github.com/ignaciomsarmiento/datasets/blob/main/credit_class.rds?raw=true"))
head(credit)

In [None]:
default<-credit$Default  #defino ahora va a servir después

#mutación de factores
credit<-credit %>% mutate(Default=factor(Default,levels=c(0,1),labels=c("No","Si")),
                          history=factor(history,levels=c("good","poor","terrible"),labels=c("buena","mala","terrible")),
                          foreign=factor(foreign,levels=c("foreign","german"),labels=c("extranjero","aleman")),
                          purpose=factor(purpose,levels=c("newcar","usedcar","goods/repair","edu", "biz" ),labels=c("auto_nuevo","auto_usado","bienes","educacion","negocios")))         

In [None]:
mylogit <- glm(Default~., data = credit, family = "binomial")

In [None]:
pred<-predict(mylogit,newdata = credit, type = "response")


In [None]:
## what are our misclassification rates?
rule <- 1/2 # Bayes Rule

sum( (pred>rule)[default==0] )/sum(default==0) #False positive rate

In [None]:
sum( (pred>rule)[default==1] )/sum(default==1) #True positive rate

## Aside: Dummy Vars

In [None]:
p_load("caret")
dmy <- dummyVars(" ~ .", data = credit) # One-hot-encoding

credit <- data.frame(predict(dmy, newdata = credit))

## Out of sample prediction

In [None]:
credit<- credit  %>% mutate(Default=factor(Default.Si,levels=c(0,1),labels=c("No","Si")))

In [None]:
inTrain <- createDataPartition(
  y = credit$Default.Si,## La variable dependiente u objetivo 
  p = .7, ## Usamos 70%  de los datos en el conjunto de entrenamiento 
  list = FALSE)


train <- credit[ inTrain,]
test  <- credit[-inTrain,]

In [None]:
head(train)

### Logit

In [None]:
ctrl<- trainControl(method = "cv",
                    number = 5,
                    classProbs = TRUE,
                    verbose=FALSE,
                    savePredictions = T)


In [None]:
set.seed(1410)
mylogit_caret <- train(Default~duration+amount+installment+age+
                       history.buena+history.mala+
                       purpose.auto_nuevo+purpose.auto_usado+purpose.bienes+purpose.educacion+
                       foreign.extranjero+
                       +rent.TRUE, 
                       data = train, 
                       method = "glm",
                       trControl = ctrl,
                       family = "binomial")


mylogit_caret

In [None]:
predictTest_logit <- data.frame(
  obs = test$Default,                                    ## observed class labels
  predict(mylogit_caret, newdata = test, type = "prob"),         ## predicted class probabilities
  pred = predict(mylogit_caret, newdata = test, type = "raw")    ## predicted class labels
)


In [None]:
head(predictTest_logit)

In [None]:
twoClassSummary(data = predictTest_logit, lev = levels(predictTest_logit$obs))

## KNN

In [None]:
set.seed(1410)
mylogit_knn <- train(Default~duration+amount+installment+age+
                       history.buena+history.mala+
                       purpose.auto_nuevo+purpose.auto_usado+purpose.bienes+purpose.educacion+
                       foreign.extranjero+
                       +rent.TRUE, 
                       data = train, 
                       method = "knn",
                       trControl = ctrl,
                     tuneGrid = expand.grid(k=c(3,5,7,9,11)))


mylogit_knn

## LDA


\begin{align}
p (Y=1|X)=\frac{f(X|Y=1)p(Y=1)}{m(X)}
\end{align}


with $m(X)$ is the marginal distribution of $X$, i.e.

\begin{align}
m(X)=\int f(X|Y=y)p(Y=y)dy
\end{align}

Recall that there are two states of nature $y \rightarrow i\in\{0,1\}$


\begin{align}
m(X) &= f(X|Y=1)p(Y=1) + f(X|Y=0)p(Y=0) 
\end{align}


\begin{align}
m(X)     &= f(X|Y=1)p(Y=1) + f(X|Y=0)(1-p(Y=1))
\end{align}

We need to estimate $f(X|Y=1)$,  $f(X|Y=0)$ and $p(Y=1)$ 


#### By Hand


- Let's start by estimating $p(Y=1)$. We've done this before

    \begin{align}
    p(Y=1) = \frac{\sum_{i=1}^n 1[Y_i=1]}{N}
    \end{align}


In [None]:
p1<-sum(train$Default.Si)/dim(train)[1]
p1


- Next $f(X|Y=j)$ with $j=0,1$. 

    - If we assume one predictor and $X|Y\sim N(\mu_j,\sigma_j)$, the problem boils down to estimating $\mu_j,\sigma_j$

    - LDA makes it simpler, assumes $\sigma_j=\sigma$ $\forall j$

To do this partition the sample in two $Y=0$ and $Y=1$, estimate the moments and get $\hat{f}(X|Y=j)$

**Means**

\begin{align}
\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}x_i
\end{align}

In [None]:
#Means
mu1<-mean(train$duration[train$Default.Si==1])
mu1

In [None]:
mu0<-mean(train$duration[train$Default.Si==0])
mu0

**Variance**

\begin{align}
\hat{\sigma}^2 = \frac{1}{N-K} \sum_{k=1}^K \sum_{i:y_i=k} (x_i -\hat{\mu}_k)^2
\end{align}

In [None]:
#Variance
g1<-sum((train$duration[train$Default.Si==1]-mu1)^2)
g0<-sum((train$duration[train$Default.Si==0]-mu0)^2)


sigma<-sqrt((g1+g0)/(dim(train)[1]-2))
sigma

With the moments, now we can obtain $f(X|Y=j)$ with $j=0,1$. 

In [None]:
f1<-dnorm(test$duration,mean=mu1,sd=sigma)
f0<-dnorm(test$duration,mean=mu0,sd=sigma)

- Finally plug everything into the Bayes Rule and we are done:
\begin{align}
p (Y=1|X)=\frac{f(X|Y=1)p(Y=1)}{f(X|Y=1)p(Y=1) + f(X|Y=0)(1-p(Y=1))}
\end{align}


In [None]:
post_hand<-f1*p1/(f1*p1+f0*(1-p1))
head(post_hand)

In [None]:
p_load("MASS")     # LDA
lda_simple <- lda(Default.Si~duration, data = train)
lda_simple_pred<-predict(lda_simple,test)
names(lda_simple_pred)


In [None]:
posteriors<-data.frame(lda_simple_pred$posterior)
posteriors$hand<-post_hand

head(posteriors)

### Caret

In [None]:
lda_fit = train(Default~duration+amount+installment+age, 
                data=train, 
                method="lda",
                trControl = ctrl)

lda_fit

## Naive Bayes

In [None]:
p_load("klaR")
set.seed(1410)
mylogit_nb <- train(Default~duration+amount+installment+age+
                       history.buena+history.mala+
                       purpose.auto_nuevo+purpose.auto_usado+purpose.bienes+purpose.educacion+
                       foreign.extranjero+
                       +rent.TRUE, 
                       data = train, 
                       method = "nb",
                       trControl = ctrl)


mylogit_nb