# STA 220 Data & Web Technologies for Data Analysis

### Lecture 15, 2/27/24, Visualizations for Classification

### Last weeks's topics
- Classification: 
    - LDA
    - Naive Bayes

### Today's topics
- Visualization for classification

### References

* Jakob Raymaekers & Peter J. Rousseeuw (2022): Silhouettes and Quasi Residual Plots for Neural
Nets and Tree-based Classifiers, Journal of Computational and
Graphical Statistics, 31:4, 1332-1343
* Jakob Raymaekers, Peter J. Rousseeuw & Mia Hubert (2022): Class maps for visualizing
classification results, Technometrics, 64 (2)

For convenience, today's code is in __R__. The data set `data_floralbuds` contains six features and a label with four levels: `bud`, `branch`, `scales` and `support`. 

In [None]:
library("classmap")

In [None]:
head(data_floralbuds, 3)

In [None]:
summary(data_floralbuds)

We are interested in classifying the observations. 

In [None]:
require("MASS")

In [None]:
fit <- lda(y~., data=data_floralbuds) #linear discriminant analysis
yhat <- predict(fit)

In [None]:
head(yhat$posterior, 3)

Either explicitly or implicitly, most classifiers provide posterior probabilities (cf. the latent dirichlet allocation). 

In [None]:
head(yhat$class)

In [None]:
caret::confusionMatrix(yhat$class, data_floralbuds$y)$table

In [None]:
require(viridis)
vcrout <- vcr.da.train(data_floralbuds[,-7], data_floralbuds$y) #lda in in classmap

In [None]:
yhat <- factor(vcrout$pred, levels = unique(vcrout$pred))

In [None]:
caret::confusionMatrix(yhat, data_floralbuds$y)$table

In [None]:
options(repr.plot.width=12, repr.plot.height=6)

In [None]:
stackedplot(vcrout, classCols=viridis::viridis(4), showOutliers = FALSE)

Suppose we have objects denoted by their index $i$ where $i = 1, \dots, n$, and there are classes
(labels, groups) $g$ with $g = 1, ..., G$. The target is thus a discrete variable with $G$ levels.
Consider a case $i$ in the training set or a test set. 

Denote the posterior probabilities $\hat{p}(i, g)$ of object $i$ belonging to each of the classes $g$, with
$\sum_{g}\hat{p}(i, g) = 1$ for each $i$. 

Now assume that object $i$ has a known given label $g_i$. We wish to measure to what extent
the given label $g_i$ agrees with the classiffcation. For this purpose we denote the highest
$\hat{p}(i, g)$ attained by a class different from $g_i$ as
$$\tilde{p}(i) = max_g\{\hat{p}(i, g); g \neq g_i\}$$
The class attaining this maximum can be seen as the best alternative class. 
If $\hat{p}(i, g_i) > \tilde{p}(i)$ it follows that $g_i$ attains the overall highest value of $\hat{p}(i, g)$ 
so the classiffer agrees with the
given class $g_i$. 
On the other hand, if $\hat{p}(i, g_i) < \tilde{p}(i)$ the classiffer will not assign object $i$ to
class $g_i$.

We now compute the conditional posterior *probability of the best alternative class* when
comparing it with the given class $g_i$ as
$$
PAC(i) 
= 
\frac{\tilde{p}(i)}{\hat{p}(i, g_i) + \tilde{p}(i)}
$$

We will produce a silhouette plot to visualize the classification. For each $i$, the silhouette width is defined as 
$$
s(i) = 1 - 2PAC(i).
$$
$s(i)$ ranges from $−1$ to $1$, with high values
reflecting that the given class of case $i$ fits very well, and negative
values indicating that the given class fits less well than the best
alternative class.

In [None]:
silplot(vcrout, classCols=viridis::viridis(4))

Another graphical display is obtained by plotting the PAC versus
a relevant data variable. This is not unlike plotting the absolute
residuals in regression, since small values of $PAC(i)$ indicate
that the model fits the data point nearly perfectly, whereas a
high $PAC(i)$ alerts us to a poorly fitted data point.

In [None]:
label = 'bud' # bud, branch, scales, support
PAC <- vcrout$PAC[vcrout$y==label] 
feat <- data_floralbuds[vcrout$y==label,3] # feature does not have to be part of the classification
qresplot(PAC, feat, plotErrorBars = TRUE)

The data feature on the x-axis does not have to be part of the classification
model, and it could also be a quantity derived from the data
features such as a principal component score or a prediction,
or just the index i of the data point if the data were recorded
sequentially.

Class maps are quasi residual plots versus a feature reflecting how far
each case is from its class. This is based on some distance
measure $D(i, g)$ of a case $i$ relative to a class $g$. 

Next we estimate the cumulative distribution function of
$D(x, g)$ where $x$ is a random object generated from class $g$ . The
farness of the object $i$ to the class $g$ is then defined as
$$
farness(i, g) = P[D(x, g) \leq D(i, g)].
$$

In [None]:
classmap(vcrout, 'bud', classCols=viridis::viridis(4)) # bud, branch, scales, support

Now, consider another data set. 

In [None]:
head(data_titanic, 3)

In [None]:
data_titanic <- na.omit(data_titanic)

In [None]:
help(data_titanic)

In [None]:
traindata <- data_titanic[which(data_titanic$dataType == "train"), -13]
str(traindata); table(traindata$y)
set.seed(123) # rpart is not deterministic

First, we will consider a tree-based classification. 

In [None]:
rpart.out <- rpart::rpart(y ~ Pclass + Sex + SibSp + Parch + Fare + Embarked, 
                   data = data_titanic, method = 'class', model = TRUE)

In [None]:
rpart.plot::rpart.plot(rpart.out)

Be careful in how to read the tree. 

In [None]:
mean(data_titanic$y=='survived')

In [None]:
mean((data_titanic$y=='survived')[data_titanic['Sex']=='male'])

In [None]:
sum(data_titanic['Sex']!='male' & data_titanic['Pclass']>=3 & data_titanic['Fare']>=23) #few obs

In [None]:
mytype <- list(nominal = c("Name", "Sex", "Ticket", "Cabin", "Embarked"), ordratio = c("Pclass"))
vcrtrain <- vcr.rpart.train(data_titanic[, -12], data_titanic$y, rpart.out, mytype)

In [None]:
confmat.vcr(vcrtrain)

In [None]:
stackedplot(vcrtrain, classCols=c(2,4))

In [None]:
silplot(vcrtrain, classCols = c(2, 4))

In [None]:
classmap(vcrtrain, "casualty", classCols = c(2, 4))

In [None]:
classmap(vcrtrain, "survived", classCols = c(2, 4))

Compare these visualizations to a logistic regression. 

In [None]:
str(vcrtrain)

Update `pred`, `predint` and `PAC` for the logistic regression. 

In [None]:
str(vcrtrain)

In [None]:
vcrtrain2 <- vcrtrain
fit <- glm(y~Pclass+Sex+Age+SibSp+Parch,family=binomial(link = logit),data=data_titanic)
pred <- fitted(fit) 

In [None]:
head(pred)

In [None]:
vcrtrain2$pred <- ifelse(pred<0.5, "casualty", "survived")
head(vcrtrain2$pred)

In [None]:
head(data_titanic$y)

In [None]:
# manually add to vcrtrain2 the predictions ... 
vcrtrain2$predint <- ifelse(pred<0.5, 1, 2)
head(vcrtrain2$predint)

In [None]:
# ... and compute success probs
vcrtrain2$PAC <- ifelse(data_titanic$y=='casualty', pred, 1-pred) 
head(vcrtrain2$PAC)

In [None]:
confmat.vcr(vcrtrain)

In [None]:
caret::confusionMatrix(data_titanic$y[!is.na(data_titanic$Age)], factor(vcrtrain2$pred), 
                       dnn = c("Reference", "Prediction"))$table # same as confmat.vcr(vcrtrain2)

In [None]:
gridExtra::grid.arrange(
    stackedplot(vcrtrain, classCols=c(2,4)),
    stackedplot(vcrtrain2, classCols=c(2,4)), ncol = 2)

In [None]:
gridExtra::grid.arrange(gridExtra::arrangeGrob(
    silplot(vcrtrain, classCols = c(2, 4)), 
    silplot(vcrtrain2, classCols = c(2, 4)), ncol=2))

In [None]:
par(mfrow = c(1,2))
classmap(vcrtrain, "survived", classCols = c(2, 4)) # survived #casualty
classmap(vcrtrain2, "survived", classCols = c(2, 4))

Lets investigate the mis-classification for non-far observations for `casualty`. 

In [None]:
# str(vcrtrain2)

In [None]:
cas <- vcrtrain2$y=='casualty' #& vcrtrain2$farness==0
idx <- which.max(vcrtrain2$farness[cas]); idx

In [None]:
vcrtrain2$X[cas,][idx,]

In [None]:
vcrtrain2$y[cas][idx]

Her fate is statistically unlikely, but well-known and documented ([wiki](https://en.wikipedia.org/wiki/Ida_Straus)). 

>  _We have lived together for many years. Where you go, I go._


In [None]:
index = vcrtrain$X$Sex=='male'
index2 = data_titanic$Sex=='male' 
par(mfrow = c(1,2))
qresplot(vcrtrain$PAC[index], vcrtrain$X$Age[index], plotErrorBars = TRUE)
qresplot(vcrtrain2$PAC[index2], data_titanic$Age[index], plotErrorBars = TRUE)

In [None]:
require("pROC")

In [None]:
roc1 <- roc(fit$y, pred)
auc(roc1)

In [None]:
head(probs <- predict(rpart.out, type = "prob")[,1])

In [None]:
roc0 <- roc(data_titanic$y,probs)
auc(roc0)

In [None]:
plot(roc0)
plot(roc1, col = 2, add = T)

### Summary

The proposed visualizations focus on the cases in a classification. The new silhouette plot describes the strength of each object’s classification, grouped by class. Quasi residual plots yield other insights, such as trends in subsets of the data like the effect of age for male passengers on the Titanic. The class map provides additional information, as it can tell us which cases lie between classes, which cases are far from their given class, and some cases maybe far from all classes. The class map allowed us to distinguish between feature noise and label noise. The displays also drew our attention to atypical cases that were inspected in more detail, providing further insights in the data. 