Copyright 2024 Luiz Barboza, Dale Bowman, Natasha A. Sahr, Vasile Rus, Andrew M. Olney and made available under [CC BY-SA]
(https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Educational Material: Classifiers

## The goal of this educational material

Educators implementing practical activities in an introductory class on classifiers will engage students in understanding the fundamental concepts of classification. Logistic regression, a key focus of the course, will be explored as a regression technique primarily used for binary classification, predicting the probability of an observation belonging to a positive class. The logistic regression's mathematical foundation, transitioning from linear regression through the sigmoid function to handle categorical outcomes, will be explained to provide students with a comprehensive understanding.

The module will elaborate on the versatility of logistic regression, applicable to various classification tasks such as identifying diabetes, spam emails, or other binary outcomes. Educators will stress that logistic regression's coefficients can be interpreted probabilistically, indicating the impact of predictor variables on the likelihood of positive classification. The distinction between binary, multinomial, and ordinal logistic regression will be conveyed, guiding students on appropriate usage.

Logistic regression's performance evaluation will be covered, emphasizing metrics such as accuracy, precision, and recall. The trade-off between precision and recall will be explained, highlighting their significance in scenarios where balancing the cost of false positives and false negatives is crucial, as seen in medical diagnosis. The course will conclude with insights into adjusting classification thresholds and the impact on precision and recall, providing educators with practical knowledge for effective model performance assessment and application.

## The proposed practice

In this educational experiment, educators are guiding students through the process of using Logistic Regression as a classifier to predict whether bank customers will default based on historical data. The dataset contains information about customers' salary, balance, and a binary default column (0 for no default, 1 for default). The educators provide access to training and testing datasets through URLs, emphasizing the importance of data exploration and loading using Blockly.

Students learn to use the scikit-learn library in Python, specifically the LogisticRegression class, to create and train a model. The code demonstrates the fitting process using the `fit()` method with salary and balance as features and default as the label. The educators encourage students to understand that the result is an object representing the trained model.

To evaluate the model's performance, students calculate accuracy using the `score()` method on the training dataset, and subsequently on the testing dataset. The demonstration shows how to use the accuracy_score function from scikit-learn to compare predicted values with the actual default values in the testing dataset. In the provided example, both training and testing accuracies yield a perfect score of 100%, indicating the model's proficiency in predicting customer defaults based on salary and balance information.

This practical activity enables students to apply machine learning concepts, reinforcing the steps of model training, prediction, and evaluation. The educators highlight the importance of understanding accuracy metrics and visually inspecting the differences between predicted and expected values in a real-world business scenario. Overall, this experiment provides a hands-on approach to machine learning with a focus on logistic regression and model assessment.

# Logistic Regression (Classifier)

We have a file with historical customers of a bank, with information about their **salary** and **balance**, besides the information on whether the customer **defaulted** or not. This last column is the one that we are trying to train the model to predict, which is a categorical one, 0 (zero) for the customers who never defaulted and 1 (one) for those who defaulted at some point with the bank. The idea with the Logistic Regression uses the training dataset so the model can "learn" how to predict the default classification, based on historical information on salary and balance. Here is the path for the published training and testing datasets: `https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/banking_tr.csv` and `https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/banking_ts.csv` (Almost the same, the training one ends with tr and the testing one with ts). Similar to what was done previously, we can load this data using Blockly. With the training dataset loaded, it is possible to create an instance/object of the class **LogisticRegression** (logreg) from the **sklearn.linear_model** (skl) library. With this logreg object, it is possible to train the model using the `fit()` method, having as parameters the salary and balance as features and the default label, as follows:

![logreg](https://pbs.twimg.com/media/GEE_-NaXEAAaA1n?format=jpg&name=medium)

That will generate the Python code as shown below:


In [None]:
import sklearn.metrics as skm
import sklearn.linear_model as skl
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/banking_tr.csv')
test = pd.read_csv('https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/banking_ts.csv')
logreg = skl.LogisticRegression()
print(logreg.fit(train.get(['salary', 'balance']),train.default))


LogisticRegression()


This code does not generate a particular output. As a result, it is only the object, linreg, as a trained model. If you want to know how good the training was, you should calculate the ***accuracy***, using the **score()** method. Notice that the features and label columns are precisely as they were performed in the training (fit) previous step.

![score](https://pbs.twimg.com/media/GEFANe_XwAAFf4i?format=jpg&name=medium)

Generating the code below:

In [None]:
print(logreg.score(train.get(['salary', 'balance']),train.default))

1.0


On our example, it shows a perfect accuracy, 100%. Being so, it is possible to predict if a client will potentially get defaulted based on his salary and balance from the testing dataset:

![predict](https://pbs.twimg.com/media/GEFA8yAWAAADKdu?format=jpg&name=medium)


As we can see on the code below:

In [None]:
print(logreg.predict(test.get(['salary', 'balance'])))

[0 0 0 0 0 0 1 1 0 0]


The expected value for the default column can be seen from the testing dataset. This allows you to visually inspect the difference between the predicted and expected values.

A better way to do that is to calculate the **accuracy** for the testing dataset this time. To do that, we will have to use the `accuracy_score` function from the **sklearn.linear_model** (skm) library, as shown on the blocks below:

![score](https://pbs.twimg.com/media/GEFBzzxWcAA0yBD?format=jpg&name=large)

Which generates the following code:

In [None]:
print(skm.accuracy_score(logreg.predict(test.get(['salary', 'balance'])),test.default))

1.0


Again a perfect accuracy, 100%, for our playtoy business scenario here!

# APPENDIX

This appendix material can be used as a content target for students of this practice. If you think it's applicable, you can use it, provided the proper reference to this original transcript.

## Intro to Classifiers

The classification problem, simply stated, is to assign a new observation to a class, where the classes have been specified in advance.
There are many different types of classifiers as we will see throughout this course.

The difference between classification and clustering is simple.
In clustering, we don't know the classes in advance, so we group observations together to try to discover the classes.
In classification, we know the classes in advance, so we train a model that can predict the class for any new observation.

Clustering is an _unsupervised_ method since the correct classes are unknown.
Classification is a _supervised_ method since the classes are known.

## Logistic Regression

So far, we have looked at two broad kinds of supervised learning, classification and regression.
Classification predicts a class label for an observation (i.e., a row of the dataframe) and regression predicts a numeric value for an observation.

Logistic regression is a kind of regression that is primarily used for classification, particularly binary classification.
It does this by predicting the **probability** (technically the log-odds) of the positive class assigned label `1`.
If the probability is above a threshold, e.g .50, then this predicted numeric value is interpreted as a classification of `1`.
Otherwise, the predicted numeric value is interpreted as a classification of `0`.
So **logistic regression predicts a numeric probability that we convert into a classification.**

Logistic regression is widely used in data science classification tasks, for example to:

* categorize a person as having diabetes or not having diabetes
* categorize an incoming email as spam or not spam

Because logistic regression is also regression, it captures the relationship between an outcome/dependent variable and the predictor/independent variables in a similar way to linear regression.
The major difference is that the coefficients in logistic regression can be interpreted probabilistically, so that we can say how much more likely a predictor variable makes a positive classification.

The most common kind of logistic regression is binary logistic regression, but it is possible to have:

* Binary/binomial logistic regression
* Multiclass/Multinomial logistic regression
* Ordinal logistic regression (there is an order among the categories)

## When to use logistic regression

Logistic regression works best when you need a classifier and want to be able to interpret the predictor variables easily, as you can with linear regression.
Because logistic regression is fundamentally regression, it has the some assumptions of linearity and additivity, which may not be appropriate for some problems.
Binary logistic regression is widely used and scales well, but multinomial variants typically begin to have performance problems when the number of classes is large.

## Mathematical Foundations of Logistic Regression for Binary Classification

We briefly review in this section the mathematical formulation of logistic regression for binary classification problems.
That is, the predicted categories are just two (say, 1 or 0) and each object or instance belongs to one and only one category.

Logistic regression expresses the relationship between the output variable, also called dependent variable, and the predictors, also called independent variables or features, in a similar way to linear regression with an additional twist.
The additional twist is necessary in order to transform the typical continuous value of linear regression onto a categorical value (0 or 1).

**From Linear Regression to Logistic Regression**

Let us review first the basics of linear regression and then discuss how to transform the mathematical formulation of linear regression such that the outcome is categorical.

In a typical linear regression equation, the output variable $Y$ is related to $n$ predictor variables $X_j$ ($j=1,n$) using the following linear relation, where the output $Y$ is a linear combination of the predictors $X_j$ with corresponding weights (or coefficients) $\beta_{j}$:

$$Y = {\beta}_{0} + \sum \limits _{j=1} ^{n} X_{j}{\beta}_{j}$$

In linear regression, the output $Y$ has continuous values between $-\inf$ and $+\inf$. In order to map such output values to just 0 and 1, we need to apply the sigmoid or logistic function.

$$\sigma (t) = \frac{1}{1 + e^{-t}}$$

A graphical representation of the sigmoid or logistic function is shown below (from Wikipedia).
The important part is that the output values are in the interval $(0,1)$ which is close to our goal of just predicted values 1 or 0.

![log](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png?20140704193223)

When applied to the $Y = {\beta}_{0} + \sum \limits _{j=1} ^{n} X_{j}{\beta}_{j}$ from linear regression we get the following formulation for logistic regression:
$$\frac{1}{1 + e^{{\beta}_{0} + \sum \limits _{j=1} ^{n} X_{j}{\beta}_{j}}}$$

The net effect is that the the typical linear regression output values ranging from $-\inf$ and $+\inf$ are now bound to $(0,1)$, which is typical for probabilities. That is, the above formulation can be interpreted as estimating the probability of instance $X$ (described by all predictors $X_j$) belonging to class 1.

$$ P( Y=1 | X ) = \frac{1}{1 + e^{{\beta}_{0} + \sum \limits _{j=1} ^{p} X_{j}{\beta}_{j}}}$$

The probability of class 0 is then:

$$ P( Y=0 | X ) = 1 - P( Y=1 | X ) $$

Values close to 0 are deemed to belong to class 0 and values close to 1 are deemed to belong to class 1, thus resulting in a categorical output which is what we intend in logistic regression.

## Interpreting the Coefficients in Logistic Regression

One of the best ways to interpret the coefficients in logistic regression is to transform it back into a linear regression whose coefficients are easier to interpret.
From the earlier formulation, we know that:

$$ Y =  P( Y=1 | X ) = \frac{1}{1 + e^{{\beta}_{0} + \sum \limits _{j=1} ^{p} X_{j}{\beta}_{j}}}$$

Applying a log function on both sides, we get:

$$ log \frac{P ( Y=1 | X )}{1- P( Y=1 | X )} = \sum \limits _{j=1} ^{p}  X_{j}{\beta}_{j} $$

On the left-hand of the above expression we have the log odds defined as the ratio of the probability of class 1 versus the probability of class 0. Indeed, this expression $\frac{P ( Y=1 | X )}{1- P( Y=1 | X )}$ is the odds because $1- P( Y=1 | X )$ is the probability of class 0, i.e., $P( Y=0 | X )$.

Therefore, we conclude that the log odds are a linear regression of the predictor variables weighted by the coefficients $\beta_{j}$. Each such coefficient therefore indicates a change in the log odds when the corresponding predictor changes with a unit (in the case of numerical predictors).

You may feel more comfortable with probabilities than odds, but you have probably seen odds expressed frequently in the context of sports.
Here are some examples:

- 1 to 1 means 50% probability of winning
- 2 to 1 means 67% probability of winning
- 3 to 1 means 75% probability of winning
- 4 to 1 means 80% probability of winning

Odds are just the probability of success divided by the probability of failure.
For example 75% probability of winning means 25% probability of losing, and $.75/.25=3$, and we say the odds are 3 to 1.

Because log odds are not intuitive (for most people), it is common to interpret the coefficients of logistic regression as odds.
When a log odds coefficient has been converted to odds (using $e^\beta$), a coefficient of 1.5 means the positive class is 1.5 times more likely given a unit increase in the variable.

## Peformance Evaluation

Performance evaluation for logistic regression is  the same as for other classification methods.
The typical performance metrics for classifiers are accuracy, precision, and recall (also called sensitivity).
We previously talked about these, but we did not focus much on precision, so let's clarify that.

In some of our previous classification examples, there are only two classes that are equally likely (each is 50% of the data).
When classes are equally likely, we say they are **balanced**.
If our classifier is correct 60% of the time with two balanced classes, we know it is 10% better than chance.

However, sometimes things are very unbalanced.
Suppose we're trying to detect a rare disease that occurs once in 10,000 people.
In this case, a classifier that always predicts "no disease" will be correct 99.99% of the time.
This is because the **true negatives** in the data are so much greater than the **true positives**
Because the metrics of accuracy and specificity use true negatives, they can be somewhat misleading when classes are imbalanced.

In contrast, precision and recall don't use true negatives at all (see the figure below).
This makes them behave more consistently in both balanced and imbalance data.
For these reasons, precision, recall, and their combination F1 (also called f-measure) are very popular in machine learning and data science.


![matrix](https://pbs.twimg.com/media/GFrGd1_XwAE4a9e?format=jpg&name=medium)

<!-- NOTE: this became redundant with Tasha's KNN classification notebook. I modified to amplify precision, which she did not focus much on. -->
<!-- happens, it is easy

These are typical derived by compared the predicted output to the golden or actual output/categories in the expert labelled dataset.

For a binary classification case, we denote the category 1 as the positive category and category 0 as the negative category. Using this new terminology, When comparing the predicted categories to the actual categories we may end up with the following cases:
* True Positives (TP): instances predicted as belonging to the positive category and which in fact do belong to the positive category
* True Negatives (TN): instances predicted as belonging to the negative category and which in fact do belong to the negative category
* False Positives (FP): instances predicted as belonging to the positive category and which in fact do belong to the negative category
* False Negatives (FN): instances predicted as belonging to the negative category and which in fact do belong to the positive category

From these categories, we define the following metrics:

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

Classfication methods that have a high accuracy are preferred in general although  -->
In some cases, maximizing precision or recall may be preferred.
For instance, a high recall is highly recommended when making medical diagnosis since it is preferrable to err on mis-diagnosing someone as having cancer as opposed to missing someone who indeed has cancer, i.e., the method should try not to miss anyone who may indeed have cancer.
This idea is sometimes referred to as **cost-sensitive classification**, because there may be an asymmetric cost toward making one kind of mistake vs. another (i.e. FN vs. FP).

In general, there is a trade-off between precision and recall.
If precision is high then recall is low and vice versa.
Total recall (100% recall) is achievable by always predicting the positive class, i.e., label all instances as positive, in which case precision will be very low.

In the case of logistic regression, you can imagine that we changed the threshold from .50 to a higher value like .90.
This would make many observations previously classified as 1 now classified as 0.
What was left of 1 would be very likely to be 1, since we are 90% confident (high precision).
However, we would have lost all of the 1s between 50-90% (low recall).

<!-- TODO: we need to normalize coverage of performance metrics across notebooks, particularly for classification -->