# Classification Metrics

## Model Evaluation

We just learn that after a machine learning process is complete we use performance metrics to evaluate how well our model did.

Remember in that train test split step we ended up running our train model on some test data and then evaluating its performance.


In this lecture we're actually going to discuss in more detail how we actually evaluate the performance of a model. And that comes in discussion with classification metrics.

The key classification metrics we need to understand are:

- Accuracy 
- Recall 
- Precision 
- F1-Score 

But first we should really understand the reasoning behind these metrics and how they actually work in the real world.

So we're going to follow along a little bit in a process with some more realistic example such as classifying a spam versus ham message.

- Typically in any classification task your model can only achieve two results:

    - Either your model was **correct** in its prediction.
    - Or your model was **incorrect** in its prediction.

- Fortunately incorrect versus correct expands to situations where you have multiple classes. It doesn't matter if you're trying to predict categories with eight different types of classes or eight different types of categories. Your model fundamentally only has two outputs either a correct output or incorrect output when it comes classification. 
- Now for the purposes of explaining the metrics we're going to imagine a **binary classification** situation where we only have two available classes spam versus ham.
- In our example, we're going to be attempting to predict if a text message is **Spam** or **Ham** being another word for a legitimate message.
- Since this is a supervised learning process, we were going to first **fit or train** a model on **training data**. Then we will test the model on **testing data**.
 
- Once we have the model's predictions from the **X_test** data, we compare it to the **true y values** that is the correct labels because remember in a supervised learning process we're dealing with historical information that we already have the labels for.

- Keep in mind there are a few steps to convert the raw text message information into a format that the machine learning model can understand.
- So whenever we're dealing with raw text we're gonna have to do a little bit of work in **vectorization** in order to translate the raw text into numerical information the machine learning model can understand. So whether it's a text message from a cell phone, whether it's an email, whether it's a movie review that's been written down, there will be some steps, that we're going to cover later on, in actually converting that raw text information into numerical information.


So just to give you a brief overview of kind of the process of vectorization going to wave our hands a little bit here.

![](../imgs/i01.png)

But essentially what we're doing is, we're taking some sort of raw text message from the testing dataset that is the **X_test**, we passes through some **vectorizer**, and then we have a **vectorized version** of that.

So to give you just a brief idea of what that actually is or that would look like. Essentially we would have some raw text information. So here's a text message from a cell phone saying hey how are you doing, I've been doing well et cetera.


![](../imgs/i02.png)

Theoretically after we pass it through the **vectorizer** it should be formatted in some sort of vectorize format that has a numerical matrix that the machine learning model can then understand.

You can imagine that we can do things such as count the number of times certain words show up and that would be numerical information and we would format it in a way that the machine learning model can understand.

We are learn about this entire process in a lot more detail, things like **term frequency inverse document frequency**,  **bag of words**, those are all different processes that we can employ in order to vectorize text information. We are going to learn all that on the feature extraction lecture.

- We set up this vectorization in a **pipeline** and again there's many ways of transforming that raw text into numerical information.

- For right now, since we want to focus on understanding classification metrics, we're just going to assume that there were some underlying vectorization process that took place.

Let's jump right into the middle of that machine learning process pipeline and assume that we have a trained model and the model is trained on the training data.

![](../imgs/i03.png)

So we took in our **X_train** and our **y_train** and then we train the model.

Now it comes time to actually evaluate the model's performance. So what we need to do is taken our **test** dataset and then pass it into the train model.

![](../imgs/i04.png)

So for example I go ahead and grab a test message from **X_test**, and we pass it into the train model. Because a test message, and this is supervised learning with historical labeled information, we actually already know whether or not this is a ham or spam.

![](../imgs/i05.png)

So keep in mind we always already know the correct label for this particular test data point. That's going to allow compare the output or the train model whether it's correct or incorrect.


So we pass it through a train model and, let's say in this particular instance the train model said I think this text message is ham or it's a legitimate test message.

![](../imgs/i06.png)

Then all we do is we simply compare the correct label that we already know to be true against the prediction that the train model output.

![](../imgs/i07.png)


So here we say it was ham equal to ham. And in this case it's correct.So here we get a correct prediction.

Keep in mind that for other data points the train model could be incorrect.
It might say Oh I think this text message is spam. And in that case it would be incorrect prediction.

![](../imgs/i08.png)

- So what we're going to do is we just repeat this process for all the text messages inside of our **X_test** data.

- At the end, we're going to have a count of correct matches and accounts of incorrect matches.

- Here's the absolute key realization, probably the most fundamental part of this particular lecture. In real world situations, not all incorrect or correct matches hold equal value which is why we actually have those various classification metrics. It's not enough to understand that you got a particular count of correct versus incorrect its various ratios that we need to take into account.

- So in the real world a single metric won't tell the complete story.

- So to understand all of this we're going to bring back those four metrics we mentioned at the beginning and see how they're actually calculated.

- And we could actually organize or predicted values compared to the real values in what's known as a **confusion matrix**.


- Accuracy
    - Accuracy in classification problems is the **number of correct predictions** made by the model divided by the **total number of predictions**.


- For example if the test dataset was 100 messages in our model correctly predicted 80 of those messages from the test dataset then we would have 80 out of a total of 100 predictions that were done correctly. And that means we were **0.8** or **80% accurate**.

- Accuracy is really useful when target classes are **well balanced**.

- So in that example it would mean that we have roughly the same amount of spam messages in our test data as we have ham or legitimate messages.


Now here's a problem.

- Accuracy is not a good choice with **unbalanced classes**.


Let's imagine a situation where we had ninety nine legitimate ham messages in a one spam text message. If our model was a simple line of code that's always predicted ham. That means we would get ninety nine percent accuracy even though it was incorrect on every single instance of spam. We got unlucky because we had an unbalanced class situation and our test data set.

- So in this situation we're going to want to understand **recall** and **precision** that's going to give us a better understanding of how it performs on the other, the smaller class.

- so let's quickly go over some formal definitions of Precision Recall an F1-Score if (a combination of precision recall).

- Recall
    - Ability of a model to find **all** the relevant cases within a dataset.
    - The precise definition of recall is the **number of true positives divided by the number of true positives plus the number of false negatives.**
    
when we see this in combination with the confusion matrix they'll make a lot more sense.

- Precision
    - Ability of a classification model to identify **only** the relevant data points.
    - Precision is dfined as the **number of true positives divided by the number of true positives plus the number of false positives.**

- So keep in mind often you have a tradeoff between recall a precision 

    - Often you hava a trade-off between Recall and Precision
    - While recall expresses the ability to find all relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant that actually were relevant.

- F1-Score
    - In cases where we want to find an optimal blend of precision and recall we can combine the two metrics using what is called F1 score.
    - The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:
    
    $$F_1 = 2*\frac{precision*recall}{precision+recall}$$
    
    - We use the harmonic mean instead of a simple average because it punishes extreme values
    - A classifier with precision of 1.0 and a recall 0.0 has a simple average of 0.5 but an F1 score of 0.
    
So that means the score automatically becomes very small or zero if one of these values; either precision or recall, happens to be very poor.

- Precision and Recall typically make more sense in the context of the confusion matrix.

- So in this next lecture we're going explore the confusion matrix and again talk again in more detail about the different classification metrics.