## Logistic Regression

It is a classification model that will help us to make predictions in cases where output is a categorical variable. Since it is very easy interpretable of all classification model, so very commonly used in various industries such as banking, healthcare etc.

### Binary Classification

In Logistic regression output variable would be categorical variable.

Example : 
  - Finance Company wants to know whether a customer is default or not
  - Predicting an email is spam or ham
  - Categorizing an email into professional, personal and offical.
  
So in Binary Classification :
  1. Two possible outputs
      - Customer would default or not
      - Email is spam or ham
      - Person would be diabatic or not

More than two possible output is called **Multiclass Classification**.

**Decision Boundary Approach**

![image-5.png](attachment:image-5.png)

Suppose there is another person, with a blood sugar level of 195, and you do not know whether that person has diabetes or not. What would you do then? Would you classify him/her as a diabetic or as a non-diabetic?

![image.png](attachment:image.png)

Now, based on the boundary, you may be tempted to declare this person a diabetic, but can you really do that? This person’s sugar level (195 mg/dL) is very close to the threshold (200 mg/dL), below which people are declared as non-diabetic. It is, therefore, quite possible that this person was just a non-diabetic with a slightly high blood sugar level. After all, the data does have people with slightly high sugar levels (220 mg/dL), who are not diabetics.

**Note** : Deciding the class blatantly on the basis of the cutoff is very risky. In other words, it would be risky to declare that a person is a diabetic, solely because her/his blood sugar level is more than 195 mg/dL.Especially in the middle, the patients could belong to any class — diabetic or non-diabetic.


### Sigmoid Curve

![image-6.png](attachment:image-6.png)

y (Prob of Diabetes) = $\frac {1}{1 + e^{-(\beta_{0} + \beta_{1}x)}}$

Since the **sigmoid curve** has all the properties you would want — **extremely low values** in the start, **extremely high values** in the end, and **intermediate values** in the middle — it’s a good choice for modelling the value of the **probability of diabetes**.


**Ques 1** : For the sigmoid curve (β0 = -15 and β1 = 0.065), what will be the probability of diabetes for a patient with sugar level 240?

**Ans** : 0.64 using formula **y (Prob of Diabetes)** = $\frac {1}{1 + e^{-(\beta_{0} + \beta_{1}x)}}$

Why can’t you just fit a straight line here? This would also have the same properties — low values in the start, high ones towards the end, and intermediate ones in the middle.

![image-2.png](attachment:image-2.png)

The main problem with a straight line is that it is not steep enough. In the sigmoid curve, as you can see, you have low values for a lot of points, then the values rise all of a sudden, after which you have a lot of high values. In a straight line though, the values rise from low to high very uniformly, and hence, the “boundary” region, the one where the probabilities transition from high to low is not present.


### Finding the Best Fit Sigmoid Curve - I

Find the combination of $\beta_{0}$ and $\beta_{1}$ which fits the data best.

Find the **Best fit Sigmoid Curve** you need to vary $\beta_{0}$ and $\beta_{1}$ until you get the combination of beta values that **maximises the likelihood**. For the diabetes example, the likelihood is given by the expression:

![image-7.png](attachment:image-7.png)

Best fitting combination of $\beta_{0}$ and $\beta_{1}$ will be the one which maximises the product:

(1-P1)(1-P2)(1-P3)(1-P4)(1-P6)(P5)(P7)(P8)(P9)(P10)

This product is called the **likelihood function**. It is the product of:

{(1-$P_{i}$)(1-$P_{i}$)-----for all non-diabetics-----} * {($P_{i}$)($P_{i}$)-----for all diabetics-----}


So, say that for the ten points in our example, the labels are a little different, somewhat like this:

|Point no.	|1	|2	|3	|4	|5	|6	|7	|8	|9	|10|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|Diabetes	|no	|no	|no	|yes	|no	|yes	|no	|yes	|yes	|yes|

In this case, the likelihood would be equal to :

(1-$P_{1}$)(1-$P_{2}$)(1-$P_{3}$)(1-$P_{5}$)(1-$P_{7}$)($P_{4}$)($P_{6}$)($P_{8}$)($P_{9}$)($P_{10}$)

**The best fitting sigmoid curve would be the one which maximises the value of above product.**

So, just by looking at the curve here, you can get a general idea of the curve’s fit. Just look at the yellow bars for each of the 10 points. **A curve that has a lot of big yellow bars is a good curve**. For example, this curve is not a good fit:

![image-3.png](attachment:image-3.png)

This curve though is better fit -

![image-4.png](attachment:image-4.png)

Clearly, this curve is a better fit. It has many big yellow bars, and even the small ones are reasonably large. Just by looking at this curve, you can tell that it will have a high likelihood value.


### Odds and Log Odds

**Note** - Optimization method : MLE (Maximum Likelihood Estimation) Optional

So far we have seen equation for logistic regression :

P = $\frac {1}{1 + e^{-(\beta_{0} + \beta_{1}x)}}$

It gives relationship between P (probability of diabetes) and x (Patient's blood sugar level). The relationship between P and x is so complex that it is difficult to understand what kind of trend exists between the two. If you increase x by regular intervals of, say, 11.5, how will that affect the probability? Will it also increase by some regular interval? If not, what will happen?

**Linearising the sigmoid equation** :

P = $\frac {1}{1 + e^{-(\beta_{0} + \beta_{1}x)}}$

1 - P = $\frac {e^{-(\beta_{0} + \beta_{1}x)}}{1 + e^{-(\beta_{0} + \beta_{1}x)}}$

$\frac {P}{1-P}$ = $e^{(\beta_{0} + \beta_{1}x)}$

ln($\frac {P}{1-P}$) = $(\beta_{0} + \beta_{1}x)$

**Odds** = $\frac {P}{1-P}$

**Log Odds** = ln($\frac {P}{1-P}$)

P = Probability of being diabetic

1-P = Probability of not being diabetic

**P** is directly proportional to **Odds**


**Ques 1** : In a dataset with mean 50 and standard deviation 12, what will be the value of a variable with an initial value of 20 after you standardise it?

**Ans** : -2.5  Using formula : ($\frac {X - \mu}{\sigma}$)

**Ques 2** : As mentioned in the lecture, you use **'fit_transform'** on the train set but just **'transform'** on the test set. Recall you had learnt this in linear regression as well. Why do you think this is done?

**Ans** : The 'fit_transform'  command first fits the data to have a mean of 0 and a standard deviation of 1, i.e. it scales all the variables using: $X_{scaled}$ = $\frac {X - \mu}{\sigma}$ . Now, once this is done, all the variables are transformed using this formula. Now, when you go ahead to the test set, you want the variables to not learn anything new. You want to use the old centralisation that you had when you used fit on the train dataset. And this is why you don't apply 'fit' on the test data, just the 'transform'.

https://datascience.stackexchange.com/questions/12321/difference-between-fit-and-fit-transform-in-scikit-learn-models


**Ques 1** : Which of the following command can be used to view the correlation table for the dataframe telecom?

**Ans** : telecom.corr()

**Ques 2** : Take a look at the heatmap provided above. Which of the variables have the highest correlation between them?

**Ans** : MultipleLines_No and MultipleLines_Yes  (The following are the correlation values between the four pair of variables given in the options:

0.53
0.54
-0.82
-0.64

As you can clearly see, the third pair, i.e. MultipleLines_No and MultipleLines_yes is the most correlated with a value of -0.82.)

**Ques 1** : After learning the coefficients of each variable, the model also produces a ‘p-value’ of each coefficient. Fill in the blanks so that the statement is correct: 

“The null hypothesis is that the coefficient is ---. If the p-value is small, you can say that the coefficient is significant and hence the null hypothesis -----.”

**Ans** : zero, can be rejected (Recall that the null hypothesis for any beta was: $\beta_{i}$ = 0 And if the p-value is small, you can say that the coefficient is significant, and hence, you can reject the null hypothesis that 
$\beta_{i}$ = 0)

**Ques 1** : You saw that Rahim chose a cut-off of 0.5. What can be said about this threshold?

**Ans** : It was arbitrarily chosen by us, i.e. there’s nothing special about 0.5. We could have chosen something else as well.

**Ques 2** : Based on the RFE output shown above, which of the variables is least significant?

**Ans** : gender_Male  (RFE assigns ranks to the different variables based on their significance. While 1 means that the variable should be selected, a rank > 1 tells you that the variable is insignificant. The ranking given to 'gender_Male' by RFE is 9 which is the highest and hence, it is the most insignificant variable present in the RFE output.)

**Ques 3** : Suppose the following table shows the predicted values for the probabilities for 'Churn'. Assuming you chose an arbitrary cut-off of 0.5 wherein a probability of greater than 0.5 means the customer would churn and a probability of less than or equal 0.5 means the customer wouldn't churn, which of these customers do you think will churn? (More than one option may be correct.)

|Customer	|Probability(Churn)|
| --- | --- |
|A|0.45|
|B|0.67|
|C|0.98|
|D|0.49|
|E|0.03|

**Ans** : Customer B and C


## Confusion Matrix and Accuracy

You chose a cutoff of 0.5 in order to classify the customers into 'Churn' and 'Non-Churn'. Now, since you're classifying the customers into two classes, you'll obviously have some errors. The classes of errors that would be there are:

  - ``'Churn'`` customers being (incorrectly) classified as ``'Non-Churn'``
  - ``'Non-Churn'`` customers being (incorrectly) classified as ``'Churn'``
  

To capture these errors, and to evaluate how well the model is, you'll use something known as the **'Confusion Matrix'**. A typical confusion matrix would look like the following:

![image-8.png](attachment:image-8.png)

This table shows a comparison of the **predicted** and **actual** labels. The **actual** labels are along the **vertical axis**, while the **predicted** labels are along the **horizontal axis**. Thus, the second row and first column (263) is the number of customers who have actually ``‘churned’`` but the model has predicted them as ``'non-churn'``.

Similarly, the cell at second row, the second column (298) is the number of customers who are actually ``‘churn’`` and also predicted as ``‘churn’``.

Now, the simplest model evaluation metric for classification models is **accuracy** - it is the percentage of correctly predicted labels. So what would the correctly predicted labels be? They would be:

  - ``'Churn'`` customers being actually identified as ``churn``
  - ``'Non-churn'`` customers being actually identified as ``non-churn``.
  
So the correctly predicted labels are contained in the first row and first column, and the last row and last column.

Accuracy = $\frac {Correctly Predicted Labels}{Total Number of Labels}$

Accuracy = $\frac {1406+298}{1406+143+263+298}$ = 80.75


**Ques 1** : Given the confusion matrix below, can you tell how many 'Churns' were correctly identified, i.e. if the person has actually churned, it is predicted as a churn?

|Actual/Predicted	|Not Churn	|Churn|
| --- | --- | --- |
|Not Churn	|80	30|
|Churn	|20	70|

**Ans** : 70

**Ques 2** : From the confusion matrix you saw in the last question, compute the accuracy of the model.

**Ans** : 75%


**Ques 1** : Suppose you built a logistic regression model to predict whether a patient has lung cancer or not and you get the following confusion matrix as the output.

|Actual/Predicted	|No	|Yes|
| -- | -- | -- |
|No	|400	|100|
|Yes	|50	|150|

How many of the patients were wrongly identified as a 'Yes'?

**Ans** : 100

**Ques 2** : Take a look at the table again. How many of these patients were correctly labelled, i.e. if the patient had lung cancer it was actually predicted as a 'Yes' and if they didn't have lung cancer, it was actually predicted as a 'No'?

**Ans** : 550

**Ques 3** : From the table you used for the last two questions, what will be the accuracy of the model?

**Ans** : 78.57%


### Interpret the Model 

![image-9.png](attachment:image-9.png)

Refer to the above image, i.e. the final summary statistics after completing manual feature elimination. Now suppose you are a data analyst working for the telecom company, and you want to compare two  customers, customer A and customer B. For both of them, the value of the variables tenure, PhoneService, Contract_One year, etc. are all the same, except for the variable **PaperlessBilling**, which is equal to **1 for customer A** and **0 for customer B**.

In other words, customer A and customer B have the exact same behaviour as far as these variables are concerned, except that customer A opts for paperless billing, and customer B does not. Now use this information to answer the following questions.

**Ques 1** : Based on the above information, what can you say about the log odds of these two customers? 

PS: Recall the log odds for univariate logistic regression was given as:

ln($\frac {P}{1-P}$) = $\beta_{0}$ + $\beta_{1}$X

Hence, for multivariate logistic regression, it would simply become:

ln($\frac {P}{1-P}$) = $\beta_{0}$ + $\beta_{1}X_{1}$ + $\beta_{2}X_{2}$ + $\beta_{3}X_{3}$ + ---- + $\beta_{n}X_{n}$

**Ans** : log odds (customer A) > log odds (customer B) {Recall the log odds are just the linear term present in the logistic regression equation. Hence, here we have 13 variables, so the log odds will be given by: 

ln($\frac {P}{1-P}$) = $\beta_{0}$ + $\beta_{1}X_{1}$ + $\beta_{2}X_{2}$ + $\beta_{3}X_{3}$ + ---- + $\beta_{13}X_{13}$ 

Now, for the two customers, all beta and all x values are the same, except for X2 (the variable for paperless billing), which is equal to 1 for customer A and 0 for customer B. 

Hence, the value will exceed by the coefficient of 'PaperlessBilling' which is 0.3367.

Basically, for customer A, this term would be = 0.3367 * 1

And for customer B, this term would be = 0.3367 * 0}

**Ques 2** : Now, what can you say about the odds of churn for these two customers?

**Ans** : For customer A, the odds of churning are higher than for customer B (Recall that in the last question, you were told that log odds for customer A are higher than those for customer B. So, the odds of churning for customer A are also higher than the odds of churning for customer B. This is because, as the number increases, its log increases and vice versa.)

**Ques 3** : Now, suppose two customers, customer C and customer D, are such that their behaviour is exactly the same, except for the fact that customer C has OnlineSecurity, while customer D does not. What can you say about the odds of churn for these two customers?

**Ans** : For customer C, the odds of churning are lower than for customer D (Recall that the log odds for customer C will differ from those for customer D, by a margin of $\beta_{OnlineSecurity}$
. Now since in this case, this coefficient is negative (-0.3739), this means that the log odds of customer C will be 0.3739 less than that of customer D. Since the log odds of customer C are lower, naturally, the actual odds for C would also be lower.)


### Graded Questions

**Ques 1** : Which of these methods is used for fitting a logistic regression model using statsmodels?

**Ans** : GLM()

**Ques 2** : Given the following confusion matrix, calculate the accuracy of the model.

|Actual/Predicted	|Nos	|Yeses|
| -- | -- | -- |
|Nos	|1000	|50|
|Yeses	|250	|1200|

**Ans** : 88%

**Ques 3** : Suppose you are building a logistic regression model to determine whether a person has diabetes or not.

Following are the values of predicted probabilities of 10 patients.

|Patient	|Probability(Diabetes)|
| --- | --- |
|A	|0.82|
|B	|0.37|
|C	|0.04|
|D	|0.41|
|E	|0.55|
|F	|0.62|
|G	|0.20|
|H	|0.91|
|I	|0.74|
|J	|0.33|
 
Assuming you arbitrarily chose a cut-off of 0.4, wherein if the probability is greater than 0.4, you'd conclude that the patient has diabetes and if it is less than or equal to 0.4, you'd conclude that the patient doesn't have diabetes, how many of these patients would be classified as diabetic based on the table above?

**Ans** : 6

**Ques 4** : Suppose you are working for a media services company like Netflix. They're launching a new show called 'Sacred Games' and you are building a logistic regression model which will predict whether a person will like it or not based on whether consumers have liked/disliked some previous shows. You have the data of five of the previous shows and you're just using the dummy variables for these five shows to build the model. If the variable is 1, it means that the consumer liked the show and if the variable is zero, it means that the consumer didn't like the show. The following table shows the values of the coefficients for these five shows that you got after building the logistic regression model.

|Variable Name	|Coefficient Value|
| --- | --- |
|TrueDetective_Liked	|0.47|
|ModernFamily_Liked	|-0.45|
|Mindhunter_Liked	|0.39|
|Friends_Liked	|-0.23|
|Narcos_Liked	|0.55|
 
Now, you have the data of three consumers Reetesh, Kshitij, and Shruti for these 5 shows indicating whether or not they liked these shows. This is shown in the table below:

|Consumer	|TrueDetective_Liked	|ModernFamily_Liked	|Mindhunter_Liked	|Friends_Liked	|Narcos_Liked|
| --- | --- | --- | --- | --- | --- |
|Reetesh	|1	|0	|0	|0	|1|
|Kshitij	|1	|1	|1	|0	|1|
|Shruti	    |0	|1	|0	|1	|1|

Based on this data, which one of these three consumers is most likely to like to new show 'Sacred Games'?

**Ans** : Reetesh

To find the person who is most likely to like the show, you can use log odds. Recall the log odds is given by:

ln($\frac {P}{1-P}$) = $\beta_{0}$ + $\beta_{1}X_{1}$ + $\beta_{2}X_{2}$ + $\beta_{3}X_{3}$ + ---- + $\beta_{n}X_{n}$

Here, there are five variables for which the coefficients are given. Hence, the log odds become:

ln($\frac {P}{1-P}$) = 0.47*$X_{1}$ - 0.45*$X_{2}$ + 0.39*$X_{3}$ - 0.23*$X_{4}$ + 0.55*$X_{5}$

As you can see, we have ignored the β0 since it will be the same for all the three consumers. Now, using the values of the 5 variables given, you get - 

$(Log Odds)_{Reetesh}$ = (0.47 X 1) - (0.45 X 0) + (0.39 X 0) - (0.23 X 0) + (0.55 X 1) = 1.02

$(Log Odds)_{Kshitij}$ = (0.47 X 1) - (0.45 X 1) + (0.39 X 1) - (0.23 X 0) + (0.55 X 1) = 0.96

$(Log Odds)_{Shruti}$ = (0.47 X 0) - (0.45 X 1) + (0.39 X 0) - (0.23 X 1) + (0.55 X 1) = -0.13

As you can clearly see, the log odds of Reetesh is the highest, hence, the odds of Reetesh liking the show is the highest and hence, he is most likely to like the new show, Sacred Games.



## Metrics Beyond Accuracy: Sensitivity & Specificity

We almost always care more about one class than the other. On the other hand, the **accuracy** tells you model's performance on both classes combined - which is fine, but not the most important metric.

If you're building a model to determine whether you should block (where blocking is a 1 and not blocking is a 0) a customer's transactions or not based on his past transaction behaviour in order to identify frauds, you'd care more about getting the 0's right. This is because you might not want to wrongly block a good customer's transactions as it might lead to a very bad customer experience.

Hence, it is very crucial that you consider the **overall business problem** you are trying to solve to decide the metric you want to maximise or minimise.

This brings us to two of the most commonly used metrics to evaluate a classification model:

  1. Sensitivity
  2. Specificity
  
**Senisitivity** = $\frac {Number-of-Actual-Yeses-Correctly-Predicted}{Total-Number-of-Actual-Yeses}$

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 3269 | 366 |
| Churn | 595 | 692 |


The different elements in this matrix can be labelled as follows:

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | **True Negatives** | **False Positives** |
| Churn | **False Negatives** | **True Positives** |


  - The first cell contains the actual 'Not Churns' being predicted as 'Not-Churn' and hence, is labelled 'True Negatives' (Negative implying that the class is '0', here, Not-Churn.).
  - The second cell contains the actual 'Not Churns' being predicted as 'Churn' and hence, is labelled 'False Positive' (because it is predicted as 'Churn' (Positive) but in actuality, it's not a Churn).
  - Similarly, the third cell contains the actual 'Churns' being predicted as 'Not Churn' which is why we call it 'False Negative'.
  - And finally, the fourth cell contains the actual 'Churns' being predicted as 'Churn' and so, it's labelled as 'True Positives'.

Now, to find out the **sensitivity**, you first need the **number of actual Yeses correctly predicted**. This number can be found at in the last row, last column of the matrix (which is denoted as true positives). This number if **(692)**. Now, you need the **total number of actual Yeses**. This number will be the sum of the numbers present in the last row, i.e. the actual number of churns (this will include the actual churns being wrongly identified as not-churns, and the actual churns being correctly identified as churns). Hence, you get **(595 + 692) = 1287**. 

**Sensitivity** = $\frac {692}{1297}$ = 53.768%

Thus, you can clearly see that although you had a **high accuracy (~80.475%)**, your sensitivity turned out to be **quite low (~53.768%)**


Now,

**Specificity** = $\frac {Number-of-Actual-Nos-Correctly-Predicted}{Total-Number-of-Actual-Nos}$

this value will be given by the value **True Negatives (3269)** divided by the actual number of negatives, i.e. **True Negatives + False Positives (3269 + 366 = 3635)**. Hence, by replacing these values in the formula, you get specificity as:

**Specificity** = $\frac {3269}{3635}$ = 89.931%


**Ques 1** : What is the number of False Positives for the model given below?

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 400 | 100 |
| Churn | 50 | 150 |

**Ans** : 100

**Ques 2** : What is the sensitivity of the above model?

**Ans** : 75%

**Ques 3** : Among the three metrics that you've learnt about, which one is the highest for the model below?

**Ans** : Specificity (80%) > Accuracy (78.57%) > Sensitivity (75%)


**Senisitivity** = $\frac {True-Positives}{True-Positives + False-Negatives}$

**Specificity** = $\frac {True-Negatives}{True-Negatives + False-Positives}$

**FalsePositiveRate** = $\frac {False-Positives}{True-Negatives + False-Positives}$ = **1 - Specificity** = $\frac {Flase-Positive}{Total-Number-of-Actual-Negatives}$

**FalsePositiveRate** term gives you the number of false positives (0s predicted as 1s) divided by the total number of negatives.

**TruePositiveRate** = $\frac {True-Positives}{True-Positives + False-Negatives}$ = **Sensitivity** = $\frac {True-Positive}{Total-Number-of-Actual-Positives}$

**TruePositiveRate** value gives you the number of positives correctly predicted divided by the total number of positives.

**PositivePredictiveValue** = $\frac {True-Positives}{True-Positives + False-Positives}$ = Everything we identified as positive how many times we are right means to identify correct positive.

The **positive predictive value** is the number of positives correctly predicted by the total number of positives predicted. This is also known as **'Precision'**.

**NegtivePredictiveValue** = $\frac {True-Negatives}{True-Negatives + False-Negatives}$ = Everything we identified as negative how many times we are right means to identify correct negative.

The **negative predictive value** is the number of negatives correctly predicted by the total number of negatives predicted.

**Ques 1** : What is the number of False Negatives for the model given below?

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 80 | 40 |
| Churn | 30 | 50 |

**Ans** : 30

**Ques 2** : What is the approximate specificity of the following model? 

**Ans** : 67%

**Ques 3** : Which among accuracy, sensitivity, and specificity is the highest for the model below?

**Ans** : Specificity (67%) > Accuracy (65%) > Sensitivity (62.5)

**Ques 4** : Calculate the given three metrics(False Positive Rate, Positive Predictive Value, Negative Predictive Value) for the model below and identify which one is the largest among them.

**Ans** : Negative Predictive Value (72.72%) > Positive Predictive Value (56%) > False Positive Rate (33%)


### ROC Curve

#### Model does not predict the class label as such, class label is something we assign according to the probability.

**Ques 1** : Given the following confusion matrix, calculate the value of True Positive Rate (TPR) and False Positive Rate (FPR).

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 300 | 200 |
| Churn | 100 | 400 |

**Ans** : TPR = 80%, FPR = 40%

**Ques 2** : Given data below, Calculate the True Positive Rate and False Positive rate for the cutoffs of 0.4 and 0.5. Which of these cutoffs, will give you a better model? 

Note: The good model is the one in which TPR is high and FPR is low.

| Customer | Churn | Predicted Churn Probability |
| --- | --- | --- |
| Thulasi | 1 | 0.52 |
| Aditi | 0 | 0.56 |
| Jaideep | 1 | 0.78 |
| Ashok | 0 | 0.45 |
| Amulya | 0 | 0.22 |

**Ans** : Cutoff of 0.5

| Customer | Churn | Predicted Churn Probability | Cutoff 0.4 | Cutoff 0.5 |
| --- | --- | --- | --- | --- |
| Thulasi | 1 | 0.52 | 1 | 1 |
| Aditi | 0 | 0.56 | 1 | 1 |
| Jaideep | 1 | 0.78 | 1 | 1 |
| Ashok | 0 | 0.45 | 1 | 0 |
| Amulya | 0 | 0.22 | 0 | 0 |

TPR(0.4) = 2 / (2+0) = 100%

FPR(0.4) = 2 / (2+1) = 67%

TPR(0.5) = 2 / (2+0) = 100%

FPR(0.5) = 1 / (1+2) = 33%

As you can see, with both the cutoffs, the TPR is 100% but for the cutoff 0f 0.5 you have a lower value of FPR. So clearly, a cutoff of 0.5 gives you a better model.

Please note that 0.5 just gives the better model among 0.4 and 0.5. It might be possible that there is a cutoff point which gives an even better model.

**ROC Curves** which show the **tradeoff between the True Positive Rate (TPR) and the False Positive Rate (FPR)**. And as was established from the formulas above, TPR and FPR are nothing but sensitivity and (1 - specificity), so it can also be looked at as a **tradeoff between sensitivity and specificity**. 

**TPR** should be high and **FPR** should be low. An ideal condition is TPR = 1, FPR = 0 which is very rare case.

When you plot the **true positive rate** against the **false positive rate**, you get a graph which shows the trade-off between them and this curve is known as the ROC curve. 

![image-10.png](attachment:image-10.png)

As you can see, for higher values of TPR, you will also have higher values of FPR, which might not be good. So it's all about finding a balance between these two metrics and that's what the ROC curve helps you find. You also learnt that a good ROC curve is the one which touches the upper-left corner of the graph; so higher the area under the curve of an ROC curve, the better is your model.


**Ques 1** : You initially chose a threshold of 0.5 wherein a churn probability of greater than 0.5 would result in the customer being identified as 'Churn' and a churn probability of lesser than 0.5 would result in the customer being identified as 'Not Churn'. 

Now, suppose you decreased the threshold to a value of 0.3. What will be its effect on the classification?

**Ans** : More customers would now be classified as 'Churn'. (Initially, the threshold was 0.5. Look at the customers in the 0.3-0.5 probability range. They were being identified as 'Not Churn' before, but now, are being identified as 'Churn'. Hence, naturally, the number of people being identified as 'Churn' will increase.)

**Ques 2** : When the value of TPR increases, the value of FPR ------.

**Ans** : Increasing (This can be clearly seen from the ROC curve as well. When the value of TPR (on the Y-axis) is increasing, the value of FPR (on the X-axis) also increases.)

**Ques 3** : You have the following five AUCs (Area under the curve) for ROCs plotted for five different models. Which of these models is the best?

| Model | AUC |
| --- | --- |
| A | 0.54 |
| B | 0.82 |
| C | 0.79 |
| D | 0.66 |
| E | 0.56 |

**Ans** : B (When the ROC curve is more towards the top left corner of the graph, the model is deemed to be more accurate. Hence, a greater area under the curve would mean the model is more accurate. Among the five models given, B has the highest AUC and hence is the most accurate model. Also, note that the highest value of AUC can be 1.)


### Interpreting the ROC Curve

![image-11.png](attachment:image-11.png)

### The $45^{0}$ Diagonal

For a completely random model, the ROC curve will pass through the 45-degree line that has been shown in the graph above and in the best case it passes through the upper left corner of the graph. So **the least area that an ROC curve can have is 0.5**, and **the highest area it can have is 1**.

Notice that in the curve when Sensitivity is increasing, (1 - Specificity) is increasing, it simply means that Specificity is decreasing.

### Area Under the Curve

By determining the Area under the curve (AUC) of a ROC curve, you can determine how good the model is. If the ROC curve is more towards the upper-left corner of the graph, it means that the model is very good and if it is more towards the 45-degree diagonal, it means that the model is almost completely random. So, the larger the AUC, the better will be your model which is something you saw in the last segment as well.


**Ques 1** : Following is the ROC curve that you got.

![image-12.png](attachment:image-12.png)

As you can see, when the 'True Positive Rate' is 0.8, the 'False Positive Rate' is about 0.24. What will be the value of specificity, then?

**Ans** : 0.76 (FPR = 1 - Specificity)

**Ques 2** : Which of the following ROC curve represents the best model?

![image-13.png](attachment:image-13.png)

**Ans** : C


### Finding the Optimal Threshold

![image-14.png](attachment:image-14.png)

As you can see, when the probability thresholds are very low, the sensitivity is very high and specificity is very low. Similarly, for larger probability thresholds, the sensitivity values are very low but the specificity values are very high. And at about 0.3, the three metrics seem to be almost equal with decent values and hence, we choose 0.3 as the optimal cut-off point. The following graph also showcases that at about 0.3, the three metrics intersect.

![image-15.png](attachment:image-15.png)

As you can see, at about a threshold of 0.3, the curves of accuracy, sensitivity and specificity intersect, and they all take a value of around 77-78%.


**Ques 1** : Suppose you created a dataframe to find out the optimal cut-off point for a model you built. The dataframe looks like the following:

|Threshold	|Probability	|Accuracy	|Sensitivity	|Specificity|
| --- | --- | --- | --- | --- |
|0.0	|0.0	|0.21	|1.00	|0.00|
|0.1	|0.1	|0.39	|0.96	|0.22|
|0.2	|0.2	|0.56	|0.88	|0.49|
|0.3	|0.3	|0.59	|0.81	|0.53|
|0.4	|0.4	|0.62	|0.78	|0.63|
|0.5	|0.5	|0.74	|0.73	|0.74|
|0.6	|0.6	|0.81	|0.64	|0.79|
|0.7	|0.7	|0.78	|0.42	|0.83|
|0.8	|0.8	|0.63	|0.21	|0.92|
|0.9	|0.9	|0.56	|0.03	|0.98|
 
Based on the table above, what will the approximate value of the optimal cut-off be?

**Ans** : 0.5 (The optimal cut-off point exists where the values of accuracy, sensitivity, and specificity are fairly decent and almost equal. At the cut-off of 0.5, the metric values are 0.74, 0.73, and 0.74 respectively. This is the optimal value of threshold that you can have.)

**Ques 2** : As you learnt, there is usually a trade-off between various model evaluation metrics, and you cannot maximise all of them simultaneously. For e.g., if you increase sensitivity (% of correctly predicted churns), the specificity (% of correctly predicted non-churns) will reduce. 

Let's say that you are building a telecom churn prediction model with the business objective that your company wants to implement an aggressive customer retention campaign to retain the 'high churn-risk' customers. This is because a competitor has launched extremely low-cost mobile plans, and you want to avoid churn as much as possible by incentivising the customers. Assume that budget is not a constraint.

Which of the following metrics should you choose the maximise?

**Ans** : Sensitivity (high sensitivity implies that your model will correctly identify almost all customers who are likely to churn. It will do that by over-estimating the churn likelihood, i.e. it will misclassify some non-churns as churns, but that is the trade-off you need to choose rather than the opposite case (in which case you may lose some low churn risk customers to the competition).)


## Precision & Recall

  - **Precision**: Probability that a predicted 'Yes' is actually a 'Yes'.
  
![image-16.png](attachment:image-16.png)

Precision = $\frac {True-Positives}{True-Positives + False-Positives}$

**'Precision'** is the same as the **'Positive Predictive Value'**

  - **Recall**:  Probability that an actual 'Yes' case is predicted correctly.
  
![image-17.png](attachment:image-17.png)

Recall = $\frac {True-Positives}{True-Positives + False-Negatives}$

**'Recall'** is exactly the same as **sensitivity**.

**Precision** and **Recall** tradeoff 

![image-18.png](attachment:image-18.png)

As you can see, the curve is similar to what you got for sensitivity and specificity. Except now, the curve for precision is quite jumpy towards the end. This is because the denominator of precision, i.e. 
(TP+FP) is not constant as these are the predicted values of 1s. And because the predicted values can swing wildly, you get a very jumpy curve.


**Ques 1** : Calculate the precision value for the following model.

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 400 | 100 |
| Churn | 50 | 150 |

**Ans** : 60%

**Ques 2** : Calculate the F1-score for the model below:

**Ans** : 66.7% 

**F1-Score** = 2 ($\frac {precision * recall}{precision + recall}$)

**Ques 3** : When using the sensitivity-specificity tradeoff, you found out that the optimal cutoff point was 0.3. Now, when you plotted the precision-recall tradeoff, you got the following curve:

![image-19.png](attachment:image-19.png)

What is the optimal cutoff point according to the curve given above?

**Ans** : 0.42 (The optimal cutoff point is where the values of precision and recall will be equal. This is similar to what you saw in the sensitivity-specificity tradeoff curve as well. So, when precision and recall are both around 0.62, the two curves are intersecting. And at this place, if you extend the line to the X-axis as given, you can see that the threshold value is 0.42.)


**Ques 1** : Recall that in the last segment you saw that the cutoff based on the precision-recall tradeoff curve was approximately 0.42. When you take this cut-off, you get the following confusion matrix on the test set.

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 1294 | 234 |
| Churn | 223 | 359 |

What will the approximate value of accuracy be on the test set now?

**Ans** : 78%

**Ques 2** : For the confusion matrix you saw in the last question, what will the approximate value of recall be?

**Ans** : 62%

![image-20.png](attachment:image-20.png)


### Graded Questions 

**Ques 1** : Suppose you got the following confusion matrix for a model by using a cutoff of 0.5. Calculate the sensitivity for the model above. Now suppose for the same model, you changed the cutoff from 0.5 to 0.4 such that your number of true positives increased from 1050 to 1190. What will the be the change in sensitivity?

| Actual/Predicted | Not Churn | Churn |
| --- | --- | --- |
| Not Churn | 1200 | 400 |
| Churn | 350 | 1050 |

Note: Report the answer in terms of new_value - old_value, i.e. if the sensitivity was, say, 0.6 earlier and then changed to 0.8, report it as (0.8 - 0.6), i.e. 0.2.

**Ans** : 0.1

**Ques 2** : Calculate the values of precision and recall for the model and determine which of the two is higher.

**Ans** : Recall (75%) > Precision (72.41)

**Ques 3** : The True Positive Rate (TPR) metric is exactly the same as ----- .

**Ans** : Sensitivity

**Ques 4** : Suppose someone built a logistic regression model to predict whether a person has a heart disease or not. All you have from their model is the following table which contains data of 10 patients.

|Patient ID	|Heart Disease	|Predicted Probability for Heart Disease	|Predicted Label|
| --- | --- | --- | --- |
|1001	|0	|0.34	|0|
|1002	|1	|0.58	|1|
|1003	|1	|0.79	|1|
|1004	|0	|0.68	|1|
|1005	|0	|0.21	|0|
|1006	|0	|0.04	|0|
|1007	|1	|0.48	|0|
|1008	|1	|0.64	|1|
|1009	|0	|0.61	|1|
|1010	|1	|0.86	|1|
 

Now, you wanted to find out the cutoff based on which the classes were predicted, but you can't. But can you identify which of the following cutoffs would be a valid cutoff for the model above based on the 10 data points given in the table? (More than one option may be correct.)

**Ans** : 0.50 and 0.55 (For patient 1007, the predicted probability is 0.48 and the predicted class is 0. This means that the cutoff has to be greater than 0.48. Also, for patient 1002, the predicted probability is 0.58 and the predicted class is 1. This means that the cutoff has to be lesser than 0.58.

Therefore, the cutoff can lie between 0.48-0.58 and hence, 0.50 are 0.55 can be valid cutoffs for the model above.)

**Ques 5** : For above given model, Calculate the values of Accuracy, Sensitivity, Specificity, and Precision. Which of these four metrics is the highest for the model?

**Ans** : Sensitivity

| Actual/Predicted | No Heart Disease | Heart Disease |
| --- | --- | --- |
| No Heart Disease | 3 | 2 |
| Heart Disease | 1 | 4 |

Sensitivity (80%) > Accuracy (70%) > Precision (67%) > Specificity (60%)