<a href="https://colab.research.google.com/github/nalinis07/APT_Class_Copy_Links/blob/MASTER/AT_Lesson_73_Class_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 73: Logistic Regression - Univariate Classification II

### Teacher-Student Activities

In this class, you  will learn a concept called the **sigmoid** function that is used by logistic regression to perform classification.

----

### Recap

In [None]:
# Import the required modules and load the heart disease dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

csv_file = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/uci-heart-disease/heart.csv'
df = pd.read_csv(csv_file)

# Print the number of records with and without heart disease.
print("Number of records in each label are")
print(df['target'].value_counts())

# Print the percentage of each label
print("\nPercentage of records in each label are")
print(df['target'].value_counts() * 100 / df.shape[0], "\n")

# Print the first five rows of Dataframe.
df.head()

-----

#### Activity 1: Sigmoid Function

The sigmoid function, in maths, classifies any data point between **0** and **1**. The graph of a sigmoid function follows the shape of English letter **S**. Mathematically, the sigmoid function is given as

$$y =  \frac{1}{1 + e^{-x}}$$

where,  

- $y$ is the output of the sigmoid function

- $x$ is an independent variable

- $e$ is Euler's constant

The $x$ takes all the values between $- \infty$ to $\infty$ i.e., $x \in (-\infty, \infty)$ and the $y$ ranges between $0$ to $1$ i.e., $y \in (0, 1)$.

**Q: What is so great about sigmoid function in machine learning?**

The sigmoid function in the context of machine learning returns the probability of occurrence of an event. As you know, the probability of an event ranges between 0 and 1. So does the output of the sigmoid function. Hence, usage of sigmoid function to calculate probabilities to classify outcomes (say 0 and 1) seems like a natural fit.

Before we go deep into the sigmoid function, let's create a function in Python and name it `sigmoid()` that takes a series/array as an input and returns a numeric output.


In [None]:
# S1.1: Create a sigmoid function using the above formula.


Now that you have the sigmoid function, you can calculate the probabilities of a few random integers between -10 and 10 and plot the output on a scatter plot with the corresponding input to see the shape of the curve.

In [None]:
# S1.2: Create a numpy array having 100 random integers between -10 and 10. Pass the array as an input to the 'sigmoid()' function.


As you can see, the sigmoid function output is in the range of 0 to 1.

Let's create a scatter plot between the random integers and their corresponding sigmoid function output to check the shape of the curve.

In [None]:
# S1.3: Create a scatter plot for output array for the sigmoid function


As you can see, the arrangement of the points appears to form a shape of the English letter 'S'.

**Important Observations**

- *Observation 1:* If $x = 0$, then the output of the sigmoid function is $y = 0.5$ because

  $$y = \frac{1}{1 + e^{0}} = \frac{1}{1 + 1} = \frac{1}{2} = 0.5$$

  Similarly, if $x = -1$, then the output of the sigmoid function is $y < 0.5$ because

  $$y = \frac{1}{1 + e^{1}} = \frac{1}{1 + 2.71} = \frac{1}{3.71} < 0.5$$

  And, if $x = 1$, then the output of the sigmoid function is $y > 0.5$ because

  $$y = \frac{1}{1 + e^{-1}} = \frac{e}{e + 1} = \frac{2.71}{3.71} = 0.73 > 0.5$$

- *Observation 2:* From the curve, you can also see that as the values on the $x$-axis increase, the values on the $y$-axis also increase. So you can say that **the sigmoid curve is continuously increasing**.

Based on the above two observations, you can say that the input values to the sigmoid function should be both negative and positive. Hence, before building a univariate logistic regression model, first inspect the polarity (i.e. sign) of values of the independent variable. If all the values are non-negative, then use the standard scaler method to normalise values so that you get a few negative values as well in the independent variable.

---

#### Activity 2: Classification Criteria in Logistic Regression

To classify an outcome as either **yes** or **no** (1 or 0), you need to randomly assign a probability value as the **threshold value**. Let's say the threshold probability value is $0.5$. If for any input value, the corresponding sigmoid function output is
- less than 0.5, then you can label that outcome as 0 or **no**
- else, you can label that outcome as 1 or **yes**.

Let's create a function called `predict()` in Python that takes the output of the `sigmoid()` function and returns a Pandas series containing 0s and 1s as the output.

In [None]:
# S2.1: Create the 'predict()' function as described above.


In the above code,

- we iterate through each item of the `sigmoid_output` series,

- compare each item with the threshold value,

- if the item is is greater than or equals to 0.5, we add `1` to a Python list, else add `0` to it,

- convert the Python list to a Pandas series using the `pd.Series()` function,

- return the Pandas series created

Now let's use the `predict()` function to classify the `sigmoid()` function outputs as 0 and 1.

In [None]:
# S2.2: Use the 'predict()' function to classify the 'sigmoid()' function outputs as 0 and 1.


Let's create a scatter plot between the random integers and their corresponding binary labels (i.e. 0 and 1).

In [None]:
# S2.3: Create a scatter plot between the random integers and their corresponding binary labels (i.e. 0 and 1).


With different threshold values, you will get different classifications but the most optimal threshold value is that value which classifies most sigmoid function outputs correctly as 0 and 1.

Let's consider five different threshold probability values and classify the sigmoid function outputs as 0 and 1 based on them. Let's also create their scatter plots.

In [None]:
# S2.4: Consider 5 different threshold probability values and classify the sigmoid function outputs as 0 and 1 based on them.
# Also create their scatter plots.


Now let's apply the same logic to classify the heart disease patients as 0 and 1 based on their cholesterol levels. But before that let's check the range of values in the `chol` column.

In [None]:
# S2.5: Get the descriptive statistics for the 'chol' column.


As you can see, all the values in the cholesterol column are non-negative. Let's first normalise them by calculating their $Z$-scores (or standard scaler).

In [None]:
# S2.6: Normalise the 'chol' column values using the standard scaler method.


In [None]:
# S2.7: Calculate the sigmoid output for both the scaled (or normalised) and non-scaled cholesterol values.


As you can see, all the probabilities (sigmoid outputs) for the non-scaled cholesterol values are 1 whereas they range between 0 and 1 for the scaled `chol` values.

Now for different thresholds between 0 and 1, let's classify whether a patient has heart disease or not based on the scaled cholesterol values.



In [None]:
# S2.8: Consider 5 different threshold probability values and classify the sigmoid function outputs as 0 and 1 based on them.
# Also create their scatter plots.


Now let's find out the number of 1s and 0s classified by the `predict()` function w.r.t. the threshold of 0.5.

In [None]:
# S2.9: Find out the number of 1s and 0s classified by the 'predict()' function for a threshold of 0.5.


As you can see the number of values classified as 0 and 1 is almost opposite to the number of actual 0s and 1s in the dataset.

However, are the predicted 0s  correctly classified as 0s and the predicted 1s  correctly classified as 1s?

Let's find out by creating a confusion matrix.

In [None]:
# S2.10: Create a confusion matrix for the predicted values.


There are a lot of false positives and false negatives in the predicted values. Let's print the f1-scores.

In [None]:
# S2.11: Print the f1-scores for the predicted values.


The f1-scores are also lower. Hence, clearly, the cholesterol values alone cannot  accurately predict whether a person has heart disease or not. You need to consider more features to build a logistic regression model for this purpose.

Let's stop here. In the next class, we will create a linear function using with one of the features in the dataset and pass it as an input to the sigmoid function.

---

### **Project**

You can now attempt the **Applied Tech. Project 73 - Logistic Regression I - Univariate Classification II** on your own.


**Applied Tech. Project 73 - Logistic Regression I - Univariate Classification II**: https://colab.research.google.com/drive/15pzkxDsmwy4Nu5LoRygl8r5ZA7pAT8NH

---