In [None]:
# Logistic Regression

#### Probability vs Likelihood

*Probability:*

For example, This is a Continous distribution of mouse weights, mean=32gms, SD = 2.5, min=24gms, max=40gms. 

The probability that we will weigh a randomly selected mouse between 32 and 34 grams is the area under the curve between 32 and 34 grams. In this case the area under the curve = 0.29, meaning there is 29% chance a randomly selected mouse will weigh between 32 and 34 grams. 

If we want to express this in mathematically,

*Pr(weigh between 32 and 34 grams | mean=32, SD=2.5) = 0.29*

We can calculate the probability for other conditions too by changing the value in the condition which is on the left side. The right side, which defines the shape and locaton of the distribution stays the same.  

*Pr(mouse weight > 34 grams | mean=32, SD=2.5)*



![image.png](attachment:image.png)



*likelihood*

To talk about likelihood, you assume that you have already weighed your mouse, it weighs 34grams. Then find the likelihood of weighing the mouse of 34 grams is 0.12..

![image.png](attachment:image.png)

Mathematically it is expressed as,

*L(mean=32, SD=2.5 | mouse weighs 34 grams) = 0.12*

The likelihood of distribution with mean=32 and SD=2.5 given we weighed a 34 grams of mouse all that equal to 0.12.

If we shifted the distribution, so that new mean=34 the new likelihood would be 0.21

So with likelihood the measurements on the right side are fixed, and we modify the shape and location of the distribution with left side. 

*Summary*

![image.png](attachment:image.png)

### Odds and Odds ratio

In [1]:
import pandas as pd

gender_df = pd.read_csv('oddData.csv')
gender_df.head(3)

Unnamed: 0,Gender,Purchase
0,Female,Yes
1,Female,Yes
2,Female,No


In [2]:
# creating frequency table using pandas crosstab feature

table = pd.crosstab(gender_df['Gender'], gender_df['Purchase'])
table

Purchase,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,106,159
Male,125,121


**Odds**, which describes the ratio of success to ratio of failure.

Ratio of success = 159/265 (yes/total No. of Females)

Ratio of failure = 106/265 (no/total No. of females)

*Odds = Ratio of Success / Ratio of Failure*

higher the odds, better is the chance of success. Range of odds can be any number between 0 to infinity.


## Logistic Regression

Logistic Regression predicts something True or False instead of predicting something continuous like size. Logistic Regression is used for classification instead of regression.

Instead of fitting line to data, logistic regression fits 'S' logistic function or Sigmoid function to the data.

Curve goes from 0 to 1. That means curve tells probability between 0 and 1. In linear regression the values on the y-axis can be anything but with logistic regression y-axis is confined to probability values between 0 and 1.

If the estimated probability is greater than 50% then the model predicts that the instance belongs to positive class or else it predict instance belongs to negative class.

One main difference between logistic regression and linear regression is how the line is fit to the data. With linear regression we fit line using '**least squares method**'. In other words we find the line that minimizes the Sum of Squared Residuals (SSE). We also use residuals to calculate R-square and to compare simpler models with complicated models.

Logistic regrssion doesnt have same concept so it doesnt use least squares or R-square. Instead it uses something called **maximum-likelihood**.

### Q & A

• Is it possible to design a logistic regression algorithm with Neural Network? Yes Neural network is universal approximator it can be implemented with any logistic regression.

• if you apply logistic regression on multi-class classification problem, we can use OneVsAll method.

• To best fit data in logistic Regression we use Maximum likelihood  method. 

• Mean-Squared-Error cannot be applied in case of logistic regression because logistic regression is a classification algorithm so its output cannot be a real time value.

• A good method to analyze the performance of logistic regression is by using AIC, we prefer model with less AIC.

• Standardization is not required or mandatory for logistic regression.

• for values of x in the range of real number from negative infinity to positive infinity Logistic function will give value between 0 and 1.

### Deriving Logistic Regression equation from Linear regression

the odds function has the advantage of transforming the probability function, which has values from 0 to 1, into an equivalent function with values between 0 and ∞. When we take the natural log of the odds function, we get a range of values from -∞ to ∞


![image.png](attachment:image.png)

Odds function = ratio of success/ratio of failure

In linear regression Y has values from -(infinity) to +(infinity) but in Logistic regression Y has values only from 0 to 1. 

Now to get the values form the -(infinity) to +(infinity) even in case of Logistic Regression we use odds function. Here ratio of success is Y and ratio of failure is (1-y).

odds function = Y/(1-Y)

when Y = 0, odds function is 0.

when Y = 1, odds function is +infinity.

Now we achieved the values between  0 and infinity but the range we need is between -(infinity) to +(infinity). for that we apply natural log to the odd function( Y/(1-Y)).

## Softmax Regression

When to want to solve multi-class classification using Logistic regression you train and combine multiple binary classifiers (OneVsOne or OneVsAll), Instead of doing all these logistic regression can support multiclass directly using Softmax regression.

When given an instance softmax function first computes the score of the instance for each class, then estimates the probability of each class by applying softmax function.

![image.png](attachment:image.png)

Sigmoid function is used for two class logistic regression where as Softmax function is used for multiclass logistic regression.

Softmax regression Classifier can predict only one class at a time. So it should be used only with mutually exclusive classes (each instance belongs to only one class).

The objective of the softmax regression is to have a model that estimates a high probability for the target class (and consequently low probability for other classes). 

*Cross-Entropy* It penalizes the model when it estimates a low probability for a target class. Cross entropy is frequently used to measure how well a set of estimated class probabilities match the target class.

Scikit-learn's LogisticRegression uses one-versus-all by default when you train it on more than two classes, but you can set the 'multi_class' hyperparameter to "multinomial" to switch it to Softmax Regression instead. You must also specify a solver that supports Softmax Regression, such as the "lbfgs" solver.

It applies l2 regularization by default, which you can control using hyperparameter C.

### Cross Entropy and Embeddings

Cross Entropy loss or Log loss, measures the performance of a classification model whose output is probability between 0 and 1. Cross entropy loss increases as the predicted probability diverges from actual label.
![image.png](attachment:image.png)

Cross Entropy is used as cost function when training classifiers. OneHotEncoding works very well for most of the problems, until you get into situation where you have thousands or millions of features.

In this case vector becomes really large and has almost zeros everywhere and that becomes very inefficient. you can solve this problem with embeddings.

*OneHot Encoding* Convert each category variable value into new column and assigns 1 or 0. This type of approach does not capture the relationship between the categorical variables, It gives equal importance to every column. But when you use embedding you can capture the relationships.

*Process*

We have an input variable, it is going to be turned into logits using liner model. Wx+b.

Then we are going to feed the logits (Scores) into Softmax to turn them into probabilities. Then we are going to compare the probabilities to labels using cross entropy functions. The entire setting is called Multinomial Classification / Softmax Regression.

![cross%20Entropy.png](attachment:cross%20Entropy.png)