## Logistic Regression

Logistic Regression is used when the dependent variable (target or y variable) is categorical.

e.g
- predict an email is spam or not
- predict whether it's going to rain tomorrow or not

We cannot use linear regression for these kind of problems as linear regressio is unbounded. To solve these kind of problems is where Logistic Regression is used. 

Instead of fitting a line to the data (as was the case in linear regression), logistic regression fits an 'S' shaped logistic function

<img src='img/logistic-reg-1.png'/>

When the weights are closer to 0, y value indicates "Not Obese" and as the weight increases the y value slowly starts moving upwards to the "Obese" category.

Curve goes from 0 to 1

<img src='img/logistic-reg-2.png'/>

Consider a new point in the X-axis, there is a high probability that mouse is Obese

<img src='img/logistic-reg-3.png'/>

If we pick a point in the X-axis kind of in the middle, then 50% probability for the mouse to be Obese and so on.

Just like linear regression, logistic regression can also work with multiple independent variables. And the independent variables can be categorical or continous

In Linear regression, we fit the line using "least squares" method. i.e. We minimize the sum of the squares of the residuals. We also use R² to compare simple models to complicated models.

Logistic regression doesnt have any residuals and hence no least squares or R², instead it uses "maximum likelihood". Curve with the maximum likelihood is selected. We try fitting the curve that covers the maximum cordinates. 

In Linear regression, when we try the best fit line , the y value ranges from -infinity to +infinity. 
In Logistic regression, the y value as seen in the plot above ranges between 0 - 1 i.e. the probability values and that's a problem if we are trying to solve Logistic regression using Linear Models (Generalized linear models)
<img src='img/logistic-reg-4.png'/>

To solve the problem we can transform the y-axis from probability of obesity to the log(odds of obesity) as shown int he picture below
<img src='img/logistic-reg-5.png'/>

This transformation can be done using the <b>logit</b> function

<img src='img/logistic-reg-6.png'/>

### Logarithms - A quick refresher

<img src='img/logarithm-1.png'/>

<img src='img/logarithm-2.png'/>

### Odds & Log(Odds)

Odds are not the same as probabilities.

Odds are the ratio of something happening to something not happening.

e.g - ratio of my team winning / my team not winning

Probability is the ratio of something happening to everything that could happen.

e.g - ratio of my team winning / my team winning and losing

<img src='img/odds-1.png'/>

In the above example,

odds of winning = 5/3 = 1.7
probability of winning = 5/8 = 0.625
probability of losing = 3/8 = 0.375

if we take the ratio of probability of winning to probability of losing i.e. (5/8) / (3/8) = 5/3 which is the same as odds.

We can calculate the odds if we have the probability of winning. Let's call the probability of winning as 'p' then we can say

odds = (p) / (1 - p)

Why we need log(odds)

consider how the odds of winning changes based on how worst the team goes, 

e.g. 1/4 = 0.25 , 1/8 = 0.124 , 1/32 = .03125 .. as the odds of winning goes from bad to worse , the value approaches 0. 

if the odds are against the team winning then the value will be between 0 and 1

now let's consider how the odds of winning changes based on how the team performs better,

e.g. 4/3 = 1.3 , 8/3 = 2.7 , 32/3 = 10.7

If the odds are in favour of the team winning then the value will be between 1 and infinity

<img src='img/odds-2.png'/>

The magnitude of the odds of not in favour v/s in favour is not in the same scale and hence not symmetrical. Consider 1/6 and 6/1 in the below image

<img src='img/odds-3.png'/>

Now let's take log(odds) to make the whole thing symmetrical

<img src='img/odds-4.png'/>

Let's also quickly get introduced to the concept of 'odds ratio', consider the below dataset. 
The data represents 356 people
- 29 of these people have cancer
- 327 do not have cancer
- 140 people have mutated gene
- 216 people do not have mutated gene

<img src='img/odds-6.png'/>

We can use 'odds ratio' to determine if there is a relationship between mutated gene and cancer (if someone has mutated gene, are the odds higher that they have cancer ?)

Given that the person have mutated gene, the odds that they have cancer are - 23/117
Given that the person does not have mutated, gene the odds that they have cancer are - 6/210

odds ratio = (23/117) / (6/210) = 6.88
log(odds ratio) = log(6.88) = 1.93

The odds ratio tells us that the odds are 6.88 times greater, the person with mutated gene also will have cancer

Odds ratio and the log of the odd ratio are like R² , they indicate a relationship between two things. Just like R², the values corresponds to the effect size. i.e. larger value means that the mutated gene is a good predictor of cancer , smaller value means that the mutated gene is not a good predictor of cancer.

Just like R², we need to know that if the relationship is statistically significant, people uses 3 ways to determine if the odds ratio (or the log(odds ratio)) is statistically significant

- Fisher's Exact Test
- Chi-Square Test
- Wald Test

### Logit contd ...

Going back to the previous discussion of calculating odds from the probabilities,

<img src='img/odds-5.png'/>

After the transformation the new y axis will change from -infinity to +infinity as the y axis is presenting logit (log of the odds). The new plot will also be a straight line as opposed to a curve.

Now with logistic regression we can find the coefficients and errors. 

So far we were talking about a continous variable (weight) and it's relationship with obesity (categorical variable)

Let's see how the concept applies to a categorical variable

<img src='img/logistic-reg-7.png'/>

This type of logistic regression is similar to how t-test is done using linear models

<img src='img/logistic-reg-8.png'/>

<img src='img/logistic-reg-9.png'/>

We pair this equation with a design matrix to predict the size of the mouse given that it has normal or mutated version of the gene
Design matrix : - first column corresponds to the values to be substituted for B1 and the second column corresponds to the value to be substituted for B2,

<pre>
| 1 0 |
| 1 0 | 
| 1 0 |
| 1 0 |
| 1 1 |
| 1 1 |
| 1 1 |
| 1 1 |
</pre>

<img src='img/logistic-reg-10.png'/>

Now let's see how this concept of t-test is applied on logistic regression

<img src='img/logistic-reg-11.png'/>

<img src='img/logistic-reg-12.png'/>

size = log(odds geneₙₒᵣₘₐₗ) * B₁ + (log(odds geneₘᵤₜₐₜₑ𝒹) - log(odds geneₙₒᵣₘₐₗ)) * B₂

size = log(odds geneₙₒᵣₘₐₗ) * B₁ + (log(odds geneₘᵤₜₐₜₑ𝒹 / odds geneₙₒᵣₘₐₗ) * B₂

The second term here is log of odds ratio. It tells us, on a log scale, how much having the mutated gene increases(or decreases) the odds of mouse being obese 

When we solve the above equation with values, that's what we get as coefficients when we do logistic regression

In short, in terms of coefficients, logistic regression is the exact same as good old linear models except the coeffcients are in terms of log(odds)

### Finding the best fit curve a.k.a finding the maximum likelihood

<img src='img/logistic-reg-13.PNG'/>

We transform the y axis from probability of obesity to the log(odds of obesity). We cannot use least square in the new plot to get the best fit line because the residuals (distance from the data points to the line) are also infinity.

<img src='img/logistic-reg-14.PNG'/>

Instead of least squares we use maximum likelihood.

We project the original data points onto the candidate line. This gives each point a log odds value.

<img src='img/logistic-reg-15.PNG'/>

We can transofrm the candidate log(odds) to candidate probabilities by  using this formulae

p = eˡᵒᵍ⁽ᵒᵈᵈˢ⁾ / ( 1 + eˡᵒᵍ⁽ᵒᵈᵈˢ⁾)

<img src='img/logistic-reg-16.PNG'/>

Basically if we have probability as input, then we can convert them to log(odds) and if we have log(odds) as input we can transform back to probability. Let's figure out how we arrived at this formulae

log(p/(1-p)) = log(odds)

p / (1-p) = eˡᵒᵍ⁽ᵒᵈᵈˢ⁾

p = (1-p) * eˡᵒᵍ⁽ᵒᵈᵈˢ⁾ 

p = eˡᵒᵍ⁽ᵒᵈᵈˢ⁾ - p * eˡᵒᵍ⁽ᵒᵈᵈˢ⁾

p + p * eˡᵒᵍ⁽ᵒᵈᵈˢ⁾ =  eˡᵒᵍ⁽ᵒᵈᵈˢ⁾

p * (1 + eˡᵒᵍ⁽ᵒᵈᵈˢ⁾) = eˡᵒᵍ⁽ᵒᵈᵈˢ⁾

p = eˡᵒᵍ⁽ᵒᵈᵈˢ⁾ / ( 1 + eˡᵒᵍ⁽ᵒᵈᵈˢ⁾)

Now, for each points, pick the value on the y-axis , find the p value , plot the squiggle line.

Find the likelyhood of each mouse being obese looking at the probability value. 

Likelyhood for all of the obese mouse are the product of individual likelyhoods.

<img src='img/logistic-reg-17.PNG'/>

Although it's possible to calculate the likelihood as the product of the individual likelihoods, statisticians prefer to calculate the log of the likelihood instead. The squiggle that maximizes the likelihood is the same as the one that maximizes the log of the likelihood

log(likelihood of data given the squiggle) = log(0.49) + log(0.9) + log(0.91) + log(0.91) + log(0.92) + log(1-0.9) + log(1-0.3) + log(1-0.01) + log(1-0.01) 

Note - finding log (a * b ) = log a + log b

Now we keep repeating the process till we find the best(maximizes) likelihood.

### Found best fit curve - but how do we know it's useful ?

Like how R² provides a way to measure how useful the fit is, in Logistic Regression we can use log-likelihood (LL) as a measure.

We can use LL(fit) for the fitted line just like how we did SS(fit) for the fitted line in linear regression.

We have to find something similar to SS(mean) in linear regression which tells us the poorly fitted line.

We can find the log of odds of obesity for all the mice without considering weights. In the above example log(5/4) = 0.22. We get a horizontal line parallel to x and passing through y at point 0.22

We project the data onto this line, find the probability and plot the probability squiggle

<img src='img/logistic-reg-18.png'/>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<img src='img/logistic-reg-19.png'/>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<img src='img/logistic-reg-20.png'/>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<img src='img/logistic-reg-21.png'/>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<img src='img/logistic-reg-22.png'/>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Like in Linear Regression, even for Logistic Regression R² value falls between 0 and 1 (0 when fit is bad and 1 when fit is best)

### Accuracy for classification algorithms

In Regression algorithm we find the accuracy metrics using,
- MSE
- RMSE
- MAE
- MAPE
- R²

In Classification algorithm we find the accuracy metrics using,
- Confusion matrix
- proba graphs
- ROC AUC curves

### Confusion matrix

- <b>True Positives</b> - These are patients that got heart disease and was correctly identified by algorithm
- <b>True Negatives</b> - These are patients that did not have heard disease and was correctly identified by algorithm
- <b>False Negatives</b> - Patients has heart disease but algorithm said they didnt
- <b>False Positives</b> - Patients do not have heart disease but algorithm said they do

Associate 
True -> Model predicting correctly | False -> Model predicting incorrectly
Positive -> With heart condition | Negative -> Without heart condition

- True's are correct predictions 
- False's are incorrect predictions
    - False Negative - is incorrect Negative prediction. i.e. actual condition is positive.
    - False Positive - is incorrect Positive prediction. i.e actual condition is negative.

<img src='img/confmatrix-1.png'/>

We can run multiple models (e.g KNN , Random Forest), create the confusion matrix and then see which one performs the best.

### ROC AUC Curve

ROC - Receiver Operations Characteristics
AUC - Area Under Curve

In the graphs seen earlier for Logistic Regression the Y axis represents the probability values.

<img src='img/roc-1.png'/>

We can define a threshold line (say at y = 0.5) above which we shall classify all mice as obese and below which the mice will be classified as not obese. Now when a new observation comes it's easier to predict the class based on the probability. But, how do we find the perfect threshold forms the crux of the problem

Consider the 8 new data (4 obese and 4 non obese mice) for which we know the classification already and we try to run it through the model using the threshold 0.5

<img src='img/roc-2.png'/>

As we run through the classification, we see that 
- 3 non obese mice is predicted correctly
- 3 obese mice is predicted correctly
- 1 non obese is predcited as obese
- 1 obese is predicted as non-obese

<img src='img/roc-3.png'/>

We plot a confusion matrix with these numbers

<img src='img/roc-4.png'/>

Now it's clear that if we play around with the threshold value (any value between 0 and 1) we get a whole bunch of confusion matrices to choose from. Based on the problem statement in hand (e.g cancer detection - we dont want to miss catching any cases which means False Positives are acceptable) we have to find an optimum threshold. How do we find the best threshold ? That's where ROC curve comes in handy

ROC Curve - is a graph between FPR (False Positive Rate) & TPR (True Positive Rate)

TPR (Sensitivity) = (True Positives) / (True Positives + False Negatives)

FPR (1 - Specificity) = (False Positives) / (False Positives + True Negatives)

<img src='img/roc-5.png'/>

Green diagonal line shows where the TPR = FPR. 
Like you see, we plot the ROC curve with changing thresholds and thereby computing new TPRs and FPRs. 

The top most point on the y axis shows that model correctly classified 75% of the obese samples and 100% of the non obese samples.

ROC graph summarizes all of the confusion matrices each of the threshold produced.

<b> AUC - Area Under the Curve </b>
Higher the area under the curve better is the curve.

If red ROC curve represents logistic regression and blue represents random forest, we can choose logistic regression.

<img src='img/auc-1.png'/>

Note : instead of using FPR people might use precision
Precision = (True Positives) / (True Positives + False Positives)


### Binary Classification

When we have 2 values in the y variable

### Multi class classification

We can use an approach called <b>one v/s rest classifier</b>

Let's say the y variable has values like 1,2,3 then it's a multi class classification problem.. 

Approach is going to be as follows,

- Take all the rows from the training data where y=1. Let's assume there are 'n' rows. Flag the new y value as 'yes'
- Take 'n' rows from the remaining dataset which will have values other than y=1. Flag the y value for this set as 'no'
- Fit the data into a model (say model1) 
- Repeat the above steps for y=2 and y=3
- After the models are fitted for y=1,2,3 we have 3 models (model1, model2, model3) who can classify 1,2,3 respectively.
- The test data will be passed to the predict method of all the 3 models. The model giving the highest probability for 'yes' will be chosen and the 'y' variable it's trained to predict for will be the predicted 'y'variable. An example is given below which will clarify this further.

In [2]:
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randint(0,10,(1000,5)))
data[4]=np.random.randint(0,3,1000)
data.head()

Unnamed: 0,0,1,2,3,4
0,5,8,3,4,1
1,1,6,6,1,0
2,3,6,0,5,2
3,8,2,1,0,2
4,3,6,7,6,0


In [3]:
y=data[4]
x=data.drop(4,axis=1)
y.shape,x.shape

((1000,), (1000, 4))

In [4]:
y=data[4]
x=data.drop(4,axis=1)
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x,y)
lr.predict(x)

array([0, 0, 2, 1, 2, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 2, 0, 1, 0, 0, 0, 2,
       0, 0, 2, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 0, 1, 2, 1, 2,
       0, 2, 2, 1, 2, 0, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2,
       0, 0, 2, 2, 1, 2, 2, 0, 1, 2, 1, 2, 0, 0, 2, 0, 2, 1, 1, 0, 2, 2,
       2, 0, 0, 2, 1, 1, 2, 2, 0, 1, 1, 0, 2, 1, 1, 0, 0, 2, 2, 1, 2, 2,
       2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 2, 1, 0, 2, 0, 2, 2, 0, 2,
       2, 0, 2, 1, 2, 1, 2, 0, 2, 0, 2, 0, 0, 0, 0, 1, 0, 2, 1, 0, 2, 2,
       2, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 2,
       0, 2, 1, 0, 1, 2, 1, 1, 2, 0, 2, 1, 2, 0, 1, 1, 2, 0, 2, 2, 0, 0,
       0, 0, 0, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0, 2, 2, 2, 1, 0, 1,
       2, 2, 2, 1, 2, 2, 0, 2, 2, 1, 1, 1, 0, 2, 1, 0, 0, 2, 2, 2, 2, 0,
       0, 1, 2, 2, 1, 0, 1, 2, 0, 2, 2, 0, 1, 1, 0, 2, 2, 2, 1, 2, 0, 1,
       2, 1, 0, 2, 2, 1, 0, 2, 2, 0, 2, 0, 0, 2, 1, 2, 2, 1, 0, 0, 2, 0,
       2, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 0, 2, 2, 1,

In [5]:
data

Unnamed: 0,0,1,2,3,4
0,5,8,3,4,1
1,1,6,6,1,0
2,3,6,0,5,2
3,8,2,1,0,2
4,3,6,7,6,0
...,...,...,...,...,...
995,9,7,1,7,0
996,4,7,4,5,1
997,1,9,0,4,1
998,4,5,6,5,2


In [6]:
#d2=data[data[4]!=0].sample(d1.shape[0])
d0_yes = data[data[4] == 0]
d0_no = data[data[4] != 0]
d0_yes.shape, d0_no.shape

((337, 5), (663, 5))

In [7]:
# take equal number of samples for y != 0 as y = 0
d0_no = data[data[4]!=0].sample(d0_yes.shape[0])
d0_yes.shape, d0_no.shape

((337, 5), (337, 5))

In [8]:
import warnings
warnings.filterwarnings('ignore')

In [20]:
predict_values=[]
test_dataset=pd.DataFrame(np.random.randint(10,100,(100,4)))
for i in data[4].unique():
    d1=data[data[4]==i]
    d1[4]="yes"
    d2=data[data[4]!=i].sample(d1.shape[0])
    d2[4]="No"
    
    final_dataset=d1.append(d2)
    print(i,final_dataset[4].unique(),d1.shape,d2.shape,final_dataset.shape)
    y=final_dataset[4]
    x=final_dataset.drop(4,axis=1)
    from sklearn.linear_model import LogisticRegression
    lr=LogisticRegression()
    lr.fit(x,y)
    
    predict_values.append(lr.predict_proba(test_dataset))

1 ['yes' 'No'] (322, 5) (322, 5) (644, 5)
0 ['yes' 'No'] (337, 5) (337, 5) (674, 5)
2 ['yes' 'No'] (341, 5) (341, 5) (682, 5)


In [15]:
predict_values[2]

array([[0.49675417, 0.50324583],
       [0.14853238, 0.85146762],
       [0.01935312, 0.98064688],
       [0.03418572, 0.96581428],
       [0.11773769, 0.88226231],
       [0.22072973, 0.77927027],
       [0.12228571, 0.87771429],
       [0.16933556, 0.83066444],
       [0.53618424, 0.46381576],
       [0.01256422, 0.98743578],
       [0.1864411 , 0.8135589 ],
       [0.71906647, 0.28093353],
       [0.44459149, 0.55540851],
       [0.19153542, 0.80846458],
       [0.26659083, 0.73340917],
       [0.22982042, 0.77017958],
       [0.4678813 , 0.5321187 ],
       [0.7783377 , 0.2216623 ],
       [0.52221452, 0.47778548],
       [0.43491274, 0.56508726],
       [0.02020308, 0.97979692],
       [0.2107718 , 0.7892282 ],
       [0.1250816 , 0.8749184 ],
       [0.38601765, 0.61398235],
       [0.44227896, 0.55772104],
       [0.14582325, 0.85417675],
       [0.26052121, 0.73947879],
       [0.42844235, 0.57155765],
       [0.61118175, 0.38881825],
       [0.4604117 , 0.5395883 ],
       [0.

In [14]:
predict_values[2].shape

(100, 2)

In [21]:
predict_values=np.array(predict_values)
predict_values.shape

# 3 model outputs 
# each model output has 100 rows and 2 columns 
# first column is probability of 'yes' and second column is probability of 'no'

(3, 100, 2)

In [22]:
# from each model output take the first column alone i.e. the probability of 'yes' and combine them
coordinates=list(zip(predict_values[0][:,0],predict_values[1][:,0],predict_values[2][:,0]))
coordinates=np.array(coordinates)
# coordinates is holding the probability of 'yes' from all the 3 models. Pick the one with maximum probability
coordinates.argmax(axis=1)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       2, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [23]:
coordinates

array([[0.98988895, 0.69738051, 0.01899272],
       [0.96256411, 0.70076047, 0.06323923],
       [0.94966665, 0.66557807, 0.53931507],
       [0.96955348, 0.79523224, 0.04354932],
       [0.97076093, 0.51987905, 0.10906437],
       [0.93377472, 0.50296839, 0.5480513 ],
       [0.92252093, 0.35862915, 0.74827202],
       [0.93677996, 0.48390685, 0.58708876],
       [0.98853996, 0.60584622, 0.24246191],
       [0.97940182, 0.72922648, 0.04361155],
       [0.98255276, 0.60715601, 0.04496913],
       [0.98801244, 0.70848785, 0.02826812],
       [0.98425349, 0.56299503, 0.09684469],
       [0.76519471, 0.56125072, 0.71872673],
       [0.8990476 , 0.51933953, 0.49998503],
       [0.95229961, 0.41464397, 0.60550754],
       [0.992032  , 0.62413924, 0.05556112],
       [0.88983425, 0.70422782, 0.14800316],
       [0.83078783, 0.42144152, 0.81102718],
       [0.85358153, 0.52708833, 0.51819678],
       [0.96968608, 0.58139561, 0.1809676 ],
       [0.971646  , 0.58993199, 0.21162826],
       [0.

In [23]:
predict_values[0]

array([[0.97709031, 0.02290969],
       [0.95418512, 0.04581488],
       [0.89550491, 0.10449509],
       [0.97929285, 0.02070715],
       [0.92155804, 0.07844196],
       [0.93452805, 0.06547195],
       [0.92693005, 0.07306995],
       [0.95787465, 0.04212535],
       [0.93992463, 0.06007537],
       [0.99619211, 0.00380789],
       [0.97817384, 0.02182616],
       [0.94283052, 0.05716948],
       [0.97880807, 0.02119193],
       [0.78512915, 0.21487085],
       [0.96915552, 0.03084448],
       [0.93841656, 0.06158344],
       [0.90898377, 0.09101623],
       [0.85627968, 0.14372032],
       [0.96143279, 0.03856721],
       [0.97777035, 0.02222965],
       [0.99379601, 0.00620399],
       [0.99432379, 0.00567621],
       [0.98392006, 0.01607994],
       [0.96387818, 0.03612182],
       [0.99339099, 0.00660901],
       [0.97057733, 0.02942267],
       [0.95454852, 0.04545148],
       [0.97492076, 0.02507924],
       [0.98510154, 0.01489846],
       [0.98102603, 0.01897397],
       [0.

- we fitted data into 3 models (nunique of y variable is 3). Data fitted,
	For model1 is y = 0 , y != 0 (equal number of rows as in y = 0)
	For model2 is y = 1 , y != 1 (equal number of rows as in y = 1)
	For model3 is y = 2 , y != 2 (equal number of rows as in y = 2)
- Each model does prediction on the same test data set.
- predict_values will have the probabilities of 'yes' and 'no' for each record in the test data set. 
- We pick all the 'yes' probabilities. Find the argument of the maximum probability. Let's imagine the argmax for a given row is 1 , this mean that the model2 gave the best result and hence the predicted y value for that row is 1

## Stats model

In [2]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

x=pd.DataFrame(np.random.randint(0,10,(100,5)))
x.columns=['x1','x2','x3','x4','x5']
y=np.random.randint(0,10,100)
### extra information as opposed to models
import statsmodels.api as sm
import statsmodels.formula.api as sfa
model=sm.OLS(y,x)
lm=model.fit()
lm.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.686
Model:,OLS,Adj. R-squared (uncentered):,0.67
Method:,Least Squares,F-statistic:,41.59
Date:,"Sat, 11 Apr 2020",Prob (F-statistic):,1.76e-22
Time:,22:54:06,Log-Likelihood:,-252.89
No. Observations:,100,AIC:,515.8
Df Residuals:,95,BIC:,528.8
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.1520,0.093,1.627,0.107,-0.033,0.338
x2,0.1160,0.093,1.247,0.216,-0.069,0.301
x3,0.1536,0.102,1.508,0.135,-0.049,0.356
x4,0.2146,0.107,2.014,0.047,0.003,0.426
x5,0.2962,0.088,3.361,0.001,0.121,0.471

0,1,2,3
Omnibus:,13.708,Durbin-Watson:,1.781
Prob(Omnibus):,0.001,Jarque-Bera (JB):,4.167
Skew:,0.049,Prob(JB):,0.125
Kurtosis:,2.005,Cond. No.,4.08


### F Statistics

ANOVA - Analysis of variance

SST - Treatment variation

Skewness

Kurbosis

<img src='img/skewness.png' />

## TODO

Read all the concepts above

Recording at https://drive.google.com/open?id=1Ur-GFmT2vp17mtG8fo3Qrpd7WqFG7AAE 

https://drive.google.com/open?id=1Ur-GFmT2vp17mtG8fo3Qrpd7WqFG7AAE 
