# Logistic Regression


## Assumptions

1. **Independence of observations**: There is no relationship between each observation/data point. One opposite example is time-series data
2. **Linear relationship**: the change in the response due to a one-unit change in a predictor is constant, regardless of the value of that predictor
3. **Normality of residuals**: residuals should follow normal distribution, which can be assessed by histogram or Q-Q plot
4. **No or little multicollinearity**: predictors should not be highly correlated with each other, which can be assessed by VIF (Variance Inflation Factor)
5. **Homoscedasticity or constant variance**: the error is constant along the values of the predictor

## Advantages

1. TBD

## Limitations

1. TBD


## Least Square

The least quares appeoach chooses $\hat{\beta}$ to minimize the **RSS (Residual Sum of Squares)**

<br>
<center>
$RSS = \sum(y_i - \hat{y_i})^2 = \sum(y_i - \hat{\beta_0} - \hat{\beta_1}x_{i1} - \hat{\beta_2}x_{i2} - ... - \hat{\beta_p}x_{ip})^2$
</center>

## Stand Error, Confidence Interval, Hypothesis Test

Sample mean $\hat{\mu}$ is an unbiased estimator to population mean $\mu$, in the sense that on average, we expect $\hat{\mu}$ to equal $\mu$. However, a single estimate $\hat{\mu}$ may be a substantial undersestimate or overestimate of $\mu$. 

Standard error of $\hat{\mu}$, written as $SE(\hat{\mu})$, meansures how har off will that single estimate of $\hat{\mu}$ be

<br>
<center>
$Var(\hat{\mu}) = SE(\hat{\mu}) = \frac{\sigma^2}{n}$
</center>

where $\sigma$ is the standard deviation of each realization $y_i$ of $Y$

In general, $\sigma^2$ is unknown but can be estimated from the data. The estimate of $\sigma$ is known as **RSE (Residual Standard Error)**

<br>
<center>
$RSE = \sqrt{RSS/(n-2)}$
</center>

Standard errors can be used to compute **confidence intervals**. A $95%$ confidence interval is defined as a range of values such that with $95%$ probability, the range will contain the true unknown value of the parameter.

For linear regression, the 95% confidence interval for $\beta_i$ approximately takes the form

<br>
<center>
$\hat{\beta_i} \pm 2 \cdot SE(\hat{\beta_i})$
</center>

or more precisely
<br>
<center>
$\hat{\beta_i} \pm {t_{0.975,n-2}} \cdot SE(\hat{\beta_i})$
</center>
    
Standard errors can also be used to perform hypothesis tests on coefficients. For example, $H_0$: There is no relationship between $X_i$ and $Y$

<br>
<center>
$H_0: \beta_i = 0$
</center>
    
To test the null hypothesis, we need to determine whether $\hat{\beta_i}$ is sufficiently far from zero that we can be confident that $\beta_i$ is non-zero. In practice, we compute a **t-statistic** given by

<br>
<center>
$t = \frac{\beta_i - 0}{SE(\hat{\beta_i})}$
</center>
    
## $R^2$ Statistic
    
The $R^2$ statistic measures the proportion of variability in $Y$ that can be explained using $X$

<br>
<center>
$R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$
</center>

- $TSS = \sum(y_i - \bar{y})^2$, total sum of squares, measures the total variance in the response $Y$
- $RSS = \sum(y_i - \hat{y_i})^2$, residual sum of squares, measures the amount of variability that is left unexplained after the modeling

## (1) Is there a relationship between the response and predictors?

Perform hypothesis testing with 
<br>
<center>
$H_0: \beta_1 = \beta_2 = ... = \beta_p = 0$
<br>
$H_a$: at least one \beta_j is non-zero
</center>

The hypothesis test is performed by computing the F-statistic
<br>
<center>
$F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$
</center>

**Note**: Individual p-values for each variable cannot be used to check if there is a relationship between response and predicttor. For instance, when $p = 100$ and $H_0: \beta_1 = \beta_2 = ... = \beta_p = 0$ is true, no variable is truly associated with the response. However, still about $5\%$ of the p-values associated with each variable can be below $0.05$ by chance.

**Limitation**: The approach of using F-statistic to test for any association between the predictors and response works when $p$ is relatively small, and certainly smaller compared to $n$. If $p > n$ then there are more coefficients $\beta_j$ to estimate than observations from which to estimate them.

In this case, we cannot fit the multiple linear regression with least squares, so the F-statistic cannot be used.

When $p$ is large, other approaches such as forward selection can be used

## (2) Prediction

- **Confidence Interval** is used to quantify the uncertainty surrounding the average `sales` over a *large number of cities*
- **Prediction Interval** is used to quantify the uncertainty surrounding the average `sales` over a *particular city*

Prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about `sales` for a given city in comparison to the average `sales` over many locations

## Practice

https://realpython.com/logistic-regression-python/

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

import numpy as np

In [2]:
x = np.arange(10).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

In [5]:
model = LogisticRegression(solver='liblinear', random_state=0)
model.fit(x,y)

LogisticRegression(random_state=0, solver='liblinear')

In [6]:
model.intercept_

array([-1.04608067])

In [7]:
model.coef_

array([[0.51491375]])

In [8]:
# Probability estimates
model.predict_proba(x)

array([[0.74002157, 0.25997843],
       [0.62975524, 0.37024476],
       [0.5040632 , 0.4959368 ],
       [0.37785549, 0.62214451],
       [0.26628093, 0.73371907],
       [0.17821501, 0.82178499],
       [0.11472079, 0.88527921],
       [0.07186982, 0.92813018],
       [0.04422513, 0.95577487],
       [0.02690569, 0.97309431]])

In [9]:
# Predict class labels
model.predict(x)

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

In [10]:
# Return the mean accuracy on the given test data and labels
model.score(x, y)

0.9