# Logistic Regression

All variable in linear regression should be quantitative. Linear regression is not suitable for classification problem due to be qualitative variables. In many situations, the response variable is qualitative such as yes / no, true / false, mild / moderate / severe etc. Often qualitative variables are referred to as *categorical*. Logistic regression is classification techniques or classifiers which can use to predict a qualitative response. Logistic regression gives you a discrete outcome(cancer or not) but linear regression gives a continuous outcome(the value of car).

## Types of Logistic Regression

* *Binary Logistic Regression*

The categorical response has only two possible outcomes. For example, Fair-Unfair, Agree-Disagree
* *Multinomial Logistic Regression*

Three or more categories without ordering. For example, Veg, Non-veg, Vegan...
* *Ordinal Logistic Regression*

Three or more categories with ordering. For example, Movie rating from 1 to 5, Very happy-Quite happy-Neither happy nor unhappy-Unhappy-Very unhappy...

## Applications of Logistic Regression
- Whether a patient has cancer or not
- To predict whether an email is spam or not
- To predict whether the tumor is malignantor not
- Image Segmentation and Categorization
- Handwriting recognition


## Function of Logistic Regression 


Logistic regression measures the relationship between the dependent variable and the one or more independent variables, by estimating probabilities using it's underlying logistic function (sigmoid function)

 $$ f(t):=\frac{1}{1+e^{-t}} \quad \text{or} \quad \frac{e^{t}}{1+e^{t}} $$
 
 where $t\in \mathbb{R}$ and $f(t)\in (0,1)$.
 
The sigmoid function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those boundary values.
 $$ \lim_{t\to\infty} f(t)=1 \qquad \text{and} \qquad \lim_{t\to -\infty}=0$$
 
 <img src="files/logisticgraph.png" width='400' height='300'/>
 
 
 We get the logistic function with respect to the variable $x$ after substituting $t=w\cdot x + b$ on sigmoid function, 
$$ p(x)=\frac{1}{1+e^{-(w\cdot x +b)}} $$

$$P(y=1|\; x)=p(x)=\frac{1}{1+e^{-(w\cdot x +b)}}$$
$$P(y=0|\;x)=1-P(y=1|\;x)=1-p(x)=\frac{e^{-(w\cdot x +b)}}{1+e^{-(w\cdot x +b)}}$$

After taking proportion of the above equation, we get 
$$\frac{p}{1-p}=\frac{P(y=1|\;x)}{1-P(y=1|\;x)}=e^{(w\bullet x +b)}$$
Now, take the natural logarithm of the last equation both of sides,
$$ ln(\frac{p}{1-p})=w\cdot x +b = x'\cdot \theta$$

where $\theta=(b,w_1,w_2,\dots,w_n)$ and $x'=(1,x_1,x_2,\dots,x_n)$. We are working to find the appropriate $\theta$ term.


$\frac{p}{1-p}$ is called *odds*, it means that the rate of the probability of event occurrence to the probability of event does not occurrence. The natural logarithm of odds is called *logit*,
$$logit(x)=w\cdot x+b = x'\cdot \theta$$


The probability that the $x^{(i)}$ data is in class $y^{(i)}=0,1$
$$ P(y^{(i)}|x^{(i)}) = p(x^{(i)})^{y^{(i)}}(1-p(x^{(i)})^{1-y^{(i)}} $$
where $p(x)=\frac{1}{1+e^{-(x'\cdot \theta)}}$

Maximum Likelihood Estimation (MLE) is a general approach to estimating parameters in statistical models.From MLE,
$$P(data;\theta)=\prod_{(x^{(i)},y^{(i)}\in data)}p(x^{(i)})^{y^{(i)}}(1-p(x^{(i)}))^{1-y^{(i)}}$$

After taking the the natural logarithm
$$L(\theta)=\ln(P(\theta))=\sum_{i=1}^{m} y^{(i)} \ln(p(x^{(i)};\theta)) + (1-y^{(i)})\ln(1-p(x^{(i)};\theta))$$

We want to find $ \underset{\theta}{argmax} \;ln(P(\theta))$ or $\underset{\theta}{argmax}\; L(\theta)$

We can maximize the likelihood using different methods like Newton's Method or Gradient Descent.

If you know the suitable value of $\theta$, for the given value of the data point $x$ which does not given classify $y^{(i)}$, to decide whether it belongs to $y^{(i)}$ or not, consider
$$ P(0|x) = 1-p(x^{(i)}) $$ 
and
$$ P(1|x) = p(x^{(i)}) $$
which one is big, it belongs to that class.

# Softmax Regression 

Softmax regression (or multinomial logistic regression) is a generalization of logistic regresion that we can use for multi-class classification. In logistic regression, we have binary classification $y^{(i)}=\{0,1\}$. In contrast, we have $K$ different classes in multinomial logistic regression such that $y^{(i)}=\{1,2,\dots ,K\}$. Like sigmoid function in logistic regression, we have the *softmax function* such that

$$P(y|t^{(i)})=p(t^i)=\frac{e^{t^{(i)}}}{\sum_{j=1}^{K} e^{t^{(i)}}} \qquad\text{for}\quad i=1,2,\dots,K$$

Substitute to $t^{(i)}$ we can write $w\cdot z+b= \theta \cdot x$ where  $\theta=(b,w_1,w_2,\dots,w_n)$ and $x=(1,z_1,z_2,\dots,z_n)$.

Given a test input $x$, we want to evaluate the probability that $P(y=k|x)$ for each value of $k=1,2,\dots,K$. Thus, we have the value of the $K$ different probabilities such that our softmax function will be

$$p(x;\theta)=
\begin{bmatrix}
P(y=1|x)\\
P(y=2|x)\\
\vdots\\
P(y=K|x)
\end{bmatrix}=\frac{1}{\sum_{j=1}^{K} e^{\theta^{(i)}\cdot x}}
\begin{bmatrix}
e^{\theta^{(1)}\cdot x}\\
e^{\theta^{(2)}\cdot x}\\
\vdots\\
e^{\theta^{(K)}\cdot x}\end{bmatrix}
$$

where $\theta^{(i)}\in \mathbb{R}^{n+1}$ are the parameters of our model. Notice that $\sum_{k=1}^{K}P(y=k|x)=1$. The parameter $\theta$ can be written as

$$
\theta=\begin{bmatrix}
|&|&|&|\\
\theta^{(1)}&\theta^{(2)}&\cdots&\theta^{(K)}\\
|&|&|&|
\end{bmatrix}
$$
The function $L(\theta)$ in logistic regression can also written as the following form
$$
L(\theta)=\sum_{i=1}^{m}\sum_{k=0}^{1} 1\{y^{(i)}=k\}ln (P(y^{(i)}=k|x^{(i)}))
$$
where the function $1\{\text{a true statement}\}=1$ and $1\{\text{a false statement}\}=0$.


In softmax regression, the function $L(\theta)$ as the form
$$
L(\theta)=\sum_{i=1}^{m}\sum_{k=1}^{K} 1\{y^{(i)}=k\}ln (P(y^{(i)}=k|x^{(i)}))
$$
where 
$$P(y^{(i)}=k|x^{(i)})=\frac{e^{\theta^{(k)}\cdot x^{(i)}}}{\sum_{j=1}^{K}e^{\theta^{(j)}\cdot x^{(i)}}}$$
We need to find $\underset{\theta}{argmax}\; L(\theta)$, we can find it by using optimization methods.
## Properties of softmax regression parameterization

Suppose that every $\theta^{(i)}$ is replaced with $\theta^{(i)}-\psi$ for ever $i=1,2,\dots,K$ where $\psi$ is some fixed vector. Then we obtain
$$
\begin{align}
P(y^{(i)}=k|x^{(i)};\theta)&=\frac{e^{(\theta^{(k)}-\psi)\cdot x^{(i)}}}{\sum_{j=1}^{K}e^{(\theta^{(j)}-\psi)\cdot x^{(i)}}}\\
&=\frac{e^{\theta^{(k)}\cdot x^{(i)}}e^{-\psi\cdot x^{(i)}}}{\sum_{j=1}^{K}e^{\theta^{(j)}\cdot x^{(i)}}e^{-\psi\cdot x^{(i)}}}\\
&=\frac{e^{\theta^{(k)}\cdot x^{(i)}}}{\sum_{j=1}^{K}e^{\theta^{(j)}\cdot x^{(i)}}}
\end{align}
$$
This means that our softmax regression model is "overparameterized".

## Relationship to Logistic Regression

In softmax regression, if we take the number of class $K=2$, we get the following function

$$
p(x ;\theta)=\frac{1}{e^{\theta^{(1)}\cdot x}+e^{\theta^{(2)}\cdot x}}
\begin{bmatrix}
e^{\theta^{(1)}\cdot x}\\
e^{\theta^{(2)}\cdot x}\\
\end{bmatrix}
$$
Since the softmax regression is overparameterized, choose $\psi=\theta^{(2)}$ then we got
$$
\begin{align}
p(x;\theta)&=\frac{1}{e^{(\theta^{(1)}-\theta^{(2)})\cdot x}+e^{(\theta^{(2)}-\theta^{(2)})\cdot x}}
\begin{bmatrix}
e^{(\theta^{(1)}-\theta^{(2)})\cdot x}\\
e^{(\theta^{(2)}-\theta^{(2)})\cdot x}
\end{bmatrix}\\
&=\begin{bmatrix}
\frac{e^{\theta'\cdot x}}{e^{\theta'\cdot x}+e^{\vec{0}\cdot x}}\\
\frac{e^{\vec{0}\cdot x}}{e^{\theta'\cdot x}+e^{\vec{0}\cdot x}}
\end{bmatrix}\\
&=\begin{bmatrix}
\frac{1}{1+e^{-\theta'\cdot x}}\\
\frac{e^{-\theta'\cdot x}}{1+e^{-\theta'\cdot x}}
\end{bmatrix}\\
&=\begin{bmatrix}
\frac{1}{1+e^{-\theta'\cdot x}}\\
1-\frac{1}{1+e^{-\theta'\cdot x}}
\end{bmatrix}
\end{align}
$$

where the parameter $\theta':= \theta^{(1)}-\theta^{(2)}$

## References

1) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.

2) https://machinelearning-blog.com/2018/04/23/logistic-regression-101/

3) http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/

# sorular
1) softmax regression nın en öenmli özelliği olan "overparameterized"  ne demek ? Formal olarak rastgele bi parametre aldığımızda hiç bir etkisi olmuyor ama bu bize ne kazandırıyor ?

2) softmax functionu nasıl elde ediyoruz. Sigma functiondan elde edilişi var mıdır ?

3)  Eksiğim kalmış mı hocam? Bu çalışmalardan sonraki aşamam nedir hocam ?