### General concept

**Linearly separable**

* In Euclidean geometry, linear separability is a property of two sets of points. ... These two sets are linearly separable if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side.

**Linearly non-separable**

* A linearly nonseparable problem is a problem that, when represented as a pattern space (see above), requires more than one straight cut to separate all of the patterns of one type in the space from all of the patterns of another type.

<!-- <img src="https://lh3.googleusercontent.com/proxy/n5B_7sOJ8TKSmAaeblWuEcn6Z_ImJAj5U1yq_ITsQhpwfFReM6gq9XYzC-4-GkWBRlJPwzZbdKlnLxtFFGcgi3kDTw2IR3T4BmcKb_R_9YH9xtpkJcYKziMHolAbEQ">

**Credit** - Image from Internet -->

### Logistic Regression

* Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.
* It is a classification model that uses logistic function.
* Given the data points, the task is to find a plane that can separate two classes.
$$\prod = w^Tx+ b$$
    - where
        - $\prod$ is plane
        - $w$ is normal to the plane ($\prod$) $\implies (||w|| = 1)$
        - $b$ is intercept

![logistic_reg](https://user-images.githubusercontent.com/63333753/120607946-48195b00-c46e-11eb-82ec-1e3d7833f745.png)

Let's say we have classes where such as $y_i \in \{+1, -1\}$. Let's also assume that the plane is passing through the origin $(b = 0)$.

* $d_i$ is the distance between $x_i$ and the plane $\prod$ which is equal to $\frac{w^Tx_i}{||w||}$
* $d_j$ is the distance between $x_j$ and the plane $\prod$ which is equal to $\frac{w^Tx_j}{||w||}$

Here $w$ and $x_i$ are on the sample plane (direction) we will have

* $d_i = w^Tx_i + b > 0 \implies y_i = +1$
* $d_j = w^Tx_j + b < 0 \implies y_j = -1$

> If $y_i*w^Tx_i$ and $y_j*w^Tx_j$ are greater than $0$, then it simply means that the classifier or model is correctly predicting.

To break it down, the important task here is to find the optimum values of $w$ and $b$ that can minimize the error and maximizes $y_i*w^Tx_i$ and $y_j*w^Tx_j$ to be greater than $0$

> optimal $w* = argmax \sum_{i=1}^n y_iw^Tx_i$

### Importance of sigmoid function

* Outliers can impact the classifier drastically. It can misclassify the points just because of one or more outliers.
* To prevent this, sigmoid function is used.
* The sigmoid function is also called as squashing function that reduces the value of any larger value.

![sigmoid_func](https://user-images.githubusercontent.com/63333753/120631798-e7961800-c485-11eb-8422-180209c5a506.png)

* The sigmoid function is often written as $\sigma(x) = \frac{1}{1 + e^{-x}}$
* When we introduce this function to the orignal equation, we will get it as -

$$w^* = argmax \sum_{i=1}^n \sigma(y_iw^Tx_i)$$

* min($\sigma$) is 0
* max($\sigma$) is 1

* Finally we can write the equation of optimal w* as -

$$w^* = argmax \sum_{i=1}^n \frac{1}{1 + \exp(-y_iw^Tx_i)}$$

* Sigmoid function is very easy to differentiate since it has probabilistic interpretation.

### Mathematical formulation of an objective function

**Monotonic function**

* The function $g(x)$ is said to be monotonic iff $x$ increases then $g(x)$ also increases, and vice-versa.
    - Monotonically increasing → if $x$ increases then $g(x)$ increases
    - Monotonically decreasing → if $x$ decreases then $g(x)$ decreases

**Minima & Maxima**

* In mathematical analysis, the `maxima` and `minima` (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given range (the local or relative extrema), or on the entire domain (the global or absolute extrema).

<img src="https://datascienceintuition.files.wordpress.com/2017/12/local_global_maxmin.png">

**Credit** - Image from Internet

Let's say we have an optimzation problem, i.e., $x^* = argmin(f(x))$ and let $f(x) = x^2$.

* The $argmin(f(x)) \implies argmin(x^2) = 0$

![x-sqaure-graph](https://user-images.githubusercontent.com/63333753/120757597-60e94580-c52e-11eb-87d1-a8d96d0a5ea1.PNG)

* From the above graph we can see that the local minima of $x^2$ is at $0$.

If we introduce a monotonic function such as $g(x)$ upon $f(x)$ i.e., $x^1 = argmin(g(f(x)) \implies argmin(g(x^2)$ and let's say that $g(x) = log(x)$.

* We can claim that $x^* = x^1$, as from the below diagram.

![log_x-square-graph](https://user-images.githubusercontent.com/63333753/120758215-2d5aeb00-c52f-11eb-8128-92bca7fc2c02.PNG)

**Credit** - Images taken from Google

Now, let $w^* = argmax \sum_{i=1}^n \frac{1}{1 + \exp(-y_iw^Tx_i)} \rightarrow (1)$

Introduce monotonic function to $(1)$

$\implies w^* = argmax \sum_{i=1}^n \log \bigg(\frac{1}{1 + \exp(-y_iw^Tx_i)} \bigg)$

**Note**: $\log \big(\frac{1}{x}\big) = -\log(x)$

$\implies w^* = argmax \sum_{i=1}^n - \log \big({1 + \exp(-y_iw^Tx_i)} \big)$

**Note**: $argmax(-f(x)) = argmin(f(x))$

$\implies w^* = argmin \sum_{i=1}^n \log \big({1 + \exp(-y_iw^Tx_i)} \big) \rightarrow (2)$ where $y_i \in \{+1, -1\}$

The equation $(2)$ represents the optimization problem of logistic regression. This will not be impacted by outliers. If we try to negate $1$, then $\log$ and $\exp$ gets cancelled out and we end up remaining with the same problem without the sigmoid function.

### $L2$ Regularization : Overfitting and Underfitting

w.k.t

$$w^* = argmin \sum_{i=1}^n \log \big({1 + \exp(-y_iw^Tx_i)} \big) \rightarrow (1)$$

Let $Z_i = y_iw^Tx_i$

We can write $(1)$ as -

$$(1) \implies w^* = argmin \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) \rightarrow (2)$$

* The function value of $\exp(-x)$ be always $\geq 0$

![exp_minus_x-graph](https://user-images.githubusercontent.com/63333753/120772600-1a034c00-c53e-11eb-9cdd-e2741998df27.PNG)

* w.k.t $(\log(1) = 0)$ and $(\log(1 + \delta) \geq \log(1))$ if  $(\delta \geq 0)$
* So, mathematically $(2)$ is always $\geq 0$, and thus the minimal value will be $0$

Now, for what values of $Z_i$ the function value of $\exp(-Z_i)$ becomes $0$

* If $Z_i > 0$ and $Z_i \rightarrow +\infty \ (\forall i)$ then $\exp(-Z_i) \rightarrow 0$, therefore $\log(1 + \exp(-Z_i)) \rightarrow 0$ (explanation can be found from the above graph)

---

To avoid the problems of overfitting and underfitting, we apply regularization techniques, i.e., we add $\lambda$

$$(2) \implies w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda w^Tw \bigg] \rightarrow (3)$$

or

$$(2) \implies w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda ||w||_2^2 \bigg] \rightarrow (3)$$

or

$$(2) \implies w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda \sum_{j=1}^d w_j^2 \bigg] \rightarrow (3)$$

In $(3)$, first term is called the **loss term** and second term is called the **regularization term**. Here, the term $\lambda$ is a hyperparameter.

> $\lambda = 0 \implies$ overfitting <br> $\lambda = \text{very large} \implies$ underfitting

To summarize, the general pattern that is followed in machine learning is

$$\text{min}\bigg(\text{loss function over training data} + \text{regularization}\bigg)$$

We find the right $\lambda$ through cross validation techniques.

### $L1$ Regularization and Sparsity

Instead of using $L2$ norm (above), are there any viable regularization? Yes. We have $L1$ norm, when substituted in $(3)$, we get -

$$(3) \implies w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda ||w||\bigg] \rightarrow (4)$$

or

$$(3) \implies w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda \sum_{j=1}^d|w_j| \bigg] \rightarrow (4)$$

$L1$ regulization serves the same purpose of $L2$ regularization but has other advantages (sparsity).

* The solution to the logistic regression is said to be sparse iff it consists of many $0$'s.
* All the unimportant features become $0$.

On the other hand, we have ElasticNet which combines the advantages of both $L2$ and $L1$ norms. The equation of the same looks like -

$$w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \lambda_1 ||w|| + \lambda_2 ||w||_2^2 \bigg]$$

### Probabilistic interpretation for Logistic Regression

Refer to → https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

* In other ways, we can derive logistic regression by probabilistic methods. It is the combination of **Gaussian Naive Bayes** and **Bernaulli Distribution**.

* In the case of $y_i \in \{+1, -1\}$ the optimization problem will be

$$w^* = argmin \bigg[ \sum_{i=1}^n \log \big(1 + \exp(-Z_i)\big) + \text{Regularization} \bigg]$$

* In the case of $y_i \in \{1, 0\}$ the optimization problem will be probabilistic in nature

$$w^* = argmin \bigg[ \sum_{i=1}^n -y_i \log(p_i) - (1 - y_i) \log(1 - p_i) + \text{Regularization} \bigg] \ \text{where} \ p_i = \sigma(w^Tx_i)$$

Both are same.

### Loss minimization interpretation for Logistic Regression

>The aim of any optimization problem is to minimize the loss.

* Let's imagine that we are appending 

> `+1` → incorrect classification <br> `+0` → correct classification

* In this case, the aim should be to minimize the number of incorrectly classified points.
* Thus, we have to find the parameter $w^*$ that minimizes error.
* This type of loss function is called `0_1_loss_function`.

From this,

* The ideal loss function looks like -

$$w^* = argmin \sum_{i=1}^n \textbf{0_1_loss_function}(y_i, x_i, w)$$

**Different loss functions plot**

<img src="http://i.stack.imgur.com/4DFDU.png">

**Credits** - Image from Internet

**Note**: The optimization problems in machine learning is solved using the mathematical concepts like differentiation in calculus. If the optimization problem is not differentiable, then we cannot do much about it. In order to make it differentiable ready, we do approximation by applying logistic loss function.

### Feature importance and Model interpretability

* Let's say we have features like $(f_1, f_2, \dots, f_j, \dots, f_d)$ and corresponding to these features we have weights like $(w_1, w_2, \dots, w_j, \dots, w_d)$.

* Let's also assume that all **features are independent** (Naive Bayes). Now, to determine which feature is important, we can simply rely on the absolute values of corresponding weights and pick the one which has more weight.
    * If $w_j$ is large (irrespetive of the sign), the impact of $w_j$ to determine the class label increases and thus the corresponding $f_j$ is important for the model.

### Collinearity of features

* If two features that can be represented in a linear fashion is called **Collinearity**.

$$f_1 = \alpha_1 + \alpha_2 f_2$$

* If more features are represented in a linear fashion - is called **multicollinearity**.

$$f_1 = \alpha_1 + \alpha_2 f_2 + \alpha_3 f_3 + \dots + \alpha_n f_n$$

Let, $w^* = \{1, 2, 3\}$ are the weghts corresponding to features $\{f_1, f_2, f_3\}$. Let $x_q$ be $\{x_{q1}, x_{q2}, x_{q3}\}$

Now, as per the model rule we do

$\implies w^Tx_q = w_1x_{q1} + w_2x_{q2} + w_3x_{q3}$

$\implies w^Tx_q = x_{q1} + 2x_{q2} + 3x_{q3}$

Let $f_2 = 1.5f_1$

$\implies w^Tx_q = x_{q1} + 3x_{q1} + 3x_{q3}$

$\implies w^Tx_q = 4x_{q1} + 3x_{q3}$

Therefore, the final weights are $\{4, 0, 3\}$. This simply concludes that $f_2$ is not important (which is wrong).

**Note**: If features are collinear, then we cannot use |w_j| for the feature importance.

**How do you determine if the features are multicollinear?**

* Standardize the features
* Compute the weights by applying optimization problem
* Pertubate the features by adding slight noise
* Recompute the weights by appling optimization problem
* If initial weights and final weight differ significantly, then the features are said to be multicollinear (we cannot use $|w_j|$ to determine best features). Otherwise no.

### Logistic regression - Imbalanced data

* Refer to this video - https://www.youtube.com/watch?v=l8Dge0z1Zks&ab_channel=AppliedAICourse

### Training - Time complexity

* Training in logistic regression is basically solving the given optimization problem. This process is called **stochastic gradient descent**.
* The time complexity in the **training process** is roughly $O(nd)$ where $n$ is the total number of points and $d$ is the dimensionality.
* The space complexity at runtime is $O(d)$ because the only thing that is required to store is $w^*$ which is a vector of optimized weights of $d$ dimensional space.
* The time complexity is also $O(d)$ which means that we have $d$ features.
---
* If $d$ is small, the algorithm works like magic.
* If $d$ is large, we can use $L1$ regularization which creates sparsity.

### Real world cases

* The decision surface is linear in nature. It can be a line or plane or hyperplane.
* The basic assumption here is that the data is linearly separated or almost linearly separated.
* The impact of outliers is very less because of the sigmoid function. However, if outliers are problematic, then one can remove it and re-train the model to fit the data better.

### Feature transformation

* Feature engineering and feature transformation are the important aspects in solving machine learning problems.

* In the case of non-linearly separable data, we have to do feature transformation in order to find the best separator.

<img src="https://miro.medium.com/max/1406/1*OpPID41jkJ70dslLHdP0_g.jpeg">

**Credits** - Image from Internet

**Types of transformations**

* Product transformation
* Sqaure transformation
* Trignometric transformation
* Boolean transformation
* Logarthmic transformation
* Exponential transformation