# Logistic Regression

## 1. Theory behind the Algorithm

### What is Logistic Regression?

Logistic Regression is a **classification** algorithm (our response y is categorical). The main goal of this classifier is to map each input to a vector of probabilities indicating the probability of that observation belonging to the response categories. Say our **input** is multiple vectors of the kind

$$x = \begin{pmatrix} x_1, & ..., & x_m \end{pmatrix}$$

And suppose our **output** lives in the following set $y \in \{0, 1\}$

We can now define what we are eventually interested in, meaning the probability of our input to belong in a certain category, which in this case takes the form of $$p(x) = P(y=1|x) = 1 - P(y=0|x)$$

It is possible to mathematically derive the logistic function (I am displaying just the binary case for simplicity, but similar results hold for the multivariate case), obtaining

$$p(x) = \frac {e^{\beta_0 + \beta_1  x_1 + ... + \beta_m  x_m}}{1 + e^{\beta_0 + \beta_1  x_1 + ... + \beta_m  x_m}}$$

Therefore we can train training our algorithm on the data, so that to build the parameters $\beta_0, ..., \beta_m$ and to estimate $\hat{y_i}$ for $\forall{i} \in \{0, ..., n\}$, which is usually $\hat{y_i} = \unicode{x1D7D9}(\hat{p}(x_i)\geq0.5)$, where $\unicode{x1D7D9}$ is the indicator function returning True or False (1 or 0), whether the function condition is satisfied or not. Meaning that if the predicted probability for input $i$ is greater than $\frac{1}{2}$, it gets predicted to belong to class 1, otherwise 0.

## When should I use it? What data should I feed it?

<ul>
<li>It works well with <strong>continuous data types</strong></li>
<li>Categorical variables need to be <strong>OneHotEncoded</strong></li>
<li>Logistic Regression assumes the data is linearly or curvy linearly separable in space (whereas trees do not)</li>
<li>This algorithm <strong>does not handle skewed classes well</strong>. Say if 75% of the output is 1 and just 25% is 0, then we need to <strong>balance</strong> the classes or play with the <strong>class weights</strong></li>
</ul>

Moral of the story: it is a good and reliable algorithm, yet its range of application can be considered narrow when compared to other algorithms.

# 2. Practical Example

Suppose I am interested in whether a person has survived the titanic disaster or not. As an emergency protocol the ship evacuated first the young, the elder and the women. Let us analyze the latter category with a logistic regression, checking whether the characteristic of being a female has had an impact on the probability of survival of a passenger. For those purposes I will use the titanic dataset, obtained through the seaborn library.

In [36]:
#I am importing the necessary dependencies
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
pd.options.mode.chained_assignment = None
df = sns.load_dataset('titanic')

In [37]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


I will now check how many unique types of gender characterize the individuals on the dataset.

In [38]:
df['sex'].unique()

array(['male', 'female'], dtype=object)

Male and Female are the only genders present in the dataset, thus we have a binary output.
I will now select the input and output columns I'm interested in.

In [39]:
df_sex_survived = df[['sex','survived']]

Right now the values under the column "sex" are strings, 'male' and 'female', I want to transform the column into a dummy variable, having value 1 if the individual is a female, and 0 otherwise.

In [40]:
df_sex_survived['sex'] = df_sex_survived['sex'].apply(lambda x: 1 if x=='female' else 0)

In [41]:
df_sex_survived

Unnamed: 0,sex,survived
0,0,0
1,1,1
2,1,1
3,1,1
4,0,0
...,...,...
886,0,0
887,1,1
888,1,0
889,0,1


I will now store part of my dataset for training purposes, and I will leave out another portion of it for testing purposes.

In [44]:
from sklearn.model_selection import train_test_split
X= df_sex_survived['sex'].values.reshape(-1, 1) #reshaping it because it has a single feature
y = df_sex_survived['survived'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

The next step is to train the model

In [47]:
logistic_reg = LogisticRegression(random_state=0)
logistic_reg.fit(X_train,y_train)
print('Accuracy score:',logistic_reg.score(X_train, y_train),'%')

Accuracy score: 0.7849117174959872 %


We see that we have an accuracy score of around 80%, which means simply by knowing whether a passenger was a male of a female, we would be correct 80% of the times in classifying whether he/she has survived the titanic accident.

In [54]:
print('Coefficient of the regression', logistic_reg.coef_[0][0])

Coefficient of the regression 2.370100620067801


Unfortunately the interpretability of coefficient in the logistic regression is not straightforward, but we can make some observations by looking at the positive sign, telling us that the input feature does on average have a positive impact, contributing to a probability increase in survival rate when true.

# 3. Parameters in Sklearn

**Note:** in Sklearn, when we use LinearRegression, regularization of the input is applied by default.

<em>LogisticRegression</em> (<em>penalty</em>='l2', *, <em>dual</em>=False, <em>tol</em>=0.0001, <em>C</em>=1.0, <em>fit_intercept</em>=True, <em>intercept_scaling</em>=1, <em>class_weight</em>=None, <em>random_state</em>=None, <em>solver</em>='lbfgs', <em>max_iter</em>=100, <em>multi_class</em>='auto', <em>verbose</em>=0, <em>warm_start</em>=False, <em>n_jobs</em>=None, <em>l1_ratio</em>=None)

**Penalty refers to how mispredictions are penalized**. The possible options offered by sklearn are **"$\ell_2$", "$\ell_1$"** and **"elasticnet"**. The norm we should use depends on our endgoal. Using the l2 norm, we keep all of our coefficient, with the least significant ones dropping significantly in size. Using the l1 norm, we force the coefficients that are less significant to go to zero, which we then do not consider. On the other hand, elasticnet is a combination of the two, at the expense of more computational power needed, it is more flexible, which can be a pro, though it should be weighted with the risk of overfitting.

Expanding a little more on the topic, the mathematical structure of an $\ell_p$ norm is defined as $\ell_p = (\sum_{i}^{m} |{x_i^p}|)^{1/p}$, where p denoted the type of norm. In our case, the norm regularization is applied on our coefficient $\beta$ 's, thus we have that $$\ell_1 = ||\beta||_1 =\sum_{i}^{m} |\beta_i|$$ $$\ell_2 = ||\beta||_2 = (\sum_{i}^{m} |\beta_i^2|)^{1/2}$$ $$\ell_{elasticnet}=\alpha||\beta||_2 + (1-\alpha)||\beta||_1$$ where $$\alpha = \frac{\lambda_2}{(\lambda_1 + \lambda_2)}$$
<em>More infos on elasticnet can be found at https://web.stanford.edu/~hastie/TALKS/enet_talk.pdf</em>



**fit_intercept** is a boolean parameter deciding whether the intercept $\beta_0$ should be inserted in the model or not.

**class_weight** referes to the balancing of the classes, when the class training data is skewed. For example if we're predicting a binary label 0 or 1, and the vast majority of our training dataset contains values classified as 1, having way less values classified at 0.

**random_state** is the random seed that shuffles the data. By fixing it, we are able to compare different models within constant subsets of data, eliminating differences originating by differences in subsets of data selected for the training.

**solver** refers to the algorithm used in the maximimation process. The decision should be made according to the size of the dataset and the amount of classes we want to predict.

**max_iter** refers to the number of maximum iterations of the algorithm we are able to tollerate. If the algorithm does not converge, theoretically should run forever. By setting a large yet reasonable amount of iterations we should be able to achieve close to convergence even if it will wiggle forever around the minimum of the optimization problem, yet never settling there specifically.

**multi_class** refers to the loss to be used in the particular case. If we are dealing with a multilabel classification we should change it to be "multinomial".