<a href="https://colab.research.google.com/github/jp7252/ML4RM/blob/main/Class_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Logistic Regression - Training

- Logistic Regression models use the following equation to estimate the probability that $y = 1$ given its size $X$:

$$
Pr(Y=1|X=x)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}
$$
$$
Pr(Y=0|X=x)=1-Pr(Y=1|X=x)=\frac{1}{1+e^{\beta_0+\beta_1X}}
$$

- How do we find the best beta? Similiar to linear regression, most of the machine learning algorithms would require a loss/cost function to optimize.
- Given an input X with n independent observations, the likelihood gives the joint probability of having the observations with the prescribed labels:

$$L(\beta_0, \beta_1) = \prod_{i,y_i=1}Pr(x_i,\beta_0,\beta_1) \prod_{i,y_i=0}1-Pr(x_i,\beta_0,\beta_1)$$

- The first product gives the probability of successfully predicting the “1”s and the second product is the probability of successfully predicting the “0”s in the given data.

- The likelihood function $L(\beta_0, \beta_1)$ gives the probability of making the same prediction as the observed data.
- Among all the linear models, the pair with the higher L has a higher probability to produce the prescribed class labels.
- We want to pick $\beta_0$ and $\beta_1$ to maximize the likelihood $L(\beta_0, \beta_1)$, i.e., to maximize the “agreement” of the selected model with the observed data.

- In practice it is often more convenient to work with the logarithm of the likelihood function, called the log-likelihood:

$$logL(\beta_0, \beta_1)=\sum_{i=1}^{n}y_ilogPr(x_i, \beta_0, \beta_1) + (1-y_i)log(1-Pr(x_i, \beta_0, \beta_1))$$

- We can then use optimization methods like [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method) or [BFGS algorithm](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm).
- Most of the optimization method will minimize the loss function so you will see people sometime refers to log loss when they talk about optimization step in logistic regression.

$$Log Loss = - logL(\beta_0, \beta_1)$$

- In practice, people would add the penalization term to the loss/cost function, just like the ridge/lasso regression. See more details from [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
import pandas as pd
df = pd.read_csv("https://drive.google.com/uc?id=1Ijs6Quta_ZAd3dsKWMvI6pxaHjpXgFoU")

In [None]:
df.sample(10)

Unnamed: 0,loan_amnt,int_rate,home_ownership,annual_inc,term,employment_years,loan_outcome
6865,6000.0,0.0762,OWN,,36.0,8.0,0
5128,12000.0,0.1114,RENT,105000.0,36.0,5.0,0
10524,10200.0,0.1099,RENT,45000.0,36.0,4.0,0
1596,5000.0,0.1114,MORTGAGE,50000.0,36.0,3.0,0
2382,7000.0,0.0967,RENT,36000.0,36.0,10.0,0
4841,15000.0,0.2049,RENT,36000.0,36.0,2.0,1
9180,1400.0,0.1824,RENT,41000.0,36.0,6.0,1
502,12000.0,0.1649,RENT,130000.0,36.0,6.0,0
10741,5600.0,0.1099,RENT,77000.0,36.0,4.0,0
7840,17875.0,0.1854,MORTGAGE,70000.0,60.0,10.0,1


In [None]:
y = df['loan_outcome']
X = df.drop('loan_outcome', axis=1)

In [None]:
X.head()

Unnamed: 0,loan_amnt,int_rate,home_ownership,annual_inc,term,employment_years
0,12500.0,0.1727,RENT,30000.0,60.0,5.0
1,12000.0,0.1629,RENT,88365.0,36.0,9.0
2,17500.0,0.1727,MORTGAGE,45000.0,60.0,7.0
3,6000.0,0.1349,RENT,50000.0,36.0,5.0
4,18000.0,0.079,RENT,56964.0,36.0,10.0


- We see that the `home_ownership` is a categorical column with string as the values. However, computer doesn't understand what "RENT" means so we need to find out a way to convert them into numbers.
- One solution is to use one-hot encoding. Let's see how is it implemented.

In [None]:
X = pd.get_dummies(X)

In [None]:
X.head()

Unnamed: 0,loan_amnt,int_rate,annual_inc,term,employment_years,home_ownership_MORTGAGE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,12500.0,0.1727,30000.0,60.0,5.0,False,False,False,True
1,12000.0,0.1629,88365.0,36.0,9.0,False,False,False,True
2,17500.0,0.1727,45000.0,60.0,7.0,True,False,False,False
3,6000.0,0.1349,50000.0,36.0,5.0,False,False,False,True
4,18000.0,0.079,56964.0,36.0,10.0,False,False,False,True


- Before we fit any model, we need to check if there are any missing values in the dataset.

In [None]:
X.isnull().sum()

Unnamed: 0,0
loan_amnt,118
int_rate,142
annual_inc,114
term,131
employment_years,144
home_ownership_MORTGAGE,0
home_ownership_OTHER,0
home_ownership_OWN,0
home_ownership_RENT,0


- KNN can be used to impute the missing values in the dataset. It uses the average value of its neighbors to impute. However, it can be really slow if the dataset is large.
- You can also manually impute the missing values depending on whether the column is numerical or categorical.
    - Use mean/median for numerical columns
    - Use mode for categorical columns
- **What if 20% of the values in a column are missing?**

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(X)

In [None]:
pd.DataFrame(X).isnull().sum()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0


### Imbalanced dataset

In [None]:
y.value_counts()

Unnamed: 0_level_0,count
loan_outcome,Unnamed: 1_level_1
0,10060
1,2201


- Using the `stratify` parameter when splitting the dataset will make sure that we have a similar distribution in our test set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
sum(y_train)/y_train.shape[0]

0.17956187368911675

In [None]:
sum(y_test)/y_test.shape[0]

0.17939657515629248

- Pay attention to the `class_weight` parameter. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.

In [None]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(class_weight="balanced")
logit.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
logit.score(X_test, y_test)

0.5444414243000816

In [None]:
y_pred = logit.predict(X_test)
sum(y_pred)

1902

In [None]:
y_pred[:10]

array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0])

In [None]:
logit.coef_
# loan amount, interest rate, annual income

array([[ 2.89760350e-05,  7.78742953e-02, -8.47181231e-06,
         1.59917703e-02, -1.62706585e-02, -3.54476467e-01,
        -4.41600319e-03,  2.01808480e-02,  8.84459370e-02]])

### How to interpret the beta coefficients?

- Suppose we have a fair 6-sided dice. Consider the following events:
    - Success: Rolling a 2 or 5.
    - Failure: Rolling a 1, 3, 4, or 6.
- What is the probability p of success?
$$p = \frac{success}{events} = \frac{2}{6}=\frac{1}{3}$$

- What are the odds of success?
$$Odds=\frac{p}{1-p}=\frac{1}{2}$$

- To see how does odds occur naturally, we derive

$$p=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}$$

$$1-p=\frac{1}{1+e^{\beta_0+\beta_1X}}$$

$$\frac{p}{1-p}=e^{\beta_0+\beta_1X}$$

$$log(\frac{p}{1-p})=log(e^{\beta_0+\beta_1X})=\beta_0+\beta_1X$$

- Thus logistic regression can be viewed as a linear regression on the log odds

### Choosing the right metrics

- **Precision**: how many of the predicted positive cases are truly positive
    $$Precision=\frac{TP}{TP+FP}$$

- **Recall**: how many of the positive cases are detected by the model
    $$Recall=\frac{TP}{P}$$

- A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score.

$$F1=2*\frac{precision*recall}{precision+recall}$$

In [None]:
from sklearn.metrics import classification_report

y_pred = logit.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.52      0.65      3019
           1       0.23      0.67      0.35       660

    accuracy                           0.54      3679
   macro avg       0.56      0.59      0.50      3679
weighted avg       0.76      0.54      0.60      3679



- Which one should we focus on given the context of the dataset?
    - If the loan was default but the model didn't detect that, what is the loss?
    - If the loan was not default but the model predicted that it will become default, what is the loss?

In [None]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

[[1560 1459]
 [ 217  443]]


In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [None]:
(tn, fp, fn, tp)

(1560, 1459, 217, 443)