### Logistic Regression Example 3.1
The estimated coefficients are
$\hat{\beta}_{0} = -10.6513$ 

$\hat{\beta}_{1} = 0.0055$

Thus, if an individual has **balance = 1000**, then the model yields
\begin{equation}
\hat{p}(1000)
=\dfrac{e^{-10.65+ 0.0055\cdot 1000}}{1+e^{-10.65+ 0.0055\cdot 1000}}
\approx 0.00577
\end{equation}

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.linalg.cython_blas import snrm2

# Load data
df = pd.read_csv('./data/Default.csv', sep=';')

# Add a numerical column for default
df = df.join(pd.get_dummies(df['default'], 
                            prefix='default', 
                            drop_first=True))

# Fit logistic model
x = df['balance']
y = df['default_Yes']

x_sm = sm.add_constant(x)

model = sm.GLM(y, x_sm, family=sm.families.Binomial())
model = model.fit()

# Predict for balance = 1000
x_pred = [1, 1000]
y_pred = model.predict(x_pred)

print(y_pred)

ImportError: cannot import name 'snrm2' from 'scipy.linalg.cython_blas' (/Users/marbetschar/miniconda3/envs/ZHAW-Predictive-Modeling/lib/python3.13/site-packages/scipy/linalg/cython_blas.cpython-313-darwin.so)

This probability of default is well below $1\%$, which is very low. However, a different individual with **balance = 2000** has a default probability
of approximately $59\%$.

In [None]:
# Predict for balance = 2000
x_pred = [1, 2000]
y_pred = model.predict(x_pred)

print(y_pred)

### Logistic Regression Example 3.2
For the **Default** data the following **Python**-code computes the training classification error. 

In [None]:
""" Follows Example 3.1 """
# Predict for training data
x_pred = x_sm
y_pred = model.predict(x_pred)
print(y_pred[10])

# Round to 0 or 1
y_pred = y_pred.round()
print(y_pred[10])

# Compute training error
e_train = abs(y - y_pred)
e_train = e_train.mean()

print(e_train)

The value of the training error in this example is 0.0275, which is to say that approximately 97.25%  of the cases in the training set are classified correctly. 

### Logistic Regression Example 3.3
The following **Python**-code produces the confusion matrix for the **Default** data set and the logistic regression model.

In [None]:
""" Follows Example 3.2 """
# Create confusion matrix
confusion = pd.DataFrame({'predicted': y_pred,
                          'true': y})
confusion = pd.crosstab(confusion.predicted, confusion.true, 
                        margins=True, margins_name="Sum")

print(confusion)

In [None]:
import seaborn as sn

sn.heatmap(confusion, annot=True)

It can be seen that out of $9667$ cases with **default=No**, the vast majority of $9625$ are classified correctly. On the other hand, only approximately $1/3$
of the **default=Yes** cases are classified correctly. The confusion matrix shows that the present classification scheme is by no means useful, in particular,
if you want to predict the case of **default=Yes**.

The reason for this bad result is the *imbalance* of the two classes. The training data only contains $333$ out of $ 10000$ cases with **default=Yes**.
Therefore, the likelihood function is dominated by the factors corresponding to **default=No**, so the parameters are chosen as to match mainly those 
cases. Note also that the trivial classifier predicting all observations $x$ to $\hat{f}(x)=0$ has a classification error of $333/10000=0.0333$ which 
is not much worse than that of our logistic model.

The situation can also be visualized by the histograms of the estimated probabilities of **default=Yes** separated by true class.

It is striking that the **default=No** group has a high concentration of probabilities near $0$ which is reasonable for this group. On the other hand, though,
the estimated probabilities for the **default=Yes** cases do not exhibit high mass at $1$. Instead, the maximal probability is attained close to $0$ as well!

### Logistic Regression Example 3.8
We can compute the F1 score by means of the

In [None]:
""" Follows Example 3.3 """
from sklearn.metrics import f1_score

# Find F1-score
f1 = f1_score(y, y_pred, pos_label=1, average='binary')
print(f1)


**pos\_label**  is an optional character string for the factor level that corresponds to a *positive* result.

### Logistic Regression Example 3.9
If we consider the case **default=No** as positive, then the F1 score changes to

In [None]:
""" Follows Example 3.8 """
# Find F1-score
f1 = f1_score(y, y_pred, pos_label=0, average='binary')
print(f1)


### Logistic Regression Example 3.10
We analyze the **Default** data set and fit a logistic regression model by downsampling the **default=No** class to the same size as the **default=yes** case. 

In [None]:
""" Follows Example 3.9 """
# Set ramdom seed
np.random.seed(1)
# Index of Yes:
i_yes = df.loc[df['default_Yes'] == 1, :].index

# Random set of No:
i_no = df.loc[df['default_Yes'] == 0, :].index
i_no = np.random.choice(i_no, replace=False, size=333)

# Fit Linear Model on downsampled data
i_ds = np.concatenate((i_no, i_yes))
x_ds = df.iloc[i_ds]['balance']
y_ds = df.iloc[i_ds]['default_Yes']

x_sm = sm.add_constant(x_ds)

model_ds = sm.GLM(y_ds, x_sm, family=sm.families.Binomial())
model_ds = model_ds.fit()

# Predict for downsampled data
x_pred_ds = x_sm
y_pred_ds = model_ds.predict(x_pred_ds)

# Round to 0 or 1
y_pred_ds = y_pred_ds.round()

# Classification error on training data:
e_train = abs(y_ds- y_pred_ds)
e_train = e_train.mean()

print(np.round(e_train, 4))

In [None]:
# Create confusion matrix
confusion = pd.DataFrame({'predicted': y_pred_ds,
                          'true': y_ds})
confusion = pd.crosstab(confusion.predicted, confusion.true, 
                        margins=True, margins_name="Sum")

print(confusion)

In [None]:
# Print F1-scores
f1_pos = f1_score(y_ds, y_pred_ds, pos_label=1, average='binary')
f1_neg = f1_score(y_ds, y_pred_ds, pos_label=0, average='binary')

print('\nF1-Score (positive = default) = \n', f1_pos,
      '\nF1-Score (positive = not-default) = \n', f1_neg)

On the downsampled training set, the confusion matrix is balanced, and the classification error is 0.1171, which amounts to 88.29% correctly classified samples. As we observe now, the F1 score for **default=Yes** as positive case has now considerably improved.

Furthermore, the histograms of the predicted probabilities have a complete different shape than before. The separation of the two classes becomes clearly visible.