In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
from IPython.display import Image

# Logistic Regression and Overfitting / Underfitting

Source: 

- RITHP, **Logistic Regression and regularization: Avoiding overfitting and improving generalization,** https://medium.com/@rithpansanga/logistic-regression-and-regularization-avoiding-overfitting-and-improving-generalization-e9afdcddd09d
- Nilesh Parashar, **From Generalization to Overfitting: The Science Behind Data Overfitting,** https://medium.com/@niitwork0921/from-generalization-to-overfitting-the-science-behind-data-overfitting-65f5c6901729
- Charles Chi, **Overfitting and Underfitting during Model Training,** https://medium.com/ai-assimilating-intelligence/overfitting-and-underfitting-in-model-training-e0b14a89bd49
- Ivan Zakharchuk, **Generalization, Overfitting, and Under-fitting in Supervised Learning,** https://ivanzakharchuk.medium.com/generalization-overfitting-and-underfitting-in-supervised-learning-a21f02ebf3df


#### Overfitting in logistic regression:

- Occurs when the model has too many parameters (degrees of freedom) relative to the size of the training data.
- Leads to high training accuracy but low test accuracy. 

In [None]:
Image(filename="figures/regularization_1.png", width=500)

In [None]:
Image(filename="figures/regularization_2.png", width=500)

#### Regularization:

- Technique to avoid overfitting by adding a **penalty term** to the objective function (loss function) that the model is trying to minimize. 
- The penalty term **reduces the complexity of the model**.
- It improves its generalization by **reducing the variance** of the model.

#### Types of regularization:

- **L1 (Lasso)** regularization:
    * Adds a penalty term to the objective function equal to the absolute value of the coefficients.
    * Leads to a sparse model, where many of the coefficients are exactly equal to zero.
    * Useful for feature selection because it can automatically identify and remove unnecessary or redundant features.
      
- **L2 (Ridge)** regularization (default choice):
    * Adds a penalty term to the objective function equal to the square of the coefficients.
    * Leads to a model with all coefficients close to zero, but not necessarily equal to zero.
    * Less prone to overfitting than L1 regularization.
      
- **Elastic Net** regularization:
    * Combines L1 and L2 regularization by adding a penalty term that is a combination of the absolute value and square of the coefficients.
    * Leads to a model with some coefficients equal to zero and some close to zero.
    * Useful when there are correlated features in the data (Lasso is prone to selecting only one of them).

#### Regularization procedure:

- Choose strength of regularization (hyperparameter) via cross-validation.
- Select the hyperparameter that gives the best performance on the validation set. 
- Try a range of values (e.g. 10^-6 to 10⁶) and use a logarithmic scale. 
- Use a grid search to try a range of values for multiple hyperparameters at once.


**Other techniques:**

- **Feature selection and dimensionality reduction methods**: 
    * Extract the most important and useful traits while eliminating the rest. 

- **Ensemble techniques:**
    * Ensemble techniques like **bagging** and **boosting** integrate numerous models. 
    * Ensemble approaches improve generalization.
    * Mitigate the effect of individual model biases by combining the predictions of several models.
      
- **Boosting the size of the data used for training** and **simplifying the model**

- **Pruning (for decision trees):** 
    * Pruning reduces the size of decision trees by cutting off branches that have little importance, thus simplifying the model and reducing the risk of overfitting.

- **Dropout (for neural networks):**
    * Dropout randomly disables a fraction of neurons during training
    * Helps prevent the network from becoming overly dependent on any specific set of features, i.e. helps generalize.

#### How to implement logistic regression with regularization in python:

- Use the `LogisticRegression` class in `scikit-learn` with the **“penalty”** and **“C”** hyperparameters

- Set the **“penalty”** hyperparameter to **“l1”, “l2”,** or **“elasticnet”.**

- Set the **“C”** hyperparameter to the regularization strength.

- The “C” hyperparameter controls the strength of the regularization:
    * A smaller value for “C” (e.g. C=0.01) leads to stronger regularization and a simpler model.
    * A larger value (e.g. C=1.0) leads to weaker regularization and a more complex model.

#### Example:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df_churn_pd = pd.read_csv("/data/IFI8410/sess09/mergedcustomers_missing_values_GENDER.csv")
df_churn_pd.head()

In [None]:
#remove columns that are not required
df_churn_pd = df_churn_pd.drop(['ID'], axis=1)

In [None]:
# prepare data frame for splitting data into train and test datasets
features = []
features = df_churn_pd.drop(['CHURNRISK'], axis=1)

label_churn = pd.DataFrame(df_churn_pd, columns = ['CHURNRISK']) 
label_encoder = LabelEncoder()
label = df_churn_pd['CHURNRISK']

label = label_encoder.fit_transform(label)
print("Encoded value of Churnrisk after applying label encoder : " + str(label))

In [None]:
# Load the data
X = features
y = label

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### L1 regularization

In [None]:
# Set the regularization type (L1, L2, Elastic Net)
penalty = 'l1'

# Set the regularization strength (C)
C = 0.01

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')


# Set the regularization strength (C)
C = 0.1

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')


# Set the regularization strength (C)
C = 1.0

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')

#### L2 regularization

In [None]:
# Set the regularization type (L1, L2, Elastic Net)
penalty = 'l2'

# Set the regularization strength (C)
C = 0.01

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')

# Set the regularization strength (C)
C = 0.1


# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')


# Set the regularization strength (C)
C = 1.0

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')

#### Elastic net regularization

In [None]:
# Set the regularization type (L1, L2, Elastic Net)
penalty = 'elasticnet'

# Set the regularization strength (C)
C = 0.01

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')


# Set the regularization strength (C)
C = 0.1

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')


# Set the regularization strength (C)
C = 1.0

# Create a logistic regression model
model = LogisticRegression(penalty=penalty, C=C)

# Train the model on the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

print(f'Test accuracy: {accuracy:.2f}')

#### Underfitting:

- **Increase Model Complexity:** 
    * A more complex model is necessary to capture the nuances of your data.
    * Add more layers to a neural network.
    * Use a more sophisticated algorithm to learn more complex patterns.

- F**eature Engineering:** 
    * Improve model performance by creating new features.
    * The right features can be transformations which better highlight the relationships between variables. 

- **Reduce Regularization:**
    * Reduce the strength of regularization.

- **Increase Training Time:**
    * Allow more time for training.
    * Run more epochs or iterations.
    * Giving the model additional opportunities to learn from the data. Related: tweaking learning rate may help as well.

- **Add More Data:**
    * If feasible, incorporate more training data to help the model to identify and learn the underlying structure of the dataset. 
    * More data provides a broader representation of the problem space, potentially improving the model’s accuracy and generalization.

In [None]:
Image(filename="figures/regularization_3.png", width=500)

In [None]:
Image(filename="figures/regularization_4.png", width=500)