# [3920] Homework 2 - Logistic Regression
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-3920/main/data/credit_card_churners_1_2500.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 2:30 PM on the due date
  * No late submission will be accepted
* Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework 2 Requirements
* Load data into dataframe
* Examine data
* Use SimpleImputer to replace missing values
* Prepare data for model training
* Train Logistic Regression model
  * If you get errors, change appropriate hyperparameters to eliminate errors
* Calculate and display model accuracy
* Re-train Logistic Regression model to achieve accuracy > 91%
  * Change hyperparameters accordingly to achieve this accuracy level
  * If you used hyperparameter random_state in your initial model training, do NOT change this value during model retrainings
  * Do NOT re-split training and test sets during model retrainings
* Calculate and display re-trained model accuracy

In [628]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 06/10/25 20:10:41


### Import libraries

In [629]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Load data

#### Credit Card Churn Prediction
* https://www.kaggle.com/datasets/anwarsan/credit-card-bank-churn

Business Problem  
A business manager of a consumer credit card bank is facing the problem of customer attrition. They want to analyze the data to find out the reason behind this and leverage the same to predict customers who are likely to drop off.

Columns
* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - High School, College, Post-Graduate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
* Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
* Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
* Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

In [630]:
# Read data from file (credit_card_churners_1_2500.csv) into dataframe
#  NOTE: Use CLIENTNUM as the index column
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-3920/main/data/credit_card_churners_1_2500.csv', index_col='CLIENTNUM')
SEED = 645

### Examine data

In [631]:
# Review dataframe shape
df.shape

(2500, 23)

In [632]:
# Display first few rows of dataframe
df.head()

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
719999508,1,37.0,3,0.0,1.0,0.0,24,3,1,2,...,1.561,2438,45,0.607,0.069,0,1,0,1,0
716713533,1,53.0,3,2.0,2.0,0.0,44,5,2,2,...,0.542,3393,58,0.871,0.065,0,1,0,0,1
711800658,0,42.0,3,0.0,0.0,0.0,36,2,3,3,...,0.577,2465,42,0.355,0.0,1,0,0,0,1
719384433,0,44.0,3,0.0,0.0,0.0,28,5,2,3,...,0.654,2581,57,0.781,0.641,1,0,1,0,0
718894233,1,53.0,2,1.0,1.0,0.0,36,6,2,4,...,0.698,2116,63,0.575,0.471,1,0,0,1,0


In [633]:
# Display distribution counts for target variable Attrition_Flag
df.Attrition_Flag.value_counts()

Attrition_Flag
1    2113
0     387
Name: count, dtype: int64

### Prepare data

##### Check for missing values

In [634]:
df.isna().sum()

Attrition_Flag                0
Customer_Age                109
Dependent_count               0
Education_Level               0
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
Gender_F                      0
Gender_M                      0
Marital_Status_Divorced       0
Marital_Status_Married        0
Marital_Status_Single         0
dtype: int64

#### Use the SimpleImputer to replace missing values

In [635]:
imputer = SimpleImputer(strategy='median')

In [636]:
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

#### Check for missing values again

In [637]:
df.isna().sum()

Attrition_Flag              0
Customer_Age                0
Dependent_count             0
Education_Level             0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Gender_F                    0
Gender_M                    0
Marital_Status_Divorced     0
Marital_Status_Married      0
Marital_Status_Single       0
dtype: int64

### Separate independent and dependent variables
* Independent variables: All remaining variables except Attrition_Flag
* Dependent variable: Attrition_Flag

In [638]:
X = df.drop('Attrition_Flag', axis=1)  # Independent vars
y = df['Attrition_Flag']  # Dependent vars

### Split data into training and test sets

In [639]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

### Train Logistic Regression model

In [640]:
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### If the above results in error, review the error message, look up the documentation for LogisticRegression, change the appropriate model hyperparameter(s) and re-train the model
* Repeat until there is no error

In [641]:
model = LogisticRegression(solver='liblinear', max_iter=400, random_state=SEED)
model.fit(X_train, y_train)

### Test model

In [642]:
predictions = model.predict(X_test)
print(predictions)

[1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

### Model evaluation

In [643]:
# Print model accuracy

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')


Model Accuracy: 0.906


### Goal: Improve model performance to have accuracy > 91%

In [644]:
# Re-train model with at least 1 different or additional hyperparameter

model = LogisticRegression(solver='newton-cg', max_iter=500, C=0.2, random_state=SEED)  # Adjust C for regularization
model.fit(X_train, y_train)

### Test updated model

In [645]:
# Generate predictions against the test set
y_pred = model.predict(X_test)

### Evaluate updated model

In [646]:
# Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Re-trained Model Accuracy: {accuracy}')

Re-trained Model Accuracy: 0.918
