# Project # 1 - Logistic Regression
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/credit_card_churners_1_10k.csv

## Project #1 Requirements
* Load data into dataframe
* Examine data
* Use SimpleImputer to replace missing values
* Prepare data for model training
* Train Logistic Regression model (change hyperparameters and re-train as needed)
* Test model and evaluate model performance metrics

In [10]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 02/15/24 20:30:34


### Import libraries

In [11]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/credit_card_churners_1_10k.csv')

# Now, you can review the DataFrame's shape
dimensions = df.shape

### Load data

#### Credit Card Churn Prediction
* https://www.kaggle.com/datasets/anwarsan/credit-card-bank-churn

Business Problem  
A business manager of a consumer credit card bank is facing the problem of customer attrition. They want to analyze the data to find out the reason behind this and leverage the same to predict customers who are likely to drop off.

Columns
* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - High School, College, Post-Graduate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
* Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
* Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
* Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

In [12]:
# Read data from file (credit_card_churners_1_10k.csv) into dataframe
#  NOTE: Use CLIENTNUM as the index column

data_url = "https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/credit_card_churners_1_10k.csv"

data_table = pd.read_csv(data_url)

print(data_table.head())


# df = pd.read_csv)

   CLIENTNUM  Attrition_Flag  Customer_Age  Dependent_count  Education_Level  \
0  712965183               1          63.0                2              1.0   
1  714225333               1          48.0                4              1.0   
2  710512833               1          38.0                2              1.0   
3  716396358               1          52.0                2              1.0   
4  715609533               0          47.0                3              0.0   

   Income_Category  Card_Category  Months_on_book  Total_Relationship_Count  \
0              0.0            0.0              52                         5   
1              0.0            0.0              36                         5   
2              0.0            0.0              29                         6   
3              1.0            0.0              47                         5   
4              0.0            0.0              35                         1   

   Months_Inactive_12_mon  ...  Total_Amt_Ch

### Examine data

In [13]:
# Review dataframe shape
dimensions = df.shape

print("Total rows:", dimensions[0])
print("Total columns:", dimensions[1])

Total rows: 10000
Total columns: 24


In [14]:
# Display first few rows of dataframe
df.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
0,712965183,1,63.0,2,1.0,0.0,0.0,52,5,2,...,0.416,1188,35,0.75,0.781,1,0,0,1,0
1,714225333,1,48.0,4,1.0,0.0,0.0,36,5,1,...,0.661,1545,21,0.909,0.264,1,0,0,1,0
2,710512833,1,38.0,2,1.0,0.0,0.0,29,6,1,...,0.615,5178,79,0.756,0.405,1,0,0,1,0
3,716396358,1,52.0,2,1.0,1.0,0.0,47,5,3,...,0.921,1531,35,0.667,0.619,0,1,0,1,0
4,715609533,0,47.0,3,0.0,0.0,0.0,35,1,3,...,0.621,1887,36,0.333,0.0,1,0,0,0,1


In [15]:
# Display distribution counts for target variable Attrition_Flag

frequency_distribution = df['Attrition_Flag'].value_counts()

frequency_distribution

Attrition_Flag
1    8392
0    1608
Name: count, dtype: int64

### Prepare data

##### Check for missing values

In [16]:
absent_value = df.isna().sum()

print(absent_value)

CLIENTNUM                     0
Attrition_Flag                0
Customer_Age                502
Dependent_count               0
Education_Level               0
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
Gender_F                      0
Gender_M                      0
Marital_Status_Divorced       0
Marital_Status_Married        0
Marital_Status_Single         0
dtype: int64


#### Use the SimpleImputer to replace missing values

In [17]:
imputer = SimpleImputer(strategy='mean')

df['Customer_Age'] = imputer.fit_transform(df[['Customer_Age']])

updated_missing_values = df.isna().sum()


In [18]:
print(absent_value)

CLIENTNUM                     0
Attrition_Flag                0
Customer_Age                502
Dependent_count               0
Education_Level               0
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
Gender_F                      0
Gender_M                      0
Marital_Status_Divorced       0
Marital_Status_Married        0
Marital_Status_Single         0
dtype: int64


#### Check for missing values again

In [19]:
missing_values = df.isna().sum()

print(absent_value)

CLIENTNUM                     0
Attrition_Flag                0
Customer_Age                502
Dependent_count               0
Education_Level               0
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
Gender_F                      0
Gender_M                      0
Marital_Status_Divorced       0
Marital_Status_Married        0
Marital_Status_Single         0
dtype: int64


### Separate independent and dependent variables
* Independent variables: All remaining variables except Attrition_Flag
* Dependent variable: Attrition_Flag

In [20]:
dependent_variable = 'Attrition_Flag'

independent_variables = [
    'Customer_Age',
    'Dependent_count',
    'Education_Level',
    'Income_Category',
    'Card_Category',
    'Months_on_book',
    'Total_Relationship_Count',
    'Months_Inactive_12_mon',
    'Contacts_Count_12_mon',
    'Credit_Limit',
    'Total_Revolving_Bal',
    'Avg_Open_To_Buy',
    'Total_Amt_Chng_Q4_Q1',
    'Total_Trans_Amt',
    'Total_Trans_Ct',
    'Total_Ct_Chng_Q4_Q1',
    'Avg_Utilization_Ratio',
    'Gender_F',
    'Gender_M',
    'Marital_Status_Divorced',
    'Marital_Status_Married',
    'Marital_Status_Single'
]


X = df[independent_variables]
y = df[dependent_variable]

### Split data into training and test sets

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Train Logistic Regression model

In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logistic_regression_model = LogisticRegression(max_iter=1000)
logistic_regression_model.fit(X_train_scaled, y_train)

### If above results in error, review error message, look up documentation for LogisticRegression, and change model hyperparameter appropriately

### Test model

In [23]:
# Generate predictions against the test set

y_pred = logistic_regression_model.predict(X_test)

feature_names = X.columns
coefficients = logistic_regression_model.coef_

for feature, coef in zip(feature_names, coefficients[0]):
    print(f"{feature}: {coef}")


Customer_Age: 0.019990153998473804
Dependent_count: -0.14407826402829843
Education_Level: -0.04062591700408743
Income_Category: -0.21782386247678942
Card_Category: -0.14200193313092055
Months_on_book: 0.0859592204752861
Total_Relationship_Count: 0.7423578834642501
Months_Inactive_12_mon: -0.5299692646551385
Contacts_Count_12_mon: -0.5475441762710864
Credit_Limit: 0.11698391418787597
Total_Revolving_Bal: 0.6848314294942892
Avg_Open_To_Buy: 0.05552185282188204
Total_Amt_Chng_Q4_Q1: 0.10237928409757999
Total_Trans_Amt: -1.6412853449573885
Total_Trans_Ct: 2.8044436766137157
Total_Ct_Chng_Q4_Q1: 0.6294975060292627
Avg_Utilization_Ratio: 0.11379226548032693
Gender_F: -0.24183100614914127
Gender_M: 0.24183100614914127
Marital_Status_Divorced: -0.05525456900878645
Marital_Status_Married: 0.16709153766373594
Marital_Status_Single: -0.1332164778837748


### Model evaluation

In [24]:
from sklearn.metrics import accuracy_score

# Print model accuracy
y_pred = logistic_regression_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.2265


In [25]:
# Print classification report
from sklearn.metrics import classification_report

y_pred = logistic_regression_model.predict(X_test)

report = classification_report(y_test, y_pred)

print(report)

              precision    recall  f1-score   support

           0       0.15      0.84      0.26       322
           1       0.78      0.11      0.19      1678

    accuracy                           0.23      2000
   macro avg       0.47      0.48      0.22      2000
weighted avg       0.68      0.23      0.20      2000



In [26]:
# Print confusion matrix
y_pred = logistic_regression_model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)

print("confusion matrix:")
print(conf_matrix)


confusion matrix:
[[ 272   50]
 [1497  181]]
