# Lab 3: Extending Logistic Regression

This task would involve predicting the education level of credit card customers based on other features in the dataset. This could be relevant for customer segmentation and targeted marketing strategies. The results could be of interest to marketing teams or customer relationship management departments within financial institutions. The model could help tailor marketing campaigns or product offerings based on customers' education levels. The model might be deployed to assist in customer segmentation for targeted marketing efforts.

Target:

This model will aim to learn patterns in the dataset to predict one of multiple education levels for each customer and whether they are likely to churn or not. Overall, this model will provide insights into customers' education levels and churn behavior, enabling targeted marketing efforts, customer retention strategies, and better decision-making for the credit card portfolio.

In [29]:
import pandas as pd

# load the dataset
file_path = 'data/BankChurners.csv'
df = pd.read_csv(file_path)

# display the counts of each card category
card_category_counts = df['Education_Level'].value_counts()
print("\nCounts of each Education_Level:")
print(card_category_counts)



Counts of each Education_Level:
Education_Level
Graduate         3128
High School      2013
Unknown          1519
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: count, dtype: int64


In [30]:
# check for missing data in the dataframe
missing_data = df.isnull().sum()

# print the number of missing values for each column
print("Missing data per column:")
print(missing_data)

Missing data per column:
CLIENTNUM                                                                                                                             0
Attrition_Flag                                                                                                                        0
Customer_Age                                                                                                                          0
Gender                                                                                                                                0
Dependent_count                                                                                                                       0
Education_Level                                                                                                                       0
Marital_Status                                                                                                                        0
Income_Category        

no missing data!


We will pick the variable education level, as it has three or more classes to predict. 

In [31]:
# define a dictionary to map each education level to a numerical label
education_level_map = {
    'Graduate': 1,
    'High School': 2,
    'Unknown': 3,
    'Uneducated': 4,
    'College': 5,
    'Post-Graduate': 6,
    'Doctorate': 7
}

# Apply it to the Education_Level column
df['Education_Level'] = df['Education_Level'].map(education_level_map)

# Convert Attrition_Flag column to binary labels
df['Attrition_Flag'] = df['Attrition_Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1})

# drop unnecessary variables
df.drop(['CLIENTNUM', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1'], axis=1, inplace=True)

print("Final dataset description:")
print(df.describe())


Final dataset description:
       Attrition_Flag  Customer_Age  Dependent_count  Education_Level  \
count    10127.000000  10127.000000     10127.000000     10127.000000   
mean         0.160660     46.325960         2.346203         2.861361   
std          0.367235      8.016814         1.298908         1.770156   
min          0.000000     26.000000         0.000000         1.000000   
25%          0.000000     41.000000         1.000000         1.000000   
50%          0.000000     46.000000         2.000000         2.000000   
75%          0.000000     52.000000         3.000000         4.000000   
max          1.000000     73.000000         5.000000         7.000000   

       Months_on_book  Total_Relationship_Count  Months_Inactive_12_mon  \
count    10127.000000              10127.000000            10127.000000   
mean        35.928409                  3.812580                2.341167   
std          7.986416                  1.554408                1.010622   
min         13.

Eddie : explain the numbers IN DETAIL ( Provide a breakdown of the variables after preprocessing (such as the mean, std, etc. for all variables, including numeric and categorical )
This is how education level is assigned :

    'Graduate': 1,
    'High School': 2,
    'Unknown': 3,
    'Uneducated': 4,
    'College': 5,
    'Post-Graduate': 6,
    'Doctorate': 7

In [34]:
# display the first few rows of the preprocessed dataframe
print("\nFirst few rows of the preprocessed dataframe:")
print(df.head())


First few rows of the preprocessed dataframe:
   Attrition_Flag  Customer_Age Gender  Dependent_count  Education_Level  \
0               0            45      M                3                2   
1               0            49      F                5                1   
2               0            51      M                3                1   
3               0            40      F                4                2   
4               0            40      M                3                4   

  Marital_Status Income_Category Card_Category  Months_on_book  \
0        Married     $60K - $80K          Blue              39   
1         Single  Less than $40K          Blue              44   
2        Married    $80K - $120K          Blue              36   
3        Unknown  Less than $40K          Blue              34   
4        Married     $60K - $80K          Blue              21   

   Total_Relationship_Count  Months_Inactive_12_mon  Contacts_Count_12_mon  \
0                    

In [39]:
from sklearn.model_selection import train_test_split

X = df.drop(['Education_Level', 'Attrition_Flag'], axis=1) 
y_education = df['Education_Level']
y_attrition = df['Attrition_Flag']

# split the data into training and testing sets for education level
X_train_edu, X_test_edu, y_train_edu, y_test_edu = train_test_split(X, y_education, test_size=0.2)

# split the data into training and testing sets for attrition flag
X_train_attrition, X_test_attrition, y_train_attrition, y_test_attrition = train_test_split(X, y_attrition, test_size=0.2)

# print the shapes of the training and testing sets for education level and attrition flag
print("Training set shape (Education Level):", X_train_edu.shape, y_train_edu.shape)
print("Testing set shape (Education Level):", X_test_edu.shape, y_test_edu.shape)
print("Training set shape (Attrition Flag):", X_train_attrition.shape, y_train_attrition.shape)
print("Testing set shape (Attrition Flag):", X_test_attrition.shape, y_test_attrition.shape)


Training set shape (Education Level): (8101, 18) (8101,)
Testing set shape (Education Level): (2026, 18) (2026,)
Training set shape (Attrition Flag): (8101, 18) (8101,)
Testing set shape (Attrition Flag): (2026, 18) (2026,)


Split into testing and training

Splitting the dataset into training and testing sets using an 80/20 split is a common practice in machine learning and is appropriate for several reasons:

Sufficient Training Data: With 80% of the dataset allocated for training, the model will have ample data to learn the underlying patterns and relationships between features and target variables. This large training set can help the model generalize well to unseen data and make accurate predictions.

Adequate Testing Data: Allocating 20% of the dataset for testing ensures that there is a significant amount of data reserved for evaluating the model's performance. This sizable test set helps provide reliable estimates of the model's performance metrics, such as accuracy, precision, recall, and F1-score.

Computationally Feasible: Training models on larger datasets can be computationally intensive, especially for complex models. An 80/20 split allows for efficient use of computational resources while still providing sufficient data for training and testing.

Standard Practice: The 80/20 split ratio is widely used in machine learning literature and is considered a standard practice. It strikes a balance between having enough data for training and testing without overly sacrificing the size of either set.

Reliable Performance Evaluation: Using a larger testing set improves the reliability of performance evaluation metrics. A larger test set reduces variability in performance metrics and provides more robust estimates of model performance.