# [3920] Homework 3 - KNN
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-3920/main/data/credit_card_churners_2_10k.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 2:30 PM on the due date
  * No late submission will be accepted
* You must submit a cleanly executed notebook (*.ipynb)
  * Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework 3 Requirements
* Load data
* Identify missing values and use SimpleImputer to replace missing values
* Ordinal Encode independent variables: 'Education_Level', 'Income_Category' and 'Card_Category'
* Dummy (one-hot) encode independent variables: 'Gender' and 'Marital_Status'
* Label encode dependent variable: 'Attrition_Flag'
* Separate independent and dependent variables
* Standardize independent variables
* Split data into training and test sets
* Train KNeighborsClassifier (with default hyperparameters)
* Calculate accuracy for KNeighborsClassifier (with default hyperparameters)
* Re-train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
  * NOTE: The objective of changing these hyperparameters is to improve model accuracy
    * If you used hyperparameter random_state in your initial model training, do NOT change this value during model retrainings
    * Do NOT re-split training and test sets during model retrainings
* Calculate accuracy for re-trained KNeighborsClassifier (with updated hyperparameters)

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 06/10/25 20:40:56


### Import libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Load data

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-3920/main/data/credit_card_churners_2_10k.csv', index_col='CLIENTNUM')

### Examine data

In [4]:
# Display first few rows of dataframe
df.head()

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
718759833,Existing Customer,44.0,F,2,High_School,Married,,Silver,35,4,3,2,32643.0,0,32643.0,1.3,1058,24,2.429,0.0
719084358,Attrited Customer,55.0,F,1,College,Married,Below_$40K,Blue,46,6,4,3,4232.0,0,4232.0,0.878,2312,37,0.609,0.0
772643058,Existing Customer,58.0,F,3,College,Married,Below_$40K,Blue,48,6,2,3,2800.0,1834,966.0,0.615,1571,36,0.636,0.655
718078008,Existing Customer,40.0,M,2,College,Divorced,$40K-$80K,Blue,36,3,2,1,20304.0,0,20304.0,0.809,7494,85,0.667,0.0
715479483,Existing Customer,35.0,M,2,College,Married,$80K-$120K,Blue,36,5,2,4,15279.0,1496,13783.0,0.997,2079,57,0.425,0.098


### Prepare data for model training

#### Use the SimpleImputer to replace missing values

In [5]:
# Check for missing values
df.isna().sum()

Attrition_Flag                 0
Customer_Age                 501
Gender                         0
Dependent_count                0
Education_Level                0
Marital_Status                 0
Income_Category             1101
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

In [6]:
# Create imputers
imputer_mean = SimpleImputer(strategy='mean')
imputer_freq = SimpleImputer(strategy='most_frequent')

# Apply mean imputer to Customer_Age
df['Customer_Age'] = imputer_mean.fit_transform(df[['Customer_Age']])

# Apply most frequent imputer to Income_Category
df['Income_Category'] = imputer_freq.fit_transform(df[['Income_Category']]).ravel()

In [7]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
718759833,Existing Customer,44.0,F,2,High_School,Married,Below_$40K,Silver,35,4,3,2,32643.0,0,32643.0,1.3,1058,24,2.429,0.0
719084358,Attrited Customer,55.0,F,1,College,Married,Below_$40K,Blue,46,6,4,3,4232.0,0,4232.0,0.878,2312,37,0.609,0.0
772643058,Existing Customer,58.0,F,3,College,Married,Below_$40K,Blue,48,6,2,3,2800.0,1834,966.0,0.615,1571,36,0.636,0.655
718078008,Existing Customer,40.0,M,2,College,Divorced,$40K-$80K,Blue,36,3,2,1,20304.0,0,20304.0,0.809,7494,85,0.667,0.0
715479483,Existing Customer,35.0,M,2,College,Married,$80K-$120K,Blue,36,5,2,4,15279.0,1496,13783.0,0.997,2079,57,0.425,0.098


#### Check for missing values again

In [8]:
df.isna().sum()

Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

#### Ordinal Encode Education_Level, Income_Category and Card_Category

In [9]:
# Ordinal encode column: Education_Level

In [10]:
oe_education = OrdinalEncoder(categories=[['High_School', 'College', 'Post-Graduate']])
df['Education_Level'] = oe_education.fit_transform(df[['Education_Level']])

In [11]:
# Ordinal encode column: Income_Category

In [12]:
oe_income = OrdinalEncoder(categories=[['Below_$40K', '$40K-$80K', '$80K-$120K', 'Above_$120K']])
df['Income_Category'] = oe_income.fit_transform(df[['Income_Category']])

In [13]:
# Ordinal encode column: Card_Category

In [14]:
oe_card = OrdinalEncoder(categories=[['Blue', 'Silver', 'Gold', 'Platinum']])
df['Card_Category'] = oe_card.fit_transform(df[['Card_Category']])

In [15]:
# Display first few rows of updated dataframe
df.head(10)

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
718759833,Existing Customer,44.0,F,2,0.0,Married,0.0,1.0,35,4,3,2,32643.0,0,32643.0,1.3,1058,24,2.429,0.0
719084358,Attrited Customer,55.0,F,1,1.0,Married,0.0,0.0,46,6,4,3,4232.0,0,4232.0,0.878,2312,37,0.609,0.0
772643058,Existing Customer,58.0,F,3,1.0,Married,0.0,0.0,48,6,2,3,2800.0,1834,966.0,0.615,1571,36,0.636,0.655
718078008,Existing Customer,40.0,M,2,1.0,Divorced,1.0,0.0,36,3,2,1,20304.0,0,20304.0,0.809,7494,85,0.667,0.0
715479483,Existing Customer,35.0,M,2,1.0,Married,2.0,0.0,36,5,2,4,15279.0,1496,13783.0,0.997,2079,57,0.425,0.098
779070033,Existing Customer,63.0,F,0,1.0,Single,0.0,0.0,44,3,4,2,1938.0,0,1938.0,0.536,3974,56,0.931,0.0
714046608,Existing Customer,46.351616,M,0,2.0,Married,1.0,0.0,39,5,6,3,19719.0,1395,18324.0,0.565,3572,73,0.738,0.071
708281433,Existing Customer,57.0,F,2,1.0,Married,0.0,0.0,36,5,1,3,1438.3,0,1438.3,0.565,3848,75,0.705,0.0
711613308,Attrited Customer,58.0,M,2,0.0,Single,1.0,0.0,48,2,1,2,2236.0,710,1526.0,0.434,2562,38,0.462,0.318
787565433,Existing Customer,46.351616,M,2,2.0,Married,2.0,0.0,43,4,2,4,4039.0,1397,2642.0,0.396,1494,36,0.286,0.346


#### Dummy (one-hot) encode Gender and Marital_Status

In [16]:
# Dummy (one-hot) encode column: Gender
df = pd.get_dummies(df, columns=['Gender'])

In [17]:
# Dummy (one-hot) encode column: Marital_Status
df = pd.get_dummies(df, columns=['Marital_Status'])

In [18]:
# Display first few rows of updated dataframe
df.head(10)

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
718759833,Existing Customer,44.0,2,0.0,0.0,1.0,35,4,3,2,...,1.3,1058,24,2.429,0.0,True,False,False,True,False
719084358,Attrited Customer,55.0,1,1.0,0.0,0.0,46,6,4,3,...,0.878,2312,37,0.609,0.0,True,False,False,True,False
772643058,Existing Customer,58.0,3,1.0,0.0,0.0,48,6,2,3,...,0.615,1571,36,0.636,0.655,True,False,False,True,False
718078008,Existing Customer,40.0,2,1.0,1.0,0.0,36,3,2,1,...,0.809,7494,85,0.667,0.0,False,True,True,False,False
715479483,Existing Customer,35.0,2,1.0,2.0,0.0,36,5,2,4,...,0.997,2079,57,0.425,0.098,False,True,False,True,False
779070033,Existing Customer,63.0,0,1.0,0.0,0.0,44,3,4,2,...,0.536,3974,56,0.931,0.0,True,False,False,False,True
714046608,Existing Customer,46.351616,0,2.0,1.0,0.0,39,5,6,3,...,0.565,3572,73,0.738,0.071,False,True,False,True,False
708281433,Existing Customer,57.0,2,1.0,0.0,0.0,36,5,1,3,...,0.565,3848,75,0.705,0.0,True,False,False,True,False
711613308,Attrited Customer,58.0,2,0.0,1.0,0.0,48,2,1,2,...,0.434,2562,38,0.462,0.318,False,True,False,False,True
787565433,Existing Customer,46.351616,2,2.0,2.0,0.0,43,4,2,4,...,0.396,1494,36,0.286,0.346,False,True,False,True,False


#### Label encode Attrition_Flag (target)

In [19]:
# Label encode target: Attrition_Flag
le = LabelEncoder()
df['Attrition_Flag'] = le.fit_transform(df['Attrition_Flag'])

In [20]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender_F,Gender_M,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
CLIENTNUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
718759833,1,44.0,2,0.0,0.0,1.0,35,4,3,2,...,1.3,1058,24,2.429,0.0,True,False,False,True,False
719084358,0,55.0,1,1.0,0.0,0.0,46,6,4,3,...,0.878,2312,37,0.609,0.0,True,False,False,True,False
772643058,1,58.0,3,1.0,0.0,0.0,48,6,2,3,...,0.615,1571,36,0.636,0.655,True,False,False,True,False
718078008,1,40.0,2,1.0,1.0,0.0,36,3,2,1,...,0.809,7494,85,0.667,0.0,False,True,True,False,False
715479483,1,35.0,2,1.0,2.0,0.0,36,5,2,4,...,0.997,2079,57,0.425,0.098,False,True,False,True,False


### Separate independent and dependent variables
* Independent variables: All remaining variables except Attrition_Flag
* Dependent variable: Attrition_Flag

In [21]:
Y = df["Attrition_Flag"]
X = df.drop("Attrition_Flag", axis=1)

### Standardize independent variables

In [22]:
min_max_scaler = StandardScaler()
X = pd.DataFrame(min_max_scaler.fit_transform(X), columns=X.columns)

### Split data into training and test sets

In [23]:
x_test, x_train, y_test, y_train = train_test_split(X, Y, stratify=Y, test_size=0.2, random_state=42)




















### Train KNeighborsClassifier (with default hyperparameters)


In [24]:
knn = KNeighborsClassifier()
knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [25]:
knn.fit(x_train, y_train)

### Evaluate performance for KNeighborsClassifier (with default hyperparameters)

In [26]:
# Predict using the test set
y_pred = knn.predict(x_test)

In [27]:
# Print model accuracy score
accuracy_score_1 = accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((accuracy_score_1 * 100), 4)}%")

Accuracy = 88.4%


### Train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
NOTE: The objective of changing these hyperparameters is to improve model accuracy

In [28]:
knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 10,
 'p': 2,
 'weights': 'uniform'}

In [29]:
knn.fit(x_train, y_train)

### Evaluate performance for KNeighborsClassifier (with updated hyperparameters)

In [30]:
# Predict using the test set
y_pred = knn.predict(x_test)

In [31]:
# Print model accuracy score
accuracy_score_1 = accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((accuracy_score_1 * 100), 4)}%")

Accuracy = 88.75%
