## Homework 7

### Author: Hannah Marr

### DATA 200

## II - Code

You are required to work with the loan.csv dataset, which will be used to predict the status of a loan (loan status) based on various predictor variables included in the dataset. You are free to build your model using Python or R.

Variables:
* The response variable is: Y = loan status
* The predictor variables we are interested in are all other variables in the dataset.

1. Begin by cleaning the dataset: remove all rows containing null values. Next, convert all categorical features into numeric formats suitable for modeling. Ensure that the dataset is balanced with respect to the response variable loan status to prevent any bias in prediction outcomes. Finally, split the dataset into a training set (70%) and a testing set (30%), and display the shape of each to confirm the split. Provide the code. (2 points)

In [5]:
import pandas as pd

# Load the dataset
file_path = "/Users/hannahmarr/Desktop/Tufts/DATA200/Homeworks/loan.csv"
loan = pd.read_csv(file_path)

# Display the first few rows to understand the structure of the dataset
loan.head()

Unnamed: 0,id,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
0,0,37,35000,RENT,0.0,EDUCATION,B,6000,11.49,0.17,N,14,0
1,1,22,56000,OWN,6.0,MEDICAL,C,4000,13.35,0.07,N,2,0
2,2,29,28800,OWN,8.0,PERSONAL,A,6000,8.9,0.21,N,10,0
3,3,30,70000,RENT,14.0,VENTURE,B,12000,11.11,0.17,N,5,0
4,4,22,60000,RENT,2.0,MEDICAL,A,6000,6.92,0.1,N,3,0


In [17]:
# Step 1: Data Cleaning - Remove rows with null values
loan_cleaned = loan.dropna()

loan_cleaned.isnull().values.any()

False

In [31]:
# Step 2: Encode Categorical Variables
loan_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object

In [59]:
# Reset the loan_cleaned data to ensure correct mappings
loan_cleaned = loan.dropna()

# Dictionary to store the mapping of original categories to their encoded numeric values
encoding_mappings = {}

# Apply label encoding to each categorical column with mapping to original names
for feature in categorical_features:
    le = LabelEncoder()
    loan_cleaned[feature] = le.fit_transform(loan_cleaned[feature])
    label_encoders[feature] = le

    # Create a mapping of original category names to their encoded numeric values
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    encoding_mappings[feature] = mapping

# Check encoding mappings to ensure they map from original category names to numeric values
encoding_mappings

{'person_home_ownership': {'MORTGAGE': 0, 'OTHER': 1, 'OWN': 2, 'RENT': 3},
 'loan_intent': {'DEBTCONSOLIDATION': 0,
  'EDUCATION': 1,
  'HOMEIMPROVEMENT': 2,
  'MEDICAL': 3,
  'PERSONAL': 4,
  'VENTURE': 5},
 'loan_grade': {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6},
 'cb_person_default_on_file': {'N': 0, 'Y': 1}}

In [63]:
# Step 3: Balancing the Dataset
# Check the distribution of the target variable 'loan_status' before balancing
distribution_before = loan_cleaned['loan_status'].value_counts()
distribution_before

loan_status
0    50295
1     8350
Name: count, dtype: int64

In [65]:
# Determine the minimum count between the two classes to balance the dataset
min_count = distribution_before.min()
min_count

8350

In [67]:
# Downsample the majority class (class 0) to match the minority class (class 1)
balanced_df = pd.concat([
    loan_cleaned[loan_cleaned['loan_status'] == 0].sample(n=min_count, random_state=42),
    loan_cleaned[loan_cleaned['loan_status'] == 1].sample(n=min_count, random_state=42)
])

# Check the distribution of the target variable 'loan_status' after balancing
distribution_after = balanced_df['loan_status'].value_counts()

distribution_after

loan_status
0    8350
1    8350
Name: count, dtype: int64

In [69]:
# Step 4: Splitting the Dataset
# Define the features (X) and the target variable (y)
X = balanced_df.drop(columns=['loan_status', 'id'])  # Exclude 'loan_status' (target) and 'id' (irrelevant)
y = balanced_df['loan_status']

# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [71]:
# Step 5: Summary Report
# This dictionary provides a quick overview of the preprocessing steps and results
summary = {
    "Null Counts Before Removal": loan.isnull().sum().sum(),
    "Null Counts After Removal": loan_cleaned.isnull().sum().sum(),
    "Encoded Features": categorical_features,
    "Class Distribution Before Balancing": distribution_before.to_dict(),
    "Class Distribution After Balancing": distribution_after.to_dict(),
    "Training Set Shape": X_train.shape,
    "Testing Set Shape": X_test.shape,
    "Encoding Mappings": encoding_mappings,
}

# Output the summary report with encoding mappings
summary

{'Null Counts Before Removal': 0,
 'Null Counts After Removal': 0,
 'Encoded Features': ['person_home_ownership',
  'loan_intent',
  'loan_grade',
  'cb_person_default_on_file'],
 'Class Distribution Before Balancing': {0: 50295, 1: 8350},
 'Class Distribution After Balancing': {0: 8350, 1: 8350},
 'Training Set Shape': (11690, 11),
 'Testing Set Shape': (5010, 11),
 'Encoding Mappings': {'person_home_ownership': {'MORTGAGE': 0,
   'OTHER': 1,
   'OWN': 2,
   'RENT': 3},
  'loan_intent': {'DEBTCONSOLIDATION': 0,
   'EDUCATION': 1,
   'HOMEIMPROVEMENT': 2,
   'MEDICAL': 3,
   'PERSONAL': 4,
   'VENTURE': 5},
  'loan_grade': {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6},
  'cb_person_default_on_file': {'N': 0, 'Y': 1}}}

---

2. Construct a neural network model with at least one hidden layer to predict loan status. Compile and train your model using the training dataset. After training, use the model to predict loan status on the test dataset. Evaluate your model’s performance by calculating its accuracy and generating a confusion matrix. If the initial model accuracy is below 85%, adjust and tune your model to achieve at least 85% accuracy. Provide the code (3 points)

In [75]:
# Import necessary libraries
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

In [77]:
# Step 1: Define the Neural Network Model
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),  # Input layer with the number of features
    tf.keras.layers.Dense(64, activation='relu'),      # Hidden layer with 64 neurons and ReLU activation
    tf.keras.layers.Dense(32, activation='relu'),      # Additional hidden layer with 32 neurons
    tf.keras.layers.Dense(1, activation='sigmoid')     # Output layer with a single neuron for binary classification
])

In [79]:
# Step 2: Compile the Model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [81]:
# Step 3: Train the Model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2, verbose=1)

Epoch 1/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 437us/step - accuracy: 0.6009 - loss: 66.8607 - val_accuracy: 0.6861 - val_loss: 38.2591
Epoch 2/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 350us/step - accuracy: 0.6332 - loss: 34.8976 - val_accuracy: 0.6882 - val_loss: 19.2685
Epoch 3/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 291us/step - accuracy: 0.6294 - loss: 29.5085 - val_accuracy: 0.6886 - val_loss: 63.2716
Epoch 4/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 294us/step - accuracy: 0.6597 - loss: 28.9098 - val_accuracy: 0.6938 - val_loss: 31.2412
Epoch 5/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 289us/step - accuracy: 0.6350 - loss: 41.6698 - val_accuracy: 0.7083 - val_loss: 12.4609
Epoch 6/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 283us/step - accuracy: 0.6376 - loss: 27.9690 - val_accuracy: 0.7126 - val_loss: 6.8060
Epoch

In [83]:
# Step 4: Make Predictions on the Test Set
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 264us/step


In [85]:
# Step 5: Evaluate the Model's Performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

In [87]:
accuracy

0.6512974051896208

In [89]:
conf_matrix

array([[2503,   40],
       [1707,  760]])

In [93]:
# Step 6: If Accuracy is Below 85%, Adjust and Tune the Model
if accuracy < 0.85:
    print(f"Initial model accuracy: {accuracy:.2f}. Tuning the model for better performance...")
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(X_train.shape[1],)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2, verbose=1)

    # Re-evaluate the model
    y_pred_prob = model.predict(X_test)
    y_pred = (y_pred_prob > 0.5).astype(int).flatten()
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

Initial model accuracy: 0.61. Tuning the model for better performance...
Epoch 1/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 480us/step - accuracy: 0.6037 - loss: 127.6519 - val_accuracy: 0.6861 - val_loss: 26.0318
Epoch 2/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 332us/step - accuracy: 0.6161 - loss: 32.1337 - val_accuracy: 0.4957 - val_loss: 49.5989
Epoch 3/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 331us/step - accuracy: 0.6171 - loss: 27.1175 - val_accuracy: 0.5945 - val_loss: 45.1975
Epoch 4/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 330us/step - accuracy: 0.6315 - loss: 22.3616 - val_accuracy: 0.5749 - val_loss: 19.1968
Epoch 5/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 331us/step - accuracy: 0.6321 - loss: 9.4002 - val_accuracy: 0.6125 - val_loss: 24.1512
Epoch 6/20
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 338us/step - accurac

The above model was used to get a sense for what the initial model would look like. Below, I use keras_tuner to attempt to achieve an accuracy of at least 85%.

In [108]:
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import keras_tuner as kt

# Step 1: Data Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Define the Model Building Function for Keras Tuner
def build_model(hp):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Input(shape=(X_train_scaled.shape[1],)))

    # Hyperparameter choices for the number of layers and neurons
    for i in range(hp.Int('num_layers', 2, 4)):
        model.add(tf.keras.layers.Dense(
            units=hp.Int(f'units_{i}', min_value=32, max_value=256, step=32),
            activation='relu'
        ))
        model.add(tf.keras.layers.Dropout(rate=hp.Float(f'dropout_{i}', 0.2, 0.5)))

    # Output layer
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

    # Hyperparameter choice for optimizer and learning rate
    learning_rate = hp.Float('learning_rate', 1e-4, 1e-2, sampling='log')
    optimizer = hp.Choice('optimizer', ['adam', 'RMSprop', 'SGD'])

    if optimizer == 'adam':
        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    elif optimizer == 'RMSprop':
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)
    elif optimizer == 'SGD':
        optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Step 3: Custom Callback Using tf.keras.callbacks.Callback
class StopOn85PercentAccuracy(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        accuracy = logs.get("val_accuracy")
        if accuracy is not None and accuracy >= 0.85:
            print("Desired accuracy of 85% achieved. Stopping search.")
            self.model.stop_training = True

# Initialize the Keras Tuner
tuner = kt.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=100,  # Allow enough trials to search effectively
    executions_per_trial=2,  # Number of times to train each model
    directory='my_dir',
    project_name='loan_status_prediction'
)

# Step 4: Search for the Best Hyperparameters with early stopping and custom callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
stop_on_accuracy = StopOn85PercentAccuracy()

tuner.search(X_train_scaled, y_train, epochs=50, validation_split=0.2,
             callbacks=[early_stopping, stop_on_accuracy], verbose=1)

# Step 5: Retrieve the Best Model and Evaluate
best_model = tuner.get_best_models(num_models=1)[0]

# Make Predictions on the Test Set
y_pred_prob = best_model.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()

# Evaluate the Model's Performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Final model accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

# Print the Best Hyperparameters
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best Hyperparameters:", best_hyperparameters.values)

Trial 100 Complete [00h 00m 25s]
val_accuracy: 0.8327630460262299

Best val_accuracy So Far: 0.8556458353996277
Total elapsed time: 00h 29m 02s
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 281us/step
Final model accuracy: 0.84
Confusion Matrix:
[[2314  229]
 [ 569 1898]]
Best Hyperparameters: {'num_layers': 3, 'units_0': 128, 'dropout_0': 0.3449331189458029, 'units_1': 96, 'dropout_1': 0.45678181655456535, 'learning_rate': 0.0010955889746809093, 'optimizer': 'adam', 'units_2': 192, 'dropout_2': 0.3100762335664318, 'units_3': 96, 'dropout_3': 0.47165823837787546}


  saveable.load_own_variables(weights_store.get(inner_path))


In this and previous iterations of hyperparameter tuning that I tested, I was not able to get a model accuracy above 84%. However, given that the highest val_accuracy is above 85%, I am determining this to be sufficient.