# **MAS 651 Midterm Exam (40 points)**

Philip Bachas-Daunert

## **Question 1 (4 Points)**

Find out how many rows and columns this data set has. Check if the data set has any missing value. Remove missing values if there are any. Show the first 10 observations of this data set.

### **Question 1 Answer**

In [1]:
# Import the necessary libraries
import pandas as pd

# Load the data
churn_data = pd.read_csv('https://raw.githubusercontent.com/wangx346/MAS651/main/Churn.csv')

# Display the shape of the DataFrame to understand its size
num_rows, num_columns = churn_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

# Check for the presence of any missing values within the DataFrame
if churn_data.isnull().values.any():
    print("There are missing values in the dataset.")
    
    # Remove rows with any missing values to ensure data integrity
    churn_data.dropna(inplace=True)
    
    # Display the new shape after removing missing values
    new_num_rows = churn_data.shape[0]
    print(f"New number of rows after removing missing values: {new_num_rows}")
else:
    print("There are no missing values in the dataset.")

# Display the first 10 observations of the DataFrame
print(churn_data.head(10))


Number of rows: 10000
Number of columns: 14
There are no missing values in the dataset.
   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   
5          6    15574012       Chu          645     Spain    Male   44   
6          7    15592531  Bartlett          822    France    Male   50   
7          8    15656148    Obinna          376   Germany  Female   29   
8          9    15792365        He          501    France    Male   44   
9         10    15592389        H?          684    France    Male   27   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0       2       0.00           

## **Question 2 (4 Points)**

Which 3 columns should not be used as features? Exclude these columns. Create the appropriate X matrix (representing categorical variables using dummy variables). Define the y vector as the last column of the data.

### **Question 2 Answer**

In [2]:
# Import necessary libraries
import pandas as pd

# Exclude the columns that should not be used as features
columns_to_exclude = ['RowNumber', 'CustomerId', 'Surname']

print(f"Exclude the columns that should not be used as features: {columns_to_exclude}. This is because they are identifiers and not relevant to the prediction.") 

# Drop the columns from the DataFrame
churn_data_filtered = churn_data.drop(columns=columns_to_exclude)

# Convert categorical variables into dummy variables
categorical_columns = ['Geography', 'Gender']
churn_data_with_dummies = pd.get_dummies(churn_data_filtered, columns=categorical_columns)

# Define the X matrix (features) and y vector (target variable is last column)
X = churn_data_with_dummies.drop('Exited', axis=1)
y = churn_data_with_dummies['Exited']

# Show the first few rows of X to verify the transformation
print(X.head())

# Show the first few rows of y to verify the target variable
print(y.head())


Exclude the columns that should not be used as features: ['RowNumber', 'CustomerId', 'Surname']. This is because they are identifiers and not relevant to the prediction.
   CreditScore  Age  Tenure    Balance  NumOfProducts  HasCrCard  \
0          619   42       2       0.00              1          1   
1          608   41       1   83807.86              1          0   
2          502   42       8  159660.80              3          1   
3          699   39       1       0.00              2          0   
4          850   43       2  125510.82              1          1   

   IsActiveMember  EstimatedSalary  Geography_France  Geography_Germany  \
0               1        101348.88              True              False   
1               1        112542.58             False              False   
2               0        113931.57              True              False   
3               0         93826.63              True              False   
4               1         79084.10            

## **Question 3 (6 Points)**

Create a training data set and a test data set such that the test data set contains randomly 20% of the data set. Set the random seed to be 40. Scale the features in both the training and testing set using StandardScaler.

### **Question 3 Answer**

In [3]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=40)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data with the same scaler
X_test_scaled = scaler.transform(X_test)

# Show the shapes of the resulting datasets to confirm the split
print(f"Training Data Set shape: {X_train_scaled.shape}")
print(f"Test Data Set shape: {X_test_scaled.shape}")

Training Data Set shape: (8000, 13)
Test Data Set shape: (2000, 13)


## **Question 4 (6 Points)**

Implement Xgboost for classification on the training data. Evaluate the performance of the Xgboost classifier on the test data. Report accuracy and the confusion matrix. Report precision and recall.

### **Question 4 Answer**

In [4]:
# Import necessary libraries
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

# Initialize the XGBoost classifier
xgb_classifier = XGBClassifier(eval_metric='logloss')

# Fit the classifier to the scaled training data
xgb_classifier.fit(X_train_scaled, y_train)

# Predict the labels for the test set
y_pred = xgb_classifier.predict(X_test_scaled)

# Calculate accuracy, confusion matrix, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# Print results
print(f"Model Evaluation Metrics:\n{'-'*28}")
print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}\n")
print("Confusion Matrix:")
print(f"{conf_matrix[0][0]} (True Negative)  | {conf_matrix[0][1]} (False Positive)")
print(f"{conf_matrix[1][0]} (False Negative) | {conf_matrix[1][1]} (True Positive)")


Model Evaluation Metrics:
----------------------------
Accuracy : 0.8735
Precision: 0.7162
Recall   : 0.5651

Confusion Matrix:
1530 (True Negative)  | 86 (False Positive)
167 (False Negative) | 217 (True Positive)


## **Question 5 (5 Points)**

Could you find better choices of parameters using GridSearchCV? Does the performance improve with this choice of parameter?

### **Question 5 Answer**

In [5]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [4, 6, 8]
}

# Initialize the XGBoost classifier
xgb_classifier = XGBClassifier(eval_metric='logloss')

# Initialize GridSearchCV with the parameter grid, classifier, and scoring metric
grid_search = GridSearchCV(estimator=xgb_classifier, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1)

# Fit GridSearchCV to the scaled training data
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best parameters found: {grid_search.best_params_}")

# Use the best estimator found to predict the labels of the test set
y_pred_optimized = grid_search.best_estimator_.predict(X_test_scaled)

# Calculate and print the improved accuracy, confusion matrix, precision, and recall
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
conf_matrix_optimized = confusion_matrix(y_test, y_pred_optimized)
precision_optimized = precision_score(y_test, y_pred_optimized)
recall_optimized = recall_score(y_test, y_pred_optimized)

# Function to compare model performance
def compare_model_performance(accuracy_orig, precision_orig, recall_orig, conf_matrix_orig,
                              accuracy_opt, precision_opt, recall_opt, conf_matrix_opt):
    """
    Compares the performance metrics of the original XGBoost model and the optimized model.
    """
    print("\nComparison of Original and Optimized Model Performances:\n" + "-"*60)
    print(f"{'Metric':<12}{'Original':<10}{'Optimized':<10}")
    print(f"{'Accuracy':<12}{accuracy_orig:.4f}    {accuracy_opt:.4f}")
    print(f"{'Precision':<12}{precision_orig:.4f}    {precision_opt:.4f}")
    print(f"{'Recall':<12}{recall_orig:.4f}    {recall_opt:.4f}\n")
    
    print("Original Confusion Matrix:")
    print(f"{conf_matrix_orig[0][0]} (TN) | {conf_matrix_orig[0][1]} (FP)")
    print(f"{conf_matrix_orig[1][0]} (FN) | {conf_matrix_orig[1][1]} (TP)\n")
    
    print("Optimized Confusion Matrix:")
    print(f"{conf_matrix_opt[0][0]} (TN) | {conf_matrix_opt[0][1]} (FP)")
    print(f"{conf_matrix_opt[1][0]} (FN) | {conf_matrix_opt[1][1]} (TP)")

# Call the function with both the original and optimized metrics
compare_model_performance(accuracy, precision, recall, conf_matrix,
                          accuracy_optimized, precision_optimized, recall_optimized, conf_matrix_optimized)


Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best parameters found: {'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 300}

Comparison of Original and Optimized Model Performances:
------------------------------------------------------------
Metric      Original  Optimized 
Accuracy    0.8735    0.8740
Precision   0.7162    0.7391
Recall      0.5651    0.5312

Original Confusion Matrix:
1530 (TN) | 86 (FP)
167 (FN) | 217 (TP)

Optimized Confusion Matrix:
1544 (TN) | 72 (FP)
180 (FN) | 204 (TP)


## **Question 6 (5 Points)**

How does the accuracy of Xgboost compare with that of logistic regression? Show your comparisons.

### **Question 6 Answer**

In [6]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the logistic regression model
log_reg = LogisticRegression(max_iter=1000)

# Fit the model to the scaled training data
log_reg.fit(X_train_scaled, y_train)

# Predict the labels for the test set
y_pred_log_reg = log_reg.predict(X_test_scaled)

# Calculate the accuracy of the logistic regression model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)

# Function to compare the accuracy of two models
def compare_accuracy(accuracy_xgb, accuracy_lr):
    """
    Compares the accuracy of the XGBoost model and the Logistic Regression model.
    """
    print("\nComparison of Model Accuracies:\n" + "-"*35)
    print(f"{'Model':<20}{'Accuracy':<15}")
    print(f"{'Logistic Regression':<20}{accuracy_lr:.4f}")
    print(f"{'Optimized XGBoost':<20}{accuracy_xgb:.4f}")

# Call the function to compare the accuracies
compare_accuracy(accuracy_optimized, accuracy_log_reg)



Comparison of Model Accuracies:
-----------------------------------
Model               Accuracy       
Logistic Regression 0.8210
Optimized XGBoost   0.8740


## **Question 7 (10 Points)**

Fit an artificial neural network model with two hidden layers to predict the customer churn. Use activation='sigmoid' for the output layer. How does the accuracy of the neural network model compare with Xgboost and logistic regression?

### **Question 7 Answer**

In [7]:
# Import necessary libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers.legacy import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import accuracy_score
import numpy as np

model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    BatchNormalization(),
    Dropout(0.4),  # Increased dropout rate
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.4),  # Increased dropout rate
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.0001),  # Adjusted learning rate
              loss='binary_crossentropy', metrics=['accuracy'])

# Add ReduceLROnPlateau to reduce learning rate when a metric has stopped improving
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.00001, verbose=1)

# Adjust early stopping patience
early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)

history = model.fit(X_train_scaled, y_train, epochs=300, batch_size=64, verbose=1,
                    validation_split=0.2, callbacks=[early_stopping, reduce_lr])

y_pred_nn = model.predict(X_test_scaled)
y_pred_nn = np.round(y_pred_nn).flatten()

accuracy_nn = accuracy_score(y_test, y_pred_nn)

# Function to compare the accuracy of three models
def compare_model_accuracies(accuracy_xgb, accuracy_lr, accuracy_nn):
    """
    Compares the accuracy of the XGBoost model, Logistic Regression model, and Neural Network model.
    """
    print("\nComparison of Model Accuracies:\n" + "-"*40)
    print(f"{'Model':<25}{'Accuracy':<15}")
    print(f"{'Logistic Regression':<25}{accuracy_lr:.4f}")
    print(f"{'Optimized XGBoost':<25}{accuracy_xgb:.4f}")
    print(f"{'Neural Network':<25}{accuracy_nn:.4f}")

# Call the function to compare the accuracies
compare_model_accuracies(accuracy_optimized, accuracy_log_reg, accuracy_nn)

print("XGBoost has demonstrated to be the most accurate model in this comparison. This is likely because XGBoost is particularly effective with tabular data, leveraging its gradient boosting framework to handle various data irregularities (such as missing values and outliers). On the other hand, while neural networks are powerful for capturing complex patterns and interactions, they may require more extensive data preprocessing, feature engineering, and hyperparameter tuning to achieve optimal performance on tabular datasets.")


Epoch 1/300


2024-02-12 11:59:05.338867: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 64: ReduceLROnPlateau reducing learning rate to 1.9999999494757503e-05.
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 71: ReduceLROn