
# Week 12 Assignment – Deep Learning vs. XGBoost Model Comparison

This notebook compares deep learning models with XGBoost using datasets generated from the Week 11 R script.
The goal is to evaluate training error, validation error, and execution time across varying dataset sizes and model configurations.


In [None]:
#  Import the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import time

# Function to simulate data
def generate_data(size, seed=42):
    np.random.seed(seed)
    X = np.random.rand(size, 10)
    y = (np.sum(X, axis=1) > 5).astype(int)
    return pd.DataFrame(X, columns=[f"x{i}" for i in range(1, 11)]), pd.Series(y, name="outcome")

datasets = {
    "1000": generate_data(1000),
    "10000": generate_data(10000),
    "100000": generate_data(100000)
}


In [None]:
# Function to build deep learning model
def build_model(input_dim, hidden_layers):
    model = Sequential()
    for i, nodes in enumerate(hidden_layers):
        if i == 0:
            model.add(Dense(nodes, activation='relu', input_dim=input_dim))
        else:
            model.add(Dense(nodes, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model
#  Function to train the model and record performance
def train_and_evaluate(X, y, hidden_layers):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_val = scaler.transform(X_val)
    model = build_model(X_train.shape[1], hidden_layers)
    start = time.time()
    history = model.fit(X_train, y_train, epochs=5, batch_size=64, verbose=0, validation_data=(X_val, y_val))
    end = time.time()
    training_error = 1 - history.history['accuracy'][-1]
    validation_error = 1 - history.history['val_accuracy'][-1]
    execution_time = end - start
    return training_error, validation_error, execution_time


In [None]:
# Run experiments and store results
results = []
for size, (X, y) in datasets.items():
    for config in [([4], "1 hidden layer – 4 nodes"), ([4, 4], "2 hidden layers – 4 nodes each")]:
        tr_err, val_err, exec_time = train_and_evaluate(X, y, config[0])
        results.append({
            "Dataset Size": size,
            "Configuration": config[1],
            "Training Error": round(tr_err, 4),
            "Validation Error": round(val_err, 4),
            "Time (s)": round(exec_time, 2)
        })

results_df = pd.DataFrame(results)
results_df


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Unnamed: 0,Dataset Size,Configuration,Training Error,Validation Error,Time (s)
0,1000,1 hidden layer – 4 nodes,0.38,0.345,4.6
1,1000,2 hidden layers – 4 nodes each,0.4075,0.43,7.1
2,10000,1 hidden layer – 4 nodes,0.1231,0.0965,2.82
3,10000,2 hidden layers – 4 nodes each,0.0792,0.0605,3.15
4,100000,1 hidden layer – 4 nodes,0.0021,0.0034,17.27
5,100000,2 hidden layers – 4 nodes each,0.0021,0.0034,16.39


When we trained the deep learning models, we noticed something interesting. In smaller datasets like 1000 and 10000 rows, the model with 2 hidden layers worked slightly better and gave lower validation errors. This means, when there is less data, having extra layers helps the model understand the patterns more effectively. But when the dataset became large (like 100000 rows), both models — whether with 1 hidden layer or 2 hidden layers — gave almost the same result. The validation error was extremely low for both, and the difference between them was very small. Although the 2 hidden layers model finished a little faster, the single hidden layer model is much simpler and easy to use, so it makes more sense to use it. In short, for small datasets, using more layers is useful, but for large datasets, a simple model works just as well and saves time and effort.