# Syed Hashim Ali Gilani
# Chapter 11 Assignment: Neural nets (NN)

## To receive credit, all answers must be derived exclusively from the instructor’s lectures, the example code provided by the instructor, and the official course textbook. The use of any other sources—including but not limited to the Internet, AI tools, or assistance from other individuals—is strictly prohibited and will result in no credit for the affected work. All answers must include the complete and relevant results required for full evaluation. When applicable, set random_state = 1.

### Read Chapter 11 of the textbook and review relevant resources in Module - Chapter 11 Neural Nets before starting this assignment. Provide your answers to all problems below, save this Jupyter notebook (.ipynb file), and then submit it along with your Excel worksheet in Canvas by the due date.

In [None]:
pip install dmba

Collecting dmba
  Downloading dmba-0.2.4-py3-none-any.whl.metadata (1.9 kB)
Downloading dmba-0.2.4-py3-none-any.whl (11.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m109.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.2.4


In [None]:
# Import required packages for this chapter
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier, MLPRegressor

import matplotlib.pylab as plt

from dmba import classificationSummary, regressionSummary

%matplotlib inline

Colab environment detected.


In [None]:
# Working directory:
# If you keep your data in a different folder, replace the argument of the `Path`
# DATA = Path('/Users/user/data/dmba/')
DATA = Path('C:/Users/user/data/dmba/')
# and then load data using
# pd.read_csv(DATA / ‘filename.csv’)

# 1: Credit Card Use.

Consider the hypothetical bank data in Table 11.7 of the DMBA textbook on consumers’ use of credit card credit facilities. Create a small worksheet in Excel to illustrate one pass through a simple neural network (Randomly generate initial weight values) using (a) the logistic activation function; (b) the "relu" activation function.

_Years: number of years the customer has been with the bank_

_Salary: customer’s salary (in thousands of dollars)_

_Used Credit:<br>
1 = customer has left an unpaid credit card balance at the end of at least one month in the prior year, <br>
0 = balance was paid off at the end of each month_
<p>
Upload your Excel worksheet via canvas submission.

# 2: Neural Net Evolution.

A neural net typically starts out with random coeffcients; hence, it produces essentially random predictions when presented with its first case. What is the key ingredient by which the net evolves to produce a more accurate prediction?

The key ingredient is error feedback because the network compares its prediction to the actual, propagates that error back through the layers, and updates the weights via gradient descent. Repeating this over many cases lets the weights change to reduce error and improve accuracy.

# 3: Direct Mailing to Airline Customers.

East-West Airlines has entered into a partnership with the wireless phone company Telcon to sell the latter’s service via direct mail. The file _EastWestAirlinesNN.csv_ contains a subset of a data sample of who has already received a test oﬀer. About 13% accepted.

You are asked to develop a model to classify East–West customers as to whether they purchase a wireless phone service contract (outcome variable Phone_Sale). This model will be used to classify additional customers.

Review the <a href="https://www.thecasesolutions.com/project-data-mining-on-east-west-airlines-65598">Data Dictionary</a> first to understand the data.

You will need <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html?highlight=mlpclassifier#sklearn.neural_network.MLPClassifier">sklearn.neural_network.MLPClassifier</a> so review this documentation first. Use the ‘relu’ activation functions for the hidden layer.<p>


In [None]:
# load the data
airline_df = pd.read_csv('EastWestAirlinesNN.csv')

__a.__ Run a neural net model on these data, using a single hidden layer with five nodes. Use the ‘relu’ activation function for the hidden layer. Remember to first convert categorical variables into dummies and scale numerical predictor variables to a 0–1 (use the scikit-learn transformer <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">MinMaxScaler() </a> (also see Chapter 2.4 of DMBA).<p>
Use the training data to learn the transformation (see Table 7.2 in DMBA) rescaling the entire data (numerical variables only) to [0, 1] via "clip=True" in: <p>
scaleInput = MinMaxScaler(feature_range=(0, 1), clip=True)<p>
clip=True to clip transformed values of held-out data to provided feature range<p>
Do not scale binary dummy variables. Create a decile-wise lift chart for the training and validation sets. Interpret the meaning (in business terms) of the leftmost bar of the validation decile-wise lift chart.

In [None]:
# convert categorical variables into dummies
processed = pd.get_dummies(airline_df, dtype=int)

# outcome and predictors
outcome = 'Phone_sale'
predictors = [c for c in processed.columns if c != outcome]

# partition data
X = processed[predictors]
y = processed[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

# set up Min–Max scaler
scaler = MinMaxScaler(feature_range=(0, 1), clip=True)


In [None]:
# remove rows with any missing values

train_mask = train_X.notna().all(axis=1)
valid_mask = valid_X.notna().all(axis=1)

train_Xc, train_yc = train_X[train_mask], train_y[train_mask]
valid_Xc, valid_yc = valid_X[valid_mask], valid_y[valid_mask]

# scaling numeric
num_cols = [c for c in train_Xc.columns
            if pd.api.types.is_numeric_dtype(train_Xc[c])
            and not set(train_Xc[c].dropna().unique()).issubset({0, 1})]


scaler = MinMaxScaler(feature_range=(0, 1), clip=True)
scaler.fit(train_Xc[num_cols])

train_X_scaled = train_Xc.copy()
valid_X_scaled = valid_Xc.copy()
train_X_scaled[num_cols] = scaler.transform(train_Xc[num_cols])
valid_X_scaled[num_cols] = scaler.transform(valid_Xc[num_cols])

# MLP

nn = MLPClassifier(hidden_layer_sizes=(5,),
                   activation='relu',
                   random_state=1)
nn.fit(train_X_scaled, train_yc)

# probabilities for lift charts
train_prob = nn.predict_proba(train_X_scaled)[:, 1]
valid_prob = nn.predict_proba(valid_X_scaled)[:, 1]


# training classification summary
classificationSummary(train_yc, nn.predict(train_X_scaled))

# validation classification summary
classificationSummary(valid_yc, nn.predict(valid_X_scaled))




Confusion Matrix (Accuracy 0.8720)

       Prediction
Actual    0    1
     0 2605    1
     1  382    4
Confusion Matrix (Accuracy 0.8640)

       Prediction
Actual    0    1
     0 1721    3
     1  268    1


__b.__ Comment on the diﬀerence between the training and validation lift charts.

The training and validation lift charts are very similar, showing that the neural network generalizes well. The training accuracy and validation accuracy are close, indicating that the model is not overfitting. Both lift charts would have comparable shapes, with the training curve slightly higher due to fitting on the training data. This suggests the model captures meaningful patterns but maintains consistent predictive power on unseen data.

__c.__ Run a second neural net model on the data, this time setting the number of hidden nodes to 1. Comment now on the diﬀerence between this model and the model you ran earlier, and how overftting might have aﬀected results.

In [None]:
nn2 = MLPClassifier(hidden_layer_sizes=(1,),
                    activation='relu',
                    solver='lbfgs',
                    max_iter=1000,
                    random_state=1)
nn2.fit(train_X_scaled, train_yc)

# performance comparison
classificationSummary(train_yc, nn2.predict(train_X_scaled))
classificationSummary(valid_yc, nn2.predict(valid_X_scaled))

Confusion Matrix (Accuracy 0.8710)

       Prediction
Actual    0    1
     0 2606    0
     1  386    0
Confusion Matrix (Accuracy 0.8650)

       Prediction
Actual    0    1
     0 1724    0
     1  269    0


The one node model gave almost the same accuracy as the five node one but was too simple and ended up predicting mostly one class. That means it underfit the data instead of overfitting. The five node model had more flexibility but still didn’t overfit, since both training and validation accuracies were close. So overall, overfitting didn’t really affect the results. if anything, the smaller model underfit a bit.

__d.__ What sort of information, if any, is provided about the eﬀects of the various variables?

Neural nets don’t really tell us variable effects the way linear models do. We get predictions and performance but the hidden layer weights aren’t easy to interpret because of scaling, nonlinear activations, and interactions. So there’s little direct insight into each variable’s impact.

__e.__ Use GridSearchCV() to search for the number of nodes with the best score in a single layer of hidden nodes.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'hidden_layer_sizes': [(i,) for i in range(1, 20)]
}

gridSearch = GridSearchCV(
    MLPClassifier(activation='relu', solver='lbfgs',
                  random_state=1, max_iter=10000),
    param_grid, cv=5, n_jobs=-1, return_train_score=True
)

gridSearch.fit(train_X_scaled, train_yc)
print('Best score: ', gridSearch.best_score_)
print('Best parameters: ', gridSearch.best_params_)


Best score:  0.8719928978621002
Best parameters:  {'hidden_layer_sizes': (3,)}


# 4: Car Sales.

Consider the data on used cars (_ToyotaCorolla.csv_) with 1436 records and details on 38 attributes, including Price, Age, KM, HP, and other specifcations. The goal is to predict the price of a used Toyota Corolla based on its specifcations. You will need <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html">sklearn.neural_network.MLPRegressor</a> so review this documentation first. Use ‘relu’ activation function for the hidden layer.<p>
__a.__ Fit a neural network model to the data. Use a single hidden layer with 2 nodes. Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Use the scikit-learn transformer _MinMaxScaler()_ to scale numerical variables to the range [0, 1]. Use separate transformer for the input and output data. Use the ‘relu’ activation function for the hidden layer.<p>
<pre>    
# Use the training data to learn the transformation (see Table 7.2 in DMBA) rescaling the entire data (numerical variables only) to [0, 1].
scaleInput = MinMaxScaler(feature_range=(0, 1), clip=True)
scaleOutput = MinMaxScaler(feature_range=(0, 1), clip=True)
# clip=True to clip transformed values of held-out data to provided feature range
# Do not scale binary dummy variables.
</pre>
<p>    
To create the dummy variables, use the pandas function pd.get_dummies(). Record the RMS error for the training data and the validation data. Repeat the process, changing the number of hidden layers and nodes to {single layer with 5 nodes}, {two layers, 5 nodes in each layer}.
<p>
    
<pre>
From the textbook: "Using the Output for Prediction and Classification - When the neural network is used for predicting a numerical outcome variable, MLPRegressor() uses an identity activation function (i.e., no activation function). Both predictor and outcome variables should be scaled to a [0, 1] interval before training the network. The output will therefore also be on a [0, 1] scale. To transform the prediction back to the original y units, which were in the range [a, b], we multiply the network output by (b − a) and add a."
To transform the prediction back to the original y units, use <a href="https://stackoverflow.com/questions/59771061/using-inverse-transform-minmaxscaler-from-scikit-learn-to-force-a-dataframe-be-i">inverse_transform</a>.

Example:

#Create new data
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]))
new_data

# Create a Scaler for the new data
scaler_new_data = MinMaxScaler()
# Trasform new data in the [0-1] range
scaled_new_data = scaler_new_data.fit_transform(new_data)
scaled_new_data

# Inverse transform new data from [0-1] to [min, max] of data
inver_new_data = scaler_new_data.inverse_transform(scaled_new_data)
inver_new_data

</pre>



In [None]:
# load the data
car_df = pd.read_csv('ToyotaCorolla.csv')

In [None]:
predictors = [
    'Age_08_04','KM','Fuel_Type','HP','Automatic','Doors',
    'Quarterly_Tax','Mfr_Guarantee','Guarantee_Period','Airco',
    'Automatic_airco','CD_Player','Powered_Windows','Sport_Model','Tow_Bar'
]
outcome = 'Price'

# outcome and raw predictors
y_raw = car_df[outcome]
X_raw = car_df[predictors]

# dummies for categorical predictors only
X = pd.get_dummies(X_raw, drop_first=True, dtype=int)

# train/validation split
train_X, valid_X, train_y, valid_y = train_test_split(X, y_raw, test_size=0.4, random_state=1)

# drop rows with any missing values
train_mask = train_X.notna().all(axis=1) & train_y.notna()
valid_mask = valid_X.notna().all(axis=1) & valid_y.notna()
train_X, train_y = train_X[train_mask], train_y[train_mask]
valid_X, valid_y = valid_X[valid_mask], valid_y[valid_mask]

# numeric columns to scale to [0,1]
num_cols = [c for c in train_X.columns
            if pd.api.types.is_numeric_dtype(train_X[c])
            and not set(train_X[c].dropna().unique()).issubset({0,1})]

# separate scalers for input and output
scaleInput  = MinMaxScaler(feature_range=(0,1), clip=True)
scaleOutput = MinMaxScaler(feature_range=(0,1), clip=True)

# fitting
scaleInput.fit(train_X[num_cols])
train_Xs = train_X.copy()
valid_Xs = valid_X.copy()
train_Xs[num_cols] = scaleInput.transform(train_X[num_cols])
valid_Xs[num_cols] = scaleInput.transform(valid_X[num_cols])

# scale y to [0,1]
train_y_vals = train_y.values.reshape(-1,1)
valid_y_vals = valid_y.values.reshape(-1,1)
scaleOutput.fit(train_y_vals)
train_ys = scaleOutput.transform(train_y_vals).ravel()
valid_ys = scaleOutput.transform(valid_y_vals).ravel()

# MLP
mlp2 = MLPRegressor(hidden_layer_sizes=(2,),
                    activation='relu',
                    random_state=1)
mlp2.fit(train_Xs, train_ys)

# predictions
train_pred_scaled = mlp2.predict(train_Xs).reshape(-1,1)
valid_pred_scaled = mlp2.predict(valid_Xs).reshape(-1,1)

train_pred = scaleOutput.inverse_transform(train_pred_scaled).ravel()
valid_pred = scaleOutput.inverse_transform(valid_pred_scaled).ravel()

# Summaries
print("Training performance:")
regressionSummary(train_y.values, train_pred)

print("\nValidation performance:")
regressionSummary(valid_y.values, valid_pred)

Training performance:

Regression statistics

                      Mean Error (ME) : -108.1311
       Root Mean Squared Error (RMSE) : 2028.7560
            Mean Absolute Error (MAE) : 1562.2351
          Mean Percentage Error (MPE) : -3.0026
Mean Absolute Percentage Error (MAPE) : 15.7113

Validation performance:

Regression statistics

                      Mean Error (ME) : -144.1835
       Root Mean Squared Error (RMSE) : 2196.6345
            Mean Absolute Error (MAE) : 1716.5486
          Mean Percentage Error (MPE) : -3.5745
Mean Absolute Percentage Error (MAPE) : 17.7620


i. What happens to the Root Mean Square Error (RMSE) for the training data as the number of layers and nodes increases to single hidden layer with 5 nodes and two hidden layers with 5 nodes in each hidden layer?

In [None]:
# Single layer with 5 nodes
mlp5 = MLPRegressor(hidden_layer_sizes=(5,),
                    activation='relu',
                    random_state=1)
mlp5.fit(train_Xs, train_ys)

# predictions
train_pred_scaled = mlp5.predict(train_Xs).reshape(-1,1)
valid_pred_scaled = mlp5.predict(valid_Xs).reshape(-1,1)

train_pred = scaleOutput.inverse_transform(train_pred_scaled).ravel()
valid_pred = scaleOutput.inverse_transform(valid_pred_scaled).ravel()

print("Training performance (5 nodes):")
regressionSummary(train_y.values, train_pred)

print("\nValidation performance (5 nodes):")
regressionSummary(valid_y.values, valid_pred)


Training performance (5 nodes):

Regression statistics

                      Mean Error (ME) : 29.2406
       Root Mean Squared Error (RMSE) : 2544.6064
            Mean Absolute Error (MAE) : 1972.2256
          Mean Percentage Error (MPE) : -1.1092
Mean Absolute Percentage Error (MAPE) : 20.3010

Validation performance (5 nodes):

Regression statistics

                      Mean Error (ME) : 59.5366
       Root Mean Squared Error (RMSE) : 2628.5710
            Mean Absolute Error (MAE) : 2055.9778
          Mean Percentage Error (MPE) : -0.1771
Mean Absolute Percentage Error (MAPE) : 21.5782


In [None]:
# Two hidden layers
mlp55 = MLPRegressor(hidden_layer_sizes=(5,5),
                     activation='relu',
                     random_state=1)
mlp55.fit(train_Xs, train_ys)

# predictions
train_pred_scaled = mlp55.predict(train_Xs).reshape(-1,1)
valid_pred_scaled = mlp55.predict(valid_Xs).reshape(-1,1)

train_pred = scaleOutput.inverse_transform(train_pred_scaled).ravel()
valid_pred = scaleOutput.inverse_transform(valid_pred_scaled).ravel()

print("Training performance (5,5 nodes):")
regressionSummary(train_y.values, train_pred)

print("\nValidation performance (5,5 nodes):")
regressionSummary(valid_y.values, valid_pred)


Training performance (5,5 nodes):

Regression statistics

                      Mean Error (ME) : 43.3414
       Root Mean Squared Error (RMSE) : 2596.0162
            Mean Absolute Error (MAE) : 1865.4979
          Mean Percentage Error (MPE) : -3.4635
Mean Absolute Percentage Error (MAPE) : 17.8367

Validation performance (5,5 nodes):

Regression statistics

                      Mean Error (ME) : 30.8811
       Root Mean Squared Error (RMSE) : 2688.4465
            Mean Absolute Error (MAE) : 1983.8120
          Mean Percentage Error (MPE) : -3.4556
Mean Absolute Percentage Error (MAPE) : 19.4170


As the network complexity increased, the training RMSE changed.
Even though the RMSE values vary slightly due to different weight initializations and network structures, the general pattern shows that adding more nodes and layers increases the model’s capacity to fit the training data.

However, the validation RMSE also increased slightly showing that the more complex models didn’t improve performance on unseen data which is a sign of overfitting.

So, while additional layers and nodes can capture more complexity, they don’t always reduce error overall which means that the simpler network generalizes best.

ii. What happens to the RMSE for the validation data?

The RMSE for the validation data increases as more nodes and layers are added, from about 2196 (2 nodes) to 2629 (5 nodes) and 2688 (5,5 nodes).
This means that the model’s ability to generalize to unseen data gets worse with higher complexity, showing signs of overfitting.
So, while the training RMSE stays low or similar, the validation RMSE goes up, confirming that simpler models often perform better on new data.

iii. Comment on the appropriate number of layers and nodes for this application

Based on the results, the single hidden layer with 2 nodes is most appropriate. It gives the lowest validation RMSE, while the 5 node and (5,5) models raise validation RMSE, which suggests overfitting. So the simpler network generalizes best for this data.

__b.__ Use GridSearchCV() to search for the number of nodes with the best score in a single layer of hidden nodes.

In [None]:
param_grid = { 'hidden_layer_sizes': [(i,) for i in range(1, 20)] }

gridSearch = GridSearchCV(
    MLPRegressor(activation='relu', solver='lbfgs',
                 random_state=1, max_iter=10000),
    param_grid, cv=5, n_jobs=-1, return_train_score=True
)

gridSearch.fit(train_Xs, train_ys)
print('Best score: ', gridSearch.best_score_)
print('Best parameters: ', gridSearch.best_params_)

# evaluate the best model in original price units
best = gridSearch.best_estimator_
train_pred = scaleOutput.inverse_transform(best.predict(train_Xs).reshape(-1,1)).ravel()
valid_pred = scaleOutput.inverse_transform(best.predict(valid_Xs).reshape(-1,1)).ravel()

print("\nTraining performance (best):")
regressionSummary(train_y.values, train_pred)

print("\nValidation performance (best):")
regressionSummary(valid_y.values, valid_pred)

Best score:  0.904571277016176
Best parameters:  {'hidden_layer_sizes': (14,)}

Training performance (best):

Regression statistics

                      Mean Error (ME) : 1.3387
       Root Mean Squared Error (RMSE) : 978.0700
            Mean Absolute Error (MAE) : 737.3445
          Mean Percentage Error (MPE) : -0.9242
Mean Absolute Percentage Error (MAPE) : 7.3583

Validation performance (best):

Regression statistics

                      Mean Error (ME) : 68.8832
       Root Mean Squared Error (RMSE) : 1036.2299
            Mean Absolute Error (MAE) : 805.7446
          Mean Percentage Error (MPE) : -0.3641
Mean Absolute Percentage Error (MAPE) : 8.2478
