<a href="https://colab.research.google.com/github/mae25-create/data_visualization-analysis_practice/blob/main/Expedia_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applying Neural Networks to the Expedia Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_clean = pd.read_csv("expedia_clean.csv")

The data preparation is going to be identical to the one we did in the last class (Random Forests and Gradient Boosting).

In [None]:
# Converting the dates to date format (as it was in object format)

df_clean["date_time"] = pd.to_datetime(df_clean["date_time"])
df_clean["srch_ci"] = pd.to_datetime(df_clean["srch_ci"])
df_clean["srch_co"] = pd.to_datetime(df_clean["srch_co"])

In [None]:
df_clean.drop(columns=["Unnamed: 0", "Unnamed: 0.1"], inplace=True)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96534 entries, 0 to 96533
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date_time                 96534 non-null  datetime64[ns]
 1   site_name                 96534 non-null  int64         
 2   posa_continent            96534 non-null  int64         
 3   user_location_country     96534 non-null  object        
 4   user_location_region      96534 non-null  int64         
 5   user_location_city        96534 non-null  int64         
 6   user_id                   96534 non-null  int64         
 7   is_mobile                 96534 non-null  int64         
 8   is_package                96534 non-null  int64         
 9   channel                   96534 non-null  int64         
 10  srch_ci                   96534 non-null  datetime64[ns]
 11  srch_co                   96534 non-null  datetime64[ns]
 12  srch_adults_cnt   

Now, as we mentioned yesterday, the cross validation approach is not suitable, since the data has a time structure. We cannot use future data to predict past data, it doesn't make any sense. So what we are going to do is the following:

Again, we are going to split into 70% training, 15% validation, and 15% testing, based on the time stamp. This means, the first 70% of the data for training, the next 15% for validation, and the last 15% for testing. By doing so, we are training on past data to predict future data, which does make sense.

In [None]:
df = df_clean.copy()

# Ensure the data is sorted by search date
df = df.sort_values(by="date_time")

# Extract features from srch_ci (Check-in Date)
df["srch_ci_year"] = df["srch_ci"].dt.year
df["srch_ci_month"] = df["srch_ci"].dt.month
df["srch_ci_day"] = df["srch_ci"].dt.day
df["srch_ci_dow"] = df["srch_ci"].dt.dayofweek  # Monday=0, Sunday=6
df["srch_ci_hour"] = df["srch_ci"].dt.hour

# Extract features from srch_co (Check-out Date)
df["srch_co_year"] = df["srch_co"].dt.year
df["srch_co_month"] = df["srch_co"].dt.month
df["srch_co_day"] = df["srch_co"].dt.day
df["srch_co_dow"] = df["srch_co"].dt.dayofweek
df["srch_co_hour"] = df["srch_co"].dt.hour

# Delete time objects
df = df.drop(columns=["srch_ci", "srch_co", "date_time", "time"])

In [None]:
from sklearn.preprocessing import LabelEncoder

# Dealing with categorical variables

day_map = {
    "Monday": 0, "Tuesday": 1, "Wednesday": 2, "Thursday": 3,
    "Friday": 4, "Saturday": 5, "Sunday": 6
}
df["day_of_week"] = df["day_of_week"].map(day_map)

categorical_cols = ["user_location_country", "hotel_country"]

label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Save encoder if you want to inverse transform later

In [None]:
# Define the split index (70% training, 15% validation, 15% testing)
n_total = len(df)
train_end = int(n_total * 0.70)
val_end = int(n_total * 0.85)

# Split the data
X_train = df.iloc[:train_end].drop("is_booking", axis=1)
y_train = df.iloc[:train_end]["is_booking"]

X_val = df.iloc[train_end:val_end].drop("is_booking", axis=1)
y_val = df.iloc[train_end:val_end]["is_booking"]

X_test = df.iloc[val_end:].drop("is_booking", axis=1)
y_test = df.iloc[val_end:]["is_booking"]

In [None]:
!pip install tensorflow



In [None]:
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

We haven't seen anything related to adaptative learning rates, so we'll fix a value for a learning rate $\eta=0.01$. Also, we'll use a neural network with two hidden layers with the ReLU as the activation function. The first hidden layer will have 30 neurons and the second will have 15. Finally, since we want the output to be a probability (between zero and one), we'll use the sigmoid function as the activation function for the output layer.

As we explained before, the loss function for this type of problems is the binary cross-entropy loss. We haven't introduced the concept of **epoch** before, but it's finally time! We had that every time we update the model's parameters, we had an iteration. The epoch is a slightly different concept. An epoch is one complete pass through the entire training dataset.

Let's use an example to explain it. Let's say we have 1,000 training samples and the mini-batch size is 100 samples. For each mini-batch (100 samples), the model will make an update after computing the gradient and loss for that mini-batch. In each epoch, the model will go through all 1,000 training samples, which means it will process 10 mini-batches (since $1000/100 = 10$ mini-batches).

To sum up: 1 iteration $=$ 1 mini-batch update. 1 epoch $=$ 1 full pass through the dataset (10 iterations in the previous example).

In Keras, we'll usually set the number of epochs instead of the number of iterations. In our case, we have 455 training rows. And we'll set the size of the mini-batches to 16. So we would have $455/16=28.4\simeq 29$ iterations per epoch (we round up to the next number to ensure that all data is processed, to complete an epoch).

In [None]:
from tensorflow.keras.optimizers import SGD
from sklearn.utils import class_weight

sgd_optimizer = SGD(learning_rate=0.01)

# Compute class weights
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weight_dict = dict(enumerate(class_weights))

# Define the neural network model, be free to change the # of neuron you want to use
model = keras.Sequential([
    keras.layers.Dense(30, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy', metrics=['recall'])

# Train the model: choose only 16 out of 1000 to train
history = model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=1,
                    class_weight=class_weight_dict)
# The verbose=1 will show us what's happening in every epoch.
# Take a look at the output. Thanks to the verbose=1, you'll see that in each epoch we have 29 iterations, as expected.

# Evaluate the model on test data
y_pred = (model.predict(X_val) > 0.5).astype("int32") # The threshold to determine if the target is 0 or 1 is 0.5 (remember: probabilities).
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
cm = confusion_matrix(y_val, y_pred)

print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test Precision: {precision:.4f}")
print(f"Test Recall: {recall:.4f}")
print(f"Test F1 Score: {f1:.4f}")
print(f"Test Confusion Matrix: {cm}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 719726430921196887728979968.0000 - recall: 0.3981
Epoch 2/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.6955 - recall: 0.4924
Epoch 3/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.6893 - recall: 0.2770
Epoch 4/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.6922 - recall: 0.2518
Epoch 5/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.6987 - recall: 0.5765
Epoch 6/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 2ms/step - loss: 0.6864 - recall: 0.1926
Epoch 7/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.6961 - recall: 0.7757
Epoch 8/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.6925 - recall: 0.2654
Epoch 9/

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


What are we seeing here? It went pretty quick, but the results are terrible! We are just saying that everybody is not booking.

We'll see that the problem is that the data is not normalized (or standardized). Let's do it, see what happens, and then explain why not doing it is a problem.

In [None]:
# Standardlize / Normalize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [None]:
sgd_optimizer = SGD(learning_rate=0.01)

# Compute class weights
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weight_dict = dict(enumerate(class_weights))

# Define the neural network model
model = keras.Sequential([
    keras.layers.Dense(30, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy', metrics=['recall'])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=16, verbose=1,
                    class_weight = class_weight_dict)

Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.6717 - recall: 0.6369
Epoch 2/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.6087 - recall: 0.7359
Epoch 3/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 2ms/step - loss: 0.5763 - recall: 0.7657
Epoch 4/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 2ms/step - loss: 0.5774 - recall: 0.8034
Epoch 5/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.5692 - recall: 0.8130
Epoch 6/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.5638 - recall: 0.8228
Epoch 7/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - loss: 0.5627 - recall: 0.8205
Epoch 8/50
[1m4224/4224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - loss: 0.5640 - recall: 0.8148
Epoch 9/50
[1m4224/4224[0m [32m━━━━━━━━━━━━

In [None]:
# Evaluate the model on test data
y_pred = (model.predict(X_val_scaled) > 0.5).astype("int32")
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
cm = confusion_matrix(y_val, y_pred)

print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation Precision: {precision:.4f}")
print(f"Validation Recall: {recall:.4f}")
print(f"Validation F1 Score: {f1:.4f}")
print(f"Validation Confusion Matrix: {cm}")

[1m453/453[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
Validation Accuracy: 0.6901
Validation Precision: 0.1064
Validation Recall: 0.5080
Validation F1 Score: 0.1759
Validation Confusion Matrix: [[9514 4023]
 [ 464  479]]


## Insight

In my training, problem we want to avoid is overfitting, so we put 0 in the weight in our neuron network.

Now we can see that we are getting pretty good results! The only problem is that it's pretty slow (~10 minutes). Remember what we saw in class about choosing the right learning rate $\eta$? This is where Adam comes into play. It makes it faster!

Let's make the neural network more complex and let's add regularization. Because we can see that there's some overfitting. The recall in the validation set is way below the recall in the training set. So we'll force the neural network to be easier.

In [None]:
from tensorflow.keras.optimizers import Adam

# Define the neural network model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

# Compile the model
adam_optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=adam_optimizer, loss='binary_crossentropy', metrics=['Recall'])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=64, verbose=1,
                    class_weight=class_weight_dict)

Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - Recall: 0.6240 - loss: 0.6488
Epoch 2/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - Recall: 0.7870 - loss: 0.5899
Epoch 3/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - Recall: 0.7988 - loss: 0.5737
Epoch 4/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - Recall: 0.8002 - loss: 0.5761
Epoch 5/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - Recall: 0.8001 - loss: 0.5687
Epoch 6/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - Recall: 0.8116 - loss: 0.5666
Epoch 7/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - Recall: 0.8105 - loss: 0.5817
Epoch 8/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - Recall: 0.8155 - loss: 0.5743
Epoch 9/50
[1m1056/1056[0m [32m━━━━━━━━━━━━━━━━━

In [None]:
# Predict probabilities
y_proba = model.predict(X_val_scaled) # This is already a probability!

# Make predictions
y_pred = (y_proba > 0.5).astype("int32")

# Evaluate metrics
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, zero_division=0)
recall = recall_score(y_val, y_pred, zero_division=0)
f1 = f1_score(y_val, y_pred, zero_division=0)
cm = confusion_matrix(y_val, y_pred)

# Print results
print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation Precision: {precision:.4f}")
print(f"Validation Recall: {recall:.4f}")
print(f"Validation F1 Score: {f1:.4f}")
print(f"Validation Confusion Matrix:\n{cm}")

[1m453/453[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
Validation Accuracy: 0.5892
Validation Precision: 0.1068
Validation Recall: 0.7211
Validation F1 Score: 0.1861
Validation Confusion Matrix:
[[7851 5686]
 [ 263  680]]


If the recall is getting higher, the validation precision will become lower

In neurual network, we don't need to "# Predict Probabilities" because the result already in the range of (0,1)