**Part 0: Data Preparation**

In [1]:
from sklearn.datasets import fetch_openml
import pandas as pd

In [2]:
adult=fetch_openml(name="adult",version=2,as_frame=True)
data=adult.frame
data.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
x=data.drop(columns="class")
y=data['class']  #dividing my data into features and target

In [4]:
from sklearn.model_selection import train_test_split

#70=> train 15=>test 15=>validation
x_train_valid,x_test,y_train_valid,y_test=train_test_split(x,y,test_size=0.15,shuffle=True)  #Train =>85% test=>15%
#x,y valid=>15/85
x_train,x_valid,y_train,y_valid=train_test_split(x_train_valid,y_train_valid,test_size=0.1765,shuffle=True)
print(len(y_train))
print(len(y_test))
print(len(y_valid))

34187
7327
7328


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
 14  class           48842 non-null  category
dtypes: category(9), int64(6)
memory usage: 2.7 MB


In [6]:
catagorical_feature=x_train.select_dtypes(include=["category"]).columns
numerical_feature=x_train.select_dtypes(exclude=["category"]).columns

print(catagorical_feature) #==>label encoding
print(numerical_feature)

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country'],
      dtype='object')
Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')


In [7]:
from sklearn.preprocessing import OneHotEncoder , StandardScaler ,LabelEncoder
categorical_transformer = OneHotEncoder(handle_unknown="ignore") #ensures that won’t crash if it encounters a new unseen category.
numerical_transformer=StandardScaler()


In [8]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_feature), # make numerical_transformer for each numerical fearture
        ("cat", categorical_transformer, catagorical_feature) #same thing
    ]
)

X_train_proc = preprocessor.fit_transform(x_train)  #fit on train because it learn here
X_valid_proc = preprocessor.transform(x_valid) #apply scale or encoder in validation and test
X_test_proc = preprocessor.transform(x_test)


encoder = LabelEncoder()
y_train_enc = encoder.fit_transform(y_train)
y_valid_enc = encoder.transform(y_valid)
y_test_enc = encoder.transform(y_test)

In [9]:
#logistic regression ==>machine learing
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train_proc ,y_train_enc)


In [10]:
from sklearn.metrics import accuracy_score
train_acc_log = accuracy_score(y_train_enc, log_reg.predict(X_train_proc))
valid_acc_log = accuracy_score(y_valid_enc, log_reg.predict(X_valid_proc))
test_acc_log = accuracy_score(y_test_enc, log_reg.predict(X_test_proc))
print(f"accuracy of train: ${train_acc_log *100:0.2f}%")
print(f"accuracy of validation: ${valid_acc_log*100:0.2f}%")
print(f"accuracy of test: ${test_acc_log*100:0.2f}%")


accuracy of train: $85.29%
accuracy of validation: $85.22%
accuracy of test: $85.31%


In [11]:
!pip install tensorflow



In [12]:
# Deep Neural Network==>deep learning
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

In [21]:
#optimization using Stochastic Gradient Descent (SGD)
from tensorflow.keras.optimizers import SGD
sgd_model=Sequential()
sgd_model.add(Dense(64,activation="relu",input_shape=(X_train_proc.shape[1],)))
sgd_model.add(Dense(32,activation="relu"))
sgd_model.add(Dense(1,activation="sigmoid"))

optimizer = SGD(learning_rate=0.01)
sgd_model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
sgd_model_fitting=sgd_model.fit(X_train_proc,y_train_enc,epochs=10,validation_data=(X_valid_proc,y_valid_enc),batch_size=1)
print(f"Training Stochastic Gradient Descent (SGD) completed after {len(sgd_model_fitting.epoch)} epochs.")

Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 2ms/step - accuracy: 0.8392 - loss: 0.3406 - val_accuracy: 0.8521 - val_loss: 0.3202
Epoch 2/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 2ms/step - accuracy: 0.8521 - loss: 0.3154 - val_accuracy: 0.8555 - val_loss: 0.3160
Epoch 3/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 2ms/step - accuracy: 0.8591 - loss: 0.3032 - val_accuracy: 0.8536 - val_loss: 0.3150
Epoch 4/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 2ms/step - accuracy: 0.8628 - loss: 0.2966 - val_accuracy: 0.8511 - val_loss: 0.3174
Epoch 5/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 2ms/step - accuracy: 0.8564 - loss: 0.3071 - val_accuracy: 0.8543 - val_loss: 0.3131
Epoch 6/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 2ms/step - accuracy: 0.8606 - loss: 0.3021 - val_accuracy: 0.8480 - val_loss: 0.3244
Epoch 7/1

In [14]:
#optimization using SGD with Momentum
moment_model=Sequential()
moment_model.add(Dense(64,activation="relu",input_shape=(X_train_proc.shape[1],)))
moment_model.add(Dense(32,activation="relu"))
moment_model.add(Dense(1,activation="sigmoid"))

optimizer = SGD(learning_rate=0.01,momentum=0.9)#highly depeneding on past gardients
moment_model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
moment_model_fitting=moment_model.fit(X_train_proc,y_train_enc,epochs=10,validation_data=(X_valid_proc,y_valid_enc))
print(f"Training SGD with Momentum completed after {len(moment_model_fitting.epoch)} epochs.")

Epoch 1/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8237 - loss: 0.3744 - val_accuracy: 0.8532 - val_loss: 0.3193
Epoch 2/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8545 - loss: 0.3089 - val_accuracy: 0.8511 - val_loss: 0.3218
Epoch 3/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8547 - loss: 0.3062 - val_accuracy: 0.8559 - val_loss: 0.3140
Epoch 4/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8590 - loss: 0.3006 - val_accuracy: 0.8569 - val_loss: 0.3142
Epoch 5/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8584 - loss: 0.3037 - val_accuracy: 0.8548 - val_loss: 0.3133
Epoch 6/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.8555 - loss: 0.3041 - val_accuracy: 0.8528 - val_loss: 0.3152
Epoch 7/10
[1m1

In [15]:
#optimization using Adam
from tensorflow.keras.optimizers import Adam
adam_model=Sequential()
adam_model.add(Dense(64,activation="relu",input_shape=(X_train_proc.shape[1],)))
adam_model.add(Dense(32,activation="relu"))
adam_model.add(Dense(1,activation="sigmoid"))

optimizer = Adam(learning_rate=0.01)
adam_model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
adam_model_fitting=adam_model.fit(X_train_proc,y_train_enc,epochs=10,validation_data=(X_valid_proc,y_valid_enc))
print(f"Training using Adam completed after {len(adam_model_fitting.epoch)} epochs.")

Epoch 1/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8422 - loss: 0.3361 - val_accuracy: 0.8519 - val_loss: 0.3157
Epoch 2/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8565 - loss: 0.3070 - val_accuracy: 0.8571 - val_loss: 0.3142
Epoch 3/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8594 - loss: 0.3021 - val_accuracy: 0.8555 - val_loss: 0.3160
Epoch 4/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.8626 - loss: 0.2980 - val_accuracy: 0.8552 - val_loss: 0.3107
Epoch 5/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8625 - loss: 0.2926 - val_accuracy: 0.8562 - val_loss: 0.3207
Epoch 6/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8647 - loss: 0.2909 - val_accuracy: 0.8548 - val_loss: 0.3225
Epoch 7/10
[1m1

##Compare the training and validation accuracy for each optimizer.?
SGD

- Training accuracy started around ~ 77% until reached 86%.

- Validation accuracy started around in range of 85%.

- Learning happens but it is slow.

SGD + Momentum

- Training accuracy started a bit higher ~ %82 until reached ~ 86%.

- Validation accuracy stable around ~ 0.85–0.856.

- Faster than plain SGD.

Adam

- Training accuracy started high (~ 84%) until reached ~ 87%.

- Validation accuracy reached ~ 86% (highest of all).

- Very fast learning from the first epochs.
##Which converges faster? Which generalizes better?

Fastest convergence => Adam (high accuracy among of them).

Best generalization (validation$test accuracy) => Adam, since validation accuracy is slightly higher (~ 86%).

##Why is Adam often better than SGD?

SGD: Uses the same learning rate for all weights moves step by step with the same pace in every direction. This makes it slower or oscillating around the solution.

Adam: Smarter one it also combines

- Momentum (keeps track of past directions to speed up).

- Uses RMSprop which is Adaptive learning rates (each weight can move with a different step size depending on its updates).

so it became Faster training + often better validation accuracy

In [16]:
#training the data with different batch size using ADam optimizer
# Adam with batch size = 1
adam_model_bs1 = Sequential()
adam_model_bs1.add(Dense(64, activation="relu", input_shape=(X_train_proc.shape[1],)))
adam_model_bs1.add(Dense(32, activation="relu"))
adam_model_bs1.add(Dense(1, activation="sigmoid"))

adam_model_bs1.compile(optimizer=Adam(learning_rate=0.01),
                       loss="binary_crossentropy",
                       metrics=["accuracy"])

adam_model_bs1_fitting = adam_model_bs1.fit(X_train_proc, y_train_enc,epochs=10,batch_size=1,validation_data=(X_valid_proc, y_valid_enc))
print("Training with Adam and batch_size=1 completed.")

Epoch 1/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 2ms/step - accuracy: 0.8343 - loss: 0.3621 - val_accuracy: 0.8186 - val_loss: 0.4111
Epoch 2/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 2ms/step - accuracy: 0.8413 - loss: 0.3408 - val_accuracy: 0.8519 - val_loss: 0.3338
Epoch 3/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 3ms/step - accuracy: 0.8498 - loss: 0.3315 - val_accuracy: 0.8442 - val_loss: 0.4338
Epoch 4/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m133s[0m 2ms/step - accuracy: 0.8490 - loss: 0.3417 - val_accuracy: 0.8530 - val_loss: 0.3652
Epoch 5/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 3ms/step - accuracy: 0.8543 - loss: 0.3288 - val_accuracy: 0.8470 - val_loss: 0.3472
Epoch 6/10
[1m34187/34187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 3ms/step - accuracy: 0.8511 - loss: 0.3384 - val_accuracy: 0.8463 - val_loss: 0.35

In [17]:
# Adam with batch size = 32
adam_model_bs32 = Sequential()
adam_model_bs32.add(Dense(64, activation="relu", input_shape=(X_train_proc.shape[1],)))
adam_model_bs32.add(Dense(32, activation="relu"))
adam_model_bs32.add(Dense(1, activation="sigmoid"))

adam_model_bs32.compile(optimizer=Adam(learning_rate=0.01),
                        loss="binary_crossentropy",
                        metrics=["accuracy"])

adam_model_bs32_fitting = adam_model_bs32.fit(
    X_train_proc, y_train_enc,
    epochs=10,
    batch_size=32,
    validation_data=(X_valid_proc, y_valid_enc)
)
print("Training with Adam and batch_size=32 completed.")

Epoch 1/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8363 - loss: 0.3436 - val_accuracy: 0.8530 - val_loss: 0.3197
Epoch 2/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8571 - loss: 0.3104 - val_accuracy: 0.8558 - val_loss: 0.3145
Epoch 3/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8601 - loss: 0.3051 - val_accuracy: 0.8562 - val_loss: 0.3146
Epoch 4/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.8609 - loss: 0.3001 - val_accuracy: 0.8549 - val_loss: 0.3152
Epoch 5/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8632 - loss: 0.2963 - val_accuracy: 0.8552 - val_loss: 0.3149
Epoch 6/10
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8628 - loss: 0.2946 - val_accuracy: 0.8511 - val_loss: 0.3177
Epoch 7/10
[1m1

In [19]:
# Adam with batch size = 128
adam_model_bs128 = Sequential()
adam_model_bs128.add(Dense(64, activation="relu", input_shape=(X_train_proc.shape[1],)))
adam_model_bs128.add(Dense(32, activation="relu"))
adam_model_bs128.add(Dense(1, activation="sigmoid"))

adam_model_bs128.compile(optimizer=Adam(learning_rate=0.01),
                        loss="binary_crossentropy",
                        metrics=["accuracy"])

adam_model_bs128_fitting = adam_model_bs128.fit(
    X_train_proc, y_train_enc,
    epochs=10,
    batch_size=128,
    validation_data=(X_valid_proc, y_valid_enc)
)
print("Training with Adam and batch_size=128 completed.")

Epoch 1/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8308 - loss: 0.3540 - val_accuracy: 0.8562 - val_loss: 0.3161
Epoch 2/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8576 - loss: 0.3047 - val_accuracy: 0.8549 - val_loss: 0.3132
Epoch 3/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8593 - loss: 0.3044 - val_accuracy: 0.8537 - val_loss: 0.3147
Epoch 4/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8608 - loss: 0.2991 - val_accuracy: 0.8564 - val_loss: 0.3158
Epoch 5/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8643 - loss: 0.2928 - val_accuracy: 0.8562 - val_loss: 0.3114
Epoch 6/10
[1m268/268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.8646 - loss: 0.2882 - val_accuracy: 0.8538 - val_loss: 0.3177
Epoch 7/10
[1m268/268[0m 

In [20]:
# Adam with batch size = 1024
adam_model_bs1024 = Sequential()
adam_model_bs1024.add(Dense(64, activation="relu", input_shape=(X_train_proc.shape[1],)))
adam_model_bs1024.add(Dense(32, activation="relu"))
adam_model_bs1024.add(Dense(1, activation="sigmoid"))

adam_model_bs1024.compile(optimizer=Adam(learning_rate=0.01),
                        loss="binary_crossentropy",
                        metrics=["accuracy"])

adam_model_bs1024_fitting = adam_model_bs1024.fit(
    X_train_proc, y_train_enc,
    epochs=10,
    batch_size=1024,
    validation_data=(X_valid_proc, y_valid_enc)
)
print("Training with Adam and batch_size=1024 completed.")

Epoch 1/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 17ms/step - accuracy: 0.7693 - loss: 0.4342 - val_accuracy: 0.8534 - val_loss: 0.3218
Epoch 2/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8552 - loss: 0.3104 - val_accuracy: 0.8536 - val_loss: 0.3152
Epoch 3/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8582 - loss: 0.3052 - val_accuracy: 0.8552 - val_loss: 0.3140
Epoch 4/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.8609 - loss: 0.2984 - val_accuracy: 0.8506 - val_loss: 0.3183
Epoch 5/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.8606 - loss: 0.2979 - val_accuracy: 0.8532 - val_loss: 0.3164
Epoch 6/10
[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.8644 - loss: 0.2926 - val_accuracy: 0.8553 - val_loss: 0.3146
Epoch 7/10
[1m34/34[0m [32m━━━━━━

*Training speed:*

- Batch size =1 => Very slow because the model updates weights after every single sample.That means too many updates and much longer training time.

- Batch size = 32 & 128 => Much faster, since the model processes more samples at once (mini-batch training)and accuracy is almost high 87%.

- Batch size =1024 => Even faster per epoch (fewer updates), but sometimes less accurate updates.

*Validation accuracy:*

- Batch size = 1: Validation accuracy fluctuates a lot which is unstable. Sometimes good, sometimes drops.

- Batch size = 32: Stable and high validation accuracy. Often gives the best balance.

- Batch size = 128: Still good, but sometimes slightly less stable than 32.

- Batch size = 1024: Validation accuracy can drop, because updates are too poor and the model may not capture enough detail.


*Test accuracy:*

Small batch sizes (1, 32) → Usually generalize better meaning they perform well on unseen test data.

Large batch sizes (1024) → Sometimes test accuracy is worse because the model overfits the training data patterns and doesn’t generalize well.

*Generalization ability:*

- Best generalization is usually with batch size = 32 or 128.

- Batch size = 1 generalizes okay but is very noisy and inefficient.

- Batch size = 1024 generalizes poorly because updates are too smooth and the model doesn’t learn enough.
#Which batch size leads to the noisiest gradient updates?
Batch size = 1 =>Because every update depends on only one sample, the gradient jumps around a lot = very noisy.

#Which batch size generalizes better and why?

Batch size = 32 & 128 =>Because:

- It balances between noise and stability. (mini batch gradient descent)


In [22]:
#Overfitting and Regularization
large_model = Sequential()
large_model.add(Dense(256, activation="relu", input_shape=(X_train_proc.shape[1],)))
large_model.add(Dense(128, activation="relu"))
large_model.add(Dense(64, activation="relu"))
large_model.add(Dense(1, activation="sigmoid"))

large_model.compile(optimizer=Adam(learning_rate=0.001),loss="binary_crossentropy",metrics=["accuracy"])

history_large = large_model.fit(X_train_proc, y_train_enc,epochs=20,batch_size=32,validation_data=(X_valid_proc, y_valid_enc)
)

Epoch 1/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.8376 - loss: 0.3434 - val_accuracy: 0.8495 - val_loss: 0.3186
Epoch 2/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8588 - loss: 0.3065 - val_accuracy: 0.8570 - val_loss: 0.3139
Epoch 3/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8640 - loss: 0.2954 - val_accuracy: 0.8551 - val_loss: 0.3104
Epoch 4/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8667 - loss: 0.2914 - val_accuracy: 0.8552 - val_loss: 0.3145
Epoch 5/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8655 - loss: 0.2853 - val_accuracy: 0.8578 - val_loss: 0.3122
Epoch 6/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.8726 - loss: 0.2764 - val_accuracy: 0.8543 - val_loss: 0.3203
Epoch 7/20
[1m1

#Do you see signs of overfitting?
Yes,i see
Training accuracy keeps increasing steadily (83% → 91%).

Validation accuracy improves at first (~ 85%), but then stays flat or even drops slightly (~ 84%).
This is sign of overfitting:
the model continues to fit the training data better and better, but fails to improve on unseen validation data.

In [23]:
#L2 regularization
from tensorflow.keras import regularizers

l2_model = Sequential()
l2_model.add(Dense(256, activation="relu", kernel_regularizer=regularizers.l2(0.01), input_shape=(X_train_proc.shape[1],)))
l2_model.add(Dense(128, activation="relu", kernel_regularizer=regularizers.l2(0.01)))
l2_model.add(Dense(64, activation="relu", kernel_regularizer=regularizers.l2(0.01)))
l2_model.add(Dense(1, activation="sigmoid"))

l2_model.compile(optimizer=Adam(learning_rate=0.001),
                loss="binary_crossentropy",
                metrics=["accuracy"])

history_l2 = l2_model.fit(
    X_train_proc, y_train_enc,
    epochs=20,
    batch_size=32,
    validation_data=(X_valid_proc, y_valid_enc)
)


Epoch 1/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.8337 - loss: 1.0971 - val_accuracy: 0.8424 - val_loss: 0.3843
Epoch 2/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 4ms/step - accuracy: 0.8509 - loss: 0.3679 - val_accuracy: 0.8487 - val_loss: 0.3644
Epoch 3/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.8505 - loss: 0.3578 - val_accuracy: 0.8537 - val_loss: 0.3548
Epoch 4/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8545 - loss: 0.3500 - val_accuracy: 0.8523 - val_loss: 0.3546
Epoch 5/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.8520 - loss: 0.3491 - val_accuracy: 0.8484 - val_loss: 0.3583
Epoch 6/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8533 - loss: 0.3459 - val_accuracy: 0.8517 - val_loss: 0.3549
Epoch 7/20
[1m1

In [24]:
#Dropout
from tensorflow.keras.layers import Dropout

dropout_model = Sequential()
dropout_model.add(Dense(256, activation="relu", input_shape=(X_train_proc.shape[1],)))
dropout_model.add(Dropout(0.5))   #drop 50% of the next nurens
dropout_model.add(Dense(128, activation="relu"))
dropout_model.add(Dropout(0.5))
dropout_model.add(Dense(64, activation="relu"))
dropout_model.add(Dense(1, activation="sigmoid"))

dropout_model.compile(optimizer=Adam(learning_rate=0.001),
                     loss="binary_crossentropy",
                     metrics=["accuracy"])

history_dropout = dropout_model.fit( X_train_proc, y_train_enc, epochs=20,batch_size=32,validation_data=(X_valid_proc, y_valid_enc)
)


Epoch 1/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8300 - loss: 0.3681 - val_accuracy: 0.8581 - val_loss: 0.3144
Epoch 2/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8533 - loss: 0.3173 - val_accuracy: 0.8538 - val_loss: 0.3174
Epoch 3/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8564 - loss: 0.3087 - val_accuracy: 0.8518 - val_loss: 0.3185
Epoch 4/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8577 - loss: 0.3028 - val_accuracy: 0.8579 - val_loss: 0.3137
Epoch 5/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.8581 - loss: 0.3064 - val_accuracy: 0.8558 - val_loss: 0.3120
Epoch 6/20
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8568 - loss: 0.3074 - val_accuracy: 0.8569 - val_loss: 0.3134
Epoch 7/20
[1m1

Before regularization:

 - Training accuracy => kept rising (83% → 91%).

- Validation accuracy => peaked ~ 85% then dropped (~84%).

- Validation loss →=> increased a lot


With L2 regularization:

Training accuracy: stayed ~85%.

Validation accuracy: ~ 85% stable, doesn’t drop.

Validation loss: much lower and flatter ~0.34–0.36, no big increase

- =>L2 slowed down learning a bit, but kept train and val much closer.This means less overfitting.


With Dropout:

Training accuracy: lower than before because dropout forces the network to train with missing neurons

Validation accuracy: usually more stable across epochs.

- =>Dropout makes the model less likely to memorize patterns make it better generalization.

in my opinion Both reduced overfitting compared to the unregularized model.
But here L2 regularization was more effective because it kept validation accuracy stable and validation loss consistently low.

In [26]:
#train with Early Stopping
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

model_es = Sequential()
model_es.add(Dense(256, activation="relu", input_shape=(X_train_proc.shape[1],)))
model_es.add(Dense(128, activation="relu"))
model_es.add(Dense(64, activation="relu"))
model_es.add(Dense(1, activation="sigmoid"))

model_es.compile(optimizer=Adam(learning_rate=0.001),
                 loss="binary_crossentropy",
                 metrics=["accuracy"])

history_es = model_es.fit(
    X_train_proc, y_train_enc,
    epochs=50,
    batch_size=32,
    validation_data=(X_valid_proc, y_valid_enc),
    callbacks=[early_stop],
)
print(f"Training stopped after {len(history_es.epoch)} epochs.")

Epoch 1/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8373 - loss: 0.3460 - val_accuracy: 0.8551 - val_loss: 0.3159
Epoch 2/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8595 - loss: 0.3035 - val_accuracy: 0.8529 - val_loss: 0.3136
Epoch 3/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8604 - loss: 0.3009 - val_accuracy: 0.8579 - val_loss: 0.3106
Epoch 4/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8674 - loss: 0.2931 - val_accuracy: 0.8537 - val_loss: 0.3135
Epoch 5/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.8658 - loss: 0.2875 - val_accuracy: 0.8581 - val_loss: 0.3146
Epoch 6/50
[1m1069/1069[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8690 - loss: 0.2792 - val_accuracy: 0.8585 - val_loss: 0.3205
Training stopped

#Comparison of epochs and accuracy:

**Without early stopping:**

Training accuracy reached ~90%+.

Validation accuracy peaked around epoch 5–7 (~85%), then dropped to ~84% or lower.

Clear overfitting after epoch 7.

**With early stopping :**

Training stopped automatically after 6 epochs.

Training accuracy: ~86.9%.

Validation accuracy: ~85.8% (higher than the late epochs in the no-early-stopping run).

Validation loss: stayed low (~0.31–0.32) instead of blowing up.

### so early stopping saved ~14 epochs of wasted computation and gave better validation accuracy compared to training too long.


#How early stopping prevents overfitting?

It monitors validation loss
When validation loss stops improving for a set number of epochs training stops.
This prevents the model from continuing to learn the noise and quirks of the training data.

By restoring the best weights(restore_best_weights=True) we keep the model at the point where it generalized best.

**Reflection**
1. What I learned

**Optimizers**: They decide how the model updates its weights to reduce the loss.Some optimizers like Adam usually work better than others

**Batch size**: Small batches make training very slower but can help the model learn better. Large batches train faster but might not generalize well.

**Regularization**: Methods like early stopping Stops training when validation performance no longer improves. This saves time and prevents the model from getting worse on unseen data.L1,L2 remove irrelevant features or make their weight tend to zero, dropout remove random neurons to prevent the model from overfitting and to learn everytime with new values.

Train/validation/test splits: Splitting the data ensures we train on one part, tune the model on another, and finally test on a fresh set to check real performance.
#If I train a new deep learning model on tabular data, I would choose:

**Optimizer**: Adam, because it’s fast and usually gives good results.

**Batch size**: Medium size (like 32 or 64) for a balance between speed and accuracy.

**Regularization**: Dropout or L2 regularization, to avoid overfitting.

**Early stopping:** Yes, to stop training when validation loss stops improving.

**Data splitting strategy:** Use 70% training, 15% validation, 15% test (or similar). This way I can tune the model on validation and still have a fresh test set.