# __Dropout Regularization__

Dropout is a technique where:

- Randomly selected neurons are ignored during training. They are 'dropped out' randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass.


- If neurons are randomly dropped out of the network during training, other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

- The effect is that the network becomes less sensitive to the specific weights of neurons. This, in turn, results in a network that is capable of better generalization and is less likely to overfit the training data.

Let's understand how it works.


## Steps to be followed:
1. Import the required libraries
2. Read a CSV file into a DataFrame
3. Create dummies
4. Standardize and prepare data for modeling
5. Perform K-fold cross-validation and model training
6. Final accuracy calculation on the test set

### Step 1: Import the required libraries

- Import libraries for data preprocessing, including z-score standardization using **scipy.stats.zscore** and data manipulation using **pandas**. It also imports libraries for model evaluation, such as metrics from **sklearn** and train-test splitting from **sklearn.model_selection**.
- Import the necessary components from TensorFlow Keras (**Sequential** and **Dense**) to build a neural network model. These components allow for the creation of a sequential model with dense layers and activation functions.

In [1]:
!pip install tensorflow==2.17.0 scikeras==0.13.0 keras==3.2.0

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import os

# Disable oneDNN optimizations to avoid potential minor numerical differences caused by floating-point round-off errors.
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

In [4]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
import numpy as np
import tensorflow as tf
from sklearn import metrics
from sklearn.model_selection import train_test_split, KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


### Step 2: Read a CSV file into a DataFrame
- Read a CSV file from a given URL and stores it in a Pandas DataFrame by using **na_values** to replace specified values with missing values.

In [5]:
# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

In [6]:
df.head(5)

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


**Observation**
- The output appears to be a tabular representation of a dataset with various columns.
- Each row represents a sample or instance, while each column represents a different attribute or feature of that instance.
- The columns contain information such as the `ID, job, area, income, aspect, subscriptions, dist_healthy, save_rate, dist_unhealthy, age, pop_dense, retail_dense, crime, and product`.
- The values in the columns represent specific measurements or categories related to each attribute.

In [7]:
# Check for missing values
df.isnull().sum()

id                 0
job                0
area               0
income            59
aspect             0
subscriptions      0
dist_healthy       0
save_rate          0
dist_unhealthy     0
age                0
pop_dense          0
retail_dense       0
crime              0
product            0
dtype: int64

#### Create dummy variables

- Use the **pd.get_dummies()** function to convert categorical columns **'job'** and **'area'** into dummy variables, which represent the presence or absence of each category as binary values.

- Drop the original categorical columns **'job'** and **'area'** from the DataFrame using the **df.drop()** function

In [8]:
# Convert categorical columns to dummy variables
categorical_cols = ['job', 'area']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [9]:
df.columns


Index(['id', 'income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate',
       'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime',
       'product', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv',
       'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd',
       'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb',
       'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp',
       'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_b', 'area_c',
       'area_d'],
      dtype='object')

In [10]:
# Convert all boolean columns to numeric (integer) format
boolean_columns = df.select_dtypes(include=['bool']).columns
df[boolean_columns] = df[boolean_columns].astype(int)

In [11]:
df.head()

Unnamed: 0,id,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,...,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz,area_b,area_c,area_d
0,1,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,...,0,0,0,0,0,1,0,0,1,0
1,2,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,...,0,0,0,0,0,0,0,0,1,0
2,3,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,...,0,0,0,0,0,0,0,0,1,0
3,4,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,...,0,0,0,0,0,0,0,0,1,0
4,5,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,...,0,0,0,0,0,0,0,0,0,1


#### Train and Test split:

`product` column is the target variable. `id` column is removed since it doesn't add value to the training process


In [12]:
# Split the data into features and targets before any preprocessing
x_columns = df.columns.drop(['product', 'id'])
y = pd.get_dummies(df['product']).values

X_train, X_test, y_train, y_test = train_test_split(df[x_columns], y, test_size=0.2, random_state=42)

### Step 4: Standardize and prepare data for modeling


In [13]:
# Apply imputation to numeric columns in train and use the same transformer for test
imputer = SimpleImputer(strategy='median')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize the numeric columns
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

### Step 5: Perform K-fold cross-validation and model training
- Train a model using K-fold cross-validation with 5 folds.
- The model consists of a sequential neural network with two hidden layers, using ReLU activation for the first hidden layer and L1 regularization for the second hidden layer.
- Dropout is applied to the first hidden layer to prevent overfitting.
- The model is trained using the Adam optimizer and categorical cross-entropy loss function.
- The accuracy of each fold is calculated and printed.

In [None]:
# Set up K-fold cross-validation
kf = KFold(5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(kf.split(X_train_sc), 1):

    print("X_train[train_idx] shape:", X_train_sc[train_idx].shape)
    print("y_train[train_idx] shape:", y_train[train_idx].shape)
    print("X_train[test_idx] shape:", X_train_sc[test_idx].shape)
    print("y_train[test_idx] shape:", y_train[test_idx].shape)
    print()


    print(f"Fold #{fold}")
    model = Sequential([
        Dense(50, input_dim=X_train_sc.shape[1], activation='relu'),
        Dropout(0.5),  # Dropout to prevent overfitting
        Dense(25, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
        Dense(y_train.shape[1], activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # Train the model on the fold
    # Convert NumPy arrays to TensorFlow Tensors with float32 data type directly using appropriate indexes
    model.fit(tf.convert_to_tensor(X_train_sc[train_idx], dtype=tf.float32),
              tf.convert_to_tensor(y_train[train_idx], dtype=tf.float32),
              validation_data=(tf.convert_to_tensor(X_train_sc[test_idx], dtype=tf.float32),
                               tf.convert_to_tensor(y_train[test_idx], dtype=tf.float32)),
              epochs=50, batch_size=32, verbose=0)

    # Evaluate the model on the fold's test data
    predictions = model.predict(tf.convert_to_tensor(X_train_sc[test_idx], dtype=tf.float32))
    score = metrics.accuracy_score(np.argmax(y_train[test_idx], axis=1), np.argmax(predictions, axis=1))
    print(f"Accuracy for fold {fold}: {score:.4f}")


X_train[train_idx] shape: (1280, 45)
y_train[train_idx] shape: (1280, 7)
X_train[test_idx] shape: (320, 45)
y_train[test_idx] shape: (320, 7)

Fold #1


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2024-11-06 09:49:23.458663: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 809us/step
Accuracy for fold 1: 0.7188
X_train[train_idx] shape: (1280, 45)
y_train[train_idx] shape: (1280, 7)
X_train[test_idx] shape: (320, 45)
y_train[test_idx] shape: (320, 7)

Fold #2


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 788us/step
Accuracy for fold 2: 0.7469
X_train[train_idx] shape: (1280, 45)
y_train[train_idx] shape: (1280, 7)
X_train[test_idx] shape: (320, 45)
y_train[test_idx] shape: (320, 7)

Fold #3


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 734us/step
Accuracy for fold 3: 0.6750
X_train[train_idx] shape: (1280, 45)
y_train[train_idx] shape: (1280, 7)
X_train[test_idx] shape: (320, 45)
y_train[test_idx] shape: (320, 7)

Fold #4


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 783us/step
Accuracy for fold 4: 0.7375
X_train[train_idx] shape: (1280, 45)
y_train[train_idx] shape: (1280, 7)
X_train[test_idx] shape: (320, 45)
y_train[test_idx] shape: (320, 7)

Fold #5


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


**Observation**
- The output shows the accuracy scores for each fold of the cross-validation process:

  - Fold scores: The output displays the fold number (for example, Fold #1) and the corresponding accuracy score (for example, 0.71) for each fold. The accuracy score represents the proportion of correctly predicted labels to the total number of labels in the test set. Higher accuracy scores indicate better performance of the model on the test data.

### Step 6: Final accuracy calculation on the test set

In [None]:
# Final evaluation on the standardized test set
final_predictions = model.predict(tf.convert_to_tensor(X_test_sc, dtype=tf.float32)) # Convert X_test_sc to a Tensor
final_score = metrics.accuracy_score(np.argmax(y_test, axis=1), np.argmax(final_predictions, axis=1))
print(f"Final accuracy on test set: {final_score:.4f}")



**Observation**
- The final accuracy score achieved by the model is 0.69.