# __Dropout Regularization__

Dropout is a technique where:

- Randomly selected neurons are ignored during training. They are "dropped out" randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass.


- If neurons are randomly dropped out of the network during training, other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

- The effect is that the network becomes less sensitive to the specific weights of neurons. This, in turn, results in a network that is capable of better generalization and is less likely to overfit the training data.



## Steps to Be Followed:
1. Importing the required libraries
2. Reading a CSV file into a DataFrame
3. Creating the dummies
4. Preparing the data for modeling
5. Performing K-fold cross-validation and model training
6. Calculating the error

### Step 1: Importing the Required Libraries

- Imports libraries for data preprocessing, including z-score standardization using **scipy.stats.zscore**, and data manipulation using **pandas**. It also imports libraries for model evaluation, such as metrics from **sklearn** and train-test splitting from **sklearn.model_selection**.
- Imports the necessary components from TensorFlow Keras (**Sequential** and **Dense**) to build a neural network model. These components allow for the creation of a sequential model with dense layers and activation functions.

In [None]:

import pandas as pd
from scipy.stats import zscore
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation



### Step 2: Reading a CSV File into a DataFrame
- It reads a CSV file from a given URL and stores it in a Pandas DataFrame by using **na_values** to replace specified values with missing values.

In [None]:
# dataset link : "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv

In [None]:


# Read the data set



# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])



In [None]:
df.head(5)

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


**Observation**
- The output appears to be a tabular representation of a dataset with various columns.
- Each row represents a sample or instance, while each column represents a different attribute or feature of that instance.
- The columns contain information such as the ID, job, area, income, aspect, subscriptions, dist_healthy, save_rate, dist_unhealthy, age, pop_dense, retail_dense, crime, and product.
- The values in the columns represent specific measurements or categories related to each attribute.

### Step 3: Creating the Dummies

- It uses the **pd.get_dummies()** function to convert categorical columns **'job'** and **'area'** into dummy variables, which represent the presence or absence of each category as binary values.

- The resulting dummy variables are concatenated with the original DataFrame **df** using **pd.concat()**, which adds the dummy variables as new columns.

- Finally, the original categorical columns **'job'** and **'area'** are dropped from the DataFrame using the **df.drop()** function with the **axis=1** parameter set to remove columns. This ensures that only the dummy variables remain in the DataFrame.


In [None]:

df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)


- The missing values in the **'income'** column of the DataFrame **'df'** are filled with the median value of the **'income'** column.

In [None]:
med = df['income'].median()
df['income'] = df['income'].fillna(med)

- The specified columns **('income', 'aspect', 'save_rate', 'age', 'subscriptions')** in the DataFrame **df** are standardized using z-score normalization, which transforms the values to have zero mean and unit variance.

In [None]:
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

### Step 4: Preparing the Data for Modeling
- It selects the relevant columns from the DataFrame df by dropping the **'product'** and **'id'** columns and assigns them to the variable **x_columns**.
- It creates dummy variables for the **'product'** column using one-hot encoding and assigns the column names to the variable products. The target variable **'y'** is assigned the corresponding dummy variable values.

In [None]:

x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product'])
products = dummies.columns
y = dummies.values

### Step 5: Performing K-Fold Cross-Validation and Model Training
- Train a model using K-fold cross-validation with 5 folds.
- The model consists of a sequential neural network with two hidden layers, using ReLU activation for the first hidden layer and L1 regularization for the second hidden layer.
- Dropout is applied to the first hidden layer to prevent overfitting.
- The model is trained using the Adam optimizer and categorical cross-entropy loss function.
- The accuracy of each fold is calculated and printed.

In [None]:
# Train the model

from tensorflow.keras.layers import Dropout
from tensorflow.keras import regularizers
from sklearn.model_selection import KFold

# cross_validate
kf=  KFold(5, shuffle =True, random_state =42)

oos_y = []
oos_pred = []
fold = 0



# Train the model
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")

    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]

    #kernel_regularizer=regularizers.l2(0.01),

    model = Sequential()
    model.add(Dense(50, input_dim=x.shape[1], activation='relu')) # Hidden 1
    model.add(Dropout(0.5))
    model.add(Dense(25, activation='relu', \
                activity_regularizer=regularizers.l1(1e-4))) # Hidden 2
    # Usually do not add dropout after final hidden layer
    #model.add(Dropout(0.5))
    model.add(Dense(y.shape[1],activation='softmax')) # Output
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    model.fit(x_train,y_train,validation_data=(x_test,y_test),\
              verbose=0,epochs=10)

    pred = model.predict(x_test)
    oos_y.append(y_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1)
    oos_pred.append(pred)

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")


Fold #1
Fold score (accuracy): 0.67
Fold #2
Fold score (accuracy): 0.65
Fold #3
Fold score (accuracy): 0.6475
Fold #4
Fold score (accuracy): 0.61
Fold #5
Fold score (accuracy): 0.65


**Observation**
- The output shows the accuracy scores for each fold of the cross-validation process:

  - Fold Scores: The output displays the fold number (e.g., Fold #1) and the corresponding accuracy score (e.g., 0.67) for each fold. The accuracy score represents the proportion of correctly predicted labels to the total number of labels in the test set. Higher accuracy scores indicate better performance of the model on the test data.

  - Performance Variation: The output demonstrates that the model's performance varies across different folds. This variation can provide insights into the stability and robustness of the model. The accuracy scores range from 0.61 to 0.67, suggesting that the model performs reasonably well but with some degree of variability across different subsets of the data.

### Step 6: Calculating the Error
- It calculates the final accuracy score and creates a DataFrame combining the original data with the true values and predicted values.

In [None]:
# Calculate the error

oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score}")

# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )



Final score (accuracy): 0.6455



**Observation**
- The final accuracy score achieved by the model is 0.6455.