# **Data Cleaning and Neural Networks**

##### Keshav Ramji and Emily Paul

## **Set Up**

First, we'll clone our AI@Penn GitHub repository, which contains all of our datasets. After running the following cell, you'll be able to see the contents of the repository in the file system on the left hand side.

In [None]:
!git clone https://github.com/kjaisingh/AI-Penn.git

## **Data Cleaning: Black Friday Purchase Patterns**

Next, let's import the libraries we'll be using and name them:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# scikit-learn doesn't automatically import its subpackages, so we'll have to do this ourselves
from sklearn import preprocessing

Now we can load the training and testing datasets (credit to Analytics Vidhya for the data):

In [None]:
train = pd.read_csv('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/black_friday_train.csv')

test = pd.read_csv('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/black_friday_test.csv')

Print the training dataset to get a sense of what we're looking at:

In [None]:
train

We can immediately see that we'll need to do some pre-processing. Several columns contain non-numerical data, which we'll have to encode, and there are missing values that we'll need to deal with. Also note that different columns of numerical data have different scales, so we should normalize the dataframes before moving on.

First, let's delete the `User_ID` and `Product_ID` columns, since these labels don't encapsulate any qualities that might contribute to our model:

In [None]:
train = train.drop(labels=['User_ID', 'Product_ID'], axis = 1)

Now let's deal with the missing values; check to see which columns contain NaN values:

In [None]:
train.columns[train.isnull().any()]

So we know now that only the `Product_Category_2` and `Product_Category_3` have missing values. In this case, since the values in both of those columns represent the number of items purchased, it may be that the missing values are actually just a lack of purchases of goods in their categories (so they should actually be 0's). Note that in reality, you should **never make assumptions** about what missing values mean, so you would do more research about the dataset and/or run your analysis multiple times to gauge the effects of different types of NaN replacement. For now though, let's just replace the NaN's in these two columns with 0's:

In [None]:
train['Product_Category_2'] = train['Product_Category_2'].fillna(0)
train['Product_Category_3'] = train['Product_Category_3'].fillna(0)

Now checking for NaN's should produce no columns:

In [None]:
train.columns[train.isnull().any()]

Ok, next let's handle our categorical data. Check to see which columns contain non-numerical data by generating a list of columns whose data types are not numbers:

In [None]:
train.select_dtypes(exclude=['number'])

For now, let's simply implement label encoding, but keep in mind that since we always want to **avoid introducing bias** we might use a different approach (e.g. one-hot encoding) in reality if that suits the data better. 

We'll use the label encoder provided by the scikit-learn library. This assigns a numerical value to each instance of a categorical feature:

In [None]:
label_encoder = preprocessing.LabelEncoder()
train['Gender'] = label_encoder.fit_transform(train['Gender'])
train['Age'] = label_encoder.fit_transform(train['Age'])
train['City_Category'] = label_encoder.fit_transform(train['City_Category'])
train['Stay_In_Current_City_Years'] = label_encoder.fit_transform(train['Stay_In_Current_City_Years'])

Now checking for non-numerical values should produce no columns:

In [None]:
train.select_dtypes(exclude=['number'])

The last thing we need to do before moving on is to normalize the dataframe. We can do this using the `MinMaxScaler` provided by scikit-learn to scale the occurences of each feature to lie in the range from 0 to 1 when given no arguments:

In [None]:
scaler = preprocessing.MinMaxScaler()
scaled_train = scaler.fit_transform(train)

# the MinMaxScaler outputs a numpy ndarray, which we'll convert back into a pandas dataframe
train = pd.DataFrame(scaled_train, columns = train.columns)

Let's take a look at our processed training dataset:

In [None]:
train

This is exactly what we wanted! Now let's pre-process our testing dataset the same way:

In [None]:
# drop the useless columns
test = test.drop(labels=['User_ID', 'Product_ID'], axis = 1)

# replace NaN's with 0's
test['Product_Category_2'] = test['Product_Category_2'].fillna(0)
test['Product_Category_3'] = test['Product_Category_3'].fillna(0)

# implement label encoding on the categorical values
# note that there's no need to re-instantiate the label encoder
test['Gender'] = label_encoder.fit_transform(test['Gender'])
test['Age'] = label_encoder.fit_transform(test['Age'])
test['City_Category'] = label_encoder.fit_transform(test['City_Category'])
test['Stay_In_Current_City_Years'] = label_encoder.fit_transform(test['Stay_In_Current_City_Years'])

# scale the dataframe
# note that there's no need to re-instantiate the scaler
scaled_test = scaler.fit_transform(test)
test = pd.DataFrame(scaled_test, columns = test.columns)

So checking for NaN's shouldn't produce any columns:

In [None]:
test.columns[test.isnull().any()]

And neither should checking for non-numerical values:

In [None]:
test.select_dtypes(exclude=['number'])

Printing the cleaned testing dataframe shows that it now looks to be in the same format as our cleaned training dataframe:

In [None]:
test

We're all set; at this point you could start to run an analysis on these cleaned datasets. 

Let's move on to exploring how to build a neural network!

## **Neural Networks: Predicting Audiobook Sales from Prior Purchases**

We're going to approach this using Tensorflow, which is another framework/library which is widely used for machine learning. 

Tensorflow 2.0, the version we'll be using here, has been integrated with Keras (another commonly used framework) as its front-end. 
Let's import the packages we'll be using:

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Our goal is to build a neural network that can predict audio book sales from information about customers' prior purchases. Let's take a look at our dataset (credit to 365DataScience for the data):

In [None]:
dataset = pd.read_csv('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/Audiobooks_data_3_with_titles.csv')
dataset

Our objective here will be to use the other features to predict the targets column. The targets column displays 1 if a user who has previously purchased an audiobook becomes a repeat customer, and 0 if they do not return. 

We'll start by loading in our datasets - these are `.npz` files, which are a form of storing numpy arrays. Don't worry too much about the preprocessing - we've already done it for you, and the resulting train, test, and validation `.npz` files are in our repository. 

If you're interested in how the pre-processing was done with numpy, you can take a look at the code block in the appendix at the bottom of this Colaboratory notebook. Running that cell will load `Audiobooks_data_3.csv` and generate the train, test, and validation `.npz` files for you.

Let's load our cleaned training, validation, and testing datasets:

In [None]:
train = np.load('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/Audiobooks_train_data_3.npz')
train_inputs = train['inputs'].astype(np.float)
train_targets = train['targets'].astype(np.int)

validation = np.load('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/Audiobooks_validation_data_3.npz')
validation_inputs = validation['inputs'].astype(np.float)
validation_targets =  validation['targets'].astype(np.int)

test = np.load('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/Audiobooks_test_data_3.npz')
test_inputs = test['inputs'].astype(np.float)
test_targets = test['targets'].astype(np.int)


Now let's declare how many inputs, outputs, and nodes per hidden layer we want our neural network to have:

In [None]:
input_size = 10 # This is because there are 10 columns that we use (not including ID) in our prediction
output_size = 2 # This is because there are 2 possible outputs - 1 (yes, customer returned) or 0 (no, did not return)
hidden_layer_nodes = 100 # can be chosen as seen fit - this is part of hyperparameter tuning, to see what size would be optimal

We'll now build our neural network:

In [None]:
tf.keras.initializers.GlorotNormal(seed=None)

model = tf.keras.Sequential([
                 tf.keras.layers.Input(shape = input_size), 
                 tf.keras.layers.Dense(hidden_layer_nodes, activation='relu'), 
                 tf.keras.layers.Dense(hidden_layer_nodes, activation='relu'),
                 #tf.keras.layers.Dropout(0.3,),
                 tf.keras.layers.Dense(output_size, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size = 32 # This is the number of samples that we are considering for each iteration
epochs = 100     # The number of passes through 

Let's first make note of how the model is being constructed using `keras.Sequential`. We have 2 fully connected hidden layers with 100 nodes each with the ReLU activation function, and a final dense layer with the 2 output prediction nodes.

When we run `model.compile()`, we are effectively configuring the neural network for optimization through the loss function (we have chosen "sparse_categorical_crossentropy") and the optimization method (Adam). Adam has a pre-set learning rate of 0.001, but we can change this as well. 

We chose sparse_categorical_crossentropy here because that is often used when we want to calculate the loss between the labels and predictions for a categorical variable, which we have here since we have 2 possible outputs - 0 or 1 - for the targets (did the customer return or not). 

It's important to note that there are many activation functions that may be used in various stages, but ReLU is very widely used for the hidden layers, and softmax or sigmoid is commonly used for the output layer. 

Here is some tensorflow documentation which goes through different loss and activation functions, as well as optimization methods, if anyone is interested in learning more about this. 

https://www.tensorflow.org/api_docs/python/tf/keras/losses

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers

https://www.tensorflow.org/api_docs/python/tf/keras/activations


Lastly, we need to fit the model - what we have done so far is like configuration, and what we now need to do is to actually train the neural network! Let's see our results: 

In [None]:
model.fit(train_inputs, train_targets, batch_size = batch_size, 
          epochs = epochs, validation_data = (validation_inputs, validation_targets),
          verbose = 2 # this is just to ensure that what is printed is loss, validation loss, accuracy, and validation accuracy
          )  

We can also work with the model on the test set, although we won't get as strong of a picture of what our results look like. We can do this through model.evaluate, which takes in the test set and the batch_size. 

In [None]:
model.evaluate(test_inputs, test_targets, batch_size = batch_size, verbose = 2)

Our results are not too bad! We're getting around 83% on our validation accuracy, and we haven't even used any special techniques yet. The first thing that we can do is hyperparameter tuning - changing the size of the hidden layers, possibly the batch size, the number of epochs, and so on. 

We definitely also need to be mindful of overfitting - there are a few techniques that we can use to accomplish this which are mostly beyond the scope of this bootcamp - but be sure to keep an eye out for future sessions where we may cover topics like these! However, a very easy that method we can employ is dropout, where we effectively ignore a certain percentage of inputs for each iteration, and this allows us to learn various parts of the data better. 

For now, instead of constantly running this for 100 epochs even when validation loss continues to increase, we will use Tensorflow's `EarlyStopping` mechanism to stop fitting the model after a "certain number of iterations" have passed where the loss was higher than the previous epoch's loss. 

In [None]:
stop_early = tf.keras.callbacks.EarlyStopping(patience = 5) # This patience is the "certain number of iterations" mentioned above
model.fit(train_inputs, train_targets, batch_size = batch_size, 
          epochs = epochs, validation_data=(validation_inputs, validation_targets), 
          verbose = 2, 
          callbacks = [stop_early]
          )  

Note, however, that the main drawback to using the `EarlyStopping` mechanism is that we do not get to see the broader set of validation accuracy results - this leaves us to believe that our maximum accuracy is lower than it actually is, and does not allow us to get a sense of general consistency of performance - every percent in accuracy counts!

#**Appendix**
Here are the preprocessing steps for the audiobooks customer dataset that we didn't touch upon for the neural networks component of the programming session. 

In [None]:
# preprocessing steps
csv_data = np.loadtxt('/content/AI-Penn/Session 4 - Neural Networks (News Outreach)/Audiobooks_data_3.csv', delimiter = ',')
unscaled_inputs = csv_data [:,1:-1]
targets = csv_data[:,-1]

# balancing dataset - ensuring that when we train, we are able to ensure that we have similar number of each so
# that our model does not learn one more than the other
num_one_targets = int(np.sum(targets))  ## find number of 1 boolean values for balancing
num_zero_targets = 0
removed_index = []
for i in range (targets.shape[0]):
    if(targets[i] == 0):
        num_zero_targets += 1
        if (num_zero_targets > num_one_targets):
            removed_index.append(i)
unscaled_inputs_equalpriors = np.delete (unscaled_inputs, removed_index, axis = 0)
target_equal_priors = np.delete (targets, removed_index, axis = 0)

# Scaling the set using sklearn's preprocessing 
scaled_inputs = preprocessing.scale(unscaled_inputs_equalpriors)

#shuffling - so we effectively have a randomized process of selecting values from the dataset for when it is partitioned
indices_shuffled = np.arange(scaled_inputs.shape[0])
np.random.shuffle(indices_shuffled)
shuffled_inputs = scaled_inputs[indices_shuffled]
shuffled_targets = target_equal_priors[indices_shuffled]

# splitting data into train, validation, and test -- currently 80:10:10 but can be tweaked
samples_count = shuffled_inputs.shape[0]
samples_count = scaled_inputs.shape[0]
train_samples_count = int(0.8*samples_count) 
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count 

#Setting the train, validation, and test inputs and outputs to incorporate particular ranges of the set 
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]
#train_inputs = scaled_inputs[:train_samples_count]
#train_targets = target_equal_priors[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:(train_samples_count + validation_samples_count)]
validation_targets = shuffled_targets[train_samples_count:(train_samples_count + validation_samples_count)]

test_inputs = shuffled_inputs[(train_samples_count + validation_samples_count):]
test_targets = shuffled_targets[(train_samples_count + validation_samples_count):]


print(np.sum(train_targets), train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets)/test_samples_count)

# Saving the numpy arrays as .npz files - this is what is in the Github that we imported to construct our neural network from
np.savez('Audiobooks_train_data_2', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_validation_data_2',inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_test_data_2', inputs=test_inputs, targets=test_targets)
