**DeapSECURE module 4: Deap Learning**

# Session 2: Classifying Smartphone Apps with Keras

Welcome to the DeapSECURE online training program!
This is a Jupyter notebook for the hands-on learning activities of the
["Deep Learning" (DL) module](https://deapsecure.gitlab.io/deapsecure-lesson04-nn/),
Episode 5: ["Classifying Smartphone Apps with Keras"](https://deapsecure.gitlab.io/deapsecure-lesson04-nn/24-keras-classify/index.html).
Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.

### Overview

In this session, we will use this notebook to build neural network models to perform a classification task on the Sherlock's "Applications" dataset.
We will be using a more realistic dataset which contains 18 applications, which will challenge machine learning models.
Just like the previous lesson on [machine learning](https://deapsecure.gitlab.io/deapsecure-lesson03-ml/), the goal of this lesson is to build a neural network model which will accurately distinguish these 18 applications based on their resource usage characteristics.
We will find out in this session whether the neural network model will surpass the traditional machine learning models in terms of its accuracy.

> **Your challenge** in this notebook is to train machine learning models (similar to those introduced in the previous notebooks) using the "18-apps" dataset to correctly classify running apps on the smartphone with very high accuracy. Our target is to reach >99% accuracy.


**QUICK LINKS**
* [Setup](#sec-setup)
* [Loading Sherlock Data](#sec-load_data)
* [Neural Network Models](#sec-NN)
* [Comparison with Traditional Machine Learning Models](#sec-ML)


<a id="sec-setup"></a>
## 1. Setup Instructions

If you are opening this notebook from the Wahab OnDemand interface, you're all set.

If you see this notebook elsewhere, and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

1. Make sure you have activated your HPC service.
2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
3. Create a new Jupyter session with the following parameters: Python version **3.7**, Python suite `tensorflow 2.6 + pytorch 1.10`, Number of Cores **4**, Number of GPU **0**, Partition `main`, and Number of Hours at least **4**. (See <a href="https://wiki.hpc.odu.edu/en/ood-jupyter" target="_blank">ODU HPC wiki</a> for more detailed help.)
4. From the JupyterLab launcher, start a new Terminal session. Then issue the following commands to get the necessary files:

       mkdir -p ~/CItraining/module-nn
       cp -pr /shared/DeapSECURE/module-nn/. ~/CItraining/module-nn

Using the file manager on the left sidebar, now change the working directory to `~/CItraining/module-nn`.
The file name of this notebook is `NN-session-2.ipynb`.

### 1.1 Reminder

* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. 

* To run a code in a cell, press `Shift+Enter`.

* <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" target="_blank">Pandas cheatsheet</a>

* <a href="https://deapsecure.gitlab.io/deapsecure-lesson02-bd/10-pandas-intro/index.html#summary-indexing-syntax" target="_blank">Summary table of the commonly used indexing (subscripting) syntax</a> from our own lesson.

* <a href="https://keras.io/api/" target="_blank">Keras API document</a>

We recommend you open these on separate tabs or print them;
they are handy help for writing your own codes.

### 1.2 Loading Python Libraries

We need to import the required libraries into this Jupyter Notebook:
`pandas`, `numpy`,`matplotlib.pyplot`,`sklearn` and `tensorflow`.
Keras is now part of TensorFlow.

In [None]:
import os
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# CUSTOMIZATIONS (optional)
np.set_printoptions(linewidth=1000)

%matplotlib inline

In [None]:
# tools for machine learning:
import sklearn

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix

# classic machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [None]:
# tools for deep learning:
import tensorflow as tf
import tensorflow.keras as keras

# Import key Keras objects
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

<a id="sec-load_data"></a>
## 2. Loading Sherlock Application Dataset

First of all, we must repeat the data preparation steps for the "18-apps" dataset, including data wrangling and exploration, in this bigger dataset.
We will run through typical preprocessing steps below.
While executing the codes, please read them and understand what steps were needed to make the data ready for machine learning modeling.

In [None]:
df = pd.read_csv("sherlock/sherlock_18apps.csv", index_col=0)

## Summarize the dataset
print("* shape:", df.shape)
print()
print("* info::\n")
df.info()
print()
print("* describe::\n")
print(df.describe().T)
print()

### 2.1 Exploring the SherLock "18-apps" Dataset

Please use the standard pandas functions to explore the new table (e.g. `head()`, `tail()`, and so on).

**QUESTION**:

1. How many features exist in the original table?
2. From the pandas output in the previous cell, do you see any irregularities in the dataset?
3. What are the names of the applications contained in this "18-apps" dataset?
   Do you recognize some of these apps?
4. What are the frequencies of these apps in the dataset?
   Are there apps that are much represented or underrepresented in the dataset?
   According to this data, which apps are used most often by this user?

In [None]:
# Use this cell to do your exploration. If needed, add new cells below.

Find the counts (frequencies) of apps in the ApplicationName column.
*Hint*: Use the `value_counts()` method of a pandas' column.

In [None]:
"""Find the counts (frequencies) of apps in the ApplicationName column.""";
#app_frequencies = df[#TODO].#TODO
#print(app_frequencies)
#print('Total num of apps = ', len(app_frequencies))

### 2.2 Data Cleaning and Preprocessing

**EXERCISE**: Based on the irregularities of the data discovered above, please clean the data to make them ready for machine learning modeling.
Write the Python codes to clean the data so that we can use this dataset for analysis and machine learning.
We will repeat many steps we took for the 2-apps dataset here.

> We encourage you to perform data exploration and identify issues with the data before running these cleaning steps.
> However, the complete codes for cleaning and preprocessing are given at the end of this notebook to get all of us quicker to the neural network modeling, which is the core mission of this module.

In [None]:
"""Enter your data cleaning procedure below""";

#TODO

*HINTS*: Use the following code snippet as the starting point.
It is a minimum code skeleton:
```python
df2 = df.drop(FIXME, axis=1)
df2.dropna(FIXME)
```

> **STOP**: Have you done your data cleaning yet? You cannot proceed to the next step with dirty data.
> Solution for data cleaning can be found at the very end of this notebook, if needed.

After the data is cleaned, the label column needs to be separated from the features.

In [None]:
"""Separate labels from the features"""
#labels = df2#TODO
#df_features = df2.drop(#TODO)

#### One-Hot Encoding

When using neural networks to do a classification task, we need to encode the labels using **one-hot encoding**.
This is necessary because many machine learning algorithms require numerically meaningful input and output variables.
With one-hot encoding, each 18-apps label is converted into a vector of 18 integers where only one class has a value of 1; all others are zeros.
Here is an illustration:

| App name                  | One-hot representation                                |
|---------------------------|-------------------------------------------------------|
| Calendar                  | `1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0`                 |
| Chrome                    | `0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0`                 |
| ES File Explorer          | `0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0`                 |
| ...                       |                                                       |
| WhatsApp                  | `0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0`                 |
| Zelle                     | `0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1`                 |

For more information on why we need one-hot encoding, see these articles:

* https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
* https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

> We did not have to do one-hot in scikit-learn, because the ML objects such as `DecisionTreeClassifier` do it for us behind the scene.

In [None]:
df_labels_onehot = pd.get_dummies(labels)

For one-hot encoding, there is a `1` in a distinct spot for every category and `0` everywhere else.
Below shows the first five rows; notice that there is only a single `1` in each row, with the rest being `0`.

In [None]:
df_labels_onehot.head()

Similarly, any input features that are of categorical data type will also have to be encoded using either integer encoding or one-hot encoding.

In [None]:
"""Perform one-hot encoding for **all** categorical features."""
print("Step: Converting all non-numerical features to one-hot encoding.")
# This will be explained later
df_features = pd.get_dummies(df_features)

**QUESTION**: Anything changes after one-hot encoding? Are there changes with the number of columns?

In [None]:
"""Inspect the most current dataframe contents:""";
#df_features.head()

In [None]:
"""Step: Feature scaling using StandardScaler."""
print("Step: Feature scaling with StandardScaler")

# keep the unscaled feature matrix under a different name:
df_features_unscaled = df_features
scaler = preprocessing.StandardScaler()
scaler.fit(df_features_unscaled)

# Recast the features still in a dataframe form
df_features = pd.DataFrame(scaler.transform(df_features_unscaled),
                           columns=df_features_unscaled.columns,
                           index=df_features_unscaled.index)
print("After scaling:")
print(df_features.head(10))
print()

In [None]:
"""Step: Perform train-validation split on the master dataset.
This should be the last step before constructing & training the model.
"""
# percent size reserved for validation dataset
val_size = 0.2
# if reproducibility is desired:
#random_state = 34
# for the lesson:
random_state = np.random.randint(1000000)

In [None]:
print("Step: Train-validation split  val_size=%s  random_state=%s" \
      % (val_size, random_state))

train_features, val_features, train_L_onehot, val_L_onehot = \
    train_test_split(df_features, df_labels_onehot,
                     test_size=val_size, random_state=random_state)

print("- training dataset:   %d records" % (len(train_features),))
print("- validation dataset: %d records" % (len(val_features),))
print("Now the data is ready for machine learning!")
sys.stdout.flush()

After the cell above is executed, you will find new variables defined that hold the training & validation data:

* `df_features`: DataFrame of the features for the machine learning models
* `labels`: The labels (expected output of the ML models)
* `train_features` = training data's features
* `val_features` = validation data's features
* `train_L_onehot` = training data's labels (one-hot encoded)
* `val_L_onehot` = validation data's labels (one-hot encoded)

> The `random_state` argument above is optional.
> It is used to force reproducible results (in this case, reproducible train/validation split), as the train/validation process uses random numbers to shuffle the training data before splitting them.

(Extra step)
We need to repeat the same train/test split with non-one-hot labels because it will be needed for comparison with traditional ML:

In [None]:
train_features_x, val_features_x, train_labels, val_labels = \
    train_test_split(df_features, labels,
                     test_size=val_size, random_state=random_state)

(This is a case where the fixed `random_state` value is important so that we can get identical shuffling for the two `train_test_split` function calls.)

### 2.3 Inspecting Preprocessed SherLock "18-apps" Data

**EXERCISE**:
Take a peek at the training feature DataFrame *after* the preprocessing steps.
What can you learn from this?

In [None]:
"""Take a peek at the training feature DataFrame.""";
#TODO

**QUESTIONS:**

- How many features for each record?
- How many applications in the total dataset?
- How many records in the separated training and testing dataset?

<a id="sec-NN"></a>

## 3. Building Neural Networks to Classify Applications

Let us now proceed by building some neural network models to classify smartphone apps.

### 3.1 First Model: No Hidden Layer

The simplest neural network model will have no hidden layer. The following is an example of a neural network model without any hidden layers:

In [None]:
def NN_Model_no_hidden(learning_rate):
    """Definition of deep learning model with no hidden layer"""
    model = Sequential([
        Dense(18, activation='softmax', input_shape=(19,),
              kernel_initializer='random_normal')
    ])
    adam_opt = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
    model.compile(optimizer=adam_opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

Let us now train this model with an initial *learning rate* of 0.0003 and observe what happens.

In [None]:
model_0 = NN_Model_no_hidden(0.0003)
model_0_history = model_0.fit(train_features,
            train_L_onehot,
            epochs=5, batch_size=32,
            validation_data=(val_features, val_L_onehot),
            verbose=2)

The `fit` function above returns the model's training history as a complex object.
Of importance is the `history` attribute, which contains a dictionary of values of loss functions, accuracy, etc. as computed in the training process:

In [None]:
model_0_history.history.keys()

In [None]:
model_0_history.history

This history can be recast as a DataFrame for easy inspection and/or saved as a CSV file:

In [None]:
df_model_0_history = pd.DataFrame(model_0_history.history)

In [None]:
df_model_0_history

In [None]:
df_model_0_history.to_csv('model_0_history.csv')

### 3.2 Visualizing Training History

Visualization of training history is a helpful aid in understanding the progress of the model training.
The model's `fit` function returns an object that contains the history of the loss function, accuracy, and potentially other metrics computed during every epoch of the training.
We can use these data to create a few graphs:

- A plot of accuracy on the training and validation datasets over training epochs.
- A plot of loss on the training and validation datasets over training epochs.

In [None]:
def plot_loss(model_history):
    # summarize history for loss
    plt.plot(model_history.history['loss'])
    plt.plot(model_history.history['val_loss'])
    plt.title('Model Loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'val'], loc='upper right')
    plt.show()

In [None]:
def plot_acc(model_history):
    # summarize history for accuracy
    plt.plot(model_history.history['accuracy'])
    plt.plot(model_history.history['val_accuracy'])
    plt.title('Model Accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'val'], loc='upper left')
    plt.show()

In [None]:
plot_loss(model_0_history)
plot_acc(model_0_history)

In the next notebook we will ask the following questions:

1. What is the effect of changing the learning rate?
   (Examples: 0.03, 0.003, or 0.00003)
   
2. What will happen if we increase the `epochs` value? (To 10, 20?)

3. What is the ultimate accuracy of a no-hidden-layer model compared to Decision Tree and Logistic Regression?

**QUESTION**:

* Does this training result look converged to you? (The training has converged to a solution if the changes of the loss function and accuracy between two consecutive epochs have dropped to a very small value--for example, the values have changed by 0.1% or less between the two epochs.)

* If this training does not look converged, what can you do?

### 3.3 One Hidden Layer

Apparently, the first NN model that we created above did not perform very well.
One way to improve the performance of a NN model is to add one or more hidden layers.
The function below has a hidden layer, an output layer, and utilizes the `Adam` optimizer that was used in the previous notebook.
We will use this function to test the performance based on different parameters (number of hidden neurons, hidden layers, learning rate, etc.).
To start, let us try an example of a model with 1 hidden layer, `18` hidden neurons, and a learning rate of `0.0003`.

In [None]:
def NN_Model_1H(hidden_neurons, learning_rate):
    """Definition of deep learning model with one dense hidden layer"""
    model = Sequential([
        # More hidden layers can be added here
        Dense(hidden_neurons, activation='relu', input_shape=(19,),
              kernel_initializer='random_normal'), # Hidden Layer
        Dense(18, activation='softmax',
              kernel_initializer='random_normal')  # Output Layer
    ])
    adam_opt = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
    model.compile(optimizer=adam_opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

In [None]:
model_1 = NN_Model_1H(18,0.0003)
model_1_history = model_1.fit(train_features,
                              train_L_onehot,
                              epochs=10, batch_size=32,
                              validation_data=(val_features, val_L_onehot),
                              verbose=2)

In [None]:
plot_loss(model_1_history)
plot_acc(model_1_history)

## Extra Activities: Prelude to Model Tuning (Self-Exploration)

Now that we know how to use the function `NN_model_1H`, we are ready to run a variety of tests using different hyperparameters.
This will be a topic for the subsequent notebook.
But those who are curious can get ahead by training similar models which differ in the number of hidden neurons.

  - What happens to the model's accuracy if we *increase* the number of hidden neurons (e.g. **25, 36, 40, 80** or beyond)?
  - What happens to the model's accuracy if we *decrease* the number of hidden neurons (e.g. **12, 8, 4, 2, 1**)?

We will run these tests in the next notebook, where we perform *model tuning* to find the optimal network architecture to perform the 18-apps classification task.

> **HINT:**
> The easiest way to do this exploration is to simply copy the code in the cell above and paste it in a new cell below, since most of the hyperparameters (`hidden_neurons`, `learning_rate`, `batch_size`, etc.) can be changed when calling the function or when fitting the model.

In [None]:
"""Start self exploration here""";


## Remarks

This process of experimentation with different parameters for the neural network can get repetitive and cause this notebook to become very long.
Instead, it would be more beneficial to run experiments like this in a scripting environment.
To do this, we need to identify the relevant code elements for our script.
In a general sense, this is what we should pick out:

* Useful Python libraries & user-defined functions
* Proper sequence of commands that were run throughout this notebook (i.e. one-hot encoding must be done before training the models)
* Code cells that require repetition to run many tests (i.e. the cells right above this section)

In brief, once the initial experiments are done and we have established a working pipeline for machine learning, we need to change the way we work.
Real machine-learning work requires many repetitive experiments, each of which may take a long time to complete.
Instead of running many experiments in Jupyter notebooks, where each will require us to wait for a while to finish, we need to be able to carry out many experiments in parallel so that we can obtain our results in a timely manner.
This is key reason why we should make a script for these experiments and submit the script to run them in batch (non-interactive model).
HPC is well suited for this type of workflow--in fact it is most efficient when used in this way.
Here are the key components of the "batch" way of working:

* A job scheduler (such as SLURM job scheduler on HPC) to manage our jobs and run them on the appropriate resources;
* The machine learning script written in Python, which will read inputs from files and write outputs to files and/or standard output;
* The job script to launch the machine learning script in the non-interactive environment (e.g. HPC compute node);
* A way to systematically repeat the experiments with some variations. This can be done by adding some command-line arguments for the (hyper)parameters that will be varied for each test.

In your hands-on package, there is a folder called `expts-sherlock` which contains a sample Python script and SLURM job script that you can submit to the HPC cluster:

* `NN_Model-064n.py` shows an example of how a script converted from this notebook would look like.
  We recommend only one experiment per script to avoid complication.

* `NN_Model-064n.wahab.job` is the corresponding job script for ODU's Wahab cluster.

<a id="sec-ML"></a>
## 4. Comparison with Traditional Machine Learning 

Now, we first try the traditional machine learning algorithms learned in the previous session. 
Here we test on **Decision Tree** and **Logistic Regression**. 
To simplify the code, we will use the `model_evaluate` function to evaluate the performance of a machine learning model (whether traditional ML or neural network model).

In [None]:
def model_evaluate(model,test_F,test_L):
    test_L_pred = model.predict(test_F)
    print("Evaluation by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(test_L, test_L_pred))
    # Uncomment the following line to show the confusion matrix:
    #print("confusion_matrix:","\n",confusion_matrix(test_L, test_L_pred))
    return

> **NOTE**: You can uncomment the print statement above if you'd like to examine the confusion matrix whenever you evaluate the model.


We need the old-fashioned train labels as texts below:

In [None]:
ML_dtc = DecisionTreeClassifier(criterion='entropy',
                                   max_depth=6,
                                   min_samples_split=8)
%time ML_dtc.fit(train_features, train_labels)

In [None]:
model_evaluate(ML_dtc, val_features, val_labels)

In [None]:
ML_log = LogisticRegression(solver='lbfgs')
%time ML_log.fit(train_features, train_labels)

In [None]:
model_evaluate(ML_log, val_features, val_labels)

**QUESTIONS**:

* Do you notice issues with the training process of any of the models above?
* (Optional) Can you find a way to ensure full convergence of the training?

By now, we have a pretty good background knowledge about this dataset.
And we know the accuracy scores we can get by using the Decision Tree and Logistic Regression methods,
which are reasonably good, but not close to 99%.

### Timing the Computation

Do you notice that the training of logistic regression model takes a while?
Often we want to know *how long* this actually takes place.
We can get this timing easily in Jupyter by prepending `%time` to the Python statement we'd like to measure the execution time.

> #### About the Warning Message
>
> The training phase stops with an error:
>
> ```
> ConvergenceWarning: lbfgs failed to converge (status=1):
> STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
> ```
>
> This happens because the solver fails to reach convergence after the maximum number of iteration (default=100) is reached.
> You may want to investigate by trying different solvers in the `LogisticRegression` object.
> Try the Scikit-learn documentation on [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), the `solver` argument, if you are interested.
> That may confirm whether a reasonable solution has indeed been reached (i.e. different solvers yield about the same accuracy). 

# Appendix A: Solutions

## Solution: Data Cleaning (Sec. 2.2)

The absolute bare minimum cleaning steps for the Sherlock "18-apps" data would be like this:

```python
df2 = df.drop(['cminflt', 'guest_time'], axis=1)
df2.dropna(inplace=True)
```

### Alternative Solution (Advanced Users)

Verbose code is often helpful especially when automating the machine learning workflow.
The following segments of code are examples of self-documenting code which prints clear messages as it processes the data.

**STEP 1**: Columns with obviously irrelevant and missing data are removed.
```python
# Missing data or bad data
del_features_bad = [
    'cminflt', # all-missing feature
    'guest_time', # all-flat feature
]
df2 = df.drop(del_features_bad, axis=1)

print("Cleaning:")
print("- dropped %d columns: %s" % (len(del_features_bad), del_features_bad))
```
Output:
```
Cleaning:
- dropped 2 columns: ['cminflt', 'guest_time']
```

**STEP 2**: Remove rows with missing data.
```python
print("- remaining missing data (per feature):")

isna_counts = df2.isna().sum()
print(isna_counts[isna_counts > 0])
print("- dropping the rest of missing data")

df2.dropna(inplace=True)

print("- remaining shape: %s" % (df2.shape,))
```
Output:
```
- remaining missing data (per feature):
CPU_USAGE      52
cutime         52
num_threads    52
priority       52
rss            52
state          52
stime          52
utime          52
vsize          52
dtype: int64
- dropping the rest of missing data
- remaining shape: (273077, 17)
```