# Session 5: Assigment

```{contents}

```

## The mechanism of Dropout Layer

```python
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Input(...))
model.add(Dense(...)) # no activation
model.add(Dropout(0.2))
model.add(Activation(...))
```

In the above exmaple, we create an layer `dropout_rate = 0.2`.

That is, `each unit` in the Dense layer will have a 20% probability of being assigned to the value = 0.

The remaining numbers will be scaled up according to the formula

$$
\text{new value} = \text{old value} * \frac{1}{1-\text{rate}}
$$

Let's go through the detailed example below to better understand

In [None]:
from tensorflow.random import set_seed
from tensorflow.keras.layers import Dropout

layer = Dropout(0.2, input_shape=(2,))

Create a matrix with shape = $(5,2)$ representing the Dense layer

In [None]:
import numpy as np

np.random.seed(42)

data = np.arange(10).reshape(5, 2).astype(np.float32)
print(data)

Pass the above matrix through the Dropout layer

In [None]:
set_seed(42)

outputs = layer(data, training=True)
print(outputs)

## A simple Spam Classification

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Prepare the dataset

In this article, we will practice using the `MLP` model to classify phone messages (SMS) as spam or normal messages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/spam.csv', encoding='latin-1', usecols=[0,1],names=['Label','SMS'], header=0)
df.head()

Annotate the values in column **Label**
- **ham** means normal message
- **spam** means spam messages (ads, phishing, etc.)



#### TODO 1
- Print out **5 spam messages** and **5 ham messages**

In [None]:
# YOUR SOLUTION
spam = df[df.Label == 'spam']
spam[:5]

In [None]:
ham = df[df.Label == 'ham']
ham[:5]

We count the number of messages belonging to each **Label** to see if this data set is **Balanced Dataset** or **Imbalanced Dataset**

In [None]:
df['Label'].value_counts()

We take the data in the SMS column as input **x**

Then convert **Label** into number
- **Ham** = 0
- **Spam** = 1

In [None]:
# retrieve the data from the SMS column, then convert into numpy array
x = df['SMS'].values

# retrieve the data of the Label column
# binary mapping and convert into numpy array
y = df['Label'].map({'ham':0,'spam':1}).values

### Data Conversion

Remember that AI models take in data in the form of numbers. Therefore we will transform our data set from string of characters to numbers.

There are many ways to transform, here we will use the simplest method called **Count Vectorizer**. For example, we have a data set of 4 sentences as follows:

In [None]:
example_data = [
  'This is the first document.',
  'This document is the second document.',
  'And this is the third one.',
  'Is this the first document?',
]

Perform **CountVectorizer** on `example_data`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer # Import sklearn library

transformer = CountVectorizer()  # Initiallize CountVectorizer
transformer.fit(example_data) # Fit CountVectorizer on  example_data

example_features = transformer.transform(example_data).toarray() # create a set of features
print(example_features)

To understand the meaning of the numbers in variable 'example_features' one must understand how the **Count Vectorizer** transformation works

How the **Count Vectorizer** transformation works
- Step 1: Separate the dataset into a list of separate words (dictionary)

In [None]:
# We can see the result of step 1 by the following function
print(transformer.get_feature_names_out())

We see that from `example_dataset`, we can separate them into **9 seperable words**. `example_dataset` có 4 sentences

$\rightarrow$ `example_features` will have the shape of ``(4, 9)``


In [None]:
print('Shape:',example_features.shape)

- Step 2: Reconcile original data with separate words after splitting. These are the features of each sentence in the dataset. Example:
  - Sentence 1: `'This is the first document.'`. At the corresponding places there will be the number `n` meaning that the word appears `n` times in the sentence.
  ```
  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
  [0 1 1 1 0 0 1 0 1]
  ```
  - Sentence 2: `'This document is the second document.'` There are 2 words in this sentence **document** $\rightarrow$ at the position [1] is 2
  ```
  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
  [0 2 0 1 0 1 1 0 1]
  ```

We will apply **Count Vectorizer** to the SMS dataset

In [None]:
# fit on x_train
transformer = CountVectorizer(stop_words = 'english').fit(x)
# transform on x_train and x_test
x_vectors = transformer.transform(x)

print('Shape of x_train:',x_vectors.shape)

That is, in the dataset there are
- 5572 messages
- 8404 distinct words, each sample in the dataset is represented by a characteristic 8404

Also, when initializing `CountVectorizer`, we pass param `stop_words="english"`

In language processing, the term `stop_words` refers to words that are frequently used in a language, but do not make much sense. These words will be ignored when the `CountVectorizer` builds a dictionary of words. Run the cell below to see a list of `stop_words` in English



In [None]:
transformer.get_stop_words()

In [None]:
len(transformer.get_stop_words())

You can completely create 1 `list` containing your own word stops and pass on to the `CountVectorizer` if you are not satisfied with the available `list`

### Train Test Split

#### TODO 2
Divide the dataset into Train and Test sets with:
- Test set size is 20% of the total data
- Use `stratified split`
-Shuffle
- Use random_state=42

In [None]:
# YOUR SOLUTION
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_vectors, y, test_size=0.2, shuffle=True, random_state=42, stratify=y)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

### Train the model

#### TODO 3

Apply the MLP model to classify the above dataset. Once you're satisfied with the training results, draw a Confusion Matrix and print out a `classification_repor`.

Comment on the predicted results of the model based on the Confusion Matrix and Classification Report

In [None]:
# YOUR SOLUTION


Visualization

In [None]:
# YOUR SOLUTION

Is there any sign of overfitting.

Confusion Matrix here

In [None]:
# YOUR SOLUTION

Classification Report

In [None]:
# YOUR SOLUTION

#### Your comment