# Homework of Ch5. Long Short-Term Memory
----
This is the homework of TU-ETP-AD1062 Machine Learning Fundamentals.

For more information, please refer to:
https://sites.google.com/view/tu-ad1062-mlfundamentals/

For original dataset information, please visit:
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import Adadelta
from keras.preprocessing import sequence
from keras.preprocessing import text

import numpy as np
import pickle
import sklearn.model_selection
import csv

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

## 1. Load and Pre-processing SMS Spam Collection Dataset 
----
The code here demonstrate how to load the `SMS Spam Collection` Dataset. Here also includes some pre-processing steps:
1. Load SMS Spam Collection Dataset
2. Define the following parameters:
    * Vocabury size for one-hot-encodding (`vocab_size`)
    * Maximum Length for each sentence (`max_len`)
    * Maximum features, which is the output dimension of Word embedding (`max_features`)
    * Batch size for training (`batch_size`)
3. Conduct one-hot-encodding: Use `keras.preprocessing.text.one_hot`, which helps you
    * Segment the sentence into the word array
    * Encode each word via One-hot-encodding  
    Notice that you should encoding your training data and test data together
4. Pre-pend 0 to the one-hot-encodded sentence into fix length: Use `keras.preprocessing.sequence.pad_sequences`

> **Your task**:  
> Complete step 2,3 and 4 mentioned above

In [None]:
# Step 1. Load SMS Spam Collection Dataset 
f_train = open('uciml_sms_spam_train_tu-etp-ad1062-hw5.pickle', 'rb')
(X_train, y_train) = pickle.load(f_train)
f_train.close()

f_test = open('uciml_sms_spam_test_tu-etp-ad1062-hw5.pickle', 'rb')
X_test = pickle.load(f_test)
f_test.close()

print("Step1. Result of original sentence:")
print("%s\n" % X_train[0])

# Step 2. Define the following parameters:
#  - Vocabury size for one-hot-encodding (vocab_size)
#  - Maximum Length for each sentence (max_len)
#  - Maximum features, which is the output dimension of Word embedding (max_features)
#  - Batch size for training (batch_size)
# 
# Now it's your turn!: Adjust your parameters to maximize the performance
# ----------------------------------------------------------------
vocab_size = 1024
max_len = 128
max_features = 64
batch_size = 32

# Step 3. One-hot-encodding
# 
# Now it's your turn!:
#     Use keras.preprocessing.text.one_hot to conduct one-hot-encodding
#     Remember to specify the parameter n with the vocab_size assigned before
# ----------------------------------------------------------------
# ...

print("Step3. Result of one_hot:")
print("%s\n" % X_train[0])

# Step 4. Pre-pend 0 to the one-hot-encodded sentence into fix length
#
# Now it's your turn!:
#     Use keras.preprocessing.sequence.pad_sequences
#     Remember to specify the parameter maxlen with the max_len assigned before
# ----------------------------------------------------------------
# ...

print("Step4. Result of pad_sequences:")
print("%s\n" % X_train[0])

## 2. Construct LSTM
----
The code shown below constructs a LSTM with following structure:
1. Embedding Layer:
    * `input_dim`: `vocab_size`, which is initialized as 1024
    * `output_dim`: `max_features`, which is initialized as 64
    * `input_length`: `max_len`, which is initialized as 128
2. LSTM Layer:
    * Unit size 64
3. Fully-connected layer
    * Unit size 256
4. Drop-out layer
5. Fully-connected layer with sigmoid activation

> **Your task**:  
> Adjust the hierachy of your LSTM. Check Keras manual for more details. For examples:
> - Embedding Layer: https://keras.io/layers/embeddings/#embedding
> - LSTM Layer: https://keras.io/layers/recurrent/#lstm
> - Dense Layer (Fully-connected layer): https://keras.io/layers/core/#dense
>
> You can also replace your `LSTM` layer with `SimpleRNN` or `GRU`

In [None]:
def create_lstm():
    model = Sequential()
    
    # Now it's your turn!: Adjust your LSTM
    # ----------------------------------------------------------------
    model.add(Embedding(vocab_size, max_features, input_length=max_len))
    model.add(LSTM(64))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
    
    return model

## 3. Cross-validation
----
The code shown below conduct 5-fold cross-validation on training set based on the model adjusted above (i.e., `create_lstm`)

> **Your task**:  
> - Keep adjusting `create_lstm()` and execute the following block
> - Make sure that you are happy with the 5-fold cross-validation result!

In [None]:
neural_network = KerasClassifier(build_fn=create_lstm, epochs=5, batch_size=256, verbose=1)

cv = 5
scores = sklearn.model_selection.cross_val_score(neural_network, X_train, y_train, cv=5)

print("%d-fold Cross Validation Result" % cv)
print(scores)

4. Predict the testing set
----
The code shown below helps you read the testing data, predict with your `create_lstm()`, then output as CSV files

> **Your task:**
> 1. Download testing set from Kaggle website:  https://www.kaggle.com/t/ff59441b7e064bc2a5d8e9374cfe1a11
> 2. Put the testing set downloaded to the same location as this `*.ipynb` file
> 3. Execute the following block to:
>    - Train with `X_train` and `y_train`
>    - Predict with `X_test`
> 4. Upload your evaluation result to Kaggle (NOTICE: 5 submissions per-day!)
> 5. Check your public scoreboard!
> 6. Submit your homework 5 (Google form), make sure to **Fill in your Trend Micro PSID team name!!**

In [None]:
# Train and predict the test data
neural_network.fit(X_train, y_train)
y_test_predict = neural_network.predict(X_test)

# Output as CSV
id_array = range(0,524)
y_test_predict.astype('int8')
submission = np.stack((id_array, y_test_predict.reshape((len(id_array)))), axis=1)
np.savetxt("submission.csv", submission, fmt="%d", delimiter=',', header='id,answer', comments='')

print("CONGRATULATIONS! YOU'VE ALREADY DONE! PLEASE SUBMIT YOUR submission.csv to https://www.kaggle.com/t/ff59441b7e064bc2a5d8e9374cfe1a11")