# 2020 Introduction to Deep Learning Coding Ex 4: Spam Message Generator!

Contact T/A: Yeon-goon Kim, SNU ECE, CML. (ygoonkim@cml.snu.ac.kr)  

Dataset from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

On this homework, you will train spam message generator, which is basic RNN/LSTM/GRU char2char generation model. Of course, your results may not so good, but you can expect some sentence-like results by doing this homework sucessfully.

## Now, We'll handle texts, not images. Is there any differences?

Of course, there are many differences between processing images and texts. One is that text cannot be expressed directly in matrices or tensors. We know an image can be expressed in Tensor(n_channel, width, height). But how about an text sentence? Can word 'Homework' can be expressed in tensor directly? By what laws? With what shapes? Even if it can, which one is closer to the word, 'Burden', or 'Work'? This is called 'Word Embedding Problem' and be considered as one of the most important problem in Natural Language Process(NLP) resarch. Fortunatly, there are some generalized solution in this problem (though not prefect, anyway) and both Tensorflow and Pytorch give basic APIs that solve this problem automatically. You may use those APIs in this homework. 

The other one is that text is sequential data. Generally, when processing images, without batch, input is just one image. However in text, input is mostly some or one paragraphs/sentences, which is sequential data of embedded character or words. So, If we want to generate word 'Homework' with start token 'H', 'o' before 'H' and 'o' before 'Homew' should operate different when it gives to input. This is why we use RNN-based model in deep learning when processing text data.

## Requirement-Tensorflow
In this homework I recommend that you should use the latest "anaconda" stable version of Tensorflow, which is on now(2020-11-05) 2.2.x., but latest version(2.3.x) wouldn't be a serious problem. I'm using python3.7 on grading environment but there are no major changes on python3.6/8 so also will not be a serious problem. 
There are other required packages to run the code which is 'unidecode'. You can easily install these packages with 'pip install unidecode'.  

You can add other packages if you want, but if they are not basically given pkgs in Python/Tensorflow you should contact T/A to check whether you can use them or not.

In [13]:
####### Import Packages ##########

import tensorflow as tf
import time
import random
import string
import random
import unidecode
import re
import os
import datetime

##################################

## Changable Parameters

In [14]:
RANDOM_SEED = 2020
tf.random.set_seed(RANDOM_SEED)

###### On TF2.0, it automatically select whether to use GPU(default) or CPU #####
#USE_GPU = True
#################################################################################


############################# Changeable Parameters #############################
SEQ_LENGTH = 200
N_ITER = 2500
TXT_GEN_PERIOD = 500
LEARNING_RATE = 0.005
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
#################################################################################

## Data Prepration (Contact T/A If you wan to change)

In [15]:
with open('./spam.txt', 'r') as f:
    textfile = f.read()

TEXT_LENGTH = len(textfile)
random.seed(RANDOM_SEED)

textfile = unidecode.unidecode(textfile)
textfile = re.sub(' +',' ', textfile)


def pick_input(textfile):
    start_index = random.randint(0, TEXT_LENGTH - SEQ_LENGTH)
    end_index = start_index + SEQ_LENGTH + 1
    return textfile[start_index:end_index]

def char2tensor(text):
    lst = [string.printable.index(c) for c in text]
    tensor = tf.Variable(lst)
    return tensor

def draw_random_sample(textfile):    
    sampled_seq = char2tensor(pick_input(textfile))
    inputs = sampled_seq[:-1]
    outputs = sampled_seq[1:]
    return inputs, outputs

print(draw_random_sample(textfile))
print(draw_random_sample(textfile))

(<tf.Tensor: shape=(200,), dtype=int32, numpy=
array([54, 74,  4,  4,  0, 94, 24, 27, 94, 49, 50, 74,  4,  4,  0, 94, 54,
       14, 14, 94, 17, 14, 27, 77, 94, 32, 32, 32, 75, 54, 48, 54, 75, 10,
       12, 76, 30, 76, 23, 10, 29,  2,  7,  0,  8,  1,  9,  8,  0, 94, 54,
       55, 50, 51, 82, 94, 54, 14, 23, 13, 94, 54, 55, 50, 51, 94, 41, 53,
       49, 39, 94, 29, 24, 94,  6,  2,  4,  6,  8, 96, 56, 53, 42, 40, 49,
       55, 75, 94, 44, 22, 25, 24, 27, 29, 10, 23, 29, 94, 18, 23, 15, 24,
       27, 22, 10, 29, 18, 24, 23, 94, 15, 24, 27, 94,  0,  2, 94, 30, 28,
       14, 27, 75, 94, 55, 24, 13, 10, 34, 94, 18, 28, 94, 34, 24, 30, 27,
       94, 21, 30, 12, 20, 34, 94, 13, 10, 34, 62, 94,  2, 94, 15, 18, 23,
       13, 94, 24, 30, 29, 94, 32, 17, 34, 94, 73, 94, 21, 24, 16, 94, 24,
       23, 29, 24, 94, 17, 29, 29, 25, 77, 76, 76, 32, 32, 32, 75, 30, 27,
       10, 32, 18, 23, 23, 14, 27, 75, 12, 24, 22, 94, 29], dtype=int32)>, <tf.Tensor: shape=(200,), dtype=int32, numpy=
array([

## Custom Data Preparation Functions
You can add any other functions that preparation data in below cell.  

However, you should annotate precisely for each functions you define. One annotation line should not cover more than 5 lines that you write.

## Task1: RNN/LSTM/GRU Module

The main task is to create RNN/LSTM/GRU network. You can use any tensorflow/keras api that basically given.

Basically build_model are given, but you can add other functions that help your networks.

In [16]:
#################### WRITE DOWN YOUR CODE ################################

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

#################### WRITE DOWN YOUR CODE ################################

## Optional Task: Train & Generate Code

These cells would define functions of training network and generating text function. 

You can change these codes but if then you should annotate where do you make change precisely.
One annotation line should not cover more than 5 lines that you make your changes.  
Also, do not delete original code, just comment out them. (or make another cells of jupyter notebook)

In [17]:
model = build_model(vocab_size = 97,
  embedding_dim=EMBEDDING_DIM,
  rnn_units=HIDDEN_DIM,batch_size=1)

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
def accuracy(labels, logits):
    return tf.keras.metrics.sparse_categorical_accuracy(labels, logits)

optimizer = tf.keras.optimizers.Adam() 
model.compile(optimizer=optimizer, loss=loss) 

In [18]:
def generate_text(model, start_string):
  
    num_generate = SEQ_LENGTH
    input_eval = char2tensor(start_string)
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 0.8
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predictions = tf.math.exp(predictions)
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(string.printable[predicted_id])

    return (start_string + ''.join(text_generated))

## Execution Code & Credit Creterion
Half Credit (4 points): Generate some ugly text, without any meaningful words.

Q3 Credits (6 points): In SEQ_LENGTH 200, generate 6 or less differet words.

Full Credit (8 points): in SEQ_LENGTH 200, generate 7 or more different words.



You can change this cell based on your code modifications above.


In [19]:
checkpoint_dir = './tf-checkpoints'+ datetime.datetime.now().strftime("_%Y.%m.%d-%H:%M:%S")
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_0")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [20]:
# Start Training by running this cell
for i in range(N_ITER):
    x, y = draw_random_sample(textfile)    
    history = model.fit(tf.expand_dims(x, 0), tf.expand_dims(y, 0), epochs=1, callbacks=[checkpoint_callback], verbose=0)
    if (i % TXT_GEN_PERIOD == 0):
            checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_%d"%i)
            checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
                filepath=checkpoint_prefix,
                save_weights_only=True)
            print("\nIteration %d"%i)
            print(generate_text(model, 'joy'))
            print("\n\n")


Iteration 0
joyd&H'E9~?nnz1Vf(It75%#++77|-gNTRUd
=\5@RQY('O-]3+"xwdKi2D6@[Um }~N
f#!07^`m785#!Gc&*uP?]"tofCL~RDq+|l+Eucv,-;KX!@xK(L&9|X-VV,3k$95Cl]Lzg<d6U`5Z0^KbRxD8bX~$df\8XZ47,,V^rQLQY|xOrmB=!'2dKuAo+V<{ 9bJ5f$"]Z




Iteration 500
joy or all 080000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000




Iteration 1000
joy or to 871870 from to 871870 prom a comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to comtome to




Iteration 1500
joyour a call 09066612112 from a call 09066612112 from a call 09066612112 from a call 09066612112 from a call 09066612112 from a call 09066612112 from a call 09066612112 from a call 09066612112 from a ca




Iteration 2000
joyker for your call 0871230032 from land line only to claim call 0871230003 from land line only

## Model Loading Code
You can change this if you don't like or understand this code.

In [30]:
model = build_model(97,EMBEDDING_DIM,HIDDEN_DIM,batch_size=1)
model.load_weights(tf.train.latest_checkpoint("./tf-checkpoints_2020.11.29-17:21:52"))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fa6ada47f28>

## Using Pre-trained Networks: Transformers

Actually, RNN-based sequential model is now regarded as little bit old-fashioned, since 'Transformer' model was announced in paper 'Attention is All You Need'(https://arxiv.org/abs/1706.03762). Now this model is widely used on many state-of-the-art sequential-data-use model, and even in non-sequential-data-use (ex)image) model too. However, model training cost is too heavy(maybe you need multiple million-won GPUs) to train on this homework. Fortunately, there is package called 'transformers' that contains multipe pre-trained transformer-based model that can be used directly. Below is example of text generation using GPT2, which is one of the most popular pre-trained NLP models.

You can install this package with 'pip install transformers'. To download pre-traind model, you may have 2GB or more free disk space.

In [31]:
from transformers import pipeline, set_seed
# Install ipywidgets and restart notebooks if you meet error message.

generator = pipeline('text-generation', model='gpt2')
set_seed(25)
start_text = 'my name is park'
generator(start_text, max_length=300, num_return_sequences=2)

All model checkpoint layers were used when initializing TFGPT2Model.

All the layers of TFGPT2Model were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'my name is park and I really like it" says Gabor Yevlikos, a local journalist. "I went there this week because it\'s a local market that I believe in. We used to run a bakery which sells biscuits from the local supermarket."\n\nThe local supermarket has a £10-a-meal menu and they\'ve been putting a stop to many pies. So, they\'re making some new ones because there\'s no need to put them in the pies at the end of the day. "I like that part now, but we don\'t really want to be the only ones who use to be so busy. You don\'t want to put it in a few pies as we think it does bring out the best in us."\n\nThe good news is they have a more efficient sales policy. The bad news, though, is they all run up a £10-a-meal menu - so it\'s much quicker to order when it runs out. It\'s not like there\'s an \'oh, there\'s no one here\' response, like in stores and supermarkets, with a \'do not order\' message or a \'please don\'t order\' message as the shop is only closed two hours

## Task2: Follow Tutorial of Fine-tuned Networks (2 points.)
There are hundreds of fine-tuned NLP models in 'transformers' package. Try one of these models and follow its tutorial (except language translation model). Results must produce some meaningful, or funny one, and you must write down what model you choose and explain its function (ex) what is input/output, what does it mean etc) with one or two sentences. 

Hint: You can find list of pre-trained models in 'transformers' package on https://huggingface.co/models

In [18]:
#################### WRITE DOWN YOUR CODE ################################
from question_generator.questiongenerator import QuestionGenerator, print_qa
qg = QuestionGenerator()


with open("question_generator/articles/twitter_hack.txt", "r") as f:
    text = f.read()
print_qa(qg.generate(text, num_questions=10, answer_style='sentences'))



#################### WRITE DOWN YOUR CODE ################################

########### WRITE DOWN YOUR Explaination with annotation #################

# Explaination:
# text를 input으로 받은 후에 그 text를 통하여 임의의 question과 answer를 generate합니다.


##########################################################################


Generating questions...

Evaluating QA pairs...

1) Q: How long did it take to get the Bitcoin to be sent to the address of his digital wallet?
   A: " On the official account of Mr Musk, the Tesla and SpaceX chief appeared to offer to double any Bitcoin payment sent to the address of his digital wallet "for the next 30 minutes". 

2) Q: How did the company fix the flaw?
   A: Last year, Twitter chief executive Jack Dorsey's account was hacked, but the company said it had fixed the flaw that left his account vulnerable. 

3) Q: What did the campaign of Joe Biden say about the hack?
   A: The campaign of Joe Biden, who is the current Democratic presidential candidate, said Twitter had "locked down the account within a few minutes of the breach and removed the related tweet". 

4) Q: What is the role of social media companies in the 2020 election?
   A: "Social media companies such as Twitter and, Facebook all have a duty to consider the damage and influence their platforms can have on t



## Additional Information: Project
Of course, there are massive ammount of pretrained models on domain of image, NLP or else in web with open-source licenses. You can fine-tune those models if your GPUs are good enough, or at least transfer its information by using output feature of pre-trained networks. Or, maybe neither, it is up to you.