# 2019 Introduction to Deep Learning HW4: Fake News Generator!

Created by Yeon-goon Kim, SNU ECE, CML.

On this homework, you will create fake news generator, which is basic RNN/LSTM/GRU char2char generate model. Of course, your results may not so good, but you can expect some sentence-like results by doing this homework sucessfully.

## Now, We'll handle texts, not images. Is there any differences?

Of course, there are many differences between processing images and texts. One is that text cannot be expressed directly to matrix or tensor. We know an image can be expressed as Tensor(n_channel, width, height). But how about text? Can word 'Homework' can be expressed to tensor directly? By what laws? With what shapes? Even if it can, which one is closer to that word, 'Burden', or 'Work'? This is called 'Word Embedding Problem' and be considered as one of the most important problem in Natural Language Process(NLP) resarch. Fortunatly, there are some generalized solution in this problem (though not prefect, anyway) and both Tensorflow(Keras) and Pytorch give basic API that automatically solve this problem. You may investigate and use those APIs in this homework. 

The other one is that text is sequential data. Generally when processing images, without batch, input is just one image. However in text, input is mostly some or one paragraphs/sentences, which is sequential data of embedded characters or words. So, If we want to generate word 'Homework' with start token 'H', 'o' before 'H' and 'o' before 'Homew' should operate different when it gives to input. This is why we use RNN-based model in deep learning when processing text data.


## Requirement
In this homework file, you should use the latest version of Tensorflow_r1, which is on now(2019-11-19) Tensorflow 1.15.x.. Maybe you should use python3.7 because python3.8 may not compatible and inconsistent now. And to use dataset, you must install 'pandas' package, which that give convinience to read and manipulate .csv files. You can easilly install the package with command 'pip install pandas' or with conda if you use conda venv. Don't be so worry that you don't need to know how to use it, data pre-process code will be given. 

## Import Packages & Create Dataset
These codes will create dataset that automatically change each character in texts to int, which is assigned index by vocab.txt.

In [1]:
####### This Code should not be changed except 'USE_GPU'. Please mail to T/A if you must need to change with proper description.
import pandas as pd
import tensorflow as tf
tf.enable_eager_execution()
import numpy as np
import time

########### Change whether you would use GPU on this homework or not ############
USE_GPU = False
#################################################################################
if USE_GPU:
    device = '/device:GPU:0'
else:
    device = '/cpu:0'
print('using device:', device)

vocab = open('vocab.txt').read().splitlines()
n_vocab = len(vocab)
tf.random.set_random_seed(1)

# Change char to index.
def text2idx(csv_file, dname, vocab):
    ret = []
    data = csv_file[dname].values
    for datum in data:
        for char in str(datum):
            idx = vocab.index(char)
            ret.append(idx)
    ret = np.array(ret)
    return ret


# Create dataset to automatically iterate.
csv_file = pd.read_csv('data.csv', sep='|')

with tf.device(device):
    x = text2idx(csv_file, 'x_data', vocab)
    y = text2idx(csv_file, 'y_data', vocab)
    
seq_length = 64
examples_per_epoch = len(x)//seq_length
batch_size = 64
steps_per_epoch = examples_per_epoch//batch_size
dataset = tf.data.Dataset.from_tensor_slices((x,y)).shuffle(10000).repeat().batch(64, drop_remainder=True).batch(64, drop_remainder=True)
dataset.output_shapes

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


using device: /cpu:0
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.


(TensorShape([Dimension(64), Dimension(64)]),
 TensorShape([Dimension(64), Dimension(64)]))

## Task1: RNN/LSTM/GRU Module

The main task is to create RNN/LSTM/GRU network that both input & output shape is (batch_size, vocab_size). You can use Tensorflow/Keras api such as tf.keras.Model, tf.keras.Sequential or barebone tensorflow such as tf.layer.XXX. You can use any of tensorflow api that basically given. 

In [None]:
#################### WRITE DOWN YOUR CODE ################################
## Task_recommended form. You can use another form but in that case you may need to change some test or train code that given on later.
def selfModule(### args you need):
    return model
#################### WRITE DOWN YOUR CODE ################################

## Optional Task: Test Code

This code would define test function that test network by generating (max_length)-length character sequence from 'start_letter'

In [None]:
####################### Test Code. On mostly you don't need to change this except value of 'max_length', but ok if you really need to do it.
def test(model, start_letter, n_vocab):
    max_length = 1000
    idx = vocab.index(start_letter)
    input_t = tf.constant([[idx]])
    output_sen = start_letter
    model.reset_states()
    for i in range(max_length):
        predictions = model(input_t)
        predictions = tf.squeeze(predictions)
        idx = tf.math.argmax(predictions)
        output_sen += vocab[idx.numpy()]        
        input_t = tf.constant([[idx.numpy()]])
    return output_sen

## Task2: Train & Generate

Using above defined functions and network, Do your train process and show your results freely! Since this is generating tasks so there are no clear test set, credits are given based on quality of generated sequence. Please see the document to find criterion. (Hint: See your loss carefully, and if final loss is between 1~2 or more, you will get results that match to basic credit. If final loss is under ~0.1, you will get results that match to full credit.) 

In [None]:
################## main Code. You can chage code on this cell or hyperparameter freely. 
do_restore = False

model = selfModule(20, n_vocab, 128, batch_size)

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer = tf.train.AdamOptimizer(learning_rate=0.01,
                                                 beta1=0.9,
                                                 beta2=0.999,
                                                 epsilon=2e-16),
              loss = loss)

if do_restore:
    model.load_weights('fng_tf.h5')

else:
    model.fit(dataset, epochs=10, steps_per_epoch=steps_per_epoch)
    model.save_weights('fng_tf.h5')

old_weights = model.get_weights()
t_model = selfModule(20, n_vocab, 128, 1)
t_model.set_weights(old_weights)
print(test(t_model, 'W', n_vocab))