## Sequence GAN from char-rnn
This is a character-level language model using recurrent neural networks based Sequence GAN (SeqGAN).
SeqGAN was proposed to cover discrete sequence data.
In this assignment, you will implement SeqGAN with shakespeare data used in assignment 3.

Original blog post & code:
https://github.com/LantaoYu/SeqGAN

That said, you are allowed to copy paste the codes from the original repo with an additional effort to apply it to our data.
HOWEVER, try to implement the model yourself first, and consider the original source code as a last resort.
You will learn a lot while wrapping around your head during the implementation. And you will understand more clearly in a code level.

### AND MOST IMPORTANTLY, IF YOU JUST BLINDLY COPY PASTE THE CODE, YOU SHALL RUIN YOUR EXAM.
### The exam is designed to be solvable for students that actually have written the code themselves.
At least strictly re-type the codes from the original repo line-by-line, and understand what each line means thoroughly.

## YOU HAVE BEEN WARNED.

Now proceed to the code. You may use textloader in previous assingment or not. You can freely create another python files (\*.py) and then import them. Following codes can be modified as you want. Just make sure that SeqGAN training works.



In [1]:
# ipython magic function for limiting the gpu to be seen for tensorflow
# if you have just 1 GPU, specify the value to 0
# if you have multiple GPUs (nut) and want to specify which GPU to use, specify this value to 0 or 1 or etc.
%env CUDA_DEVICE_ORDER = PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES = 2
# load a bunch of libraries
#from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
import numpy as np
import argparse
import time
import os
from six.moves import cPickle
from six import text_type
import sys

# this module is from the .py file of this folder
# it handles loading texts to digits (aka. tokens) which are recognizable for the model
from utils import TextLoader

# for TensorFlow vram efficiency: if this is not specified, the model hogs all the VRAM even if it's not necessary
# bad & greedy TF! but it has a reason for this design choice FWIW, try googling it if interested
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

data_dir = 'data/tinyshakespeare'
seq_length = 20
batch_size = 32

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=2


Write down the generator class and any other methods required for the generator class. You may define such methods in other python files (ex : utils.py).

In [2]:
#In the Model.py, there is the generator method.
import model
#We refered and copied coded from the original blog post & code(https://github.com/ofirnachum)
#Also, we applied the our things to the reference code.

Write down the discriminator class and any other methods required for the discriminator class. You may define such methods in other python files (ex : utils.py).

In [3]:
#In the Model.py, there is the discriminator, roll out, and other methods that helps to implement.
#We refered and copied coded from the original blog post & code(https://github.com/ofirnachum)

If you need any other class or method, use below blanks. You may insert or delete blanks as many you want. Of course, you may define them in other python files and then import them.

In [None]:
from __future__ import print_function

import codecs

__doc__ = """Char-based Seq-GAN on data from a book."""


import train

import os.path
import numpy as np
import tensorflow as tf
import random
import subprocess
import gzip

EMB_DIM = 20
HIDDEN_DIM = 25
SEQ_LENGTH = 20
START_TOKEN = 0

EPOCH_ITER = 10000
CURRICULUM_RATE = 0.02  # how quickly to move from supervised training to unsupervised
TRAIN_ITER = 1000000  # generator/discriminator alternating
D_STEPS = 2  # how many times to train the discriminator per generator step
SEED = 88

DATA_FILE = 'data/tinyshakespeare/input.txt'


def tokenize(s):
    return [c for c in ' '.join(s.split())]


def get_data(download=not os.path.exists(DATA_FILE)):
    token_stream = []
    is_gzip = False
    try:
        open(DATA_FILE).read(2)
    except UnicodeDecodeError:
        print("HERE")
    with gzip.open(DATA_FILE) if is_gzip else codecs.open(DATA_FILE, 'r', 'utf-8',errors='ignore') as f:
        for line in f:
            line = line if not is_gzip else line.decode('utf-8')
            if ("Even to the court, the heart, to the seat o' the brain;" in line or token_stream) and line.strip():
                token_stream.extend(tokenize(line.strip().lower()))
                token_stream.append(' ')
            if len(token_stream) > 10000 * SEQ_LENGTH:  # enough data
                break

    return token_stream


class BookGRU(model.GRU):

    def d_optimizer(self, *args, **kwargs):
        return tf.train.AdamOptimizer()  # ignore learning rate

    def g_optimizer(self, *args, **kwargs):
        return tf.train.AdamOptimizer()  # ignore learning rate


def get_trainable_model(num_emb):
    return BookGRU(
        num_emb, EMB_DIM, HIDDEN_DIM,
        SEQ_LENGTH, START_TOKEN)


def get_random_sequence(token_stream, word2idx):
    """Returns random subsequence."""
    start_idx = random.randint(0, len(token_stream) - SEQ_LENGTH)
    return [word2idx[tok] for tok in token_stream[start_idx:start_idx + SEQ_LENGTH]]


def verify_sequence(three_grams, seq):
    """Not a true verification; only checks 3-grams."""
    for i in range(len(seq) - 3):
        if tuple(seq[i:i + 3]) not in three_grams:
            return False
    return True


def main():
    random.seed(SEED)
    np.random.seed(SEED)

    token_stream = get_data()
    assert START_TOKEN == 0
    words = ['_START'] + list(set(token_stream))
    word2idx = dict((word, i) for i, word in enumerate(words))
    num_words = len(words)
    three_grams = dict((tuple(word2idx[w] for w in token_stream[i:i + 3]), True)
                       for i in range(len(token_stream) - 3))
    extends_words = []
    print('num words', num_words)
    print('stream length', len(token_stream))
    print('distinct 3-grams', len(three_grams))

    trainable_model = get_trainable_model(num_words)
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    print('training')
    for epoch in range(TRAIN_ITER // EPOCH_ITER):
        print(" ")
        print('epoch', epoch)
        proportion_supervised = max(0.0, 1.0 - CURRICULUM_RATE * epoch)
        train.train_epoch(
            sess, trainable_model, EPOCH_ITER,
            proportion_supervised=proportion_supervised,
            g_steps=1, d_steps=D_STEPS,
            next_sequence=lambda: get_random_sequence(token_stream, word2idx),
            verify_sequence=lambda seq: verify_sequence(three_grams, seq),
            words=words)
    #print('words will be like ', words)
    #print(*extends_words, sep = "\n") 

if __name__ == '__main__':
    main()

num words 37
stream length 200018
distinct 3-grams 5193
training
 
epoch 0
running 10000 iterations with 1 g steps and 2 d steps
of the g steps, 1.00 will be supervised
>>>> correct generations (supervised, unsupervised): 0.0012998700129987 0.0
None
 
 
epoch 1
running 10000 iterations with 1 g steps and 2 d steps
of the g steps, 0.98 will be supervised
>>>> correct generations (supervised, unsupervised): 0.003875968992248062 0.05555555555555555
[' ', 'f', 'o', ' ', 's', 'e', "'", 's', 'o', 'b', 'u', ' ', 'o', 'n', 'd', ' ', 'o', 'm', ' ', 'o']
[" fo se'sobu ond om o"]
 
epoch 2
running 10000 iterations with 1 g steps and 2 d steps
of the g steps, 0.96 will be supervised
>>>> correct generations (supervised, unsupervised): 0.004789670970428988 0.09547738693467336
[' ', 'h', 'l', 'e', 't', 'a', 'r', ',', ' ', 'd', 'o', 'r', 'd', ' ', 'k', 'o', 's', 'e', 'm', 'a']
[' hletar, dord kosema']
 
epoch 3
running 10000 iterations with 1 g steps and 2 d steps
of the g steps, 0.94 will be supervi

Write down the main code processing the train. You should show at least 16 generated text sequences at the end of the training. We will judge your progress with your final generated result. Be sure to pretrain the generator using supervised frame before training the model with SeqGAN framework.