# 2019 Introduction to Deep Learning HW4: Fake News Generator!

Created by Yeon-goon Kim, SNU ECE, CML.

On this homework, you will create fake news generator, which is basic RNN/LSTM/GRU char2char generate model. Of course, your results may not so good, but you can expect some sentence-like results by doing this homework sucessfully.

## Now, We'll handle texts, not images. Is there any differences?

Of course, there are many differences between processing images and texts. One is that text cannot be expressed directly to matrix or tensor. We know an image can be expressed as Tensor(n_channel, width, height). But how about text? Can word 'Homework' can be expressed to tensor directly? By what laws? With what shapes? Even if it can, which one is closer to that word, 'Burden', or 'Work'? This is called 'Word Embedding Problem' and be considered as one of the most important problem in Natural Language Process(NLP) resarch. Fortunatly, there are some generalized solution in this problem (though not prefect, anyway) and both Tensorflow(Keras) and Pytorch give basic API that automatically solve this problem. You may investigate and use those APIs in this homework. 

The other one is that text is sequential data. Generally when processing images, without batch, input is just one image. However in text, input is mostly some or one paragraphs/sentences, which is sequential data of embedded characters or words. So, If we want to generate word 'Homework' with start token 'H', 'o' before 'H' and 'o' before 'Homew' should operate different when it gives to input. This is why we use RNN-based model in deep learning when processing text data.


## Requirement
In this homework I recommend that you should use the latest version of Pytorch, which is on now(2019-11-19) Pytorch 1.3.x.. Maybe you should use python3.7 because python3.8 may not compatible and inconsistent now. And to use dataset, you must install 'pandas' package, which that give convinience to read and manipulate .csv files. You can easilly install the package with command 'pip install pandas' or with conda if you use conda venv. Don't be so worry that you don't need to know how to use it, data pre-process code will be given. 

## Import Packages & Create Dataset
These codes will create dataset that automatically change each character in texts to int, which is assigned index by vocab.txt.

In [2]:
####### This Code should not be changed except 'USE_GPU'. Please mail to T/A if you must need to change with proper description.
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import numpy as np
from torch.utils.data import Dataset, DataLoader
import time
import math

########### Change whether you would use GPU on this homework or not ############
USE_GPU = False
#################################################################################
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

vocab = open('vocab.txt').read().splitlines()
n_vocab = len(vocab)
torch.manual_seed(1)

# Change char to index.
def text2int(csv_file, dname, vocab):
    ret = []
    data = csv_file[dname].values
    for datum in data:
        for char in str(datum):
            idx = vocab.index(char)
            ret.append(idx)    
    ret = np.array(ret)
    return ret

# Create dataset to automatically iterate.
class NewsDataset(Dataset):
    def __init__(self, csv_file, vocab):
        self.csv_file = pd.read_csv(csv_file, sep='|')
        self.vocab = vocab
        # self.len = len(self.csv_file)

        self.len = 2645
        self.x_data = torch.tensor(text2int(self.csv_file, 'x_data', self.vocab)[:169280]).view(-1, 64)
        self.y_data = torch.tensor(text2int(self.csv_file, 'y_data', self.vocab)[:169280]).view(-1, 64)

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        return self.x_data[idx], self.y_data[idx]

## Task1: RNN/LSTM/GRU Module

The main task is to create RNN/LSTM/GRU network that both input & output shape is (batch_size, vocab_size). You can use Pytorch api such as nn.XXX or barebone torch with F. I recommend use nn.XXX and module form that described on under, but you can use any of pytorch api that basically given. 

In [3]:
#################### WRITE DOWN YOUR CODE ################################
## Task_recommended form. You can use another form such as nn.Sequential or barebone Pytorch if you want, but in that case you may need to change some test or train code that given on later.
class CharacterLSTM(nn.Module):
    def __init__(self, vocab_len, device):
        super(CharacterLSTM, self).__init__()
        self.vocab_len = vocab_len
        self.device = device
        self.lstm = nn.LSTM(self.vocab_len, hidden_size=256,
                            batch_first=True,
                            dropout=0.3, num_layers=2)
        self.linear = nn.Linear(256, vocab_len)

    def forward(self, x):  # x : (batch_size, sentence_len)
        out = torch.arange(self.vocab_len).view(1, 1, -1).repeat((*x.size(), 1)).to(self.device)
        out = (out == x.unsqueeze(2)).float()
        out, _ = self.lstm(out)
        out = self.linear(out)
        return out  # (64, 1, vocab_len)
#################### WRITE DOWN YOUR CODE ################################

## Optional Task: Train Code

This code would define train function that train network that defined above.

In [4]:
def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

###################### Train Code. On mostly you don't need to change this, but ok if you really need to do it.
def train(dataset, model, optimizer, n_iters):
    model.to(device=device)
    model.train()
    start = time.time()
    print_every = 50
    criterion = nn.CrossEntropyLoss()
    for e in range(n_iters):
        for i, (x, y) in enumerate(dataset):
            x = x.to(device=device)
            y = y.to(device=device)
            model.zero_grad()
            output = model(x)  # output: (batch_size, sentence_len, vocab_len)
            loss = criterion(output.view(-1, vocab_len), y.view(-1))
            loss.backward()
            optimizer.step()
        if e % print_every == 0:
            print(f"Iteration {e}, {e / n_iters * 100} | {time_since(start)}, Loss: {loss}")
            torch.save(model.state_dict(), f'./fng_pt.pt')

## Optional Task: Test Code

This code would define test function that test network by generating (max_length)-length character sequence from 'start_letter'

In [5]:
####################### Test Code. On mostly you don't need to change this except value of 'max_length', but ok if you really need to do it.
def test(start_letter):
    max_length = 1000
    with torch.no_grad():
        idx = vocab.index(start_letter)
        input_nparray = [idx]
        input_nparray = np.reshape(input_nparray, (1, len(input_nparray)))
        inputs = torch.tensor(input_nparray, device=device, dtype=torch.long)
        output_sen = start_letter
        for i in range(max_length):
            output = model(inputs).squeeze(0)
            # topv, topi = output.topk(5)
            # topi = topi[-1][torch.multinomial(topv[-1], 1)]  # sample from top 5
            topv, topi = output.topk(1)
            topi = topi[-1]
            letter = vocab[topi]
            output_sen += letter
            idx = vocab.index(letter)
            input_nparray = np.append(input_nparray, [idx])
            inputs = torch.tensor(input_nparray, device=device, dtype=torch.long).unsqueeze(0)
    return output_sen

## Task2: Train & Generate

Using above defined functions and network, Do your train process and show your results freely! Since this is generating tasks so there are no clear test set, credits are given based on quality of generated sequence. Please see the document to find criterion. (Hint: See your loss carefully, and if final loss is between 1~2 or more, you will get results that match to basic credit. If final loss is under ~0.1, you will get results that match to full credit.) 

In [7]:
print('using device:', device)
vocab_len = len(vocab)
model = CharacterLSTM(vocab_len, device)

do_restore = True

if do_restore:
    model.load_state_dict(torch.load('fng_pt.pt', map_location=lambda storage, location: storage))
    model.eval()
    model.to(device=device)
else:
    n_iters = 500
    dataset = NewsDataset(csv_file='data.csv', vocab=vocab)
    train_loader = DataLoader(dataset=dataset, batch_size=64, shuffle=False, num_workers=1)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=2e-16, weight_decay=0)
    train(train_loader, model, optimizer, n_iters)

print(test('W'))

using device: cpu
White and the Seven Dwarfs Moria wrete of Mr. Trump’s basial bower sayia cornull to their sideas and dosser to the endired the plants on the system to preventing them after decades of fabric posters, they was the friends and barnest stape father. He said. “There was still surpers in the city. “I was going to take my two painting both straved it was also leader as the shipping scients as a landscape prints of dolbars of the book what he had done it is Houst on Starios Coust A time than $500 and a dozed by made more things — graduations on the political narrative of the country’s most gunnsy. The fabric resord that experien in the Bronx, much of the plant — was because of a specific community,” he said. Groups like the services are artist can plant posty growing 40 millions and permanity,” said He said Dude test to put the city’s los — with his summer, with time, Ms. Kerr said, “I called my friends in the city. In a locais from about $15, that showed by Mr. Trump calls 