# M2177.003100 Deep Learning <br> Assignment #3 Part 2: Language Modeling with CharRNN

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Sangil Lee, October 2018, modified by Jungbeom Lee, October 2020.

This is  a basic character-level RNN to classify words.

A character-level RNN reads words as a series of characters - outputting a prediction and ?hidden state? at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we will train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:


Original blog post & code:
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html


This iPython notebook is basically a copypasta of this repo.

That said, you are allowed to copy paste the codes from the original repo.
HOWEVER, <font color=red> try to implement the model yourself first </font>, and consider the original source code as a last resort.
You will learn a lot while wrapping around your head during the implementation. And you will understand nuts and bolts of RNNs more clearly in a code level.




### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **all Assignment Part 1-3**, run the *CollectSubmission.sh* script with your **Student number** as input argument. <br>
This will produce a zipped file called *[Your student number].zip*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* team_#)

### Classifying words with character-level RNN (30 points)


1. Successful training through implementing code that works. You will need to implement the codes in char_rnn.py.  (15 points)


2. After training, the final accuracy must be <font color=red> above 65% </font> (please see the last code block). We don't split the data into train-valid-test. Don't forget to <font color=red> NOT clear the outputs of all the code blocks! (15 points)




Now proceed to the code.

### Preparing Data:

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('lusàrski'))

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines


print(category_lines['Italian'][:5])

['data/names/German.txt', 'data/names/Arabic.txt', 'data/names/Vietnamese.txt', 'data/names/Korean.txt', 'data/names/Italian.txt', 'data/names/Japanese.txt', 'data/names/Dutch.txt', 'data/names/Greek.txt', 'data/names/English.txt', 'data/names/Polish.txt', 'data/names/Spanish.txt', 'data/names/Czech.txt', 'data/names/Irish.txt', 'data/names/Chinese.txt', 'data/names/Scottish.txt', 'data/names/Russian.txt', 'data/names/French.txt', 'data/names/Portuguese.txt']
lusarski
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


### Names to Tensors:

In [2]:
import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))

print(lineToTensor('Jones').size())

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])


### Settings for training and inference:

In [3]:
# Fixed parameters, should not be changed.
N_CATEGORIES = len(all_categories)
N_LETTERS = len(all_letters)
N_EVAL_SAMPLES_PER_CATEGORY = 10

# Adjustable parameters. You can change these parameters freely.

N_HIDDEN = 256
LEARNING_RATE = 5e-2
N_ITERS = 100000
print_every = 1000

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i


### Data Preparation

In [4]:
import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

def load_train_example(n_per_cls):
    category_tensors, line_tensors = [], []
    for category in all_categories:
        for line in category_lines[category][:n_per_cls]:
            category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
            line_tensor = lineToTensor(line)
            category_tensors.append(category_tensor)
            line_tensors.append(line_tensor)
    
    return category_tensors, line_tensors
    
for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

category_tensors, line_tensors = load_train_example(10)


category = Greek / line = Kanavos
category = Scottish / line = Mackay
category = Greek / line = Arvanitoyannis
category = English / line = Charge
category = Spanish / line = Garza
category = French / line = Degarmo
category = Russian / line = Vanzha
category = Scottish / line = Jamieson
category = Italian / line = Albrici
category = Greek / line = Dertilis


In [5]:
## Should NOT change this code block
def train(category_tensor, line_tensor):
    hidden = torch.zeros(1, N_HIDDEN).cuda()
    category_tensor = category_tensor.cuda()
    line_tensor = line_tensor.cuda()

    rnn.zero_grad()
    
    output = rnn(line_tensor, hidden)
    
    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-LEARNING_RATE)

    return output, loss.item()

In [6]:
import torch
import torch.nn as nn
from char_rnn import RNN
%env CUDA_VISIBLE_DEVICES = 0




rnn = RNN(N_LETTERS, N_HIDDEN, N_CATEGORIES).cuda()
print(rnn)
rnn.train()
criterion = nn.NLLLoss()
current_loss = 0
all_losses = []

for iter in range(1, N_ITERS + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = 'correct!' if guess == category else '? (%s)' % category
        print('%d %d%% (%s) %s / %s %s %s' % (iter, iter / N_ITERS * 100, loss, line, guess, correct, LEARNING_RATE))
torch.save(rnn.state_dict(), 'models/RNN_1.pth')

env: CUDA_VISIBLE_DEVICES=0
RNN(
  (lstm): LSTM(57, 256)
  (fc): Linear(in_features=256, out_features=18, bias=True)
  (softmax): LogSoftmax(dim=1)
)
1000 1% (2.9002845287323) Favreau / Greek ? (French) 0.05
2000 2% (2.9378175735473633) Battaglia / English ? (Italian) 0.05
3000 3% (2.3564138412475586) Guthrie / English correct! 0.05
4000 4% (2.685858726501465) Mandel / English ? (German) 0.05
5000 5% (0.696927547454834) Zhilnikov / Russian correct! 0.05
6000 6% (3.220208168029785) Kouba / Japanese ? (Czech) 0.05
7000 7% (1.0778177976608276) Kerner / German correct! 0.05
8000 8% (1.9001375436782837) Pei / Italian ? (Chinese) 0.05
9000 9% (2.3661084175109863) Kloeten / Scottish ? (Dutch) 0.05
10000 10% (0.03548244759440422) Moshkov / Russian correct! 0.05
11000 11% (0.06728974729776382) Georgeakopoulos / Greek correct! 0.05
12000 12% (1.558113932609558) Wan / Vietnamese ? (Chinese) 0.05
13000 13% (1.355043888092041) Van / Korean ? (Vietnamese) 0.05
14000 14% (1.33513605594635) Bustillo /

In [7]:
## Should NOT change this code block
category_tensors, line_tensors = load_train_example(N_EVAL_SAMPLES_PER_CATEGORY)

total_loss = 0
n_samples = 0
n_correct = 0
for idx in range(len(category_tensors)):
    n_samples += 1
    category_tensor, line_tensor = category_tensors[idx].cuda(), line_tensors[idx].cuda()
    hidden = torch.zeros(1, N_HIDDEN).cuda()
    
    output = rnn(line_tensor, hidden)

    
    loss = criterion(output, category_tensor)
    guess, guess_i = categoryFromOutput(output)
#     print(guess_i, category_tensor)
    if guess_i == category_tensor[0].data.cpu().numpy():
        n_correct += 1
    total_loss += loss.item()
    
print("eval with %d samples" % n_samples)
print(total_loss / n_samples)
print(n_correct / n_samples)

eval with 180 samples
0.5305531033135923
0.8611111111111112


In [8]:
# check your saved checkpoint again

rnn = RNN(N_LETTERS, N_HIDDEN, N_CATEGORIES).cuda()
rnn.eval()
rnn.load_state_dict(torch.load('models/RNN_1.pth'), strict=True)

category_tensors, line_tensors = load_train_example(N_EVAL_SAMPLES_PER_CATEGORY)

total_loss = 0
n_samples = 0
n_correct = 0
for idx in range(len(category_tensors)):
    n_samples += 1
    category_tensor, line_tensor = category_tensors[idx].cuda(), line_tensors[idx].cuda()
    hidden = torch.zeros(1, N_HIDDEN).cuda()
    
    output = rnn(line_tensor, hidden)

    
    loss = criterion(output, category_tensor)
    guess, guess_i = categoryFromOutput(output)
#     print(guess_i, category_tensor)
    if guess_i == category_tensor[0].data.cpu().numpy():
        n_correct += 1
    total_loss += loss.item()
    
print("eval with %d samples" % n_samples)
print(total_loss / n_samples)
print(n_correct / n_samples)


eval with 180 samples
0.5305531033135923
0.8611111111111112
