## Loading the data, padding (based on 2.0)

In [70]:
import sys
import os
import numpy as np
import torch
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.nn as nn
import torch.optim as optim

## Part 1 - Sentence generation (15 points).

Convert the model in Demo 2.1 into a character-based sentence generator. (Strip out the word segmentation objective.)  The model should, given a start symbol, produce a variety of sentences that terminate with a stop symbol (you will have to add these to the data).  The sentences that it generates should be of reasonable average length compared to the sentences in the training corpus (this needn't be precise). 

Report and discuss the changes you made to the notebook using Markdown inside the notebook.

## Part 2 - Dual objectives (10 points)

Copy the notebook from part 1 and augment the copy by adding back the word segmentation objective, as a second objective with its own loss.  (You could also in theory do Part 1 and Part 2 in reverse, by adding sentence generation with dual objectives first and then stripping out the word segmentation objective; this is equivalent.)  Note that multiple losses can be combined by simple, possibly weighted addition -- backpropagation works entirely correctly on the combined loss.

Report and discuss the changes you made to the notebook using Markdown inside the notebook.

## Part 3 - Analysis (5 points)

You now have three models.  The original word segmentation model, a sentence generation model, and a dual sentence-generation/word segmentation model. 

Compare the performance on the test data of the original word segmentation model between the original objective and the dual objective model.  In how many iterations do the models converge?  What are their final F1 and accuracy scores once they've converged? Are they any different?  If so, why?

Make the same comparison between the sentence generation model and the dual-objective model, except the performance measure is the per-word perplexity on the text corpus.

Report your findings in one of the notebooks.

## Part Bonus - Embeddings (15 points)

The training process trains character embedding vectors in the nn.Embedding layer, the indices of whose weights correspond to the indices of the character vocabulary.  Explore the question, with dimensionality reduction and scatter plots, of whether characters that are more likely to appear at the beginning of a Chinese word are also more similar to one another than characters more likely to appear in the middle or end. You can do this process on the original word segmentation model or on the dual objective -- your choice.  Report your findings in a separate Markdown file.

In [52]:
def read_chinese_data(inputfilename):
    with open(inputfilename, "r") as inputfile:
        start_symbol = "é"
        stop_symbol = "ë"
        
        sentences = []
        collection_words = [start_symbol]
        collection_labels = []
        
        for line in inputfile:
            #line: 1	看似	看似	AUX	VV	_	2	cop	_	SpaceAfter=No
            
            if line[0] == '#':
                #continues with the next iteration of the loop
                continue
            columns = line.split()
            #columns = ['1', '看似', '看似', 'AUX', 'VV', '_', '2', 'cop', '_', 'SpaceAfter=No'] 
            
            #if empty it's the end of the sentence
            if columns == []:
                
                sentences.append((''.join(collection_words)+stop_symbol, collection_labels))
                collection_words = [start_symbol]
                collection_labels = []
                continue
                
            #append characters to collection_words
            collection_words.append(columns[1])
            
            #append 1 (then append 0 if there was 2 characters in columns[1])
            collection_labels += [1] + ([0] * (len(columns[1]) - 1))          

    return sentences

I wanted to add significant start and stop symbols (for example "< start >" and "< end >"), but in the end it may be more difficult to extract them later character by character. So, I chose two characters "é" and "ë" which are not in the chinese dataset. I chose letters as there may be symbols in the dataset (like "!#?./;%" etc).

In [54]:
train_sentences = read_chinese_data('/scratch/lt2316-h20-resources/zh_gsd-ud-train.conllu')

In [55]:
test_sentences = read_chinese_data('/scratch/lt2316-h20-resources/zh_gsd-ud-test.conllu')

In [57]:
def index_chars(sentences):
    megasentence = ''.join(sentences)
    char_list = set()
    for character in megasentence:
        char_list.add(character)
    char_list = [0] + list(char_list)
    char_index = {char_list[x]:x for x in range(len(char_list))}
    return char_list, char_index

In [58]:
char_list, char_index = index_chars([x[0] for x in train_sentences + test_sentences])

In [59]:
char_list

[0,
 '傭',
 '貶',
 '翌',
 '鍛',
 '烯',
 '諧',
 '尬',
 '晴',
 '酆',
 '餘',
 '勒',
 'é',
 '積',
 '弧',
 '燄',
 '十',
 '體',
 '城',
 '晃',
 '襄',
 '蜀',
 '患',
 '芭',
 '輻',
 '浩',
 '酸',
 '激',
 '粒',
 '籌',
 '滯',
 '植',
 '音',
 '毀',
 '疹',
 '擋',
 '擔',
 '緯',
 '呈',
 '內',
 '男',
 '醬',
 '：',
 '黨',
 '食',
 '修',
 '詔',
 '驢',
 '崎',
 '殆',
 '誤',
 '回',
 '孔',
 '隸',
 '統',
 '瀑',
 '伍',
 '妒',
 '扈',
 '懿',
 '陂',
 '溝',
 '青',
 '違',
 '貼',
 '婆',
 '銷',
 's',
 '摹',
 '擒',
 '姬',
 '灣',
 '羞',
 '章',
 '肌',
 '斥',
 '斂',
 '稱',
 '外',
 '入',
 '綉',
 '寨',
 '渚',
 '尚',
 '淮',
 '琅',
 '瑾',
 '衣',
 '航',
 '劾',
 '攪',
 '輛',
 '孟',
 '牡',
 '茸',
 '咖',
 '咨',
 '搞',
 '瓣',
 '狼',
 '綺',
 '愙',
 '佼',
 '磡',
 '粵',
 '而',
 '怡',
 '迫',
 '儲',
 '腓',
 '隨',
 '韃',
 '旁',
 '皖',
 '媽',
 '箏',
 '戌',
 '踢',
 '鐵',
 '殉',
 '漂',
 '兵',
 'i',
 '聳',
 '鰺',
 '邳',
 '濞',
 '寢',
 '鴉',
 '才',
 '緬',
 '跋',
 '投',
 '兒',
 '吹',
 '倫',
 '宇',
 '概',
 '兀',
 '召',
 '仲',
 '堪',
 '姻',
 '帖',
 '做',
 '紮',
 '嘆',
 '飢',
 '瀏',
 '代',
 '夕',
 '坪',
 '蔣',
 '襟',
 '佐',
 '廓',
 '逾',
 '器',
 '園',
 '頻',
 '玄',
 '熔',
 '注',
 '畹',
 '恤',
 '漲',
 '苯',


In [60]:
def convert_sentence(sentence, index):
    return [index[x] for x in sentence]

In [61]:
def pad_lengths(sentences, max_length, padding=0):
    return [x + ([padding] * (max_length - len(x))) for x in sentences]

In [62]:
def create_dataset(x, device="cpu"):
    converted = [(convert_sentence(x1[0], char_index), x1[1]) for x1 in x]
    X, y = zip(*converted)
    lengths = [len(x2) for x2 in X]
    padded_X = pad_lengths(X, max(lengths))
    Xt = torch.LongTensor(padded_X).to(device)
    padded_y = pad_lengths(y, max(lengths), padding=-1)
    yt = torch.LongTensor(padded_y).to(device)
    lengths_t = torch.LongTensor(lengths).to(device)
    return Xt, lengths_t, yt

In [63]:
train_X_tensor, train_lengths_tensor, train_y_tensor = create_dataset(train_sentences, "cuda:2")
test_X_tensor, test_lengths_tensor, test_y_tensor = create_dataset(test_sentences, "cuda:2")

## Packing the sequences for RNN

In [None]:
"""
testtensor = torch.randn((10,100,200))
testlengths = torch.randint(1, 100, (10,))
testlengths.size(), testlengths
packed = pack_padded_sequence(testtensor, testlengths, batch_first=True, enforce_sorted=False)
testtensor
packed
len(packed.batch_sizes)
unpacked = pad_packed_sequence(packed, batch_first=True, total_length=100)
unpacked
unpacked[0]
unpacked[0].size()
"""

## Batching (based on 1.0, 1.1, 1.2)

In [68]:
class Batcher:
    def __init__(self, X, lengths, y, device, batch_size=50, max_iter=None):
        self.X = X
        self.lengths = lengths # We need the lengths to efficiently use the padding.
        self.y = y
        self.device = device
        self.batch_size=batch_size
        self.max_iter = max_iter
        self.curr_iter = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.curr_iter == self.max_iter:
            raise StopIteration
        permutation = torch.randperm(self.X.size()[0], device=self.device)
        permX = self.X[permutation]
        permlengths = self.lengths[permutation]
        permy = self.y[permutation]
        splitX = torch.split(permX, self.batch_size)
        splitlengths = torch.split(permlengths, self.batch_size)
        splity = torch.split(permy, self.batch_size)
        
        self.curr_iter += 1
        return zip(splitX, splitlengths, splity)

In [26]:
"""
b = Batcher(train_X_tensor, train_lengths_tensor, train_y_tensor, torch.device('cuda:2'), max_iter=100)
testbatching = next(b)
testbatching
testbatch = next(testbatching)
testbatch
"""

## Modeling

In [32]:
"""
emb = nn.Embedding(len(char_list), 200, 0).to("cuda:2")
testX, testlengths, testy = testbatch
testembs = emb(testX)
testembs
testembs.size()
testembs.device
testlstm = nn.LSTM(200, 150, batch_first=True).to("cuda:2")
testembspadded = pack_padded_sequence(testembs, testlengths.to("cpu"), batch_first=True, enforce_sorted=False)
testoutput, teststate = testlstm(testembspadded)
testoutput
testunpacked = pad_packed_sequence(testoutput, batch_first=True)
testunpacked[0].size()
testsigm = nn.Sigmoid().to("cuda:2")
testoutput2 = testsigm(testunpacked[0])
testoutput2.size()
testlin = nn.Linear(150, 2).to("cuda:2")
testoutput3 = testlin(testoutput2)
testoutput3.size()
testsoft = nn.LogSoftmax(2).to("cuda:2")
testoutput4 = testsoft(testoutput3)
testoutput4
testy_short = testy[:, :max(testlengths)]
testy_short
testy_short.size()
max(testlengths)
testpermuted = testoutput4.permute(0, 2, 1)
testpermuted
nllloss = nn.NLLLoss(ignore_index=-1).to("cuda:2")
nllloss(testpermuted, testy_short)
"""

In [71]:
class Segmenter(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super().__init__()
        
        self.vocab_size = vocab_size
        self.emb_size = emb_size
        
        self.emb = nn.Embedding(self.vocab_size, self.emb_size, 0)
        self.lstm = nn.LSTM(self.emb_size, 150, batch_first=True)
        self.sig1 = nn.Sigmoid()
        self.lin = nn.Linear(150, 2)
        self.softmax = nn.LogSoftmax(2)
        
    def forward(self, x, lengths):
        embs = self.emb(x)
        packed = pack_padded_sequence(embs, lengths.to("cpu"), batch_first=True, enforce_sorted=False)
        output1, _ = self.lstm(packed)
        unpacked, _ = pad_packed_sequence(output1, batch_first=True)
        output2 = self.sig1(unpacked)
        output3 = self.lin(output2)
        return self.softmax(output3)
        

In [72]:
def train(X, lengths, y, vocab_size, emb_size, batch_size, epochs, device, model=None):
    b = Batcher(X, lengths, y, device, batch_size=batch_size, max_iter=epochs)
    if not model:
        m = Segmenter(vocab_size, emb_size).to(device)
    else:
        m = model
    loss = nn.NLLLoss(ignore_index=-1)
    optimizer = optim.Adam(m.parameters(), lr=0.005)
    epoch = 0
    for split in b:
        tot_loss = 0
        for batch in split:
            optimizer.zero_grad()
            o = m(batch[0], batch[1])
            l = loss(o.permute(0,2,1), batch[2][:, :max(batch[1])])
            tot_loss += l
            l.backward()
            optimizer.step()
        print("Total loss in epoch {} is {}.".format(epoch, tot_loss))
        epoch += 1
    return m

In [77]:
model = train(train_X_tensor, train_lengths_tensor, train_y_tensor, len(char_list), 200, 50, 100, "cuda:2")
torch.save(model, 'assignment2.pt')

Total loss in epoch 0 is 37.636146545410156.
Total loss in epoch 1 is 26.489946365356445.
Total loss in epoch 2 is 22.705333709716797.
Total loss in epoch 3 is 19.984140396118164.
Total loss in epoch 4 is 17.6546573638916.
Total loss in epoch 5 is 15.47716999053955.
Total loss in epoch 6 is 13.739563941955566.
Total loss in epoch 7 is 12.198772430419922.
Total loss in epoch 8 is 10.802000045776367.
Total loss in epoch 9 is 9.713250160217285.
Total loss in epoch 10 is 9.795398712158203.
Total loss in epoch 11 is 8.993219375610352.
Total loss in epoch 12 is 7.485962390899658.
Total loss in epoch 13 is 6.4367451667785645.
Total loss in epoch 14 is 5.855354309082031.
Total loss in epoch 15 is 5.485112190246582.
Total loss in epoch 16 is 5.360109806060791.
Total loss in epoch 17 is 5.597396373748779.
Total loss in epoch 18 is 6.042062759399414.
Total loss in epoch 19 is 5.378505706787109.
Total loss in epoch 20 is 5.153847694396973.
Total loss in epoch 21 is 4.400579452514648.
Total loss in

## Evaluation

In [64]:
model.eval()

Segmenter(
  (emb): Embedding(3648, 200, padding_idx=0)
  (lstm): LSTM(200, 150, batch_first=True)
  (sig1): Sigmoid()
  (lin): Linear(in_features=150, out_features=2, bias=True)
  (softmax): LogSoftmax(dim=2)
)

In [65]:
with torch.no_grad():
    rawpredictions = model(test_X_tensor, test_lengths_tensor)

In [66]:
rawpredictions.size()

torch.Size([500, 156, 2])

In [67]:
rawpredictions

tensor([[[-5.5307e+00, -3.9711e-03],
         [-4.9722e-04, -7.6067e+00],
         [-1.7355e+01,  0.0000e+00],
         ...,
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02]],

        [[-1.1877e+01, -6.9141e-06],
         [-2.7702e-02, -3.6001e+00],
         [-7.6969e+00, -4.5432e-04],
         ...,
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02]],

        [[-6.5656e+00, -1.4089e-03],
         [-1.6689e-06, -1.3334e+01],
         [-6.5181e+00, -1.4776e-03],
         ...,
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02]],

        ...,

        [[-5.2260e+00, -5.3896e-03],
         [-5.4596e-05, -9.8150e+00],
         [-1.3359e+01, -1.5497e-06],
         ...,
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02],
         [-3.2601e+00, -3.9142e-02]],

        [[-1.4282e+01, -5.9605e-07

In [68]:
import math
math.log2(0.9), math.log2(0.8)

(-0.15200309344504995, -0.3219280948873623)

In [69]:
predictions = torch.argmax(rawpredictions, 2)

In [70]:
predictions

tensor([[1, 0, 1,  ..., 1, 1, 1],
        [1, 0, 1,  ..., 1, 1, 1],
        [1, 0, 1,  ..., 1, 1, 1],
        ...,
        [1, 0, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 0, 1,  ..., 1, 1, 1]], device='cuda:2')

In [71]:
predictions.size()

torch.Size([500, 156])

In [72]:
predictions[0]

tensor([1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2')

In [73]:
test_sentences[0]

('然而，這樣的處理也衍生了一些問題。', [1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1])

In [74]:
test_y_tensor[0]

tensor([ 1,  0,  1,  1,  0,  1,  1,  0,  1,  1,  0,  1,  1,  0,  1,  0,  1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], device='cuda:2')

In [75]:
test_lengths_tensor[0]

tensor(17, device='cuda:2')

In [76]:
collectpreds = []
collecty = []

In [77]:
for i in range(test_X_tensor.size(0)):
    collectpreds.append(predictions[i][:test_lengths_tensor[i]])
    collecty.append(test_y_tensor[i][:test_lengths_tensor[i]])

In [78]:
collecty

[tensor([1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1], device='cuda:2'),
 tensor([1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
         0, 1, 1, 1, 0, 1, 0, 1], device='cuda:2'),
 tensor([1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
         1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1], device='cuda:2'),
 tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
         1, 1, 0, 1, 1, 1, 1, 1], device='cuda:2'),
 tensor([1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1], device='cuda:2'),
 tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
         1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
         1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1], device='cuda:2'),
 tensor([1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 1, 1], device='c

In [79]:
allpreds = torch.cat(collectpreds)

In [80]:
allpreds.size()

torch.Size([19206])

In [81]:
classes = torch.cat(collecty)

In [82]:
allpreds, classes

(tensor([1, 0, 1,  ..., 1, 0, 1], device='cuda:2'),
 tensor([1, 0, 1,  ..., 1, 0, 1], device='cuda:2'))

In [83]:
classes.size()

torch.Size([19206])

In [84]:
classes = classes.float()
allpreds = allpreds.float()

In [85]:
tp = sum(classes * allpreds)
fp = sum(classes * (~allpreds.bool()).float())
tn = sum((~classes.bool()).float() * (~allpreds.bool()).float())
fn = sum((~classes.bool()).float() * allpreds)

tp, fp, tn, fn

(tensor(11339., device='cuda:2'),
 tensor(673., device='cuda:2'),
 tensor(6418., device='cuda:2'),
 tensor(776., device='cuda:2'))

In [86]:
accuracy = (tp + tn) / (tp + fp + tn + fn)
accuracy

tensor(0.9246, device='cuda:2')

In [87]:
recall = tp / (tp + fn)
recall

tensor(0.9359, device='cuda:2')

In [88]:
precision = tp / (tp + fp)
precision

tensor(0.9440, device='cuda:2')

In [89]:
f1 = (2 * recall * precision) / (recall + precision)
f1

tensor(0.9399, device='cuda:2')