# Project: Translation-based Text Style Transfer

## Set-up

Run following commands to install libraries before proceeding

1. For Translation Dataset:

python -m spacy download fr_core_news_sm

python -m spacy download en_core_web_sm


2. For Grammar Checker:

pip install --upgrade language-check

In [1]:
import math
import torchtext
import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
from torchtext.utils import download_from_url, extract_archive
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torch import Tensor
import io
import time

# Sentence Tokenizer
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

# Grammar Checker
import language_tool_python
grammar_tool = language_tool_python.LanguageTool('en-US')

torch.manual_seed(0)
torch.use_deterministic_algorithms(True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

[nltk_data] Downloading package punkt to /Users/kisukjang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load Data


### Vocab

In [2]:
eng_train_url = 'train.en.gz'
eng_val_url = 'val.en.gz'
eng_test_url = 'test_2016_flickr.en.gz'
eng_spacy_lng = 'en_core_web_sm'

dst_train_url = 'train.fr.gz'
dst_val_url = 'val.fr.gz'
dst_test_url = 'test_2016_flickr.fr.gz'
dst_spacy_lng = 'fr_core_news_sm'

In [3]:
from torchtext.datasets import AG_NEWS, YelpReviewFull

url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = (eng_train_url, dst_train_url)
val_urls = (eng_val_url, dst_val_url)
test_urls = (eng_test_url, dst_test_url)

train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

eng_lng_tokenizer = get_tokenizer('spacy', language=eng_spacy_lng)
dst_lng_tokenizer = get_tokenizer('spacy', language=dst_spacy_lng)

# Torchtext dataset
ag_train_iter, ag_test_iter = AG_NEWS()
ag_train_list = list(ag_train_iter)
ag_test_list = list(ag_test_iter)
ag_train_list_sub = ag_train_list[:int(len(ag_train_list) / 24)]
ag_test_list_sub = ag_test_list[:int(len(ag_test_list) / 15)]
yelp_train_iter, yelp_test_iter = YelpReviewFull()
yelp_train_list = list(yelp_train_iter)
yelp_test_list = list(yelp_test_iter)
yelp_train_list_sub = yelp_train_list[:int(len(yelp_train_list) / 130)]
yelp_test_list_sub = yelp_test_list[:int(len(yelp_test_list) / 15)]

print("AG_NEWS Train size:", len(ag_train_list))
print("AG_NEWS Test size:", len(ag_test_list))
print("YELP FULL Train size:", len(yelp_train_list))
print("YELP FULL Test size:", len(ag_test_list))

AG_NEWS Train size: 120000
AG_NEWS Test size: 7600
YELP FULL Train size: 650000
YELP FULL Test size: 7600


Destination language vocab

In [4]:
counter = Counter()
with io.open(train_filepaths[1], encoding="utf8") as f:
    for string_ in f:
        counter.update(dst_lng_tokenizer(string_))
dst_lng_vocab = Vocab(counter, min_freq=1, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

DST_VOCAB_SIZE = len(dst_lng_vocab)
print("Dest Vocab size:", DST_VOCAB_SIZE)

Dest Vocab size: 11510


English language vocab

In [5]:
counter = Counter()
with io.open(train_filepaths[0], encoding="utf8") as f:
    for string_ in f:
        counter.update(eng_lng_tokenizer(string_))
for i in range(len(ag_train_list)):
    if "\\" in ag_train_list[i][1]:
        ag_train_list[i] = (ag_train_list[i][0], ag_train_list[i][1].replace("\\", " "))
    counter.update(eng_lng_tokenizer(ag_train_list[i][1]))
for i in range(len(yelp_train_list)):
    if "\\" in yelp_train_list[i][1]:
        yelp_train_list[i] = (yelp_train_list[i][0], yelp_train_list[i][1].replace("\\n\\n", " "))
        yelp_train_list[i] = (yelp_train_list[i][0], yelp_train_list[i][1].replace("\\n", " "))
        yelp_train_list[i] = (yelp_train_list[i][0], yelp_train_list[i][1].replace("\\", ""))
    counter.update(eng_lng_tokenizer(yelp_train_list[i][1]))

eng_lng_vocab = Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

ENG_VOCAB_SIZE = len(eng_lng_vocab)
print("English Vocab size:", ENG_VOCAB_SIZE)

English Vocab size: 408648


### Index

Define EOS, BOS, PAD indexes

In [6]:
PAD_IDX = eng_lng_vocab['<pad>']
BOS_IDX = eng_lng_vocab['<bos>']
EOS_IDX = eng_lng_vocab['<eos>']

1
2
3


## Translation Model

(Adapted from https://pytorch.org/tutorials/beginner/translation_transformer.html)

### Process Data

Generate translation dataset

In [7]:
# Init tokens
def data_process(filepaths):
    raw_eng_lng_iter = iter(io.open(filepaths[0], encoding="utf8"))
    raw_dst_lng_iter = iter(io.open(filepaths[1], encoding="utf8"))
    data = []
    for (raw_eng, raw_dst) in zip(raw_eng_lng_iter, raw_dst_lng_iter):
        eng_lng_tensor_ = torch.tensor([eng_lng_vocab[token] for token in eng_lng_tokenizer(raw_eng.rstrip("\n"))],
                                dtype=torch.long)
        dst_lng_tensor_ = torch.tensor([dst_lng_vocab[token] for token in dst_lng_tokenizer(raw_dst.rstrip("\n"))],
                                dtype=torch.long)
        data.append((eng_lng_tensor_, dst_lng_tensor_))
    return data


train_data = data_process(train_filepaths)
val_data = data_process(val_filepaths)
test_data = data_process(test_filepaths)

### Load Data

Construct dataloaders with a batch size

In [8]:
from torch.nn.utils.rnn import pad_sequence

BATCH_SIZE = 128

def generate_batch(data_batch):
    dst_lng_batch, src_lng_batch = [], []
    for (src_lng_item, dst_lng_item) in data_batch:
        dst_lng_batch.append(torch.cat([torch.tensor([BOS_IDX]), dst_lng_item, torch.tensor([EOS_IDX])], dim=0))
        src_lng_batch.append(torch.cat([torch.tensor([BOS_IDX]), src_lng_item, torch.tensor([EOS_IDX])], dim=0))
    dst_lng_batch = pad_sequence(dst_lng_batch, padding_value=PAD_IDX)
    src_lng_batch = pad_sequence(src_lng_batch, padding_value=PAD_IDX)
    return src_lng_batch, dst_lng_batch

train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE,
                       shuffle=True, collate_fn=generate_batch)

print("Train count:", len(train_iter.dataset))

Train count: 29000


Check how data looks

In [9]:
for idx, (src, tgt) in enumerate(train_iter):
    src = src.to(device)
    tgt = tgt.to(device)
    print(src)
    print(tgt)
    break

tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [ 161, 2132, 1903,  ...,  161,  161,  161],
        [ 360, 2683, 2504,  ...,  520,  882,  217],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
tensor([[  2,   2,   2,  ...,   2,   2,   2],
        [  8,  17,  71,  ...,   8,  17,   8],
        [187, 177,  39,  ...,  14,  32,  73],
        ...,
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1]])


### Model

Transformer-based Translation Model class

In [10]:
from torch.nn import (TransformerEncoder, TransformerDecoder,
                      TransformerEncoderLayer, TransformerDecoderLayer)


class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers: int, num_decoder_layers: int,
                 emb_size: int, src_vocab_size: int, tgt_vocab_size: int,
                 dim_feedforward:int = 512, dropout:float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        encoder_layer = TransformerEncoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
        decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)

        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor,
                tgt_mask: Tensor, src_padding_mask: Tensor,
                tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        memory = self.transformer_encoder(src_emb, src_mask, src_padding_mask)
        outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,
                                        tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer_encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer_decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

Position Encoder Class

In [11]:
class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding +
                            self.pos_embedding[:token_embedding.size(0),:])

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size
    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Masks for source and target

In [12]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

### Train

Define model parameters and instantiate model

In [13]:
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
NUM_EPOCHS = 25


transformer_model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS,
                                 EMB_SIZE, ENG_VOCAB_SIZE, DST_VOCAB_SIZE,
                                 FFN_HID_DIM)

for p in transformer_model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer_model = transformer_model.to(device)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(
    transformer_model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9
)

Define train and evaluation functions

In [14]:
def train_epoch(model, train_iter, optimizer):
    model.train()
    losses = 0
    for idx, (src, tgt) in enumerate(train_iter):
        src = src.to(device) # Output: maxLen(seq_src) x Batch_Size
        tgt = tgt.to(device) # Output: maxLen(seq_tgt) x Batch_Size

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,
                                src_padding_mask, tgt_padding_mask, src_padding_mask)
        # Output: (maxLen(seq_tgt)-1) x Batch_Size x DST_LNG_VOCAB

        optimizer.zero_grad()

        tgt_out = tgt[1:,:]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1)) # ((maxLen(seq_tgt)-1) x Batch_Size) x DST_LNG_VOCAB, ((maxLen(seq_tgt)-1) x Batch_Size)
        loss.backward()

        optimizer.step()

        losses += loss.item()
    return losses / len(train_iter)


def evaluate(model, val_iter):
    model.eval()
    losses = 0
    for idx, (src, tgt) in (enumerate(valid_iter)):
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,
                                  src_padding_mask, tgt_padding_mask, src_padding_mask)
        tgt_out = tgt[1:,:]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()
    return losses / len(val_iter)

Train Translation Model

In [15]:
for epoch in range(1, NUM_EPOCHS+1):
    start_time = time.time()
    train_loss = train_epoch(transformer, train_iter, optimizer)
    end_time = time.time()
    val_loss = evaluate(transformer, valid_iter)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "
          f"Epoch time = {(end_time - start_time):.3f}s"))

### Test

In [16]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    for i in range(max_len-1):
        memory = memory.to(device)
        memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                                    .type(torch.bool)).to(device)
        out = model.decode(ys, memory, tgt_mask) # Output: len(YS) x 1 x 512
        out = out.transpose(0, 1) # Output: 1 x len(YS) x 512
        prob = model.generator(out[:, -1]) # Output: 1 x 512 -> 1 x DST_LNG_VOCAB
        _, next_word = torch.max(prob, dim = 1) # Output: 1
        next_word = next_word.item()

        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0) # Output: 1 x len(YS) -> 1 x (len(YS) + 1)
        if next_word == EOS_IDX:
            break
    return ys


def translate(model, src, src_vocab, tgt_vocab, src_tokenizer):
    model.eval()
    tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer(src)]+ [EOS_IDX]
    num_tokens = len(tokens)
    src = (torch.LongTensor(tokens).reshape(num_tokens, 1) )
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")

In [17]:
translate(transformer_model, "Hello", eng_lng_vocab, dst_lng_vocab, eng_lng_tokenizer)

' indiquant desséchées desséchées desséchées desséchées desséchées filme'

### Save & Load Model

Save model

In [18]:
PATH = './translation_model_eng_to_fr.pt'

torch.save(transformer_model, PATH)

Load saved model

In [19]:
PATH = './translation_model_eng_to_fr.pt'

transformer_model = torch.load(PATH)

### Evaluate

Compute sentence BLEU on translated texts and target texts

In [21]:
from nltk.translate.bleu_score import sentence_bleu

def get_sentence_bleu():
    transformer_model.eval()

    total_score = 0.0
    for idx, (src, tgt) in enumerate(test_data):
        if idx % 100 == 0:
            print('Processing:', idx)
        eng = torch.cat([torch.tensor([BOS_IDX]), src, torch.tensor([EOS_IDX])], dim=0)
        eng = (torch.LongTensor(eng).reshape(eng.shape[0], 1) )
        eng_mask = (torch.zeros(eng.shape[0], eng.shape[0])).type(torch.bool)
        tgt_tokens = greedy_decode(transformer_model,  eng, eng_mask, max_len=eng.shape[0] + 5, start_symbol=BOS_IDX).flatten()
        eng_to_tgt = [dst_lng_vocab.itos[tok] for tok in tgt_tokens if dst_lng_vocab.itos[tok] != '<bos>' and dst_lng_vocab.itos[tok] != '<eos>' and dst_lng_vocab.itos[tok] != '<pad>']
        tgt_to_tgt = [dst_lng_vocab.itos[tok] for tok in tgt]
        score = sentence_bleu([tgt_to_tgt], eng_to_tgt)
            
        total_score += score
    return total_score / len(test_data)

print('Sentence BLEU Score:', get_sentence_bleu())

## Text Style Classification

(Adpated from https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)

### Process Data

Re-label data for text style classification

AG News is labeled as 1, and Yelp Review is labeled as 2

In [22]:
new_ag_train_list = []
new_ag_test_list = []
new_yelp_train_list = []
new_yelp_test_list = []
for i in range(len(ag_train_list)):
    new_ag_train_list.append((1, ag_train_list[i][1]))
for i in range(len(ag_test_list)):
    new_ag_test_list.append((1, ag_test_list[i][1]))
for i in range(len(yelp_train_list)):
    new_yelp_train_list.append((2, yelp_train_list[i][1]))
for i in range(len(yelp_test_list)):
    new_yelp_test_list.append((2, yelp_test_list[i][1]))
    
train_dataset = new_ag_train_list + new_yelp_train_list
test_dataset = new_ag_test_list + new_yelp_test_list

Generate DataLoader with a batch size

In [23]:
BATCH_SIZE = 64 # batch size for training

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        text_list.append(torch.cat([torch.tensor([BOS_IDX]), 
                                    torch.tensor(text_pipeline(_text), dtype=torch.int64), 
                                    torch.tensor([EOS_IDX])], dim=0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.transpose(pad_sequence(text_list, padding_value=PAD_IDX), 0, 1)
    return label_list.to(device), text_list.to(device)

num_class = 2
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

print("Train size:", len(train_dataloader.dataset))
print("Val size:", len(valid_dataloader.dataset))
print("Test size:", len(test_dataloader.dataset))

Train size: 731500
Val size: 38500
Test size: 57600


In [24]:
text_pipeline = lambda x: [eng_lng_vocab[token] for token in eng_lng_tokenizer(x)]
label_pipeline = lambda x: int(x) - 1

In [25]:
text_pipeline('here is the an example')

[54, 17, 5, 71, 2219]

In [26]:
label_pipeline(10)

9

### Model

FCNet-based Classification Model class

In [27]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
    
    def forward(self, text):
        embedded = self.embedding(text)
        return self.fc(embedded)

### Train

Define train and evaluate functions

In [29]:
import time

def train(dataloader):
    classification_model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = classification_model(text) # BATCH_SIZE x maxLen(text)
        # Output: BATCH_SIZE x C
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(classification_model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    classification_model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            predited_label = classification_model(text)
            loss = criterion(predited_label, label)
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Train Classification Model

In [31]:
# Hyperparameters
EPOCHS = 5 # epoch
LR = 1  # learning rate
emsize = 64

classification_model = TextClassificationModel(ENG_VOCAB_SIZE, emsize, num_class).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(classification_model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

### Test

Evaluate trained model on test dataset

In [54]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Check Classification Model's performance

In [33]:
style_label = {1: "AG_NEWS",
               2: "YELP"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.cat([torch.tensor([BOS_IDX]), 
                                    torch.tensor(text_pipeline(text), dtype=torch.int64), 
                                    torch.tensor([EOS_IDX])], dim=0)
        text = torch.reshape(text, (1, text.shape[0]))
        output = classification_model(text)
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

classification_model = classification_model.to("cpu")
classification_model.eval()

print("This is a %s style" %style_label[predict(ex_text_str, text_pipeline)])

This is a AG_NEWS style


### Load Model

load saved model

In [34]:
PATH_CLS = './classification_model.pt'

classification_model = torch.load(PATH_CLS)

## Text Style Generator

### Process Data

Tokenize dataset in sentence level, and re-label them for classification task

In [35]:
train_data = []
test_data = []
for i in range(len(ag_train_list_sub)):
    sentences = sent_tokenize(ag_train_list_sub[i][1])
    for sentence in sentences:
        text_tensor = torch.tensor([eng_lng_vocab[token] for token in eng_lng_tokenizer(sentence.rstrip("\n"))], dtype=torch.long)
        train_data.append((2, text_tensor))
for i in range(len(ag_test_list_sub)):
    sentences = sent_tokenize(ag_test_list_sub[i][1])
    for sentence in sentences:
        text_tensor = torch.tensor([eng_lng_vocab[token] for token in eng_lng_tokenizer(sentence.rstrip("\n"))], dtype=torch.long)
        test_data.append((2, text_tensor))


Generate DataLoader with a batch size

In [36]:
BATCH_SIZE = 64

# Split dataset
num_class = 2
num_train = int(len(train_data) * 0.95)
split_train_, split_valid_ = \
    random_split(train_data, [num_train, len(train_data) - num_train])

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text_tensor) in batch:
        label_list.append(_label - 1)
        new_tensor = torch.cat([torch.tensor([BOS_IDX]), _text_tensor, torch.tensor([EOS_IDX])], dim=0)
        text_list.append(new_tensor)
        
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = pad_sequence(text_list, padding_value=PAD_IDX)
    return label_list.to(device), text_list.to(device)

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

print("Train size:", len(train_dataloader.dataset))
print("Val size:", len(valid_dataloader.dataset))
print("Test size:", len(test_dataloader.dataset))

Train size: 6389
Val size: 337
Test size: 709


### Model

Transformer Decoder based Text Style Generator class

In [37]:
class TextStyleGenerator(nn.Module):
    def __init__(self, num_decoder_layers: int,
                 emb_size: int, tgt_vocab_size: int,
                 dim_feedforward:int = 512, dropout:float = 0.1):
        super(TextStyleGenerator, self).__init__()
        
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)
        
        decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)

        self.generator = nn.Linear(emb_size, tgt_vocab_size)

    def forward(self, memory: Tensor, trg: Tensor, tgt_mask: Tensor,
                tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor, reconst):
        if reconst:
            tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
            outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,
                                            tgt_padding_mask, memory_key_padding_mask)
            return self.generator(outs)
        else:
            ys = torch.ones(1, trg.shape[1]).fill_(BOS_IDX).type(torch.long).to(device)
            for i in range(trg.shape[0] + 5 - 1):
                tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                                            .type(torch.bool)).to(device)
                out = self.transformer_decoder(self.positional_encoding(
                          self.tgt_tok_emb(ys)), memory,
                          tgt_mask) # Output: maxLen(text) x BATCH_SIZE x 512
                out = out.transpose(0, 1) # Output: BATCH_SIZE x maxLen(text) x 512
                prob = self.generator(out[:, -1]) # Output: BATCH_SIZE x 512 -> BATCH_SIZE x DST_LNG_VOCAB
                _, next_word = torch.max(prob, dim = 1) # Output: BATCH_SIZE
                next_word = torch.reshape(next_word, (1, next_word.shape[0])) # Output: 1 x BATCH_SIZE
                ys = torch.cat([ys, next_word.type_as(src.data)], dim=0)
                
                end_count = (next_word == EOS_IDX).sum(dim=1) + (next_word == PAD_IDX).sum(dim=1)
                if end_count == next_word.shape[1]:
                    break
            return ys

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer_decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

### Train

Define train and evaluate functions

In [38]:
import time

def train(model, dataloader, crit1, crit2):
    model.train()
    running_loss = 0.0
    total_acc, total_count = 0, 0
    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
    
        src = text
        tgt_input = text[:-1, :] # Output: maxLen(text) x BATCH_SIZE
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
        
        # Encode first
        memory = transformer_model.encode(src, src_mask) # Output: maxLen(text) x BATCH_SIZE x 512
        
        # Get reconstruction loss for decoder
        logits = model(memory, tgt_input, tgt_mask, tgt_padding_mask, src_padding_mask, True)  # Output: maxLen(text) x BATCH_SIZE x ENG_LNG_VOCAB
        
        # Calculate loss
        tgt_out = text[1:,:]
        loss_reconst = crit1(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1)) # ((maxLen(text)-1) x Batch_Size) x DST_LNG_VOCAB, ((maxLen(text)-1) x Batch_Size)
        

        _, out = torch.max(torch.transpose(logits, 0, 1), dim = 2) # Output: BATCH_SIZE x maxLen(Text)
        predicted_label = classification_model(out.to(device)) # BATCH_SIZE x maxLen(text)
        loss_class = crit2(predicted_label, label)
        
        loss = loss_reconst + 0.2 * loss_class
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        
        optimizer.step()
        
        running_loss += loss.item()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)

        if idx % 10 == 0:
            val_loss = evaluate(text_styler, valid_dataloader, crit1)
            acc = total_acc / total_count
            print((f"Epoch: {epoch}, idx: {idx: .0f}, Train loss: {running_loss:.3f}, Val loss: {val_loss:.3f}, accuracy: {acc: .3f}"))
            total_acc, total_count = 0, 0

        running_loss = 0.0

def evaluate(model, dataloader, crit2):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            print(idx)
            src = text
            tgt_input = text[:-1, :] # Output: maxLen(text) x BATCH_SIZE
            src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
            
            # Encode first
            memory = transformer_model.encode(src, src_mask)
            
            ys = model(memory, src, None, None, None, False)
            style_text = torch.transpose(ys, 0, 1).to(device)
            predicted_label = classification_model(style_text) # BATCH_SIZE x maxLen(text)
            loss = crit2(predicted_label, label)
            
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Train Text Style Generator Model

In [40]:
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
NUM_DECODER_LAYERS = 3

NUM_EPOCHS = 100


text_styler = TextStyleGenerator(NUM_DECODER_LAYERS,
                                 EMB_SIZE, ENG_VOCAB_SIZE,
                                 FFN_HID_DIM, dropout=0)

for p in text_styler.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

text_styler = text_styler.to(device)

crit1 = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
crit2 = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(
    text_styler.parameters(), lr=0.0005, betas=(0.9, 0.99), eps=1e-8
)

transformer_model.eval()
classification_model.eval()
for epoch in range(1, NUM_EPOCHS+1):
    train_loss = train(text_styler, temp_dataloader, crit1, crit2)

### Test

In [47]:
print('Checking the results of test dataset.')
accu_test = evaluate(text_styler, test_dataloader, crit2)
print('test accuracy {:8.3f}'.format(accu_test))

In [42]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)
    
    transformer.eval()
    
    memory = transformer.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    for i in range(max_len-1):
        memory = memory.to(device)
        memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                                    .type(torch.bool)).to(device)
        
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim = 1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


def transfer(model, src, src_vocab, tgt_vocab, src_tokenizer):
    model.eval()
    tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer(src)]+ [EOS_IDX]
    num_tokens = len(tokens)
    src = (torch.LongTensor(tokens).reshape(num_tokens, 1) )
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")

Check Text Style Generator's performance

In [44]:
transfer(text_styler, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", eng_lng_vocab, eng_lng_vocab, eng_lng_tokenizer)

' city , city , city , , , , city , city , city , city , city , city , , , , , city , city , , , , , city , city ,'

### Load Model

Load saved model

In [41]:
PATH_STYLE = './final_model.pt'

text_styler = torch.load(PATH_STYLE)

### Evaluate

Compute sentence BLEU score on input texts and generated texts

In [45]:
from nltk.translate.bleu_score import sentence_bleu

def get_sentence_bleu():
    text_styler.eval()

    total_score = 0.0
    for idx, (label, src) in enumerate(test_data):
        if idx % 100 == 0:
            print('Processing:', idx)
        src = torch.cat([torch.tensor([BOS_IDX]), src, torch.tensor([EOS_IDX])], dim=0)
        src_text = (" ".join([eng_lng_vocab.itos[tok] for tok in src]).replace("<bos>", "").replace("<eos>", ""))
        styled_text = transfer(text_styler, src_text, eng_lng_vocab, eng_lng_vocab, eng_lng_tokenizer)

        score = sentence_bleu([src_text], styled_text)
        if score > 0.07:
            print(src_text)
            print(styled_text)
            print(score)
            
        total_score += score
    return total_score / len(test_data)

print('Sentence BLEU Score:', get_sentence_bleu())

Compute grammar accuracy on input texts and generated texts

In [52]:
def get_grammar_check_score():
    text_styler.eval()

    total_src_score = 0.0
    total_tgt_score = 0.0
    for idx, (label, src) in enumerate(test_data):
        if idx % 100 == 0:
            print('Processing:', idx)
        src = torch.cat([torch.tensor([BOS_IDX]), src, torch.tensor([EOS_IDX])], dim=0)
        src_text = (" ".join([eng_lng_vocab.itos[tok] for tok in src]).replace("<bos>", "").replace("<eos>", ""))
        styled_text = transfer(text_styler, src_text, eng_lng_vocab, eng_lng_vocab, eng_lng_tokenizer)

        src_score = len(grammar_tool.check(src_text)) / len(src)
        tgt_score = len(grammar_tool.check(styled_text)) / len(eng_lng_tokenizer(styled_text))
        
        
        total_src_score += src_score
        total_tgt_score += tgt_score
    return total_src_score / len(test_data), total_tgt_score / len(test_data)

print('Grammar Score:', get_grammar_check_score())

Processing: 0
src_score: 0.16129032258064516
 Fears for T N pension after talks Unions representing workers at Turner    Newall say they are ' disappointed ' after talks with stricken parent firm Federal Mogul . 
tgt_score: 0.4864864864864865
 city , city , city , , , , city , city , city , city , city , , , , , , , city , city , , , , city , city ,
src_score: 0.22033898305084745
 The Race is On : Second Private Team Sets Launch Date for Human Spaceflight ( SPACE.com ) SPACE.com - TORONTO , Canada -- A <unk> of rocketeers competing for the   # 36;10 million Ansari X Prize , a contest <unk> funded suborbital space flight , has officially announced the <unk> date for its manned rocket . 
tgt_score: 0.48
 , city , city , , , , , , , , , , city , city , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.11538461538461539
 Ky. Company Wins Grant to Study Peptides ( AP ) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to

src_score: 0.2
 Dutch Retailer Beats Apple to Local Download Market   AMSTERDAM ( Reuters ) - Free Record Shop , a Dutch music   retail chain , beat Apple Computer Inc. to market on Tuesday   with the launch of a new download service in Europe 's latest   battleground for digital song services . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , city , , ... 
src_score: 0.13333333333333333
 Super ant colony hits Australia A giant 100 km colony of ants   which has been discovered in Melbourne , Australia , could threaten local insect species . 
tgt_score: 0.5277777777777778
 city , city , city , city , , city , city , city , city , city , city , , , , , city , city , , , city , city ,
src_score: 0.23076923076923078
 Socialites unite dolphin groups Dolphin groups , or " pods " , rely on socialites to keep them from collapsing , scientists claim . 
tgt_score: 0.5
 city , city , city , city , , city , city , city 

src_score: 0.11538461538461539
 He had planned on attending the Red Sox ' Family Day at Fenway Park yesterday morning , but he had to sleep in . 
tgt_score: 0.5
 city , city , city , , , city , city , city , city , city , city , , , , , , city , , , , ,
src_score: 0.13333333333333333
 After all , Ortiz had a son at home , and he ... 
tgt_score: 0.19047619047619047
 U.S. city of city of city of city of city of city of city of city of city of city
src_score: 0.2608695652173913
 They 've caught his eye In   <unk> themselves , quot ; Ricky Bryant , Chas Gessner , Michael Jennings , and David Patten did nothing Friday night to make Bill Belichick 's decision on what to do with his receivers any easier . 
tgt_score: 0.47368421052631576
 , city , city , city , , city , city , city , city , city , city , , , , , , city , , , , , , , , , , ... 
src_score: 0.1282051282051282
 Indians Mount Charge The Cleveland Indians pulled within one game of the AL Central lead by beating the Minnesota Twins ,

src_score: 0.22
 Venezuelans Flood Polls , Voting Extended   CARACAS , Venezuela ( Reuters ) - Venezuelans voted in huge   numbers on Sunday in a historic referendum on whether to recall   left - wing President Hugo Chavez and electoral authorities   prolonged voting well into the night . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , city , , ... 
src_score: 0.2463768115942029
 Dell Exits Low - End China Consumer PC Market   HONG KONG ( Reuters ) - Dell Inc. & lt;DELL.O&gt ; , the world 's   largest PC maker , said on Monday it has left the low - end   consumer PC market in China and cut its overall growth target   for the country this year due to stiff competition in the   segment . 
tgt_score: 0.48
 , city , city , , , , , , , , , , city , city , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.2037037037037037
 China Says Taiwan Spy Also Operated in U.S. - Media   BEIJING (

src_score: 0.03571428571428571
 Indian state rolls out wireless broadband Government in South Indian state of Kerala sets up wireless kiosks as part of initiative to bridge digital divide . 
tgt_score: 0.5
 city , city , city , city , , city , city , city , city , city , , , , , , city , city , , , , , city
src_score: 0.1276595744680851
 Hurricane Survivors Wait for Water , Gas PUNTA GORDA , Fla. - Urban rescue teams , insurance adjusters and National Guard troops scattered across Florida Monday to help victims of Hurricane Charley and deliver water and other supplies to thousands of people left homeless ... 
tgt_score: 0.47368421052631576
 , city , city , city , , , city , city , city , city , city , city , , , , , city , , , , , , , , city , ... 
src_score: 0.07547169811320754
 Jackson Squares Off With Prosecutor SANTA MARIA , Calif. - Fans of Michael Jackson erupted in cheers Monday as the pop star emerged from a double - decker tour bus and went into court for a showdown with the p

src_score: 0.08333333333333333
 International observers certified the results as clean and accurate . 
tgt_score: 0.16666666666666666
 U.S. of city of city of of of of city of city of of of city of
src_score: 0.11764705882352941
 Jailing of HK democrat in China ' politically motivated ' ( AFP ) AFP - Hong Kong democrats accused China of jailing one of their members on trumped - up prostitution charges in a bid to disgrace a political movement Beijing has been feuding with for seven years . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , city , ... 
src_score: 0.11666666666666667
 Kmart Swings to Profit in 2Q ; Stock Surges ( AP ) AP - Shares of Kmart Holding Corp. surged 17 percent Monday after the discount retailer reported a profit for the second quarter and said chairman and majority owner Edward Lampert is now free to invest the company 's   # 36;2.6 billion in surplus cash . 
tgt_score: 0.475
 , city

src_score: 0.17857142857142858
 China cracks down on   quot;phone sex quot ; services BEIJING , Aug. 17 ( Xinhuanet ) -- China is carrying out a nationwide campaign to crack down on   quot;phone sex quot ; services , paralleling another sweeping operation against Internet pornography , Minister of Information Industry Wang <unk> said here Tuesday . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , , , ... 
src_score: 0.2631578947368421
 Surviving Biotech 's <unk> Charly Travers offers advice on withstanding the volatility of the biotech sector . 
tgt_score: 0.48148148148148145
 U.S. city city , city , city , city , city , city , city , city , city , , , city , city ,
src_score: 0.07894736842105263
 Mr Downer shoots his mouth off Just what Alexander Downer was thinking when he declared on radio last Friday that   <unk> could fire a missile from North Korea to Sydney quot ; is unclear . 
tgt_score: 0.525
 , c

src_score: 0.21052631578947367
 # <unk> understand it can # 146;t play MP3 files , # 146 ; I said . 
tgt_score: 0.4444444444444444
 U.S. city , city , city , , , city , city , city , city , city , , , , city , city ,
src_score: 0.1875
 # <unk> life is a lot longer , # 146 ; he said . 
tgt_score: 0.375
 U.S. city , city , city , city of city of city of city of city , city , city , city ,
src_score: 0.0
 # 148 ; Aug 17 
tgt_score: 0.23076923076923078
 of of of of of of of of of of of of
src_score: 0.10869565217391304
 Mills Grabs <unk> Portfolio ; Taubman Likely to Lose Contracts Mills Corp. agreed to purchase a 50 percent interest in nine malls owned by General Motors Asset Management Corp. for just over <unk> billion , creating a new joint venture between the groups . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , city , ... 
src_score: 0.0
 The deal will extend ... 
tgt_score: 0.23076923076923078
 of of

src_score: 0.2222222222222222
 It 's easy to get confused . 
tgt_score: 0.2
 of city of city of of of of of of of of of of
Processing: 200
src_score: 0.17073170731707318
 Former Florida Swimming Coach Dies at 83 ( AP ) AP - William H. Harlan , the retired University of Florida swimming coach who led the Gators to eight conference titles , died Tuesday , school officials said . 
tgt_score: 0.475
 , city city , city , city , city , city , city , city , city , city , , , , , , city , city , , , , , , city , city , ... 
src_score: 0.16666666666666666
 He was 83 . 
tgt_score: 0.16666666666666666
 of of of of of of of of of of of
src_score: 0.11764705882352941
 US Men Have Right Touch in Relay Duel Against Australia THENS , Aug. 17 - So Michael Phelps is not going to match the seven gold medals won by Mark Spitz . 
tgt_score: 0.5
 , city city , city , city , , city , city , city , city , city , city , , , , city , city , , , , , , city , city , , ,
src_score: 0.1111111111111111
 And it is to

src_score: 0.25
 Karzai Promises Afghans Security for Election ( Reuters ) Reuters - Afghanistan 's President Hamid <unk> Afghans greater security when they go to vote in <unk> 's first ever democratic election during an <unk> speech on Wednesday . 
tgt_score: 0.48717948717948717
 , city , city , city , , city , city , city , city , city , city , , , , , , city , , , , , , , , , , , ... 
src_score: 0.17543859649122806
 Google Lowers Its IPO Price Range SAN JOSE , Calif. - In a sign that Google Inc. 's initial public offering is n't as popular as expected , the company lowered its estimated price range to between <unk> and <unk> per share , down from the earlier prediction of <unk> and <unk> per share ... 
tgt_score: 0.48
 , city , city , , , , , , , , , , city , city , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.07692307692307693
 Future Doctors , Crossing Borders Students at the Mount Sinai School of Medicine learn that diet and culture shape health i

src_score: 0.18181818181818182
 A lot of browsers may support   one or two , but I 'll bet none have them all . 
tgt_score: 0.4642857142857143
 U.S. city , city , city , , city , city , city , city , city , city , , , , , city , city
src_score: 0.08333333333333333
 Cops Test Handheld Fingerprint Reader Several Minnesota police departments are field testing a handheld device that scans a suspect 's fingerprint and digitally checks it against Minnesota 's criminal history and fingerprint database . 
tgt_score: 0.5
 , city city , city , city , city , city , city , city , city , city , , , , , city , city , city , , , , city , city , city , , ,
src_score: 0.09615384615384616
 Ross Stores Profit Plummets 40 Percent ( AP ) AP - Discount retailer Ross Stores Inc. Wednesday said its profit fell about 40 percent in the latest quarter due to problems with a new computer system that limited the company 's ability to respond to changes in customer demand . 
tgt_score: 0.47368421052631576
 , city c

src_score: 0.08
 Drive maker files counterclaims in patent suit Cornice blasts Seagate 's suit over patents for tiny hard drives used in portable gadgets . 
tgt_score: 0.5161290322580645
 city , city , city , , , city , city , city , city , city , city , , , , , city , city , , ,
src_score: 0.08064516129032258
 HP moves network scanning software into beta CHICAGO - Hewlett - Packard(HP ) has moved its Active Counter Measures network security software into beta tests with a select group of European and North American customers in hopes of readying the product for a 2005 release , an HP executive said at the HP World conference here in Chicago Wednesday . 
tgt_score: 0.4878048780487805
 , city , city , city , , , , city , , city , city , city , , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.203125
 Martin announces major overhaul of key staff in Prime Minister 's Office ( Canadian Press ) Canadian Press - OTTAWA ( CP ) - Paul Martin announced a major overhaul of his senior sta

src_score: 0.2571428571428571
 Boro Captain Warns of Duo # 39;s Threat Gareth Southgate has warned Barclays Premiership defences to be wary of <unk> back - to - form strikers Mark Viduka and Jimmy Floyd Hasselbaink . 
tgt_score: 0.5348837209302325
 , city city , city , city , city , city , city , city , city , city , city , , , city , city , city , , , city , city , city , , city , ,
src_score: 0.391304347826087
 Intuit Posts Wider Loss After Charge ( Reuters ) Reuters - Intuit Inc. ( INTU.O ) , maker <unk> No . 
tgt_score: 0.5161290322580645
 city , city , city , , , city , city , city , city , city , city , , , , , city , city , , ,
src_score: 0.18518518518518517
 1 U.S. tax presentation software TurboTax , on <unk> a wider quarterly loss after taking a <unk> charge during its seasonally weaker fourth quarter . 
tgt_score: 0.4864864864864865
 city , city , city , , , , city , city , city , city , city , , , , , , city , city , , , , , city , city ,
src_score: 0.14705882352941177
 Ham

src_score: 0.07317073170731707
 Credit Suisse to merge CSFB unit into parent Credit Suisse Group announced plans to merge its Credit Suisse First Boston Securities unit with the rest of the company # 39;s operations and cut as many as 300 jobs . 
tgt_score: 0.475
 , city city , city , city , city , city , city , city , city , city , , , , , , city , city , , , , , , city , city , ... 
src_score: 0.2
 Holiday - Shopping Season Remains Sluggish ( Reuters ) Reuters - U.S. shoppers have kept a tight <unk> their wallets this holiday season with indices on <unk> sluggish sales in the second week of the   season . 
tgt_score: 0.48717948717948717
 , city , city , city , city , city , city , city , city , city , city , , , , , city , , , , , , , , city , , ... 
src_score: 0.21951219512195122
 UN to begin second airlift of Vietnamese <unk> ( AFP ) AFP - The second major airlift of Vietnamese <unk> who fled to Cambodia 's remote jungles after April anti - government protests will begin at the wee

src_score: 0.125
 According to the report , consumers aged 12 and older in the United States were as likely to be aware of Apple Computer Inc. 's iTunes Music Store and Napster 2.0 when it came to recognizing digital music download brands -- each music service registered 20 percent of what <unk> refers to as " top - of - mind " awareness . 
tgt_score: 0.48
 , city , city , , , , , , , , , city , city , city , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.06382978723404255
 Qantas says record profit not big enough Australia # 39;s flagship carrier Qantas Airways has reported a record annual net profit but warned oil prices threatened its performance , increasing the chance of a hike in ticket price surcharges to offset its fuel bill . 
tgt_score: 0.5128205128205128
 , city city , city , city , , city , city , city , city , city , , , , , , , city , , , , , , , , city , , ... 
src_score: 0.12195121951219512
 State , drug chains reach agreement The state of M

src_score: 0.2222222222222222
 Ecuadorean Lawsuit Vs Texaco Boils Down to Science ( Reuters ) Reuters - After a <unk> court battles , lawyers on Wednesday took a lawsuit <unk> Indians accusing U.S. oil firm ChevronTexaco Corp .. <unk> polluting the Amazon jungle into the field . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , city , , , , , city , , , , , , , , , city , ... 
src_score: 0.06382978723404255
 Priceline , Ramada to Make Sites More Accessible to Blind In one of the first enforcement actions of the Americans with Disabilities Act on the Internet , two major travel services have agreed to make sites more accessible to the blind and visually impaired . 
tgt_score: 0.48717948717948717
 , city , city , city , , city , city , city , city , city , city , , , , , , city , , , , , , , , , , , ... 
src_score: 0.21428571428571427
 Atlantis " Evidence " Found in Spain , Ireland In 360 B.C. 
tgt_score: 0.2
 U.S. city of city of city of city

src_score: 0.10714285714285714
 Apple recalls 28,000 batteries APPLE has issued a safety recall for 28,000 batteries for its Powerbook notebooks , saying they posed a potential fire hazard . 
tgt_score: 0.5
 city , city , city , city , , city , city , city , city , city , , , , , , city , city , , , , city ,
src_score: 0.0
 Best Buy a Bad Deal ? 
tgt_score: 0.21428571428571427
 of of of of of of of of of of of of of
src_score: 0.125
 Attorney General Jim Petro is suing Best Buy , alleging the electronics retailer has engaged in unfair and deceptive business practices . 
tgt_score: 0.5333333333333333
 city , city , city , , city , city , city , city , city , city , , , , , , city , city , ,
src_score: 0.14035087719298245
 Liu brings China 4th gold in weightlifting at Athens Games ATHENS , Aug. 19 ( Xinhuanet ) -- Chinese Hercules Liu Chunhong Thursday lifted three world records on her way to winning the women # 39;s 69 kg gold medal at the Athens Olympics , the fourth of the power sport

src_score: 0.22448979591836735
 Google scores first - day bump of 18 ( USATODAY.com ) USATODAY.com - Even a big first - day jump in shares of Google ( GOOG ) could n't quiet debate over whether the Internet search engine 's contentious auction was a hit or a flop . 
tgt_score: 0.48717948717948717
 , city , city , city , , city , city , city , city , city , city , , , , , , city , , , , , , , , , city , ... 
src_score: 0.24489795918367346
 Carly Patterson Wins Gymnastics All - Around Gold   ATHENS ( Reuters ) - Carly Patterson upstaged Russian diva   Svetlana Khorkina to become the first American in 20 years to   win the women 's Olympic gymnastics all - round gold medal on   Thursday . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , , , ... 
src_score: 0.10714285714285714
 ChevronTexaco hit with <unk> M ruling Montana jury orders oil firm to pay up over gas pipeline leak from 1955 ; company plans to appea

src_score: 0.18333333333333332
 Antidepressants to Reflect Suicide Risk   WASHINGTON ( Reuters ) - The U.S. Food and Drug   Administration plans to update antidepressant labels to reflect   studies that suggest a link between the drugs and suicide in   youths , but remains cautious about the strength of such ties ,    according to documents released on Friday . 
tgt_score: 0.475
 , city , city , city , , , , city , , city , city , city , , , , , , , , , , , , , , , , , , , , ... 
src_score: 0.13513513513513514
 UAL and Its Creditors Agree to 30 - Day Extension UAL 's United Airlines will have a 30 - day extention on the period in which it can file an exclusive bankruptcy reorganization plan . 
tgt_score: 0.5348837209302325
 , city city , city , city , , city , city , city , city , city , city , , , , , city , city , , , , , city , city , , , city , ,
src_score: 0.16666666666666666
 Mood Mixed Among Darfur Rebels Ahead of Talks CORCHA CAMP , Sudan ( Reuters ) - A Sudanese rebel commande

src_score: 0.15
 Samsung plans to invest Won25,000bn in chips Samsung Electronics , the world # 39;s second largest computer chip manufacturer , yesterday said that it would invest Won25,000bn ( <unk> ) in its semiconductor business by 2010 to generate 
tgt_score: 0.475
 , city , city , city , city , city , city , city , city , city , city , , , , , city , city , , , , , , city , , , ... 
src_score: 0.038461538461538464
 Rosetta Mission Sniffing a Comet The European Rosetta mission will sample a comet as it tries to harpoon and hook onto its surface . 
tgt_score: 0.5
 city , city , city , , , , city , city , city , city , city , city , , , , , city , , , , ,
src_score: 0.05555555555555555
 A specially designed oven will cook the comet in analogy to sniffing for recognizable elements . 
tgt_score: 0.375
 U.S. city , city , city , , city of city of city of city of city , city , city , city
src_score: 0.11428571428571428
 More gold for Britain as Wiggins strikes Bradley Wiggins has given 

src_score: 0.26666666666666666
 Unknown Nesterenko Makes World Headlines ( Reuters ) Reuters - Belarus ' Yuliya Nesterenko won the <unk> 's athletics gold medal at the Olympics on <unk> over a field stripped of many big names because <unk> woes to win the 100 meters . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , city , , , , , city , , , , , , , , , city , ... 
src_score: 0.0
 Ready to Bet on Alternative Energy ? 
tgt_score: 0.2
 of city of city of of of of of of of of of of
src_score: 0.19047619047619047
 Well , Think Again When oil prices rise , public interest in alternative energy often does , too . 
tgt_score: 0.48148148148148145
 U.S. city , city , city , , city , city , city , city , city , city , , , city , , city
src_score: 0.09090909090909091
 But the logic is evidently escaping Wall Street . 
tgt_score: 0.17647058823529413
 of city of city of of of of of of city of of of of city
src_score: 0.15384615384615385
 Athletics 5 , D

src_score: 0.14285714285714285
 Putin Visits Chechnya Ahead of Election ( AP ) AP - Russian President Vladimir Putin made an unannounced visit to Chechnya on Sunday , laying flowers at the grave of the war - ravaged region 's assassinated president a week before elections for a new leader . 
tgt_score: 0.47368421052631576
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , city , ... 
src_score: 0.22857142857142856
 U.S. Softball Team Wins , Closes in on Gold ATHENS , Greece - Right now , the Americans are n't just a Dream Team - they 're more like the Perfect Team . 
tgt_score: 0.4878048780487805
 , city city , city , city , city , city , city , city , city , city , , , , , city , city , city , , , , city , city , , , city
src_score: 0.13953488372093023
 Lisa Fernandez pitched a three - hitter Sunday and Crystl Bustos drove in two runs as the Americans rolled to their eighth shutout in eight days , 5 - 0 over Australia , putting them into th

src_score: 0.125
 Indexes in Japan fall short of hype Japanese stocks have failed to measure up to an assessment made in April by Merrill Lynch # 39;s chief global strategist , David Bowers , who said Japan was   quot;very much everyone # 39;s favorite equity market . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , city , , ... 
src_score: 0.1568627450980392
 GAME DAY RECAP Sunday , August 22 Aramis Ramirez hit a three - run homer , Moises Alou also homered and the Chicago Cubs beat the Houston Astros 11 - 6 on Sunday in the testy conclusion of a three - game series between the NL Central rivals . 
tgt_score: 0.47368421052631576
 , city , city , city , , , city , city , city , city , city , city , , , , , city , , , , , , , , city , ... 
src_score: 0.1724137931034483
 SI.com HOUSTON ( Ticker ) -- Kerry Wood got plenty of run support but didn # 39;t stick around long enough to take advantage of it . 
tgt_sco

src_score: 0.09375
 Taiwan votes on leaner parliament A vote is due to be held in Taiwan on plans to halve the number of seats in the island 's famously heated legislature . 
tgt_score: 0.47368421052631576
 city , city , city , , , , city , city , city , city , city , , , , , , city , city , , , , , city , city , city
src_score: 0.08333333333333333
 Nikkei briefly regains 11,000 level TOKYO - Japan # 39;s benchmark Nikkei stock index briefly recovered to the 11,000 level Monday morning on widespread buying prompted by advances in US shares last Friday . 
tgt_score: 0.5
 , city city , city , city , city , city , city , city , city , city , , , , , city , city , , , , , city , city , city , , , ,
src_score: 0.07692307692307693
 HK walks out of 68 - month deflation cycle , official Hong Kong Financial Secretary Henry Tang said he believed Hong Kong has walked out of the consumer price deflation cycle that lingered for 68 months , according to the consumer price index trend in the past few

src_score: 0.1
 At issue is the ability to authenticate the original source of e - mail messages , a major 
tgt_score: 0.4230769230769231
 U.S. city , city , city , , city , city , city of city , city , city , , , city , city
src_score: 0.2222222222222222
 New Fat - Busting Microwave Oven Unveiled   TOKYO ( Reuters ) - Eyeing up that juicy steak but worried   about your waistline ? 
tgt_score: 0.5151515151515151
 city , city , city , city , , city , city , city , city , city , , , , , , city , city , , , , ,
src_score: 0.29545454545454547
 Japanese electronics maker Sharp Corp.   & lt;A HREF="http://www.reuters.co.uk / <unk> qtype = sym infotype = info qcat = <unk> ; says it has developed a new fat - busting microwave oven   that can melt some of your worries away . 
tgt_score: 0.48717948717948717
 , city city , city , , , , city , city , city , city , city , , , , , , , city , , , , , , , , city , , ... 
src_score: 0.1
 Israel OKs More West Bank Settlement Homes JERUSALEM Aug. 23 , 20

src_score: 0.1346153846153846
 ( AP ) AP - A state judge ruled Monday that the sign - up period should be reopened for the Nov. 2 election in Louisiana 's 5th Congressional District , where incumbent Rep. Rodney Alexander infuriated Democrats by switching to the Republican Party minutes before the qualifying deadline . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , , , ... 
src_score: 0.034482758620689655
 Oil price down as Iraq fears ease The price of oil has fallen as fears about interruptions to supplies being pumped out of Iraq eased slightly . 
tgt_score: 0.4857142857142857
 city , city , city , , , city , city , city , city , city , city , , , , , city , city , , , , city , city
src_score: 0.11764705882352941
 Israel Accelerates Settlement Drive As Sharon Pushes On With Gaza & lt;b&gt; ... &lt;/b&gt ; The Israeli government was accelerating its settlement program Monday with plans to build hundreds

src_score: 0.0
 A report by economists at the University of the Philippines said the country faces economic collapse 
tgt_score: 0.4166666666666667
 U.S. city , city , city , , of city of city of city of city , city , city , city ,
src_score: 0.16363636363636364
 US men # 39;s basketball routs Angola Athens , Greece ( Sports Network ) - Tim Duncan led a balanced American attack with 15 points and seven rebounds , as the United States men # 39;s basketball team completed the preliminary round with a resounding 89 - 53 victory over winless Angola . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , city , ... 
src_score: 0.1702127659574468
 Patterson gets silver on balance beam Athens , Greece ( Sports Network ) - American Carly Patterson , the women # 39;s all- around champion at the Summer Games , added another medal on Monday night with a silver in the balance beam competition . 
tgt_score: 0.48717948717948

src_score: 0.10204081632653061
 Intel drops prices on computer chips SAN FRANCISCO - Intel Corp. has cut prices on its computer chips by as much as 35 percent , though analysts on Monday said the cuts were probably unrelated to swelling inventories of the world # 39;s largest chip maker . 
tgt_score: 0.48717948717948717
 , city , city , city , , city , city , city , city , city , city , , , , , , city , , , , , , , , , city , ... 
src_score: 0.1282051282051282
 Israel Announces West Bank Housing Plans ( AP ) AP - Israel announced plans Monday for 500 new housing units in the West Bank , after an apparent U.S. policy shift that has infuriated the Palestinians . 
tgt_score: 0.5121951219512195
 , city city , city , city , , city , city , city , city , city , city , , , , , city , city , , , , , city , city , city , ... 
src_score: 0.07692307692307693
 The Palestinians oppose all Jewish settlement in the West Bank and Gaza Strip , lands where they hope to establish an independent state . 


src_score: 0.23076923076923078
 quot;Wow , quot ; Hamm told his twin brother Morgan . 
tgt_score: 0.21052631578947367
 U.S. city of city of city of city of city of city of of city of city of
src_score: 0.25
 <unk> never seen this before . 
tgt_score: 0.1875
 U.S. of city of of of of of of of of of of of of
src_score: 0.25
 quot ; 
tgt_score: 0.2
 of of of of of of of of of
src_score: 0.021739130434782608
 Unions protest as overtime rules take effect WASHINGTON -- Hundreds of workers rallied on the steps of the Labor Department yesterday to protest the implementation of new rules they say will cause as many as 6 million Americans to lose their overtime pay . 
tgt_score: 0.48717948717948717
 , city , city , city , , , city , city , city , city , city , , , , , , , city , , , , , , , , , city , ... 
src_score: 0.03571428571428571
 But the Bush administration officials who crafted the complex regulations insisted more workers will actually qualify for extra pay under the plan , which almos