<h1>Embedding Words and Types<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Why-Learn-Embeddings?" data-toc-modified-id="Why-Learn-Embeddings?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why Learn Embeddings?</a></span><ul class="toc-item"><li><span><a href="#Efficiency-of-Embeddings" data-toc-modified-id="Efficiency-of-Embeddings-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Efficiency of Embeddings</a></span></li><li><span><a href="#Approaches-to-Learning-Word-Embeddings" data-toc-modified-id="Approaches-to-Learning-Word-Embeddings-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Approaches to Learning Word Embeddings</a></span></li><li><span><a href="#The-Practical-Use-of-Pretrained-Word-Embeddings" data-toc-modified-id="The-Practical-Use-of-Pretrained-Word-Embeddings-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>The Practical Use of Pretrained Word Embeddings</a></span></li></ul></li><li><span><a href="#Example:-Learning-the-Continous-Bag-of-Words-Embeddings" data-toc-modified-id="Example:-Learning-the-Continous-Bag-of-Words-Embeddings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Example: Learning the Continous Bag of Words Embeddings</a></span><ul class="toc-item"><li><span><a href="#Data-Vectorization-Classes" data-toc-modified-id="Data-Vectorization-Classes-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Vectorization Classes</a></span></li><li><span><a href="#The-CBOW-Model" data-toc-modified-id="The-CBOW-Model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>The CBOW Model</a></span></li><li><span><a href="#Model-Training-&amp;-Evaluation" data-toc-modified-id="Model-Training-&amp;-Evaluation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Model Training &amp; Evaluation</a></span></li><li><span><a href="#Trained-Embeddings" data-toc-modified-id="Trained-Embeddings-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Trained Embeddings</a></span></li></ul></li><li><span><a href="#Example:-Transfer-Learning-Using-Pretrained-Embeddings-for-Document-Classification" data-toc-modified-id="Example:-Transfer-Learning-Using-Pretrained-Embeddings-for-Document-Classification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Transfer Learning Using Pretrained Embeddings for Document Classification</a></span><ul class="toc-item"><li><span><a href="#Data-Vectorization-classes" data-toc-modified-id="Data-Vectorization-classes-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Data Vectorization classes</a></span></li><li><span><a href="#The-NewsClassifier" data-toc-modified-id="The-NewsClassifier-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>The NewsClassifier</a></span></li><li><span><a href="#Utils" data-toc-modified-id="Utils-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Utils</a></span></li><li><span><a href="#Model-Training-&amp;-Evaluation" data-toc-modified-id="Model-Training-&amp;-Evaluation-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Model Training &amp; Evaluation</a></span></li></ul></li></ul></div>

## Introduction

*Representataion Learning or Embedding* refer to learning the mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a _word embedding_. TF-IDF(Term Frequency-Inverse Document Frequency) is an example of _count based embedding_ method.

## Why Learn Embeddings?

- The count-based representations are also called _distributional representations_ because their significant content or meaning is represented by multiple dimensions in the vector. These representations are not learned from the data but heuristically constructed.

**Benefits of Low Dimensional Learned Representations:**
- Reducing the dimensionality is computationally efficient.
- The count based representations result in high dimensional vectors that encode similar information along many dimensions and do not share statistical strength.
- Very high dimensions in the input can result in real problems in machine learning and optimisation which is often called _curse of dimensionality_.
- Representations learned from task specific data are optimal for the task at hand.

### Efficiency of Embeddings

When we perform the matrix multiplication of one hot vector with weight matrix, the resulting vector is just selecting the row indicated by the non zero entry.

![Figure 5.1](../images/figure_5_1.png)

### Approaches to Learning Word Embeddings

Auxiliary Tasks used to train Word Embeddings:
- Given a sequence of words, predict the next word. This is also called the _language modeling task_.
- Given a sequence of words before and after, predict the missing word.
- Given a word, predict words that occur within a window, independent of the position.

### The Practical Use of Pretrained Word Embeddings

In [1]:
# Loading Embeddings
# Download Embeddings file from https://www.kaggle.com/danielwillgeorge/glove6b100dtxt?select=glove.6B.100d.txt
%load_ext nb_black

import numpy as np
from annoy import AnnoyIndex


class PreTrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index: mapping from word to integers.
            word_vectors: list of numpy array.
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}
        self.index = AnnoyIndex(len(word_vectors[0]), metric="euclidean")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)

    @classmethod
    def from_embeddings_file(cls, embedding_file):
        """
        Init from pretrained vector file.

        Vector filw should be of the format:
            word0 x0_0 x0_1, x0_2 ... x0_N
            word1 x1_0 x1_1, x1_2 ... x1_N

        Args:
            embedding_file: location of the file
        Returns:
            instance of PretrainedEmbeddings
        """
        word_to_index, word_vectors = {}, []
        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
        return cls(word_to_index=word_to_index, word_vectors=word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word: Input word to get embedding for.
        Returns:
            an embedding for given word
        """
        return self.word_vectors[self.word_to_index[word]]

    def get_closed_to_vector(self, vector, n=1):
        """
        Given a vector, return its n nearest neighbors.

        Args:
            vector: should match the size of the vectors in the Annoy Index.
            n: the number of neighbors to return
        Returns:
            Unsorted list of words nearest to the given vector.
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor] for neighbor in nn_indices]

    def compute_and_print_analogy(self, word1, word2, word3):
        """
        Prints the solutions to analogies using word embeddings.

        Analogies are word1 to word2 as word3 is to __
        This methid will print: word1 : word2 :: word3 : word4

        Args:
            word1, word2, word3
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closed_words = self.get_closed_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closed_words = [word for word in closed_words if word not in existing_words]
        if len(closed_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        for word4 in closed_words:
            print(f"{word1}:{word2} :: {word3}:{word4}")


embeddings = PreTrainedEmbeddings.from_embeddings_file("../data/glove.6B.100d.txt")

<IPython.core.display.Javascript object>

In [2]:
# Relationships between word embeddings

# Relationship 1: the relationship between gendered nouns and pronouns
print("the relationship between gendered nouns and pronouns")
embeddings.compute_and_print_analogy("man", "he", "woman")
print()

# Relationship 2: Verb-noun relationships
print("Verb-noun relationships")
embeddings.compute_and_print_analogy("fly", "plane", "sail")
print()

#  Relationship 3: Noun-noun relationships
print("Noun-noun relationships")
embeddings.compute_and_print_analogy("cat", "kitten", "dog")
print()

# Relationship 4: Hypernymy (broader category)
print("Hypernymy (broader category)")
embeddings.compute_and_print_analogy("blue", "color", "dog")
print()

# Relationship 5: Meronymy (part-to-whole)
print("Meronymy (part-to-whole)")
embeddings.compute_and_print_analogy("toe", "foot", "finger")
print()

# Relationship 6: Troponymy (difference in manner)
print("Troponymy (difference in manner)")
embeddings.compute_and_print_analogy("talk", "communicate", "read")
print()

# Relationship 7: Metonymy (convention / figures of speech)
print("Metonymy (convention / figures of speech)")
embeddings.compute_and_print_analogy("blue", "democrat", "red")
print()

# Relationship 8: Adjectival scales
print("Adjectival scales")
embeddings.compute_and_print_analogy("fast", "fastest", "young")
print()

the relationship between gendered nouns and pronouns
man:he :: woman:she
man:he :: woman:never

Verb-noun relationships
fly:plane :: sail:ship
fly:plane :: sail:vessel

Noun-noun relationships
cat:kitten :: dog:puppy
cat:kitten :: dog:puppies
cat:kitten :: dog:toddler

Hypernymy (broader category)
blue:color :: dog:behavior
blue:color :: dog:touch
blue:color :: dog:viewer

Meronymy (part-to-whole)
toe:foot :: finger:ground
toe:foot :: finger:pointing

Troponymy (difference in manner)
talk:communicate :: read:interpret
talk:communicate :: read:typed
talk:communicate :: read:correctly
talk:communicate :: read:instructions

Metonymy (convention / figures of speech)
blue:democrat :: red:republican
blue:democrat :: red:congressman
blue:democrat :: red:senator

Adjectival scales
fast:fastest :: young:female
fast:fastest :: young:fellow
fast:fastest :: young:younger



<IPython.core.display.Javascript object>

In [3]:
embeddings.compute_and_print_analogy("fast", "fastest", "small")
embeddings.compute_and_print_analogy("man", "king", "woman")
embeddings.compute_and_print_analogy("man", "doctor", "woman")

fast:fastest :: small:smallest
fast:fastest :: small:large
man:king :: woman:queen
man:king :: woman:monarch
man:king :: woman:throne
man:doctor :: woman:nurse
man:doctor :: woman:physician


<IPython.core.display.Javascript object>

In [4]:
embeddings.compute_and_print_analogy("sachin", "cricket", "messi")

sachin:cricket :: messi:rugby
sachin:cricket :: messi:soccer
sachin:cricket :: messi:football
sachin:cricket :: messi:club


<IPython.core.display.Javascript object>

In [5]:
embeddings.compute_and_print_analogy("nifty", "sensex", "nasdaq")

nifty:sensex :: nasdaq:index
nifty:sensex :: nasdaq:composite


<IPython.core.display.Javascript object>

## Example: Learning the Continous Bag of Words Embeddings

The CBOW model is a multi‐ class classification task represented by scanning over texts of words, creating a context window of words, removing the center word from the context window, and classifying the context window to the missing word. It is actually like a fill-in-the-blank task. There is a sentence with a missing word, and the model’s job is to figure out what that word should be.

![Figure 5.2](../images/figure_5_2.png)

### Data Vectorization Classes

In [6]:
import os
from argparse import Namespace
from collections import Counter
import json
import re
import string

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

import utils

<IPython.core.display.Javascript object>

In [7]:
class Vocabulary(object):
    """
    Class to process text and extract vocabulary for mapping.
    """

    def __init__(
        self, token_to_idx=None, mask_token="<MASK>", add_unk=True, unk_token="<UNK>"
    ):
        """
        Args:
            token_to_idx: a pre-existing map of tokens to indices.
            mask_token: the MASK token to add into the Vocab.
            add_unk: a flag that indicates whether to add the UNK token.
            unk_token: the UNK token to add into the vocab.
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}
        self._add_unk = add_unk
        self._unk_token = unk_token
        self._mask_token = mask_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token)

    def to_serializable(self):
        """
        Returns a dictionary that can be serialized.
        """
        return {
            "token_to_idx": self._token_to_idx,
            "add_unk": self._add_unk,
            "unk_token": self._unk_token,
            "mask_token": self._mask_token,
        }

    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiates the vocab from a serialized dictionary.
        """
        return cls(**contents)

    def add_token(self, token):
        """
        Update mapping dicts based on the token.

        Args:
            token: the item to add into the vocab.
        Returns:
            index: the integer corresponding to the token.
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index

    def add_many(self, tokens):
        """
        Add a list of tokens into the Vocabulary.

        Args:
            tokens: a list of string tokens.
        Returns:
            indices: a list of indices corresponding to the tokens.
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """
        Retrieve the index associated with the token or the UNK
        index if token isnt present.

        Args:
            token: the token to lookup.
        Returns:
            index: the index corresponding to the token
        Notes:
            `unk_index` needs to be >= 0 (having been added into the vocab).
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """
        Return the token associated with the index.

        Args:
            index: the index to look up.
        Returns:
            token: the token corresponding to the index.
        Raises:
            KeyError: if the index is not in the vocab.
        """
        if index not in self._idx_to_token:
            raise KeyError(f"the index {index} is not in the vocab")
        return self._idx_to_token[index]

    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"

    def __len__(self):
        return len(self._token_to_idx)

<IPython.core.display.Javascript object>

In [8]:
class CBOWVectorizer(object):
    """
    The vectorizer which coordinates the vocabularies and puts them to use.
    """

    def __init__(self, cbow_vocab):
        """
        Args:
            cbow_vocab: maps words to integers.
        """
        self.cbow_vocab = cbow_vocab

    def vectorize(self, context, vector_length=-1):
        """
        Args:
            context: the string of words separated by a space.
            vector_length: an argument for forcing the length of index vector.
        """
        indices = [self.cbow_vocab.lookup_token(token) for token in context.split(" ")]
        if vector_length < 0:
            vector_length = len(indices)
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[: len(indices)] = indices
        out_vector[len(indices) :] = self.cbow_vocab.mask_index
        return out_vector

    @classmethod
    def from_dataframe(cls, cbow_df):
        """
        Instantiate the vectorizer from the dataset df.

        Args:
            cbow_df: the target dataset.
        Returns:
            an instance of the CBOWVectorizer.
        """
        cbow_vocab = Vocabulary()
        for index, row in cbow_df.iterrows():
            for token in row.context.split(" "):
                cbow_vocab.add_token(token)
            cbow_vocab.add_token(row.target)
        return cls(cbow_vocab)

    @classmethod
    def from_serializable(cls, contents):
        cbow_vocab = Vocabulary.from_serializable(contents["cbow_vocab"])
        return cls(cbow_vocab=cbow_vocab)

    def to_serializable(self):
        return {"cbow_vocab": self.cbow_vocab.to_serializable()}

<IPython.core.display.Javascript object>

In [9]:
class CBOWDataset(Dataset):
    def __init__(self, cbow_df, vectorizer):
        """
        Args:
            cbow_df: the dataset.
            vectorizer: vectorizer instantiated from dataset.
        """
        self.cbow_df = cbow_df
        self._vectorizer = vectorizer
        measure_len = lambda context: len(context.split(" "))
        self._max_seq_length = max(map(measure_len, cbow_df.context))

        self.train_df = self.cbow_df[self.cbow_df.split == "train"]
        self.train_size = len(self.train_df)

        self.val_df = self.cbow_df[self.cbow_df.split == "val"]
        self.val_size = len(self.val_df)

        self.test_df = self.cbow_df[self.cbow_df.split == "test"]
        self.test_size = len(self.test_df)

        self._lookup_dict = {
            "train": (self.train_df, self.train_size),
            "val": (self.val_df, self.val_size),
            "test": (self.test_df, self.test_size),
        }
        self.set_split("train")

    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """
        Load dataset and make a new vectorizer.

        Args:
            cbow_csb: location of the dataset.
        Returns:
            an instance of CBOWDataset.
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split == "train"]
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, cbow_csb, vectorizer_filepath):
        """
        Load dataset and the corresponding vectorizer.
        Used in the case in the vectorizer has been cached for re-use.

        Args:
            cbow_csb: location of the dataset.
            vectorizer_filepath: location of the saved vectorizer.
        Returns:
            an instance of CBOWDataset.
        """
        cbow_df = pd.read_csv(cbow_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(cbow_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """
        A static method for loading the vectorizer from file.

        Args:
            vectorizer_filepath: the location of the serialized vectorizer.
        Returns:
            an instance of CBOWVectorizer.
        """
        with open(vectorizer_filepath) as fp:
            return CBOWVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """
        Saves the vectorizer to disk using json.

        Args:
            vectorizer_filepath: the location to save the vectorizer.
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """
        Returns the vectorizer.
        """
        return self._vectorizer

    def set_split(self, split="train"):
        """
        Selects the splits in the dataset using a column in the dataframe.
        """
        self._train_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """
        The primary entryp point method for PyTorch dataset.

        Args:
            index: the index to the data point.
        Returns:
            a dictionary holding the data point's features(x_data) and label(y_target).
        """
        row = self._target_df.iloc[index]
        context_vector = self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)
        return {"x_data": context_vector, "y_target": target_index}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

<IPython.core.display.Javascript object>

### The CBOW Model

In [10]:
class CBOWClassifier(nn.Module):
    def __init__(self, vocabulary_size, embedding_size, padding_idx=0):
        """
        Args:
            vocabulary_size: number of vocab items.
            embedding_size: size of the embeddings.
            padding_idx: default 0, Embedding will not use this index.
        """
        super(CBOWClassifier, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=vocabulary_size,
            embedding_dim=embedding_size,
            padding_idx=padding_idx,
        )
        self.fc1 = nn.Linear(in_features=embedding_size, out_features=vocabulary_size)

    def forward(self, x_in, apply_softmax=False):
        """
        The forward pass of the classifier.

        Args:
            x_in: an input data tensor. x_in.shape should be (batch, input_dim).
            apply_softmax: a flag for the softmax activation.
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim).
        """
        x_embedded_sum = self.embedding(x_in).sum(dim=1)
        x_embedded_sum = F.dropout(x_embedded_sum, 0.3)
        y_out = self.fc1(x_embedded_sum)
        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)
        return y_out

<IPython.core.display.Javascript object>

### Model Training & Evaluation

In [11]:
args = Namespace(
    # Data and Path information
    cbow_csv="../data/books/frankenstein_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="models/chapter05/cbow",
    # Model hyper parameters
    embedding_size=50,
    # Training hyper parameters
    seed=1337,
    num_epochs=100,
    learning_rate=0.0001,
    batch_size=32,
    early_stopping_criteria=5,
    # Runtime options
    cuda=True,
    catch_keyboard_interrupt=True,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))


# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")

print("Using CUDA: {}".format(args.cuda))


# Set seed for reproducibility
utils.set_seed_everywhere(args.seed, args.cuda)

# handle dirs
utils.handle_dirs(args.save_dir)

Expanded filepaths: 
	models/chapter05/cbow/vectorizer.json
	models/chapter05/cbow/model.pth
Using CUDA: False


<IPython.core.display.Javascript object>

In [12]:
if args.reload_from_files:
    print("Loading dataset and loading vectorizer")
    dataset = CBOWDataset.load_dataset_and_load_vectorizer(
        args.cbow_csv, args.vectorizer_file
    )
else:
    print("Loading dataset & creating vectorizer")
    dataset = CBOWDataset.load_dataset_and_make_vectorizer(args.cbow_csv)
    dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()
classifier = CBOWClassifier(
    vocabulary_size=len(vectorizer.cbow_vocab), embedding_size=args.embedding_size
)
print(classifier)

Loading dataset & creating vectorizer
CBOWClassifier(
  (embedding): Embedding(6138, 50, padding_idx=0)
  (fc1): Linear(in_features=50, out_features=6138, bias=True)
)


<IPython.core.display.Javascript object>

In [None]:
classifier = classifier.to(args.device)
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=optimizer, mode="min", factor=0.5, patience=1
)

train_state = utils.train_model(
    classifier=classifier,
    loss_func=loss_func,
    optimizer=optimizer,
    scheduler=scheduler,
    dataset=dataset,
    args=args,
)
train_state = utils.evaluate_test_split(
    classifier, dataset, loss_func, train_state, args
)

Training Routine:   0%|          | 0/100 [00:00<?, ?it/s]

split=train:   0%|          | 0/1984 [00:00<?, ?it/s]

split=val:   0%|          | 0/425 [00:00<?, ?it/s]

--------------- 0th Epoch Stats---------------
Training Loss=8.791101831822646, Training Accuracy=1.5703755040322613
Validation Loss=8.070989081438857, Validation Accuracy=4.176470588235291.
------------------------------------------------------------
--------------- 10th Epoch Stats---------------
Training Loss=5.938392519229839, Training Accuracy=13.904989919354827
Validation Loss=6.7094865484798625, Validation Accuracy=13.0.
------------------------------------------------------------
--------------- 20th Epoch Stats---------------
Training Loss=5.410481493799914, Training Accuracy=15.265877016129041
Validation Loss=6.579481416590075, Validation Accuracy=13.98529411764705.
------------------------------------------------------------
--------------- 30th Epoch Stats---------------
Training Loss=5.16410692624987, Training Accuracy=15.908518145161317
Validation Loss=6.535959159626684, Validation Accuracy=14.55882352941176.
------------------------------------------------------------
--

### Trained Embeddings

In [None]:
def pretty_print(results):
    """
    Pretty Print Embedding Results.
    """
    for item in results:
        print(f"[{item[1]}] = {item[0]}")
        
def get_closed(target_word, word_to_idx, embeddings, n=5):
    """
    Get the n closest words to your word.
    """
    word_embedding = embeddings[word_to_idx[target_word.lower()]]
    distances = []
    for word, index in word_to_idx.items():
        if word == "<MASK>" or word == target_word:
            continue
        distances.append((word, torch.dist(word_embedding, embeddings[index])))
    results = sorted(distances, key=lambda x: x[1])[1:n+2]
    return results

In [None]:
word = input("Enter a word:")
embeddings = classifer.embedding.weight.data
word_to_idx = vectorizer.cbow_vocab._token_to_idx
pretty_print(
    get_closest(word, word_to_idx, embeddings, n=5)
)

## Example: Transfer Learning Using Pretrained Embeddings for Document Classification

### Data Vectorization classes

In [50]:
import os
from argparse import Namespace
from collections import Counter
import json
import re
import string

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

import utils

In [51]:
class Vocabulary(object):
    """
    Class to process text and extract vocabulary for mapping.
    """
    def __init__(self, token_to_idx=None):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices.
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {
            idx: token
            for token, idx in self._token_to_idx.items()
        }
        
    def to_serializable(self):
        """
        Returns a dictionary that can be serialized.
        """
        return {
            'token_to_idx': self._token_to_idx
        }
    
    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiates the Vocabulary from a Srialized Dictionary.
        """
        return cls(**contents)
    
    def add_token(self, token):
        """
        Update mapping dicts based on the token.
        
        Args:
            token: the item to add into the Vocab.
        Returns:
            index: the integer corresponding to the token.
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    

    def add_many(self, tokens):
        """
        Add a list of tokens into the Vocab.
        
        Args:
            tokens: a list of string tokens.
        Returns:
            indices: a list of indices corresponding to the tokens.
        """
        return [self.add_token(token) for token in tokens]
    
    def lookup_token(self, token):
        """
        Retrieve the index associated with the token.
        
        Args:
            token: the token to look up.
        Returns:
            index: the index corresponding to the token.
        """
        return self._token_to_idx[token]
    
    def lookup_index(self, index):
        """
        Return the token associated with the index.
        
        Args:
            index: the index to lookup.
        Returns:
            token: the token corresponding to the index.
        Raises:
            KeyError: if the index is not in the vocab.
        """
        if index not in self._idx_to_token:
            raise KeyError(f"The index {index} is not in the Vocab.")
        return self._idx_to_token[index]
    
    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"
    
    def __len__(self):
        return len(self._token_to_idx)

![Figure 5.3](../images/figure_5_3.png)

In [59]:
class SequenceVocabulary(Vocabulary):
    def __init__(
        self, token_to_idx=None, unk_token="<UNK>",
        mask_token="<MASK>", begin_seq_token="<BEGIN>",
        end_seq_token="<END>"
    ):
        super(SequenceVocabulary, self).__init__(token_to_idx)
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token
        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)
        
    def to_serializable(self):
        contents = super(SequenceVocabulary, self).to_serializable()
        contents.update(
            {
                'unk_token': self._unk_token,
                'mask_token': self._mask_token,
                'begin_seq_token': self._begin_seq_token,
                'end_seq_token': self._end_seq_token
            }
        )
        return contents
    
    def lookup_token(self, token):
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
        

In [84]:
class NewsVectorizer(object):
    def __init__(self, title_vocab, category_vocab):
        self.title_vocab = title_vocab
        self.category_vocab = category_vocab
        
    def vectorize(self, title, vector_length=-1):
        indices = [self.title_vocab.begin_seq_index]
        indices.extend(
            self.title_vocab.lookup_token(token)
            for token in title.split(" ")
        )
        indices.append(self.title_vocab.end_seq_index)
        
        if vector_length < 0:
            vector_length = len(indices)
            
        out_vector = np.zeros(vector_length, dtype=np.int64)
        print(len(indices), out_vector.shape, title)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.title_vocab.mask_index
        return out_vector
    
    @classmethod
    def from_dataframe(cls, news_df, cutoff=25):
        category_vocab = Vocabulary()
        for category in sorted(set(news_df.category)):
            category_vocab.add_token(category)
        
        word_counts = Counter()
        for title in news_df.title:
            for token in title.split(" "):
                if token not in string.punctuation:
                    word_counts[token] += 1
        
        title_vocab = SequenceVocabulary()
        for word, word_count in word_counts.items():
            if word_count >= cutoff:
                title_vocab.add_token(word)
        return cls(title_vocab, category_vocab)
    
    @classmethod
    def from_serializable(cls, contents):
        title_vocab = SequenceVocabulary().from_serializable(
            contents['title_vocab']
        )
        category_vocab = Vocabulary.from_serializable(
            contents['category_vocab']
        )
        return cls(title_vocab=title_vocab, category_vocab=category_vocab)
    
    def to_serializable(self):
        return {
            'title_vocab': self.title_vocab.to_serializable(),
            'category_vocab': self.category_vocab.to_serializable()
        }

In [85]:
class NewsDataset(Dataset):
    def __init__(self, news_df, vectorizer):
        self.news_df = news_df
        self._vectorizer = vectorizer
        
        measure_len = lambda content: len(content.split(" "))
        self._max_seq_length = max(map(measure_len, news_df.title))

        self.train_df = self.news_df[self.news_df.split == 'train']
        self.train_size = len(self.train_df)
        
        self.val_df = self.news_df[self.news_df.split == 'val']
        self.val_size = len(self.val_df)
        
        self.test_df = self.news_df[self.news_df.split == 'test']
        self.test_size = len(self.test_df)
        
        self._lookup_dict = {
            'train': (self.train_df, self.train_size),
            'val': (self.val_df, self.val_size),
            'test': (self.test_df, self.test_size)
        }
        self.set_split('train')
        
        class_counts = news_df.category.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.category_vocab.lookup_token(item[0])
        
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequences = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequences, dtype=torch.float32)
        
    @classmethod
    def load_dataset_and_make_vectorizer(cls, news_csv):
        news_df = pd.read_csv(news_csv)
        train_news_df = news_df[news_df.split == 'train']
        return cls(news_df, 
                   NewsVectorizer.from_dataframe(train_news_df))
    
    @classmethod
    def load_dataset_and_load_vectorizer(cls, news_csv, vectorizer_filepath):
        news_df = pd.read_csv(news_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(news_csv, vectorizer)
    
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath) as fp:
            return NewsVectorizer.from_serializable(json.load(fp))
    
    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, 'w') as fp:
            json.dump(self._vectorizer.to_serializable(), fp)
            
    def get_vectorizer(self):
        return self._vectorizer
    
    def set_split(self, split='train'):
        self._train_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
        
    def __len__(self):
        return self._target_size
    
    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        title_vector = self._vectorizer.vectorize(
            row.title, self._max_seq_length
        )
        category_index = self._vectorizer.category_vocab.lookup_token(
            row.category
        )
        return {
            'x_data': title_vector,
            'y_target': category_index
        }
    
    def get_num_batches(self, batch_size):
        return len(self) // batch_size

### The NewsClassifier

In [86]:
class NewsClassifier(nn.Module):
    def __init__(
        self, embedding_size, num_embeddings, num_channels,
        hidden_dim, num_classes, dropout_p,
        pretrained_embeddings=None, padding_idx=0
    ):
        super(NewsClassifier, self).__init__()
        if pretrained_embeddings is None:
            self.emb = nn.Embedding(
                embedding_dim=embedding_size,
                num_embeddings=num_embeddings,
                padding_idx=padding_idx
            )
        else:
            pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
            self.emb = nn.Embedding(
                embedding_dim=embedding_size,
                num_embeddings=num_embeddings,
                padding_idx=padding_idx,
                _weight=pretrained_embeddings
            )
        self.convnet = nn.Sequential(
            nn.Conv1d(
                in_channels=embedding_size,out_channels=num_channels,
                kernel_size=3
            ),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3),
            nn.ELU()
        )
        self._dropout_p = dropout_p
        self.fc1 = nn.Linear(num_channels, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        
    def forward(self, x_in, apply_softmax=False):
        # Why permute here?
        # Embed and permute so features are channels
        x_embedded = self.emb(x_in).permute(0, 2, 1)
        features = self.convnet(x_embedded)
        
        # Average and remove the extra dimension
        remaining_size = features.size(dim=2)
        features = F.avg_pool1d(features, remaining_size).squeeze(dim=2)
        features = F.dropout(features, p=self._dropout_p)
        
        # MLP Classifier
        intermediate_vector = F.relu(
            F.dropout(
                self.fc1(features),
                p=self._dropout_p))
        prediction_vector = self.fc2(intermediate_vector)
        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)
        return prediction_vector

### Utils

In [87]:
def load_glove_from_file(glove_filepath):
    word_to_index = {}
    embeddings = []
    with open(glove_filepath, 'r') as fp:
        for index, line in enumerate(fp):
            line = line.split(" ")
            word_to_index[line[0]] = index
            embedding_i = np.array(
                [float(val) for val in line[1:]]
            )
            embeddings.append(embedding_i)
    return word_to_index, np.stack(embeddings)

def make_embedding_matrix(glove_filepath, words):
    word_to_idx, glove_embeddings = load_glove_from_file(glove_filepath)
    embedding_size = glove_embeddings.shape[1]
    final_embeddings = np.zeros((len(words), embedding_size))
    for i, word in enumerate(words):
        if word in word_to_idx:
            final_embeddings[i, :] = glove_embeddings[word_to_idx[word]]
        else:
            embedding_i = torch.ones(1, embedding_size)
            torch.nn.init.xavier_uniform_(embedding_i)
            final_embeddings[i, :] = embedding_i
    return final_embeddings

### Model Training & Evaluation

In [88]:

args = Namespace(
    # Data and Path hyper parameters
    news_csv="../data/ag_news/news_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="models/chapter05/document_classification",
    # Model hyper parameters
    glove_filepath='../data/glove.6B.100d.txt', 
    use_glove=False,
    embedding_size=100, 
    hidden_dim=100, 
    num_channels=100, 
    # Training hyper parameter
    seed=1337, 
    learning_rate=0.001, 
    dropout_p=0.1, 
    batch_size=128, 
    num_epochs=100, 
    early_stopping_criteria=5, 
    # Runtime option
    cuda=True, 
    catch_keyboard_interrupt=True, 
    reload_from_files=False,
    expand_filepaths_to_save_dir=True
) 

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False
    
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))

# Set seed for reproducibility
utils.set_seed_everywhere(args.seed, args.cuda)

# handle dirs
utils.handle_dirs(args.save_dir)

Expanded filepaths: 
	models/chapter05/document_classification/vectorizer.json
	models/chapter05/document_classification/model.pth
Using CUDA: False


In [89]:
args.use_glove = True

if args.reload_from_files:
    # training from a checkpoint
    dataset = NewsDataset.load_dataset_and_load_vectorizer(args.news_csv,
                                                              args.vectorizer_file)
else:
    # create dataset and vectorizer
    dataset = NewsDataset.load_dataset_and_make_vectorizer(args.news_csv)
    dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()

# Use GloVe or randomly initialized embeddings
if args.use_glove:
    words = vectorizer.title_vocab._token_to_idx.keys()
    embeddings = make_embedding_matrix(glove_filepath=args.glove_filepath, 
                                       words=words)
    print("Using pre-trained embeddings")
else:
    print("Not using pre-trained embeddings")
    embeddings = None

classifier = NewsClassifier(embedding_size=args.embedding_size, 
                            num_embeddings=len(vectorizer.title_vocab),
                            num_channels=args.num_channels,
                            hidden_dim=args.hidden_dim, 
                            num_classes=len(vectorizer.category_vocab), 
                            dropout_p=args.dropout_p,
                            pretrained_embeddings=embeddings,
                            padding_idx=0)

Using pre-trained embeddings


In [90]:
classifer = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
optimizer = optim.Adam(
    classifer.parameters(),
    lr=args.learning_rate
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=optimizer,
    mode='min',
    factor=0.5,
    patience=1
)
train_state = utils.train_model(
    classifier=classifier,
    loss_func=loss_func,
    optimizer=optimizer,
    scheduler=scheduler,
    dataset=dataset,
    args=args,
)
train_state = utils.evaluate_test_split(
    classifier, dataset, loss_func, train_state, args
)

Training Routine:   0%|          | 0/100 [00:00<?, ?it/s]

split=train:   0%|          | 0/656 [00:00<?, ?it/s]

split=val:   0%|          | 0/140 [00:00<?, ?it/s]

6 (19,) Women facing retirement poverty
12 (19,) Volkswagen workers walk out in Hanover, wage talks to continue
7 (19,) BA plane diverted after hoax
12 (19,) Four Studios Give Backing to a Format for DVD #39;s
14 (19,) Syria, Hamas: Israel tried to assassinate a movement #39;s cadre in &lt;b&gt;...&lt;/b&gt;
11 (19,) KCEU to Strike from Monday; Relay Strikes May Follow
13 (19,) IMF chief counsels Putin on how to double Russia's growth (AFP)
8 (19,) Heathrow Security Breach to be Investigated
10 (19,) Thomson Financial Receives Subpoena From US SEC (Update2)
11 (19,) Washington case done; will NCAA act on MSU next?
9 (19,) China Seeks to Soothe Markets on Rates
10 (19,) Tinker With Your Computer, and Reap the Rewards
5 (19,) Gates: Passwords passe
7 (19,) Playboy Forecasts Higher 2005 Profit
8 (19,) Notebook: Return in Hughes #39; skates?
5 (19,) Enter your e-mail:
7 (19,) Microsoft #39;s Competition From Mozilla
11 (19,) Army Says C.I.A. Hid More Iraqis Than It Claimed
6 (19,) Wish you

10 (19,) Amtrak in Another Hole with \$1.3 Bln Loss
8 (19,) Russia, Israel Agree on Anti-Terror Union
10 (19,) US warns N. Korea against missile tests (AFP)
9 (19,) Slip a Geek Book Under the Tree
6 (19,) Grumbling Over the Greenback
7 (19,) It's All in His Head
11 (19,) 12 Years After the Riots, Rodney King Gets Along
11 (19,) Kiely on hand to save the day for Addicks
11 (19,) YUKOS Can Pay Half Its \$3.4 Bln Tax Bill
10 (19,) Woman Lives With 6,000 Scorpions, Claims New Record
7 (19,) We dont recognize the results
7 (19,) Strong earthquake hits northern Japan
7 (19,) Rockies Terminate Neagle's Contract (AP)
8 (19,) World #39;s oldest man dies, 113
7 (19,) Leyland among Phils #39; candidates
5 (19,) McGahee Scores Four
7 (19,) Judge: Boston Scientific Violated Deal
9 (19,) NFL Game Summary - Minnesota at Indianapolis
11 (19,) News Corp #39;s move to US backed by investors
8 (19,) Guardian: Apple Must License or Perish
15 (19,) NIGERIA: Militia group vows to attack oil workers if firms

10 (19,) Enron Case Prosecutors Calling for a Single Trial
10 (19,) In Cup Gambit, U.S. Is Pairing Its Aces
6 (19,) Sony #39;s October surprise
6 (19,) Race inquiry hits trouble
8 (19,) And the Unexpected Olympics Winner Is...Athens
8 (19,) Ranieri honeymoon over after Valencia mauling
8 (19,) Remembering Hurricane Hazel 50 Years Later
10 (19,) Oil Above  #36;47 After New Record (Reuters)
15 (19,) First Heisman ballot is in hand, but actually picking a winner a &lt;b&gt;...&lt;/b&gt;
6 (19,) NASA Launches Hypersonic 'Scramjet'
8 (19,)  #39;Football animal #39; steps up
11 (19,) Royalty payout could await voices behind 'The Wall' (AFP)
13 (19,) The Customer Relationship Expert Takes a Dose of Its Own Medicine
8 (19,) First Sentence for Violating Privacy Law
6 (19,) Independence Continues to Struggle
10 (19,) Reeves Says EBay May Cause Identity Theft (AP)
10 (19,) Lycos Europe confronts strong resistance in spam war
10 (19,) Iran #39;s Revolutionary Guards getting a bigger appetite
7 (19

7 (19,) The Olympic  Drug Problem
9 (19,) Durazo Leads Athletics by Orioles 5-4 (AP)
4 (19,) The Rundown
9 (19,) Eriksson urges more rest for top players
6 (19,) UBS Buys Schwab Unit
10 (19,) Nikkei Falls as Xilinx Hits Tech Shares (Reuters)
12 (19,) Nastase and Basescu in close finish in Romanian presidential vote
7 (19,) Asian Stock Markets Close Mixed
8 (19,) Barwick Is New FA Chief Executive
8 (19,) UK hostage 'pleads for his life'
9 (19,) U.N. Said Not Protecting Sudan Refugees (AP)
10 (19,) Parker Hannifin completes purchase of Sporlan Valve Co.
8 (19,) New bank law due this week
7 (19,) Nations voice concern for Arafat
10 (19,) US Trade Deficit No Cause for Alarm-Evans (Reuters)
14 (19,) Nortel warns it now expects 2004 revenue growth will trail its &lt;b&gt;...&lt;/b&gt;
13 (19,) Work Out Now, Ache Later: How Your Muscles Pay You Back
8 (19,) London Stock Exchange eyes Asia HQ
11 (19,) Tivo, Netflix Close to Internet Movie Deal - Report
11 (19,) US, Iraqi forces start new move 

14 (19,) Big 6 banks poised for profit slowdown, but still expect double &lt;b&gt;...&lt;/b&gt;
9 (19,) India to build oil pipeline to Nepal
10 (19,) UPDATE 1-Henman hits back to level with Austria
8 (19,) Egypt Arrests Suspects in Taba Bombings
10 (19,) Swimming: Phelps Wins a Classic Then Steps Aside
7 (19,) Cricket: England beat Sri Lanka
9 (19,) Fourth-quarter earnings by Fisher, Henderson and Favre
11 (19,) Windows XP Service Pack 2: Is It Time Yet?
6 (19,) Opel workers end strike
6 (19,) Explosion rocks Israel checkpoint
7 (19,) Greece set for Games finale
9 (19,) I was kept in the dark: Anil
10 (19,) MSO Shares Still Confined by Martha #39;s Woes
12 (19,) State Compensation Insurance Fund Files an Average 5 Rate &lt;b&gt;...&lt;/b&gt;
6 (19,) Easy Going for Els
6 (19,) Bengals 16, Dolphins 13
6 (19,) One Shell boosts Eurostocks
11 (19,) Reuters Up on Report Instinet Up for Sale (Reuters)
8 (19,) Australian Says Practicality Key in Iraq
7 (19,) Allardyce swipe at Man Utd
12 (19,)

9 (19,) iTunes pricing unfair, says British consumer group
6 (19,) Florida St. Boots Wake
6 (19,) Corgentech treatment fails trial
11 (19,) Does Geico Own  #39;Geico #39; or Does Google?
10 (19,) Baseball legend Bonds testifies using BALCO substances: report
9 (19,) J. Lewis pleads guilty in drug case
7 (19,) Attack On Israeli Army Backfires
9 (19,) Japan's SMFG Bids \$29 Billion for UFJ
7 (19,) Hewitt back in top form
8 (19,) Orgeron to finalize deal with Rebels
8 (19,) Renowned European Laboratory Turns 50 (AP)
11 (19,) Gatlin Wins Olympic 100 Meters; Greene Finishes Third (Update4)
8 (19,) BA prepares new sick leave deal
10 (19,) Buyers back on Wall Street after two-day selloff
11 (19,) Cellphone Industry Hits Snag as It Woos Untapped Market
7 (19,) NEC restarts factory after earthquake
8 (19,) Envoys view North Korean blast site
12 (19,) Giants #39; Carter Is Showing He Can Stretch a Defense
6 (19,) Scientist Envisions Small-Scale Hydropower
6 (19,) PMC-Sierra Cuts Q3 Outlook
6 (19

10 (19,) AL Wrap: Ford, Radke Help Twins Sweep Tigers
12 (19,) Nokia Says Price Cuts Have Helped Boost Market Share (Update3)
13 (19,) US, EU to give negotiations on Boeing-Airbus subsidy row more time
13 (19,) China Raises Interest Rates for First Time in 9 Years (Update6)
10 (19,) Schroeder Won't Drop Equality Aim for East Germans
9 (19,) Microsoft paid CCIA \$19.75 million to settle
5 (19,) The (mis)information age
8 (19,) USA: Kmart appoints new chief executive
10 (19,) Study: 100,000 civilian deaths due to Iraq invasion
11 (19,) Shaukat Aziz takes oath as Pakistan #39;s new PM
9 (19,) Dell adds plasma TVs, printers to lineup
5 (19,) The Showdown Begins
14 (19,) A Texan's Race for the House That Could Lead to the F.C.C.
10 (19,) Orioles Beat Blue Jays to Start Twinbill (AP)
9 (19,) Low carbs thin out MGP earnings forecast
6 (19,) Four bidders court Boeing
4 (19,) Northeastern Conference
13 (19,) Militant Chechen Rebel Leader to Go on Trial for Masterminding &lt;b&gt;...&lt;/b&gt;
8

7 (19,) Brent Crude Falls to \$40
11 (19,) Update 8: Oil Remains Near Record Price in Asia
13 (19,) Martin offers  #36;700-million plan to improve native health (Canadian Press)
8 (19,) New sales push for struggling GM
10 (19,) Microsoft Changes Sender ID, AOL Back On Board
6 (19,) MCI adds managed services
9 (19,) Lycos Europe anti-spam Screensaver bites the dust!
8 (19,) Airline to raise number of flights
12 (19,) Wal-Mart asks for reversal of ruling in workers #39; suit
9 (19,) US Airways Pilots' Union OKs Labor Deal
7 (19,) EBay Reports Sharply Higher Profit
5 (19,) RANIERI BLASTS FRISK
7 (19,) Delta Revises 2003 Annual Report
9 (19,) Nikkei Poised to Fall for 8th Day
7 (19,) Gartner sees solid server sales
9 (19,) Middle East bid for UK water firm
10 (19,) IBM fits PCs with new hardware-based security chip
9 (19,) Directors also delay decision on first-quarter dividend
10 (19,) Outsourcers combine to take on IT services giants
9 (19,) World #39;s Biggest Aerospace Company in Offin

6 (19,) Brightest Lights Rise Together
8 (19,) State #39;s economy surges in October
11 (19,) Via gives a PCI Express boost to AMD-based computers
8 (19,) Drug hope in prostate cancer care
8 (19,) Loophole Pays Off on Upscale Buildings
8 (19,) Skoko tells of Roos #39; resolve
9 (19,) Red Cross Returns After Fallujah Offensive (AP)
4 (19,) Shear class
6 (19,) Race Is Wide Open
9 (19,) Crocodile attacks two campers in northern Australia
9 (19,) Colts Thrive in Grueling Four-Game Stretch (AP)
8 (19,) 10 Deaths Blamed on Hurricane Frances
7 (19,) Iran rejects EU nuclear proposal
7 (19,) Jackson Accuser Kin Wanted Compensation
11 (19,) 23 tigers suspectedly die of bird flu in Thailand
13 (19,) A Slowed Hurricane Jeanne Slams Into Florida, on Way to Tampa
12 (19,) I #39;m not the first bowler to get smashed: Kasprowicz
9 (19,) Pavin eagles his way into Buick lead
6 (19,) Apple launch new iPod
8 (19,) Hearing Set After Microsoft Rivals Quit
11 (19,) Kids happy to get an Apple in their stockin

13 (19,) XM Satellite in  #36;650 Million Deal with Pro Baseball (Reuters)
8 (19,) Hendry dismisses talk of grudge-carrying umps
7 (19,) DataPlay Discs Stage a Comeback
11 (19,) CONTROVERSIAL MOVE TO LET AMERICANS HIT REBEL POSTS HARDER
8 (19,) Dusenberry loves living in fast lane
9 (19,) US Airways, pilots reach pact on cuts
6 (19,) Office Depot dumps chairman
10 (19,) Pakistan prime minister-designate declared official victor in by-elections
10 (19,) Oprah #39;s Car Giveaway, Majorly Boosted Site Traffic
8 (19,) U.S. Forces Keep Sh'ite Militants Guessing
7 (19,) Allardyce backs Mutu to return
8 (19,) Tech companies building bridges with China
5 (19,) Above and beyond
8 (19,) Sharon contacts Labor on govt coalition
12 (19,) White Sox deal C. Lee to Brewers for Podsednik, Vizcaino
13 (19,) Online Games Could Be Next Big Thing, But Not Yet (Reuters)
9 (19,) BellSouth: Retiree Costs to Cut Q4 Earns
8 (19,) UK scientists bring hydrogen cars closer
9 (19,) PeopleSoft board rejects takeover

8 (19,) BA flies toward a stormy winter
6 (19,) High incomes, high scrutiny
7 (19,) BHP Billiton approves \$US990m mine
9 (19,) Clijsters #39; latest injury could be serious
10 (19,) Microsoft Previews 64-Bit XP, Promises No Price Hikes
8 (19,) Lehmann in, Katich out ... again
11 (19,) Mozilla Launching Second Act with E-Mail Client (Ziff Davis)
10 (19,) Borland to tout tool for building Microsoft apps
11 (19,) Garcia and Lara beat the elements on home turf
10 (19,)  quot;Blog quot; takes #1 word of 2004
9 (19,) Sonics soar past Mavs, improve to 17-3
8 (19,) Russian upper house ratifies Kyoto Protocol
11 (19,) U.S. Open: American Agony as Roddick, Agassi Go Out
9 (19,) Sears Shares Soar as Vornado Boosts Stake
12 (19,) Microsoft and labels in talks about copy protection and Longhorn
9 (19,) Controversial Microsoft security plan heads for Longhorn
7 (19,) UPS Steps Up RFID Efforts
13 (19,) Mixed bag for Putnam Marsh chief's exit met with relief, concern
11 (19,) Debatable: Should there 

9 (19,) US - Consumer Price Index (Oct, 2004)
8 (19,) Iran Plans to Resume Nuclear Enrichment
6 (19,) Another Real bad performance
8 (19,) Google shares set to hit market
11 (19,) Update 1: Tokyo Stocks Lower; Dollar Falls Vs. Yen
9 (19,) Japan asks Citi to end pvt banking
9 (19,) Dell to Offer SuSE Linux-Based PowerEdge Servers
10 (19,) Iranian, Egyptian hostages freed; 12 killed in Iraq
9 (19,) Axim X30 Getting Windows Media Player Update
9 (19,) Kuwait minister assures release of Indian hostages
13 (19,) Extra is not responsible for the content of external Internet &lt;b&gt;...&lt;/b&gt;
8 (19,) Kodak sues Sun for \$1 billion
12 (19,) Arsenal trio in injury race for Old Trafford summit (AFP)
7 (19,) Delta Pilots Approve Concession Package
9 (19,) Stewart Ordered to Prison Within 3 Weeks
8 (19,) Some rely on boats in Maldives
8 (19,) Drug: Test results excite Alpharetta maker
8 (19,) More protection urged for rare toothfish
6 (19,) IBM launches Atlantic (InfoWorld)
9 (19,) Northwest 

7 (19,) Phone defect could affect hearing
10 (19,) Stocks Up But Late Oil Spike Trims Gains
10 (19,) 'Grand Theft Auto' Seen Breaking Games Sales Record
11 (19,) US, European Officials Await Decision on Iranian Nuclear Program
9 (19,) A Submersible Robot Dives for Steamship Gold
9 (19,) New U.S. Offensive Could Backfire in Iraq
10 (19,) Computer Associates to Pay \$225 Over Accounting Scandal
10 (19,) Rockies #39; Neagle charged with soliciting a prostitute
8 (19,) False comfort in latest Chinese figures?
11 (19,) Palestinians Turn Toward Future as They Continue to Mourn
10 (19,) Heavy Guard as Israeli Parliament Debates Gaza Withdrawal
9 (19,) Nikkei Poised to Fall for 8th Day
8 (19,) US takes aim at e-mail spammers
7 (19,) Castroneves captures Indy race pole
8 (19,) Sun lures Siebel to x86 Solaris
9 (19,) McGrady's Flurry Sends Rockets Past Spurs (AP)
12 (19,) Iraqi PM warned to  #39;await Angel of Death #39;
12 (19,) Former F1 driver Jacques Villeneuve to test drive for Renault
8 (1

7 (19,) Jaguar workers resigned to closure
8 (19,) TSX chief will lead Royal shake-up
11 (19,) NASA To Test Laser Communications With Mars Spacecraft (SPACE.com)
12 (19,) French hostages fate hangs by a thread one month on
11 (19,) Constitutional Court Puts an End to Capital Relocation Controversy
11 (19,) Appeals court takes up law used against energy traders
7 (19,) Padres Defeat Dodgers 7-3 (AP)
10 (19,) Consumers seen as drivers of longer-term semi growth
8 (19,) Kroger's Profit Climbs, Misses Forecast (Reuters)
10 (19,) LaCassie, Price qualify for US Amateur Match Play
10 (19,) Dolphins GM may not have total job security
10 (19,) Getting Into Google Book Search - Google Print
10 (19,) U.S. Rejects N.Korea Conditions for Nuclear Talks (Reuters)
10 (19,) Giants Agree to 3-Year Deal With Matheny (AP)
10 (19,) i2 looks to technology rollout for a rebound
9 (19,) Nordstrom shares are hammered after earnings miss
10 (19,) Opec begs US help to calm oil market
8 (19,) Microsoft Warns of P

9 (19,) Atlas Copco to Sell Electric Tool Business
10 (19,) Atari to Reissue Scores of Old Video Games
10 (19,) Zarqawi Group Claims Killing of 50 Iraqis (AP)
7 (19,) PacifiCare to Buy American Medical
4 (19,) Today's schedule
9 (19,) Ginepri Ousts Srichaphan at CA Trophy (AP)
12 (19,) FIFA aims to help players prepare for 2006 World Cup
6 (19,) Kmart Appoints New CEO
7 (19,) Titan Surface Still a Mystery
7 (19,) Wife tried to sway judge
7 (19,) Nigerian oil flows, for now
10 (19,) Suicide Bomber Dies, Wounds Seven in Kabul Attack
8 (19,) Patriots Sign Top Pick Watson (Reuters)
9 (19,) Justices to hear Seattle newspapers #39; dispute
10 (19,) Notre Dame Bounces Back to Stun Michigan (AP)
9 (19,) Iraq's Sistani Begins Journey to Najaf-Witnesses (Reuters)
10 (19,) Count Delay in Afghan Poll, Candidates Coming Round
9 (19,) Dirrell Boosts American Title Hopes in Boxing
7 (19,) Hundreds Arrested in Telemarketing Scheme
9 (19,) PepsiCo Says Profit Rises on Tax Benefits
10 (19,) No time to r

7 (19,) Marconi wipes out debt mountain
11 (19,) US Appeals Nafta Ruling Against Tariffs on Canadian Lumber
9 (19,) HP's Fiorina stood up by Argentina's President
9 (19,) Radar Reaches Titan #39;s Surface Through Clouds
9 (19,) Comcast, Microsoft to Deploy Video Recorder Set-Tops
8 (19,) US raises interest rates to 2.25
11 (19,) Afghan women offer to take place of UN hostages
9 (19,) L-3 to Buy CAE Marine Controls Division
10 (19,) Cap Harnesses Human Thought to Move PC Cursor
6 (19,) Liechtenstein royals swap power
9 (19,) New spasm of violence sweeps across Iraq
9 (19,) Nokia's Back on Top (The Motley Fool)
7 (19,) Blockbuster Interested in Hollywood Chain
7 (19,) Ali: Boxing needs US commission
9 (19,) Dictators "Defacing" Famed Burma Temples, Editor Says
9 (19,) Ivory Coast to Pull Back Its Forces
9 (19,) JPMorgan to Absorb 4,000 IBM Workers (AP)
10 (19,) Alaska Air #39;s net beats analysts #39; projections
5 (19,) LATEST FOREX NEWS
9 (19,) U.S. Awaits Judgment on Venezuela Voting 

15 (19,) 3 Directors at Cox to Study Bid by Parent to Take Unit Private
8 (19,) Dollar Is Calm Before Payrolls Storm
9 (19,) IBM expands virtual resources for business partners
7 (19,) Canada slipping on environment: report
9 (19,) Rumsfeld eyes future US bases in Romania
10 (19,) J J to Buy Guidant for \$23.9 Billion
7 (19,) 100-meter winners are new crew
11 (19,) Saudis Pledge 800,000 Barrels Per Day of New Oil
6 (19,) TI increases dividend, buyback
10 (19,) BP Has Bumper Profits on High Oil Price
10 (19,) Revoking Microsoft #39;s FAT Patent Would Stir Innovation
9 (19,) Williams to box Klitschko for world title
17 (19,) Schoolgirl killer Homolka could quietly slip into any town in Canada: top cop (Canadian Press)
8 (19,) Labor groups upset over likely Wal-Mart
7 (19,) Israel Talks Tough to Arafat
7 (19,) Punch pays 335m for InnSpired
10 (19,) Just 16 centuries later, we return to Olympia
9 (19,) Security Tightened in Paris Following Embassy Bombing
9 (19,) Microsoft search encounter

11 (19,) World Bank Chief Likely to Leave in 2005 (AP)
10 (19,) Update 2: Prosecutor: Execs Conspired on Sham Sale
7 (19,) Study links cell phones, tumors
7 (19,) Google Slashes IPO Price Range
7 (19,) Xstrata makes play for WMC
11 (19,) Procter   Gamble CEO Gets \$3.5 Mln Bonus
9 (19,) Nitties grabs one-stroke lead in Australian PGA
7 (19,) Ratliff signs extension with Portland
9 (19,) US Still Wants Aircraft Deal with EU
12 (19,) Verizon: Wait for dial tone, punch in number, rock out
8 (19,) Dyke accuses Blair over Iraq war
10 (19,) Sudan Peace Talks Stuck on Issues of Security
8 (19,) Shell, Exxon Mobil Sell Gas Pipelines
8 (19,) Western investors eye Russia's wireless market
6 (19,) Socceroos #39; Kewell edge
7 (19,) Urgent review over Palace protest
7 (19,) New Ukraine vote ordered 
12 (19,) Musharraf To Shed Off Military Uniform By Year End: US
11 (19,) Update 3: Bank of America Cuts Some Fleet Workers
9 (19,) Cassini flyby gives little news on Titan
9 (19,) Hollywood Steps Into 

6 (19,) Virgin 'seeking Indian airline'
10 (19,) Australian striker Kewell fit for Solomon Islands game
8 (19,) Insiders Get Rich Through Google IPO
10 (19,) Right Whales Return to Georgia Coast Early (AP)
5 (19,) America's Car-Mart Accelerates
9 (19,) Oil Below \$45 as Gasoline Prices Slide
8 (19,) LSU rushing record up Broussard's alley
10 (19,) Small Asteroid Gives Earth its Closest Shave Yet
12 (19,) Owen Scores as Real Madrid Wins in Champions League (Update1)
10 (19,) Stocks Seen Higher as Oil Slips Below \$49
8 (19,) Boots shares dip on lacklustre outlook
10 (19,) Divided council gives initial OK to stadium bill
7 (19,) Nasa launches superfast jet plane
7 (19,) Male Contraception Gets a Boost
9 (19,) Els earns title as Woods is relegated
6 (19,) CSKA Moscow overcome PSG
10 (19,) Athens Olympics 2004 / Noguchi #39;s golden run
9 (19,) US Airways to boost number of flights
10 (19,) U.S. Stocks Flat as Oil Spikes Over \$51
8 (19,) CNNfn Network to Close by Mid-December
8 (19,) Hist

8 (19,) Citigroup's Thomson, Krawcheck Swap Posts (Reuters)
13 (19,) Oklahoma Moseys Into Texas and Takes the Beast by the Horns
7 (19,) Playing a different role (model)
10 (19,) Pujols, Rolen, Edmonds line up behind Bonds, Beltre
5 (19,) Todd MacCulloch Retires
9 (19,) French hostages fate rests on spy inquiry
10 (19,) Oklahoma #39;s mastery of Texas borders on ridiculous
8 (19,) AT amp;T keeps expanding Net-phone service
7 (19,) China Aviation seeks rescue plan
12 (19,) Arnesen dismisses reports of a rift with French coach Santini
7 (19,) Microsoft unveils IPTV-ready set-top box
11 (19,) U.S. to Give Cuba  #36;50,000 Hurricane Aid (AP)
8 (19,) Stocks Edge Up on Tech-Sector Optimism
6 (19,) HP drops Itanium workstations
7 (19,) Fed up with pension defaults
11 (19,) Oil Extends  #36;5 Rout, Uneasy Over China (Reuters)
11 (19,) Palestinians, Israelis meet to plan for Arafat #39;s absence
12 (19,) Compuware Seeks  #39;Severe #39; Sanctions in Suit vs. IBM
9 (19,) Ryanair makes EUR 4m off

9 (19,) Over 200 dead at Russia school 
6 (19,) Perez wins Vuelta stage
7 (19,) Gunners, Devils to call truce
10 (19,) Sprint, Nextel ringing up \$35B US wireless merger
5 (19,) Nepal talks hopes
11 (19,) China, Linux Could Put PalmSource in the Catbird Seat
12 (19,) Saudi Arabia to Boost Oil Production as Price Hits \$50
11 (19,) Report: Blair Warned of Iraq Chaos in '02 (AP)
10 (19,) Europe Sets Deadline for Iran on Weapons (AP)
12 (19,) CEO Louis Camilleri says Altria preparing to break up company
9 (19,) AOL Canada makes the leap into VoIP
7 (19,) Blogger grounded by her airline
10 (19,) Ex-Official Says Iraq Wants Show Trial for Saddam
9 (19,) Darfur crisis: Will new peace talks succeed?
6 (19,) Russians eyed in abductions
8 (19,) NASA #39;s Longest-Serving Astronaut to Retire
10 (19,) Stocks Set to Open Up; Grey in Spotlight
8 (19,) Former cybersecurity czar: Code-checking tools needed
7 (19,) Grim Signs Mark Global Warming
7 (19,) Pfizer to settle asbestos claims
10 (19,) Lethar

6 (19,) Firepower rests with Bucknell
8 (19,) Sup on a new alphabet soup
6 (19,) The Moon Eclipses Jupiter
8 (19,) Group seeks calendar, scheduling app interoperability
9 (19,) Bears' Secondary Forced to Play Hurt (AP)
7 (19,) Mr. Bush and Tax Reform
6 (19,) Mercedes Dulls DaimlerChrysler's Profit
8 (19,) Olympic joy in Greek, Chinese press
8 (19,) Taiko: Drum Master ships to retail
11 (19,) Microsoft still hopes to settle with EC, exec says
10 (19,) AP: Rehnquist Plans to Swear in Bush (AP)
6 (19,) Kidlow captures first victory
11 (19,) Jaguar given new lease of life by Red Bull
11 (19,) Daschle packs up after 26 years on Hill (USATODAY.com)
7 (19,) Symantec under attack from MyDoom
9 (19,) A Digital Doctor Treats Computer Contamination (washingtonpost.com)
5 (19,) Inter Thrash Valencia
8 (19,) Typhoon lashes southern Japan; 8 dead
8 (19,) Lexmark Issues Recall for 40,000 Printers
5 (19,) \$2,137,000 and Counting
8 (19,) Sybase upgrades PowerBuilder, plots RFID move
9 (19,) Indonesian

9 (19,) New species of monkey discovered in Arunachal
12 (19,) Judge warns US government to speed up Abu Ghraib probe
8 (19,) Montserrat volcano: Has it gone quiet?
7 (19,) Pirate for genuine XP trade
10 (19,) A tiny critter #39;s day in the sun
8 (19,) Outcasts of Greece and the Games
9 (19,) 3 hostages killed in Iraq, tape shows
13 (19,) GM to fire on all cylinders if Opel #39;s cost-cutting fails
6 (19,) Pacers will be charged
8 (19,) Germany #39;s 2004 Growth Outlook Raised
9 (19,) Werder could be next on Bayern warpath
11 (19,) Blue Titan adds reliable messaging to SOA tool (InfoWorld)
8 (19,) Stocks Up as Oil Prices Eases
11 (19,) Update 1: Sprint to Cut 700 Jobs, Expects Charge
7 (19,) GE meets 3Q, ups guidance
8 (19,) Titan moon holds on to mystery
4 (19,) FUTURES MOVERS
9 (19,) Link Popularity and Search Engine Ranking Pitfalls
12 (19,) Subway 500 victory overshadowed by 10 dead in company craft
9 (19,) Offshoring Forces Tech-Job Seekers To Shift Strategy
6 (19,) German busine

6 (19,) Insurer Allianz Beats Expectations
8 (19,) Browns' Winslow Needs Second Operation (AP)
7 (19,) Gailey: Ball still the starter
12 (19,) Intel Doubles Cash Dividend and Authorizes Repurchase of 500 &lt;b&gt;...&lt;/b&gt;
13 (19,) Group including Mansell says it has deal to save British GP
8 (19,) Diplomat, Parma Native, Killed In Iraq
9 (19,) Headset Maker Plantronics Surfs on Games Wave
7 (19,) Chacin #39;s debut a gem
11 (19,) Bush Shrugs Off Bad Polls on Iraq Outlook (AP)
8 (19,) IBM shows off new grid apps
9 (19,) Video Game Makers Rush to Cash In
8 (19,) Web giant Google cuts IPO prices
9 (19,) No travel problems reported during Turnpike strike
8 (19,) Jackson finds emotional rescue on court
10 (19,) AT amp;T to Offer Service at Circuit City
6 (19,) Broncos 34, Saints 13
7 (19,) Chinese retail sales hold firm
7 (19,) Microsoft scales back Passport ambitions
10 (19,) Microsoft Expands Intellectual Property Indemnification Coverage (Ziff Davis)
8 (19,) Midwest Business Growth 

8 (19,) Ivorian parties agree to peace proposals
9 (19,) Tyco Unit Signs Contract With UK Military
11 (19,) UN Special Rapporteur on violence against women visits Sudan
9 (19,) Oil Up on Winter Worries, Norway Outage
10 (19,) U.K. officials criticize secrecy in EDS system probe
8 (19,) Dollar hits record low vs. euro
9 (19,) Golf doesn #39;t need a buddy system
9 (19,) Birding Column: Scrub Jays' Peanut Feeding Frenzy
8 (19,) ObjectWeb plans open source BPEL server
9 (19,) Eye On Stocks For Monday, Dec. 20
14 (19,) Federal Court to Hear Microsoft Appeal of Internet Explorer Case (PC World)
9 (19,) Exploit Threat Ratchets Up On Windows Vulnerability
11 (19,) Typhoon Megi Slams North Japan, Death Toll Hits 13
12 (19,) News: TRMM Sees Rain from Hurricanes Fall Around the World
7 (19,) MLB: Texas 16, Cleveland 4
7 (19,) Utah Jazz: Russia trip cancelled
10 (19,) Kerry Supporters Left to Deal With Blues (AP)
9 (19,) Beckham ready to return for Real Madrid
9 (19,) Iraq Blames US-Led Forces fo

8 (19,) PeopleSoft, Oracle standoff likely to continue
9 (19,) Putin rules out talks with Chechen rebels
10 (19,) 16 including 3 Polish soldiers killed in Iraq
9 (19,) Computer Makers Sign Joint Code of Conduct
11 (19,) AmWest passes up bid to buy bankrupt ATA assets
9 (19,) Finally sync your BlackBerry with your Mac
16 (19,) Is more desire for sex worth risk to the heart? The FDA says no
9 (19,) OK, so Ellison is not a sociopath...
9 (19,) Digital signatures  #39;could be forged #39;
11 (19,) U.S. Pursues Samarra Offensive; 5 Dead, 20 Wounded (Reuters)
11 (19,) UN Says May Have Spotted Rwandan Troops in Congo
6 (19,) SuperGen Withdraws Drug Application
11 (19,) Oakland Hills Praised for Its Fair Ryder Cup Set-Up
5 (19,) Shopping Search Tactics
11 (19,) Death Toll Rises to 166 China Coal Blast (AP)
9 (19,) Johnson Leaves Game With Strained Hip (AP)
9 (19,) Manning closes in on Marinos TD mark
7 (19,) Sheen Stumps for Real-Life Politician
10 (19,) Maryland Dig May Reach Back 16,000 Year

9 (19,) 2 Assessments of Iraq, 2 Election Strategies
8 (19,) AP: Iran Converts Uranium Into Gas
8 (19,) Critical Flaw Discovered In MS Word
8 (19,) Two high-profile black coaches for Huskies
8 (19,) Nabi Says Vaccine Helped Smokers Quit
9 (19,) Analysis: Israel hand seen in Ivorian clash
10 (19,) Google Steals a Page from Amazon's Book Search
12 (19,) NFL Wrap: Gibbs Returns to NFL as Redskins Edge Buccaneers
12 (19,) Russia to Test New Soyuz Booster on Oct. 29 (Reuters)
7 (19,) China's economic boom slows down
10 (19,) Israeli Soldiers Kill Palestinian in West Bank Raid
9 (19,) Halliburton Says Bribes May Have Been Paid
6 (19,) I Steal Your Heritage
6 (19,) The Red Sox-Cardinals Legacy
8 (19,) Ex-Serono executive charged in bribery case
11 (19,) Microsoft questions future participation at CeBIT trade fair (AFP)
5 (19,) PSP Pricing Announced
5 (19,) a super-secure ThinkPad
8 (19,) Injured Gail Devers Can't Finish Hurdles
8 (19,) Smallest 'guitar string' to weigh atoms
6 (19,) Youzhny e

8 (19,) Japanese Lunar Probe Facing Delays (AP)
7 (19,) Hirst restaurant sale makes 11m
11 (19,) Disney #39;s Board on Trial as Shareholders Launch Case
8 (19,) Glaxo Sees Dip in Third-Quarter Profits
7 (19,) HD-DVD picks up Hollywood support
5 (19,) Wired Tools 2004
8 (19,) UTStarcom wins Indian IP gear contract
8 (19,) Durable Goods, Lower Oil Lift Stocks
10 (19,) Jurors Begin 2nd Phase Of Enron-Merrill Lynch Trial
8 (19,) Actional, Westbridge merge for Web services
9 (19,) Germany Could Lose Third Olympic Equestrian Gold
8 (19,) Clear Pictures of How We Think
7 (19,) Honda Expanding Production in China
11 (19,) Bonds #39; 700th homer ball draws bid topping \$800,000
10 (19,) Coalition thanks outgoing NASA Administrator Sean O #39;Keefe
9 (19,) Top pick Howard soaking it all in
9 (19,) Voter Registrations in Las Vegas Possibly Trashed
11 (19,) KPMG Units Agree to Pay to Settle Malpractice Suits
14 (19,) French FM in Amman after Qatar talks, heartened by calls for &lt;b&gt;...&lt;/b&g

7 (19,) United may cut 6,000 employees
9 (19,) Suicide Bomber at Pakistan Mosque Kills Three
12 (19,) Dealers: GM  #39;Red Tag #39; deals up to \$7,500
9 (19,) Unbeaten Auburn gets squeezed out of Orange
12 (19,) Ford #39;s out to put a lid on costly incentives
11 (19,) SUPERWEBINAR: Cisco #39;s 40-Gig CRS-1- Light Reading Test Results
6 (19,) Americans Forces Strike Fallujah
14 (19,) Yao Ming, Rockets in Shanghai for NBA #39;s first games in China
13 (19,) Rescuers in China search for 79 iron miners trapped in blaze
5 (19,) WHAT TO WEAR
8 (19,) Pakistan Parliament Elects New PM (AP)
7 (19,) US cyber security chief resigns
7 (19,) IPod Promoters Feel the Heat
9 (19,) Panel Says Pollution Plagues Great Lakes (AP)
8 (19,) Titan Posts Slightly Higher Profit (Reuters)
9 (19,) Yahoo battles Google for the cell phone
11 (19,) With Russia's Nod, Treaty on Emissions Clears Last Hurdle
5 (19,) Braves Hum Along
12 (19,) NASA's Return to Flight on Track, Shuttle Officials Say (SPACE.com)
10 (19,)

9 (19,) Google Shares Jump 18 In NASDAQ Debut
6 (19,) Correction: Mouse Product Review
12 (19,) Ballmer Calls Linux Threat Overblown, Touts Progress With Office &lt;b&gt;...&lt;/b&gt;
8 (19,) UK appeals for Bigley #39;s remains
7 (19,) Anti-French uproar in Ivory Coast
9 (19,) Foreign Doctors Find Arafat Has Flu -Confidant
8 (19,) BSkyB #39;s go-ahead to share buyback
8 (19,) Downer upbeat after North Korea meetings
10 (19,) UN remembers Iraq staff, a year after bombing
6 (19,) Problems come into focus
11 (19,) NL Wrap: Padres Ease Past Giants Despite Bonds Homer
8 (19,) NHS Strikes Money-Saving Deal with Microsoft
7 (19,) Bunch of TDs from Brady
6 (19,) Guaranteed contracts: the truth
10 (19,) Bill Gates remains atop list of richest: Forbes
10 (19,) Najaf police: a thin blue line between foes
10 (19,) Pakistan Says It Holds Suspects Planning Big Attacks
8 (19,) Canada #39;s Environmental Record Bad-Official Report
10 (19,) INTERVIEW: Australia #39;s Cochlear To Ramp Up Output
12 (19,)

12 (19,) Thousands turn out in big send-off for Malaysia #39;s Anwar
9 (19,) Veteran Mel Karmazin To Head Sirius Radio
9 (19,) Federal Reserve lifts target rate to 2.25
10 (19,) Iran Admits Mismanagement Behind Huge Quake Tolls (Reuters)
11 (19,) Asset Sales Seen of Limited Use to US Air
14 (19,) US dollar moves in tight range against major currencies in Asian &lt;b&gt;...&lt;/b&gt;
12 (19,) Intel Chief Begs Forgiveness, Says Company Became Too Relaxed #39;
7 (19,) Iran aide cites worse relations
11 (19,) Singapore will not change  quot;one-China quot; policy: PM
7 (19,) China targets more overseas acquisitions
11 (19,) Howard Stern will bring his show to satellite radio
13 (19,) Thai Premier Creates Panel to Probe 84 Deaths in Southern State
9 (19,) MCI Planning to Write Down Phone Assets
10 (19,) Samsung develops world's first five-megapixel camera phone (AFP)
13 (19,) Turkish Captive in Iraq Tells of Fearful Struggle to Hold On
8 (19,) Advanced Neuromodulation takes stake in Cyberon

7 (19,) Australian PM sets out priorities
7 (19,) Briefly: KDE updates Linux desktop
13 (19,) ETA Sets Off 7 Bombs Across Spain, at Least 5 Hurt
5 (19,) Wednesday #39;s preview
8 (19,) Warner learning to play it safe
9 (19,) HAL to showcase its wares for Putin
7 (19,) Sudan refugees report new attacks
10 (19,) Polish hostage in Iraq calls for troop pullout
12 (19,) Indian selectors pick two rookie fast bowlers for Bangladesh tour
11 (19,) UPDATE:Stanchart To Buy 51 Stake In Indonesia Bk Permata
9 (19,) US Airways seeks order to prohibit walkouts
11 (19,) Oil Shoots Up as Ivan Hits U.S. Shores (Reuters)
11 (19,) Euro sets new record high at 1.3074 dollars (AFP)
5 (19,) CELEBREX UNDER MICROSCOPE
9 (19,) Virginia Upsets No. 10 Arizona 78-60 (AP)
9 (19,) Fannie Mae might have to restate earnings
13 (19,) Iraqi prime minister to seek election reassurances from UN chief (AFP)
11 (19,) Greens Paint Grim Picture of Future, Warmer World (Reuters)
7 (19,) Ht Rangers 0 Celtic 0
8 (19,) Red Sox mi

9 (19,) Giants Look to Repeat Success Against Vikings
6 (19,) China textile exports develop
8 (19,) Racist Taunts Mar Soccer Match (AP)
6 (19,) Droughns hits ground running
11 (19,) Astros Top Rockies to Claim NL Wild Card (AP)
11 (19,) Catch The Shooting Stars In Tonight #39;s Meteor Shower
9 (19,) "Reindeer People" Resort to Eating Their Herds
9 (19,) N Korea says explosion was controlled demolition
7 (19,) Oil Falls to Six-Week Low
5 (19,) FROM THE DEBATE
8 (19,) Russia, China Hold Trade, Anti-Terrorism Talks
10 (19,) Iran extends condolences over death of UAE president
6 (19,) Global lessons in e-voting
9 (19,) Turkish Captives in Iraq Executed, Videotape Shows
7 (19,) Dollar Rallies on Jobs Numbers
9 (19,) Cell phone talker arrest refuels etiquette debate
9 (19,) Bagle Gets Stale But Remains a Threat
12 (19,) US Airways pilots to resume talks on pay cuts -NYT
11 (19,) NFL Wrap: McNabb Sparkles for Eagles, Manning Delights Colts
9 (19,) France Faces EU Fight Over Tax Rates
10 (19,)

6 (19,) Tellier clashed with owners
7 (19,) Kyoto 'won't hit' Russian economy
8 (19,) Greece is winner at these Olympics
4 (19,) Chip shots
5 (19,) Ohio St. Buckeyes
8 (19,) Wal-Mart announces \$10 billion share repurchase
9 (19,) Insurers obligated to pay controversial Marsh fees
9 (19,) African Union Urges Sudan, Rebels Overcome Impasse
7 (19,) Surprise! Ochoa surges to win
9 (19,) Dillon: Cash and carry setup working well
10 (19,) Blasts Inside Green Zone Kill at Least 5
8 (19,) Firefox Erodes IE Market Share (NewsFactor)
10 (19,) Lack of cash blamed for Beagle 2 failure
12 (19,) Department of Homeland Security Prevents Terrorist from Entering the U.S.
10 (19,) No. 20 Mississippi St. 74, New Orleans 59
8 (19,) MLB: San Francisco 9, Atlanta 5
10 (19,) US jobs and sales figures indicate weakening economy
6 (19,) Stike threatens oil exports
9 (19,) Corporate kleptocracy that mirrors Maxwell #39;s world
9 (19,) Perfect start for France in Federation Cup
5 (19,) Private World (Forbes.com

7 (19,) Of Bobbleheads and the Beast
12 (19,) EU hits back over Boeing #39;s  quot;massive quot; aid
9 (19,) Snow set to feel heat in Europe
6 (19,) Astronaut's Long Career Ends
10 (19,) Caminiti Dead at 41 of Possible Heart Attack
8 (19,) Brown pulls double duty for defense
8 (19,) Microsoft splashes cash on anti-spyware firm
11 (19,) Olympic bid suffers as Madrid #39;s image is tarnished
8 (19,) Powell: US will aid Palestinian elections
8 (19,) French cinemas act to jam mobiles
8 (19,) In the long run, humans prevail
4 (19,) The Rundown
20 (19,) In the  #39;Not For Long #39; league, the  #39;Skins #39; Joe Gibbs has missed an &lt;b&gt;...&lt;/b&gt;


ValueError: could not broadcast input array from shape (20,) into shape (19,)

In [91]:
dataset._max_seq_length

19