# Feed-Forward Networks for NLP

- Learning **intermediate representations** that have specific properties, like being linearly separable for a classification task, is one of the most profound consequences of using neural networks and is quintessential to their modeling capabilities

# Multilayer perceptron

In [9]:
%matplotlib inline 
from IPython.core.interactiveshell import InteractiveShell
get_ipython().ast_node_interactivity = 'all'
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

seed = 23

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)

<torch._C.Generator at 0x1ee5c552290>

In [3]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
        input_dim (int): the size of the input vectors
        hidden_dim (int): the output size of the first Linear layer
        output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP
        3
        4
        Args:
        x_in (torch.Tensor): an input data tensor
        x_in.shape should be (batch, input_dim)
        apply_softmax (bool): a flag for the softmax activation
        should be false if used with the cross-entropy losses
        Returns:
        the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in))
        output = self.fc2(intermediate)
        if apply_softmax:
            output = F.softmax(output, dim=1)
        return output

In [4]:
batch_size = 2 # number of samples input at once
input_dim = 3
hidden_dim = 100
output_dim = 4
# Initialize model
mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)
print(mlp)

MultilayerPerceptron(
  (fc1): Linear(in_features=3, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=4, bias=True)
)


## Testing the MLP with random inputs

With no training

In [16]:
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))
x_input = torch.rand(batch_size, input_dim)
describe(x_input)

y_output = mlp(x_input, apply_softmax=False)
describe(y_output)

y_output = mlp(x_input, apply_softmax=True)
describe(y_output)

0.2403 + 0.3313 + 0.2284 + 0.2000 == 1


Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.2933, 0.9205, 0.5876],
        [0.1299, 0.6729, 0.1028]])
Type: torch.FloatTensor
Shape/size: torch.Size([2, 4])
Values: 
tensor([[-0.0724,  0.2681,  0.0131, -0.2113],
        [-0.0952,  0.1580,  0.0931, -0.1087]], grad_fn=<AddmmBackward0>)
Type: torch.FloatTensor
Shape/size: torch.Size([2, 4])
Values: 
tensor([[0.2291, 0.3220, 0.2495, 0.1994],
        [0.2231, 0.2874, 0.2693, 0.2201]], grad_fn=<SoftmaxBackward0>)


True

## Example: Surname Classification with an MLP

## The Surnames Dataset

the surnames dataset, a collection of 10,000 surnames from 18 different nationalities

These are the same data structures used in “Example: Classifying Sentiment of Restaurant Reviews”, exemplifying a **polymorphism** that treats the character tokens of surnames in the same way as the word tokens of Yelp reviews. Instead of vectorizing by mapping word **tokens** to **integers**, the data is vectorized by mapping **characters** to **integers**.

-----------------

the Vocabulary is a coordination of **two** Python **dictionaries** that form a bijection between tokens (characters, in this example) and integers; that is, the first dictionary maps characters to integer indices, and the second maps the integer indices to characters.

------------------

we use a one-hot representation and do not count the frequency of characters and restrict only to frequent items. This is mainly because the dataset is **small** and most characters are frequent enough.

In [None]:
class SurnameDataset(Dataset):
    # Implementation is nearly identical to Example 3-14
    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        surname_vector = \
        self._vectorizer.vectorize(row.surname)
        nationality_index = \
        self._vectorizer.nationality_vocab.lookup_token(row.nationality)
        return {'x_surname': surname_vector,
        'y_nationality': nationality_index}

## The SurnameVectorizer

Surnames are **sequences** of characters, and each character is an individual token in our Vocabulary. However, until “Convolutional Neural Networks”, we will ignore the sequence information and create a collapsed **onehot** vector representation of the input by iterating over each character in the string input.