# PyTorch Programming

## Tensors
Tensors are the key of Deep Learning programming, they are the generalization of matrices, that can have more than two dimensions.

In [1]:
##Libraries
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Tensors can be created from Python lists whith the function `torch.tensor()`

In [8]:
##Create a tensor from a list 
vec = [1, 2, 3, 4]
tens = torch.tensor(vec)
print(tens)
##indexing...
print(tens[0])
print(tens[0].item()) ##to get a scalar

tensor([ 1,  2,  3,  4])
tensor(1)
1


In [12]:
##Create a tensor of dimension 2x3
mat = [[1, 2, 3], [4, 5, 6]]
print(mat)
t_mat = torch.tensor(mat)
print(t_mat)
print(t_mat[0])##to get a tensor vector

[[1, 2, 3], [4, 5, 6]]
tensor([[ 1,  2,  3],
        [ 4,  5,  6]])
tensor([ 1,  2,  3])


In [13]:
##Create a 3D tensor of dimensions 2x2x2
triD = [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]
ten_triD = torch.tensor(triD)
print(ten_triD)
print(ten_triD[0])##to get a tensor matrix

tensor([[[ 1,  2],
         [ 3,  4]],

        [[ 5,  6],
         [ 7,  8]]])
tensor([[ 1,  2],
        [ 3,  4]])


Main customable object types in tensors are the _long_ format (i.e. integers) and the _float_. 

In [15]:
##Create a random tensor of type float
ften = torch.randn(3,4, dtype=torch.float)
print(ften)

tensor([[-0.8212,  0.4561,  0.1326, -1.3092],
        [-0.7630, -0.0695,  0.8082,  0.4521],
        [ 0.8198, -0.4683, -2.2071, -0.4712]])


In [18]:
##Create a  3D tensor of type int
iten = torch.ones(2, 2, 2, dtype=torch.long)
print(iten)

tensor([[[ 1,  1],
         [ 1,  1]],

        [[ 1,  1],
         [ 1,  1]]])


In [24]:
##All kinds of operations can be made with tensors (e.g. sum...)
##Concatenate by row two tensors
t1 = torch.tensor([[1, 2, 3], [7, 8, 9]])
t2 = torch.tensor([[4, 5, 6], [10, 11, 12]])
t = torch.cat([t1, t2])
print(t)

##Concatenate by column two tensors
t3 = torch.tensor([[1, 2, 3], [4, 5, 6]])
t4 = torch.tensor([[7, 8, 9], [10, 11, 12]])
tt = torch.cat([t3, t4], 1)
print(tt)

tensor([[  1,   2,   3],
        [  7,   8,   9],
        [  4,   5,   6],
        [ 10,  11,  12]])
tensor([[  1,   2,   3,   7,   8,   9],
        [  4,   5,   6,  10,  11,  12]])


In [29]:
##Start with a tensor of dimension 2x3x4 and reshape it to a tensor 2x12
ten234 = torch.randn(2, 3, 4)
ten212 = ten234.view(2, 12)
print(ten212)
print(ten234)

tensor([[ 1.3596,  0.7083,  3.0885, -0.6278, -0.4005, -0.3610, -0.5551,
         -0.4536,  0.0187, -1.4297,  0.0323, -0.8992],
        [-3.0237, -0.9567, -0.4114,  0.3894,  0.2449,  0.7987,  0.4092,
         -0.8265, -0.8696,  0.5213, -1.7497,  0.1060]])
tensor([[[ 1.3596,  0.7083,  3.0885, -0.6278],
         [-0.4005, -0.3610, -0.5551, -0.4536],
         [ 0.0187, -1.4297,  0.0323, -0.8992]],

        [[-3.0237, -0.9567, -0.4114,  0.3894],
         [ 0.2449,  0.7987,  0.4092, -0.8265],
         [-0.8696,  0.5213, -1.7497,  0.1060]]])


## Computation Graphs and Automatic Differentiation
Computation graph is an essential concept for the computation of the backpropagation gradient. 
When we add two tensors together, what we obtain is an output tensor and the unique information it carries are the data and the dimension. This means that the way in which the output tensor was created (i.e. result of a sum) is lost. If we use the flag `requires_grad=True`, then this information is carried by the output `torch.tensor()` object.

In [33]:
##Sum two tensors and keep track of the operation
x = torch.tensor([1, 2, 3, 4], dtype=torch.float, requires_grad=True)
y = torch.tensor([5, 6, 7, 8], dtype=torch.float, requires_grad=True)

z = x + y
print(z.grad_fn)

s = z.sum()
print(s.grad_fn) ##once requires_grad is flagged, all the operations are recorded

<AddBackward1 object at 0x7fdb97234a58>
<SumBackward0 object at 0x7fdb97234b70>


In [34]:
##Now, if we want to compute the partial derivative of s w.r.t x_0,...,x_4
s.backward()
print(x.grad)

tensor([ 1.,  1.,  1.,  1.])


In [49]:
##Change a requires_grad flag in-place
x = torch.randn(2, 2)
y = torch.randn(2, 2)
print(x.requires_grad, y.requires_grad) ##as we can observe we haven't keep track of the history
z = x + y
print(z.grad_fn)

x = x.requires_grad_()
y = y.requires_grad_()
print(x.requires_grad, y.requires_grad) ##we have changed the flag in-place
z = x + y
print(z.grad_fn)
print(z.requires_grad)

z = z.detach() ##we can erase the history
print(z.grad_fn)

##if we want to stop autograd to record the history of a tensor with flag equals True
x = torch.randn(2, 2, requires_grad=True)
y = torch.randn(2, 2, requires_grad=True)
z = x + y

with torch.no_grad():
    print(z.requires_grad) ##this is true, we need to act directly on x and y
    print((x ** 2).grad_fn)
    print((y ** 2).requires_grad)
    print((x + y).requires_grad)

False False
None
True True
<AddBackward1 object at 0x7fdb972a7f28>
True
None
True
None
False
False


## Deep Learning

Core of DL is the _affine_ transformation $f(x) = Ax + b$, where the matrix $A$ and vector $b$ are to be learned.

In deep learning the input are the rows of $X$ (n_samples, n_features).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1) ##set seed

<torch._C.Generator at 0x7fd80c074510>

In [52]:
lin = nn.Linear(5, 3) ##linear map from R5 to R3
x = torch.randn(2, 5) ##the input should have 5 columns since we give the rows as input to the linear function
print(lin(x))
print(lin)

tensor([[-0.2889,  0.3574,  0.6554],
        [-0.9682,  0.0289,  0.4426]])
Linear(in_features=5, out_features=3, bias=True)


Since composition of affine transformation is an affine transformation, what we need are non-linear transformations in between the affine layers. The most common non-linear tansformations are: $\tanh(x)$, $\sigma(x)$, $ReLU(x)=\max(0, x)$. These functions have easy-to-compute gradients, this is why are largely used. $\sigma(x)$ is not used in practice, since its gradient tends to vanish very quickly as the absolute value of the argument grows. Models defaults to $\tanh(x)$ and $ReLU(x)$.

In Pytorch most non-linear functions are in `torch.functional`, which in this case was importe as `F`.

In [54]:
data = torch.randn(2, 2)
print(F.relu(data)) ##ReLU(x) = max(0, x) [rectified linear unit]

tensor([[ 0.0000,  0.2366],
        [ 0.2857,  0.6898]])


$Softmax(x)$ function is a non-linear function and usually is the last operation done in a network. It takes in a vector and returns a probability (used for classification tasks).

The i-th component of the function is given by: $(Softmax(x))_i=\frac{\exp(x_i)}{\sum_j\exp(x_j)}$

In [2]:
##Let us use Softmax
data = torch.randn(4)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())
print(F.log_softmax(data, dim=0))

tensor([ 0.3141,  0.2117,  0.1724,  0.3018])
tensor(1.)
tensor([-1.1581, -1.5525, -1.7578, -1.1981])


Since we want to minimize the _loss function_ we need to compute its partial derivatives, with PyTorch, since we can keep track of the objects used to compute it (it is a Tensor), it is straightforward. 

We can then perform standard gradient updates: $\theta^{(t+1)}=\theta^{(t)}-\eta\nabla_{\theta}L(\theta)$, with $\theta$ our parameters, $L(\theta)$ the loss function and $\eta$ a positive learning rate. This is called the _vanilla gradient update_. Other optimization algorithms (such as Adam...) are available in `torch.optim` package.

## Building a Network
All network components should inherit from `nn.Module`, this provides functionality to the components. For example, it makes it keep track of the trainable parameters and swapping between CPU and GPU with `.to(device)`, where the device can be a CPU device (`torch.device('cpu')`) or a CUDA device (`torch.device('cuda')`).

### Example: network that takes in a sparse bag-of-words representation and outputs a probability distribution over two labels ("English" and "Spanish") 

In [3]:
##logistic regression Bag-of-Words classifier (occurrences vector to represent sentences = BoW vector)
##if we denote the BOW vector as x, the output of the network will be: log(Softmax(Ax+b))
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

print(data)

[(['me', 'gusta', 'comer', 'en', 'la', 'cafeteria'], 'SPANISH'), (['Give', 'it', 'to', 'me'], 'ENGLISH'), (['No', 'creo', 'que', 'sea', 'una', 'buena', 'idea'], 'SPANISH'), (['No', 'it', 'is', 'not', 'a', 'good', 'idea', 'to', 'get', 'lost', 'at', 'sea'], 'ENGLISH')]


In [4]:
##now we need to map each word to a unique integer, which will be its index in the BOWs vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)


{'en': 3, 'cafeteria': 5, 'creo': 10, 'si': 24, 'una': 13, 'me': 0, 'idea': 15, 'que': 11, 'comer': 2, 'Yo': 23, 'is': 16, 'on': 25, 'get': 20, 'sea': 12, 'a': 18, 'lost': 21, 'to': 8, 'good': 19, 'buena': 14, 'at': 22, 'it': 7, 'Give': 6, 'not': 17, 'la': 4, 'gusta': 1, 'No': 9}


In [5]:
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

In [6]:
class BoWClassifier(nn.Module): #inheriting from nn.Module
    
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()
        
        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        self.linear = nn.Linear(vocab_size, num_labels)
        
    def forward(self, bow_vec):
        return F.log_softmax(self.linear(bow_vec), dim=1)
    
def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1,-1)

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters

In [7]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.0273, -0.0240,  0.0544,  0.0097,  0.0716, -0.0764, -0.0143,
         -0.0177,  0.0284, -0.0008,  0.1714,  0.0610, -0.0730, -0.1184,
         -0.0329, -0.0846, -0.0628,  0.0094,  0.1169,  0.1066, -0.1917,
          0.1216,  0.0548,  0.1860,  0.1294, -0.1787],
        [-0.1865, -0.0946,  0.1722, -0.0327,  0.0839, -0.0911,  0.1924,
         -0.0830,  0.1471,  0.0023, -0.1033,  0.1008, -0.1041,  0.0577,
         -0.0566, -0.0215, -0.1885, -0.0935,  0.1064, -0.0477,  0.1953,
          0.1572, -0.0092, -0.1309,  0.1194,  0.0609]])
Parameter containing:
tensor([-0.1268,  0.1274])


In [9]:
with torch.no_grad(): ##no need to train
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print(bow_vector)
    print(log_probs)

tensor([[ 1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.]])
tensor([[-0.7148, -0.6719]])


In [12]:
label_to_ix = {"SPANISH":0, "ENGLISH":1}

In [13]:
##Let's train the model: we pass instances through to get log probabilities, compute a loss function, 
##compute the gradient of the loss function and then update the parameters with a gradient step. 
##As a loss function for our logistic regression we use nn.NLLLoss(), which is the negative log likelihood loss.
##NLLLoss wants as input a vector of log probabilities and a target label, this is why we need log_softmax

# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

tensor([[-0.5511, -0.8588]])
tensor([[-0.7574, -0.6328]])


In [14]:
# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([ 0.1714, -0.1033])


In [15]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [17]:
# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()
        
        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)
        
        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)
        
        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()
        
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)
        
# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.0944, -2.4075]])
tensor([[-2.6246, -0.0752]])
tensor([ 0.6117, -0.5436])


## Word embeddings
Word embeddings are dense vectors of real numbers, one per word in the vocabulary. These vectors are built to preserve the semantic relation between words (i.e. _semantic similarity_), based on the linguistic assumption that words appearing in similar contexts are related to each other, __distributional hypothesis__. Hence, two vectors are close to each other, in terms of a distance, if the word they represent are semantically similar, i.e. appear in the same context. 

To encode semantic similarity in words we may think up some __semantic attributes__ and give scores to these attributes, e.g.

$$
q_{mathematician} = [can.run=2.3, likes.coffee=9.4, majored.in.Physics=-5.5,...]\\
q_{physicist} = [can.run=2.5, likes.coffee=9.1, majored.in.Physics=6.4,...]
$$

Then, as a measure of similarity:

$$
Sim(physicist, mathematician) = \frac{q_{physicist} \cdot q_{mathematician}}{||q_{physicist}||\ ||q_{mathematician}||} = \cos(\varphi)
$$

Where, $\varphi$ is the angle between the two vectors. In this way, extremely similar words (i.e. whose embedding points in the same direction) will have similarity $\sim 1$, whereas extremely dissimilar words will have similarity $\sim -1$.

Central to the idea of deep learning is that the neural network learns representation of the features, rather then requiring the programmer to design them. Hence, we will have some _latent semantic attributes_ that the network can learn. 

__Remark__: the output of the model won't be interpretable, if we obtain two embeddings, one for _mathematician_ and the other for _physicist_ with a similar value, let's say, in the second dimension, we are not able to understand what that exactly means.

### Word embedding in PyTorch
As for the one-hot vectors, we need to define and index for each word, when using embeddings. Embeddings are stored as a $|V|\times D$ matrix, where $D$ is the dimensionality of the embedding and $|V|$ is the dimension of the vocabulary. 

What can be done is mapping words to indices and store the 1-to-1 correspondance in a dictionary.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f4d880e4510>

In [3]:
word_to_ix = {"Hello":0, "Word":1} 
##we now want to obtain the embedding of these two words
embeds = nn.Embedding(2, 5) ##we need to specify the vocabulary size, dimensionality of the embedding
lookup_tensor = torch.tensor([word_to_ix["Hello"]], dtype=torch.long) ##we are talking of indices, hence we need Int
hello_embed = embeds(lookup_tensor)
print(lookup_tensor)
print(hello_embed)

tensor([ 0])
tensor([[-0.8923, -0.0583, -0.1955, -0.9656,  0.4224]])


#### N-Gram Language Modeling
Given a sequence of words $w$ we want to compute $\mathbb{P}(w_i | w_{i-1}, ..., w_{i-n+1})$, where $w_i$ is the $i$-th word of the sequence.

In [4]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

In [6]:
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i+1]], test_sentence[i+2]) for i in range(len(test_sentence)-2)]
print(trigrams[:3])

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [9]:
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [11]:
class NGramLanguageModeler(nn.Module): ##in this case the heredited class is nn.Module
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()
        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)
        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    
    losses.append(total_loss)
print(losses) # The loss decreased every iteration over the training data!

[tensor([ 521.5042]), tensor([ 519.1281]), tensor([ 516.7686]), tensor([ 514.4236]), tensor([ 512.0934]), tensor([ 509.7769]), tensor([ 507.4751]), tensor([ 505.1850]), tensor([ 502.9061]), tensor([ 500.6381])]


#### Continuos Bag-of-Words model
Tipically, CBOW is used to quickly train word embeddings and use these embeddings to initialize the embeddings of more complicated models. This process is called _pretraining embeddings_. CBOW tries to predict words given the context of a few word before and a few words after the target word. 

Given a target word $w_i$ and an $N$ context window on each side of the word, i.e. $C=\{w_{i-1}, ..., w_{i-N}, ..., w_{i+1}, ..., w_{i+N}\}$, the model tries to minimize:
$$
-\log(p(w_i|C))=-\log(Softmax(A(\sum_{w\in C}q_w)+b))
$$
where $q_w$ is the embedding for the word $w$.

In [18]:
import torch
import torch.nn as nn
import numpy as np

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context] ##word_to_ix is the dictionary with words in text and numbers
    tensor = torch.LongTensor(idxs)
    return tensor

def get_index_max(input):
    index = 0
    for i in range(1, len(input)):
        if input[i] > input[index]:
            index = i
        return index
    
def get_max_prob_result(input, ix_to_word):
    return ix_to_word[get_index_max(input)]

class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation1 = nn.ReLU()
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation2 = nn.LogSoftmax(dim=-1)
        
    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view((1, -1))
        out = self.linear1(embeds)
        out = self.activation1(out)
        out = self.linear2(out)
        out = self.activation2(out)
        return out
    
    def get_word_embedding(self, word):
        word = torch.LongTensor([word_to_ix[word]])
        return self.embeddings(word).view(1, -1)

In [13]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMDEDDING_DIM = 100

word_to_ix = {}
ix_to_word = {}

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()


# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

for i, word in enumerate(vocab):
    word_to_ix[word] = i
    ix_to_word[i] = word

data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))

In [14]:
print(word_to_ix)

{'with': 0, 'are': 1, 'is': 2, 'a': 4, 'pattern': 5, 'rules': 6, 'program.': 3, 'evolve,': 15, 'We': 7, 'inhabit': 8, 'create': 9, 'computational': 10, 'of': 11, 'directed': 12, 'spirits': 13, 'other': 14, 'computer': 17, 'the': 16, 'they': 18, 'study': 19, 'processes.': 20, 'programs': 21, 'The': 22, 'Computational': 23, 'things': 38, 'evolution': 25, 'direct': 26, 'that': 27, 'spells.': 28, 'process.': 29, 'idea': 30, 'our': 31, 'conjure': 32, 'we': 33, 'by': 34, 'beings': 35, 'effect,': 36, 'computers.': 37, 'data.': 40, 'called': 39, 'about': 41, 'to': 42, 'abstract': 43, 'As': 44, 'manipulate': 45, 'People': 46, 'process': 24, 'processes': 47, 'In': 48}


In [15]:
print(ix_to_word)

{0: 'with', 1: 'are', 2: 'is', 3: 'program.', 4: 'a', 5: 'pattern', 6: 'rules', 7: 'We', 8: 'inhabit', 9: 'create', 10: 'computational', 11: 'of', 12: 'directed', 13: 'spirits', 14: 'other', 15: 'evolve,', 16: 'the', 17: 'computer', 18: 'they', 19: 'study', 20: 'processes.', 21: 'programs', 22: 'The', 23: 'Computational', 24: 'process', 25: 'evolution', 26: 'direct', 27: 'that', 28: 'spells.', 29: 'process.', 30: 'idea', 31: 'our', 32: 'conjure', 33: 'we', 34: 'by', 35: 'beings', 36: 'effect,', 37: 'computers.', 38: 'things', 39: 'called', 40: 'data.', 41: 'about', 42: 'to', 43: 'abstract', 44: 'As', 45: 'manipulate', 46: 'People', 47: 'processes', 48: 'In'}


In [16]:
print(data)

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea'), (['the', 'idea', 'a', 'computational'], 'of'), (['idea', 'of', 'computational', 'process.'], 'a'), (['of', 'a', 'process.', 'Computational'], 'computational'), (['a', 'computational', 'Computational', 'processes'], 'process.'), (['computational', 'process.', 'processes', 'are'], 'Computational'), (['process.', 'Computational', 'are', 'abstract'], 'processes'), (['Computational', 'processes', 'abstract', 'beings'], 'are'), (['processes', 'are', 'beings', 'that'], 'abstract'), (['are', 'abstract', 'that', 'inhabit'], 'beings'), (['abstract', 'beings', 'inhabit', 'computers.'], 'that'), (['beings', 'that', 'computers.', 'As'], 'inhabit'), (['that', 'inhabit', 'As', 'they'], 'computers.'), (['inhabit', 'computers.', 'they', 'evolve,'], 'As'), (['computers.', 'As', 'evolve,', 'processes']

In [19]:
model = CBOW(vocab_size, EMBEDDING_DIM)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [20]:
for epoc in range(50):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)
        model.zero_grad()
        log_probs = model(context_vector)
        loss = loss_function(log_probs, torch.LongTensor([word_to_ix[target]]))
        loss.backward()
        optimizer.step()
        
        total_loss += loss.data

In [34]:
##TEST
context = ["we", "are", "to", "study"]
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector).data.numpy()
print('Raw text: {}\n'.format(' '.join(raw_text)))
print('Context: {}\n'.format(context))
print('Prediction: {}'.format(get_max_prob_result(a[0], ix_to_word)))

Raw text: We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.

Context: ['we', 'are', 'to', 'study']

Prediction: with


#### LSTM
Sequence models are models where there is some sort of dependence through time between inputs. RNNs are networks that maintain some kind of state. In the LSTM, for each element in the sequence, there is a corresponding _hidden state_ $h_t$, that can contain information from arbitrary points earlier in the sequence. The hidden states can be used to predict words in a language model and other things...

PyTorch LSTM expects its inputs to be 3D tensors: (sequence, indexes instances in the mini-batch, indexes elements of the input).

In [35]:
##example
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f4d880e4510>

In [36]:
lstm = nn.LSTM(3, 3) ##input and output dim = 3
inputs = [torch.randn(1, 3) for _ in range(5)] ##sequence of length 5

##initialize the hidden state
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3)) ##clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]])
(tensor([[[-0.3368,  0.0959, -0.0538]]]), tensor([[[-0.9825,  0.4715, -0.0633]]]))
