# GPT Model Inference

Welcome! This notebook is a tutorial on how to use the model you've just trained on the Bittensor network.

In [1]:
import os
import torch
import bittensor
from nuclei.gpt2 import GPT2Nucleus
from torch.nn import functional as F

## Load the trained model
You can find the model under `~/.bittensor/miners/gpt2-exodus/<wallet-coldkey>-<wallet-hotkey>/model.torch`. This is the default place that miners will store models. Note that the loss stored with the model is the `combined` loss, that is, the local loss + remote network loss + distillation loss. As the Bittensor network grows and more sophisticated models join, this combined loss will come down close to 0 ideally. 

However, for now, the high loss does not necessarily mean the model will perform badly as a model may have a low local loss but a high remote loss. This happens when the local model is powerful and correctly training, but all the models it is talking to on the network are not so good. This can happen when it's talking to N number of models that are all the same. Since this project is still in the early days, this may happen initially. As the network grows there will be more and more sophisticated 

In [2]:
model_path = os.path.expanduser('~/.bittensor/miners/default-default/gpt2_exodus/')

# Check which device this machine is on, just in case we're not loading the model on the same machine that we trained it
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("{}/model.torch".format(model_path), map_location=device)

# Let's load up a Bittensor config
config = GPT2Nucleus.config()

# Let's load up the same nucleus config we trained our model with
config.nucleus.n_head = 32
config.nucleus.n_layer = 12
config.nucleus.block_size = 20
config.nucleus.device = device

# Load up the model
model = GPT2Nucleus(config)
model.load_state_dict(checkpoint['nucleus_state'])
print("Combined loss (local, remote, and distilled) of preloaded model: {}:".format(checkpoint['epoch_loss']))
# Load up the huggingface tokenizer
tokenizer = bittensor.tokenizer()

Combined loss (local, remote, and distilled) of preloaded model: inf:


## Inference function
In essence, the output of the current GPT model is simply encoded using the HuggingFace tokenizer that Bittensor uses. We need to simply decode that information out using the same tokenizer and turn it into text. 

In [3]:
def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

@torch.no_grad()
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time. Clearly the sampling
    has quadratic complexity unlike an RNN that is only linear, and has a finite context window
    of block_size, unlike an RNN that has an infinite context window.
    """
    block_size = model.get_block_size()-1
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        
        # Run a local forward call through the model
        logits = model.local_forward(x_cond, training=False)
        
        # The final layer from the local forward (the local hidden layer) needs to be pushed
        # through the target layer. This helps push the dimensionality to bittensor.__vocab_size__
        # making it possible to push this information through the tokenizer's decode function to get
        # words out.
        logits = model.target_layer(logits.local_hidden)
        
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

## Sampling from the trained model

Now that we've got our `sample` function built, let's actually use it! We start our sentence using the `context` variable by giving it a name, and we let the model do the rest. Note that we can actually ask the model to predict whatever number of words we want. In this case, we made it 10 words as that produces legible sentences. The lower your model's loss is, the better predictions you'll get. 

Bring it as close to 0 as you can by changing up the `nucleus` parameters to adjust the model's architecture (number of heads, number of layers, etc.) or you can change up the training settings by changing the `miner` settings (things like learning rate, weight decay rate, etc.).

In [8]:
context = "John"

# Tokenize the input
x = tokenizer(context, padding=True, truncation=True)['input_ids']
# Turn it into a tensor
x = torch.tensor(x, dtype=torch.long)
# Give it an extra dimension for the network's sake (expects a 2D tensor input)
x = x.unsqueeze(0)

num_words_predict = 10

# Let's sample the network for some output
y = sample(model, x, num_words_predict, temperature=1.0, sample=True, top_k=10)

# Decode the output
completion = ''.join([tokenizer.decode(i, skip_special_tokens=True) for i in y])

# Print what the model has predicted
print(completion)

John is believed to have kissed Mary.John believes Bill
