In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


### Create function for generating text given a prompt and output text length

In [9]:
def generate_text(prompt_text, max_new_tokens=50):
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

    # GPT-2 tokenizers do not have a pad token by default. Set it to eos_token.
    tokenizer.pad_token = tokenizer.eos_token

    # Encode the input text to get token IDs
    encoded_input = tokenizer(prompt_text, return_tensors='pt')

    input_ids = encoded_input['input_ids']
    attention_mask = encoded_input['attention_mask']

    # Load pre-trained model (weights)
    model = GPT2LMHeadModel.from_pretrained('distilgpt2')

    # Generate a sequence of new tokens after the prompt
    output_list = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,  # Set the number of new tokens to generate
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        top_k=50,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode the output token IDs to a string
    generated_sequence = output_list[0]
    text = tokenizer.decode(generated_sequence, skip_special_tokens=True)

    return text

### Try out some very basic text generation given a prompt

In [10]:
# Usage
prompt = "Statistical genetics is a subfield of human genetics"
generated_text = generate_text(prompt, max_new_tokens=200)
print(generated_text)

Statistical genetics is a subfield of human genetics. It is the basis of the theory that genetics can be used to predict the genetic makeup of humans.

The genetic analysis of genes is based on the assumption that the genes are related to the environment. The genetic data are based upon the hypothesis that genes can have a role in the development of a particular type of genetic trait. This hypothesis is supported by the fact that genetic variation in genes may be related with the presence of other genetic factors. In addition, the data on genes in humans are not based solely on genetic information. Genetic variation is not a factor in human genetic development. However, genetic variations in genetic traits are also related. For example, in a human population, a genetic mutation can cause a mutation that causes a change in gene expression. Therefore, it is possible that a gene mutation may cause an increase in expression of certain genes. Thus, there is no evidence that an increased exp

### Exploring the tokenizer

In [11]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

sample_text = "Tokenization is essential for natural language processing."

# Tokenize text into tokens
tokens = tokenizer.tokenize(sample_text)
print(f'Tokens: {tokens}')

# Convert tokens to their respective IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f'Token IDs: {token_ids}')

# Convert IDs back to tokens
back_to_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f'Back to Tokens: {back_to_tokens}')

Tokens: ['Token', 'ization', 'Ġis', 'Ġessential', 'Ġfor', 'Ġnatural', 'Ġlanguage', 'Ġprocessing', '.']
Token IDs: [30642, 1634, 318, 6393, 329, 3288, 3303, 7587, 13]
Back to Tokens: ['Token', 'ization', 'Ġis', 'Ġessential', 'Ġfor', 'Ġnatural', 'Ġlanguage', 'Ġprocessing', '.']


In [12]:
# Encode text (tokens to IDs with additional handling of special tokens)
encoded_input = tokenizer.encode(sample_text, add_special_tokens=True)
print(f'Encoded Input with special tokens (IDs): {encoded_input}')

# Decode the encoded IDs back to text
decoded_text = tokenizer.decode(encoded_input)
print(f'Decoded Text: {decoded_text}')



Encoded Input with special tokens (IDs): [30642, 1634, 318, 6393, 329, 3288, 3303, 7587, 13]
Decoded Text: Tokenization is essential for natural language processing.


In [14]:
# Demonstrate padding, truncation, and attention mask
# Padding ensures that sequences are the same length, attention masks tell the model which tokens to pay attention to

tokenizer.pad_token = tokenizer.eos_token

encoded_plus = tokenizer.encode_plus(
    sample_text,
    max_length=30,         # Pad or truncate to this length
    padding='max_length',  # Add padding
    truncation=True,       # Enable truncation to max_length
    return_tensors='pt',   # Return PyTorch tensors
    return_attention_mask=True
)
print(f'Encoded Plus with Padding and Attention Mask: {encoded_plus}')

# Let's see how special tokens work:
print(f"Special Tokens: {tokenizer.special_tokens_map}")

print(encoded_plus['attention_mask'])

Encoded Plus with Padding and Attention Mask: {'input_ids': tensor([[30642,  1634,   318,  6393,   329,  3288,  3303,  7587,    13, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]])}
Special Tokens: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]])


### Understanding the model.generate() function

When using language models for text generation, two parameters that influence the diversity of the generated text are `top_k` and `top_p`. These parameters control the sampling process that the model uses to pick the next word.

#### top_k Sampling

- `top_k` sampling limits the model's choice for the next token to the top-k most likely options out of the probability distribution given by the model. The value of `top_k` could range anywhere from 1 to the size of the model's vocabulary.
- A `top_k` value of `1` (greedy sampling) means the model will always choose the most probable token, which tends to produce repetitive and deterministic output.
- As `top_k` increases, the model's choices become more diverse, and the generated text becomes more varied and less predictable.
- However, too high a `top_k` can also include low-probability words, leading to nonsensical results.

#### top_p (Nucleus) Sampling

- `top_p` sampling, also known as nucleus sampling, introduces a dynamic approach where the model considers a subset of the vocabulary whose cumulative probability exceeds the threshold `p`.
- The `top_p` value is a float between 0 and 1 that defines the probability mass to cover. For example, `top_p=0.9` means the model will sample from the smallest set of tokens that have a combined probability over 90%.
- Unlike `top_k`, `top_p` adapts to the token's probability distribution: sometimes it may consider more tokens if the distribution is flat, or fewer tokens if there's a sharp drop-off in probability.

#### Comparison and Interaction

- `top_k` and `top_p` can be used together, providing a way to manage the trade-off between diversity and relevance of the generated text.
- When both are used, the model first filters the top-k tokens and then within this subset applies `top_p` sampling to ensure the cumulative probability meets the threshold.

By fine-tuning the `top_k` and `top_p` parameters, one can control the randomness and creativity of the generated text while still maintaining coherence and relevancy to the given context.