# Setup

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import torch

This code imports key components needed for working with large language models using the Hugging Face transformers library.

The first line imports three specific tools: AutoTokenizer converts text into a format the model can process, AutoModelForCausalLM loads pre-trained language models that generate text one token at a time, and set_seed ensures reproducible results by controlling randomness.

The second line imports PyTorch (as 'torch'), which provides the deep learning framework that powers these language models. PyTorch handles the underlying mathematical operations and GPU acceleration.

The 'Auto' prefix in the transformer components indicates they automatically select the right configuration based on the model you choose to use. This makes the code more flexible since it works with different types of models without needing changes.

This code forms the foundation for tasks like text generation, completion, or fine-tuning language models. The next steps would typically involve loading a specific model and tokenizer, then using them to process text.

In [None]:
class CFG:
    model = "Qwen/Qwen2-0.5B"

This code defines a configuration class called CFG (short for "configuration") - a common pattern in machine learning projects to organize model settings and hyperparameters in one central place.

The class contains a single class variable 'model' that specifies which pre-trained model to use. In this case, it's set to "Qwen/Qwen2-0.5B", which refers to the 0.5 billion parameter version of the Qwen2 language model developed by Alibaba.

The name "Qwen2-0.5B" breaks down into several meaningful parts: "Qwen2" indicates it's the second generation of the Qwen model family, while "0.5B" tells us the model has roughly 500 million trainable parameters. This is considered a relatively small model compared to larger versions that might have billions of parameters.

When referenced in code, this configuration can be accessed using dot notation like CFG.model, making it easy to change the model choice in one place rather than updating it throughout the codebase. This approach to configuration management helps keep code organized and maintainable, especially in larger projects where you might want to experiment with different models or settings.

This configuration class could be expanded to include other settings like learning rate, batch size, or training epochs if the code is part of a training pipeline.

# Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model)
model = AutoModelForCausalLM.from_pretrained(CFG.model)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

These two lines set up the essential components needed to work with the Qwen2 language model. Let's break down what each line does:

The first line initializes the tokenizer using AutoTokenizer.from_pretrained(). Think of the tokenizer as a translator that converts human-readable text into numbers that the model can understand. When you type "Hello world", the tokenizer breaks this into smaller pieces called tokens and assigns each a unique numerical ID. The 'Auto' part means it automatically loads the specific tokenizer that was used to train the Qwen2 model - this is crucial because each model expects text to be split in a particular way.

The second line loads the actual language model using AutoModelForCausalLM.from_pretrained(). This downloads the model's neural network architecture and its learned parameters - all 500 million of them in this case. The term "CausalLM" in the class name refers to causal language modeling, where the model predicts the next word based only on the previous words, similar to how we read text from left to right.

Both functions take CFG.model as their argument, which we saw earlier was set to "Qwen/Qwen2-0.5B". When these functions run, they actually download the model files from Hugging Face's model hub, where Qwen has shared their pre-trained model. The files are typically cached locally after the first download to save time in future runs.

An important detail is that these operations can take a significant amount of time and memory, especially for larger models. The 0.5B model might take several seconds to load and require around 1-2GB of RAM. The model will initially load on your CPU, though you'd typically want to move it to a GPU for faster processing if you're planning to generate text or fine-tune it.

# Zero-shot generalization

In [None]:
tokenizer.encode(" positive"), tokenizer.encode(" negative")

([6785], [8225])

This line of code shows us how the tokenizer converts the words "positive" and "negative" into sequences of numbers (tokens) that the model can understand. Notice that both words have a space before them - this is important for understanding the tokenization.

When we run this code, it returns two lists of numbers. The tokenizer uses its built-in vocabulary to break down each word into meaningful pieces. The space character matters because many tokenizers treat words differently depending on whether they appear at the start of text or in the middle of a sentence.

Looking at the actual output would show us exactly how the tokenizer splits these words. For example, " positive" might be split into [" pos", "itive"], while " negative" might become [" neg", "ative"]. Each of these pieces would be converted to unique numerical IDs that the model has learned to work with during its training.

This example also highlights an interesting aspect of subword tokenization: the tokenizer often recognizes common word parts or morphemes. Words that share endings (like "-itive" and "-ative") might use the same token numbers, reflecting the linguistic patterns the model has learned.

If you're curious about why these particular numbers appear in the output, it's because each number corresponds to a specific entry in the model's vocabulary. This vocabulary was created during the model's training process by analyzing patterns in large amounts of text data. The numbers serve as lookup indices into this vocabulary, allowing the model to efficiently process text input.


In [None]:
def score(review):

    prompt = f"""Question: Is the following review positive or
    negative about the movie?
    Review: {review} Answer:"""
    input_ids = tokenizer(prompt, return_tensors=
    "pt").input_ids
    final_logits = model(input_ids).logits[0, -1]
    if final_logits[6785] > final_logits[8225]:
        print("Positive")
    else:
        print("Negative")

This function performs sentiment analysis on movie reviews. Let me break down how it works step by step:

First, the function takes a movie review as input and creates a prompt string using an f-string. This prompt frames the task as a question: "Is the following review positive or negative about the movie?" followed by the review text. The prompt ends with "Answer:", setting up the model to complete the sentence with either "positive" or "negative".

Next, we use the tokenizer to convert this text prompt into numbers the model can process. The tokenizer() function returns the numerical tokens in PyTorch tensor format (that's what return_tensors="pt" means). These tokens are stored in input_ids.

The core of the analysis happens when we pass these input_ids to the model. The model processes the entire sequence and returns logits - these are raw numerical scores before any final probability calculations. We're specifically looking at the logits for the final position in the sequence (that's what the [-1] index does) of the first batch (index [0]).

The most interesting part is in the final comparison. The numbers 6785 and 8225 are specific token IDs that correspond to " positive" and " negative" in the model's vocabulary (notice the space before each word - this matches what we saw in the earlier tokenizer.encode() example). The function compares the logits for these two tokens - essentially asking "which word does the model think is more likely to come next?"

If the score for the "positive" token (6785) is higher than the "negative" token (8225), the function prints "Positive". Otherwise, it prints "Negative". This comparison leverages the model's learned understanding of movie review language to make its decision.

Rather than training a new classification model, it repurposes a language model's next-word prediction abilities to perform sentiment analysis. The model makes its decision based on whether it thinks "positive" or "negative" is the more natural completion of the carefully crafted prompt.

In [None]:
muh_review1 = """
It's so witless, in fact, that when we do discover the secret, we want to rewind the \
        film so we don't know the secret anymore.
        And then keep on rewinding, and rewinding, until we're back at the beginning, \
        and can get up from our seats and walk backward out of the theater and go down \
        the up escalator and watch the money spring from the cash register into our pockets.
        """

In [None]:
muh_review2 = """
I am required to award stars to movies I review. This time, I refuse to do it.
The star rating system is unsuited to this film. Is the movie good? Is it bad? Does it matter?
It is what it is and occupies a world where the stars don't shine.
        """

In [None]:
muh_review3 = """
"Pearl Harbor" is a two-hour movie squeezed into three hours, about how on Dec. 7, 1941, the Japanese staged a
surprise attack on an American love triangle. Its centerpiece is 40 minutes of redundant special effects,
surrounded by a love story of stunning banality. The film has been directed without grace, vision, or
originality, and although you may walk out quoting lines of dialog, it will not be because you admire them.
"""

In [None]:
score(muh_review1)

Negative


In [None]:
score(muh_review2)

Negative


In [None]:
score(muh_review3)

Negative
