### DO THIS FIRST

Change `force_update=True` in the last line and run the next cell to install an updated course package.  Once it's done restart your kernel and change back to `force_update=False`.  You only need to do this once per server (not once per notebook).

In [None]:
# run this cell to ensure course package is installed
import sys
from pathlib import Path

course_tools_path = Path('../Course_Tools/').resolve() # change this to the local path of the course package
sys.path.append(str(course_tools_path))

from install_introdl import ensure_introdl_installed
ensure_introdl_installed(force_update=False, local_path_pkg= course_tools_path / 'introdl')

Force update requested. Uninstalling `introdl`...
Installing `introdl` from local directory: C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\DS776\Lessons\Course_Tools\introdl
The `introdl` module is now installed.


#### L07_1_Getting_Started_with_NLP Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l07_1_getting_started_with_nlp/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l07_1_getting_started_with_nlp/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/oDi5d1FbYBx" target="_blank">Open Descript version of video in new tab</a>

## A Tiny History of Natural Language Processing

Natural Language Processing (NLP) has evolved significantly over the past few decades. Initially, NLP relied heavily on rule-based systems and statistical methods to understand and generate human language. These early approaches, prominent in the 1980s and 1990s, focused on the syntactic structure of text, using techniques such as n-grams and Hidden Markov Models (HMMs) to model language. However, these methods struggled with capturing the semantic meaning and context of words.

The introduction of word embeddings in the early 2010s, such as Word2Vec and GloVe, marked a significant advancement in NLP. These embeddings allowed for the representation of words in continuous vector space, capturing semantic relationships between words. This shift enabled more sophisticated models, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, to process sequences of text and maintain context over longer passages. RNNs, in particular, played a crucial role in tasks like language translation and sentiment analysis.

The advent of transformers in 2017 revolutionized NLP by addressing the limitations of RNNs. Transformers, introduced with the Attention is All You Need paper, utilize self-attention mechanisms to process entire sequences of text simultaneously, allowing for better handling of long-range dependencies and parallelization. This led to the development of powerful models like BERT, GPT, and T5, which have set new benchmarks in various NLP tasks by providing a deeper semantic understanding of text.

Transformers have almost entirely supplanted previous approaches to NLP because:

1. **Superior Performance:** Models like BERT, GPT, T5, and their successors dominate leaderboards on tasks such as text classification, translation, summarization, and question answering.
2. **Pretraining and Transfer Learning:** Unlike traditional methods that required training separate models from scratch for different tasks, transformers leverage large-scale pretraining on vast text corpora and fine-tune efficiently on specific tasks.
3. **Self-Attention and Contextual Representations:** Transformers provide rich, context-dependent word representations, whereas earlier models like Word2Vec and GloVe generated static embeddings.
4. **Scalability and Adaptability:** With advancements in scaling laws, models can achieve better performance just by increasing their size and training data, an advantage that RNNs and classical machine learning approaches lacked.

There are a few areas where older approaches still exist:

1. **Small Datasets & Low Compute Environments:** Logistic regression, SVMs, and Lasso-penalized models often remain competitive when data is limited or when computational efficiency is a concern.
2. **Domain-Specific Applications:** Some applications, like biomedical text mining, may still rely on domain-specific feature engineering approaches alongside transformers.
3. **Traditional ML for Interpretability:** Some NLP applications in finance, healthcare, and legal fields still favor older methods due to the need for interpretability and robustness.

However, since transformer mdoels for NLP are now so dominant we will focus excusively on them in this class.



## NLP Tasks Instead of Transformer Details

Transformers are more complicated than the CNNs we saw for computer vision so we're not going to dive as deeply into the details.  We will, in Lesson 8 - Transformer Details, learn about some of the nuts and bolts especially the self-attention mechanism that allows tranformers to figure out relationships between words and to understand context.  Mostly, though, we will focus on the applications of transformers.  To this end we'll dive into the open source HuggingFace ecosystem which hosts thousands of NLP models and datasets and makes it quite simple to dive into NLP applications without having to master too much code.  All of the newest, biggest open source transformer models are hosted there including those from Meta, Mistral, and Deepseek.  The only thing keeping us from running the biggest state-of-the-art models will be lack of compute, but we can run their smaller cousins on the GPU in CoCalc's compute server, a decent gaming GPU, or even a CPU.  

## Fine-tuning a Specialized Model versus Using a Large Language Model

As large language models (LLM) continue to improve, their use as general NLP task solvers via prompting is increasing.  Particularly in situations where we don't have access to a lot of training data.  Our choices for solving an NLP task come down to
1.  Using a LLM via an API (in the cloud) like GPT-4o or Gemini.
2.  Using a LLM model running on local hardware.
3.  Fine-tuning and using a specialized transformer model designed for the task.

For example, for a text-classification task we could choose:

- **LLM via API (GPT-4o, Claude, Gemini, etc.)**
    - When you need **a quick, general-purpose classifier** without training a model.
    - When **zero-shot or few-shot classification** (via prompting) is sufficient.
    - When categories may evolve frequently, making retraining impractical.
    - Example: Categorizing support tickets by topic.

- **Local LLM (LLaMA, Mistral, OpenChat)**
    - When you need to classify text **without sending data to an external API** (e.g., **privacy-sensitive data**).
    - When you need **occasional classification** and want to avoid API costs.
    - Works well for **prompt-based classification** if the model is large enough (e.g., LLaMA-2 13B or Mistral 7B).
    - Example: **Classifying internal legal documents**.

- **Fine-tune BERT / RoBERTa / DistilBERT**
    - When you have a **moderate to large labeled dataset** and need **high accuracy**.
    - When you need **fast inference at scale**, as fine-tuned models are more efficient than large LLMs.
    - When your classification task requires **domain-specific adaptation**.
    - Example: **Sentiment analysis on customer feedback** in a specific industry.

Don't worry if you don't know all those terms yet, especially the various models mentioned such as BERT.  Zero-shot classification means classifying text without seeing any examples - the LLM just gets a prompt with the possible categories.  Few-shot classification means seeing a small number of examples provided in the LLM prompt.  

Here's some thoughts on choosing the right approach for a given NLP task:

- **Use API-based LLMs (GPT-4o, Claude, Gemini, etc.) when**:
  - You need **quick, adaptable solutions** without training.
  - You **don’t have much data** for fine-tuning.
  - Privacy and latency are not major concerns.

- **Use Local LLMs (LLaMA, Mistral, Falcon) when**:
  - You need **private, offline inference**.
  - You want **control over deployment** without external dependencies.
  - **Few-shot learning is sufficient**, and you don’t want to fine-tune.

- **Fine-Tune a Model (BERT, BART, T5, RoBERTa) when**:
  - You have **domain-specific data** and need **high accuracy**.
  - Privacy, cost, or latency concerns prevent LLM use.
  - You require **structured, predictable outputs**.

For each NLP task we study over the next several lessons we'll consider all three approaches.  We won't demonstrate using APIs at scale because API use isn't free, but it's very cheap for experimentation and Google's Gemini API is free for testing with rate limits.  To use APIs you'll need to to sign up for accounts and get API keys.  

## Getting API Keys and a HuggingFace Token

An API key is a private code that allows you to interact with applications running in the cloud or on private servers.  In this section we'll describe how to get api keys and how to get a HuggingFace token.  At the end of the section we'll describe how you can store your api keys.  Generally, you don't want to put your keys directly in notebooks or other places that might be publically visible.

#### Costs

We'll show more details about pricing later, but here's the basics:

* Google's Gemini API is **free** to use for testing but there are rate limits and daily maximums.  It's cheap to use if you want to do more.
* OpenAI's API is not free but it is cheap to use.  
* HuggingFace is free to use unless you get into some of their (or their affiliates) hosting solutions.  You may not even need the token to get access to everything, but it doesn't hurt.

**You should at least get a Google Gemini API Key and a HuggingFace Token:**

### Obtaining a Google Gemini API Key

To get started with the Gemini API and obtain an API key, follow these steps:

1.  **Go to the Google AI Studio website:** Visit [ai.google.dev](https://ai.google.dev/).
2.  **Sign in with your Google account.**
3.  **Create a new project (if needed):** If you don't have a project, you'll be prompted to create one.
4.  **Get an API key:** Once you have a project, you can generate an API key. This key will be used to authenticate your requests to the Gemini API.
5.  **Store the API key securely:** After obtaining the API key, store it securely. You can set it as an environment variable or store it in a configuration file.

I've found the [Google Gemini API docs](https://ai.google.dev/gemini-api/docs/quickstart?lang=python) to be quite helpful.  

As long as you have a Google account, limited use of the Gemini models is free so you should definitely set up a key for yourself.

### Obtaining an OpenAI API Key

OpenAI doesn't offer a free tier, but their non-reasoning models such as GPT-4o and GPT-4o-mini are quite cheap to use. I've been playing with their API sporadicaly for months and have yet to spend $15.  We'll show some sample prompts later along with their estimated costs.  You're not required to use the OpenAI API but you can if you're interested.

To get started with the OpenAI API and obtain an API key, follow these steps:

1. **Go to the OpenAI website:** Visit [openai.com](https://www.openai.com/).
2. **Sign up for an account:** If you don't have an account, sign up using your email address.
3. **Log in to your account:** Once you have an account, log in with your credentials.
4. **Buy credit:** Navigate to the billing section and purchase the desired amount of credit. OpenAI offers various pricing plans based on your usage needs.
5. **Generate an API key:** After purchasing credit, go to the API section and generate a new API key. This key will be used to authenticate your requests to the OpenAI API.
6. **Store the API key securely:** After obtaining the API key, store it securely. You can set it as an environment variable or store it in a configuration file.


### Getting a HuggingFace Token

We'll be using many models from the HuggingFace ecosystem in the NLP part of the course.  Some models, like the Llama LLM models from Meta require you to agree to terms before you download their models.  Your access to those models is associated with your HuggingFace token which is essentially an api key tied to your HuggingFace account.  Don't worry, it's free.

1. **Go to the HuggingFace website:** Visit [huggingface.co](https://huggingface.co/).
2. **Sign up for an account:** If you don't have an account, sign up using your email address or GitHub account.
3. **Log in to your account:** Once you have an account, log in with your credentials.
4. **Navigate to your profile settings:** Click on your profile picture in the top right corner and select "Settings" from the left navigation bar.
5. **Access the API tokens section:** In the settings menu, find and click on "Access Tokens" under the "API tokens" section.  You may have to authenticate here.
6. **Generate a new token:** Click on the "Create new token" button, give your token a name, and select the appropriate scope (e.g., "read" for downloading models). Then, click "Generate".
7. **Store the token securely:** After generating the token, store it securely. You can set it as an environment variable or store it in a configuration file.

### Storing and using your API keys

On my personal computers I store my api keys as environment variables.  Ask an AI how to do this for your machine if you want.  Another way to store them locally is to put them in a file, often called a ".env" file.  For example here are the contents of a sample api_keys.env file:

```
HF_TOKEN=abcdefg
OPENAI_API_KEY=abcdefg
GEMINI_API_KEY=abcdefg
```

Use the `dotenv` library to read the environment variables from the `.env` file. Here's how you can do it:

1. **Install the `python-dotenv` library:** If you haven't already installed it, you can do so using pip:
    ```bash
    pip install python-dotenv
    ```

2. **Create a `.env` file:** Save your API keys in a file named `apikeys.env` (or any name you prefer) in your project directory.

3. **Load the environment variables in your Python script:** Use the following code to load the environment variables from the `.env` file:
    ```python
    from dotenv import load_dotenv
    import os

    # Load environment variables from the .env file
    load_dotenv('path/to/apikeys.env')

    # Access the environment variables
    hf_token = os.getenv('HF_TOKEN')
    openai_api_key = os.getenv('OPENAI_API_KEY')
    gemini_api_key = os.getenv('GEMINI_API_KEY')

    print(f"HuggingFace Token: {hf_token}") #remove these print statements after you've tested this
    print(f"OpenAI API Key: {openai_api_key}")
    print(f"Gemini API Key: {gemini_api_key}")
    ```

If you're working in CoCalc and have the course package installed, you can edit the file api_keys.env in Lessons/Course_Tools to include your keys. Then when you run the code below in your imports cell, the keys will be read and set:

```python
from introdl.utils import config_paths_keys
paths = config_paths_keys()
```

## Using the APIs

We'll just give you a couple of brief examples and point you toward the documentation in case you want to explore more.  We incorporated Google and OpenAI API use into our course tools which we'll introduce in a bit.  You'll still need api keys to use them though.

### Google Gemini API

If you want to try this now.  Get your GEMINI_API_KEY and add it to the api_keys.env in Lessons/Course_Tools and run the cells below to try a simple Gemini API request.  If necessary you may need to install the `google-genai` package by running `!pip install google-genai` in a code cell.

We included the Jupyter magic command "%%capture" in the next cell to capture the output to keep things clean.  Jupyter magic commands extend the functionality of notebooks beyond standard Python.  You can learn about a few particularly useful [magic commands here](https://www.kdnuggets.com/jupyter-notebook-magic-methods-cheat-sheet).

**Note:**  You'll probably need to `pip install google-genai` first.


In [1]:
%%capture
import os
from google import genai
from introdl.utils import config_paths_keys, wrap_print_text
from introdl.nlp import display_markdown

# set keys and paths
paths = config_paths_keys()

# overload print with a version of print that wraps text at 80 characters
print = wrap_print_text(print)

In [2]:
# Calling Gemini API to generate content

client = genai.Client(api_key = os.getenv("GEMINI_API_KEY"))
response = client.models.generate_content(
    model="gemini-2.0-flash", contents="Tell me three interesting facts about space."
)
print(response.text)

Okay, here are three interesting facts about space:

1.  **There's a planet made of diamond:** Yes, you read that right!  The
exoplanet 55 Cancri e, twice the size of Earth and eight times its mass, is
believed to be primarily composed of carbon.  Due to the immense pressure and
heat within the planet, the carbon is thought to have crystallized into a
massive diamond.  Unfortunately, it's 40 light-years away, so popping over for a
sample is out of the question!

2.  **Neutron stars can spin ridiculously fast:** Neutron stars are the
ultra-dense remnants of massive stars after they've exploded as supernovae.
They pack more mass than our Sun into a sphere only about 20 kilometers (12
miles) across.  This extreme density allows them to spin at incredible speeds.
Some neutron stars, called millisecond pulsars, can rotate hundreds of times per
*second*. Imagine something that massive spinning that fast!

3.  **Space is not completely silent:** While space lacks a medium for sound
waves to t

Or we can use introdl.nlp.display_markdown to display the response as formatted markdown in our noteook, like this:

In [4]:
display_markdown(response.text)

Okay, here are three interesting facts about space:

1.  **Neutron stars are incredibly dense:** If you took the entire human population and squeezed it into the size of a sugar cube, that's approximately the density of a neutron star. They're formed from the collapsed cores of massive stars after a supernova, packing immense mass into a tiny space.

2.  **Space is not completely silent:** While it's true that sound, as we know it, can't travel through the vacuum of space, that doesn't mean it's entirely silent. Electromagnetic vibrations can occur, and NASA has instruments that can detect and translate these vibrations into sound. Some of these translated sounds are quite eerie!

3.  **There's a giant cloud of alcohol in space:** Sagittarius B2 is a giant molecular cloud near the center of the Milky Way that contains vast amounts of ethyl alcohol, the same type of alcohol found in alcoholic beverages. It's estimated to hold enough ethyl alcohol to fill trillions upon trillions of bottles of beer! While it's mixed with other gases and molecules, it's still a pretty wild thought.


There are many things we can do with the API including sending additional instructions (a system prompt) and configuring how the underlying language model generates the output.  To see more about directly working with API refer to Google's [documentation about text generation](https://ai.google.dev/gemini-api/docs/text-generation?lang=python).  We will learn more about how text-generation models work and how they can be configured to alter the results in later lessons.



### Using the OpenAI API

After getting and setting up your OPENAI_API_KEY as an environment variable or using api_keys.env as we did for Gemini you should be able to run the following cell.  It's almost exactly the same code we used for accessing Gemini through the OpenAI API.  We just have to change to the OPENAI_API_KEY and remove the URL so the request gets routed to OpenAI's servers.  We also changed the model to "gpt-4o-mini" which is currently their cheapest model and quite good for general use.

The next cell shows an example of using the OpenAI API.  We include a system prompt and some configuration parameters as an example.  Temperature is a parameter that controls the randomness of the ouput.  A temperature of 0 gives deterministic results and a value of 1 is the most random.  We'll see more about temperature in Lesson 11.

The cell won't run if you don't have an OPENAI_API_KEY stored in the appropriate environment variable.

In [5]:
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

sys_instruct="You are helpful AI assistant who is also sarcastic and talks like a pirate."

response = client.chat.completions.create(
    model="gpt-4o-mini",
    n=1,
    messages=[
        {"role": "system", "content": sys_instruct},
        {
            "role": "user",
            "content": "Tell me three interesting facts about space."
        }
    ],
    temperature=0.1,  # Added temperature parameter
    max_tokens=100    # Added max_tokens parameter
)

display_markdown(response.choices[0].message.content)

Arrr, matey! Here be three fascinating tidbits 'bout the vastness of space that’ll make ye say “shiver me timbers!”:

1. **The Universe is Expanding**: Aye, just like yer waistline after a feast o' grog and grub! The universe be stretchin' out faster than a ship in full sail, and it’s doin’ so at an acceleratin’ rate. Scientists reckon it’s due to a mysterious force they call

### Using Local LLM Models

We'll see more about text generation models in Lesson 11, but they're really easy to use in the HuggingFace ecosystem.  The benefits to running an LLM locally include data security, ease of use, and no subscription fees.  A company wanting to protect its propietary data may invest in considerable computing infrastructure to deploy larger, private LLM models.  We can mimic this experience by running smaller versions of LLMs like the Llama-3.3-3B model from Meta which, as of early 2025, is a state-of-the-art small text generation model.  We'll use a quantized model where the model weights are stored in 4-bit precision to enable faster inference and lower memory use at the cost of a little precision.

The downside to local models is that you're limited by the hardware you have available which means smaller models and slower results. These small models will demonstrate the ideas, but their performance can't compete with the hosted larger models.  Competitive models are freely available on HuggingFace but they require servers with multiple top-of-the-line GPUs.  

We'll explain more code like this later, but here is a simple way to load and use the model locally using a *pipeline*.  This should automatically detect and use a GPU if one is available.

In [6]:
from transformers import pipeline

chatbot = pipeline(
    "text-generation", 
    model="unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit"
)

# System instruction
sys_instruct = "You are a helpful AI assistant who is also sarcastic and talks like a pirate."

# Construct the chat prompt (Llama models often use specific formatting)
prompt = f"<|system|>\n{sys_instruct}\n<|user|>\nTell me three interesting facts about space.\n<|assistant|>"

# Generate response
response = chatbot(
    prompt, 
    max_length=200, 
    temperature=0.1
)

# Print the model's output
display_markdown(response[0]['generated_text'])


<|system|>
You are a helpful AI assistant who is also sarcastic and talks like a pirate.
<|user|>
Tell me three interesting facts about space.
<|assistant|> 
Arrrr, ye landlubber! Ye be wantin' to know some swashbucklin' space facts, eh? Alright then, matey! Here be three interesting tidbits about the vast expanse o' space:

1. **The Andromeda Galaxy be comin' for ye!** That's right, matey! Our home galaxy, the Milky Way, be headed straight for the Andromeda Galaxy, a behemoth o' a galaxy about 2.5 million light-years away. Don't ye worry, though - it'll take about 4 billion years to get here, so ye've got plenty o' time to prepare yer space booty.
2. **Space be filled with mysterious "Fast Radio Bursts" (FRBs)!**

Notice that the response also includes the system and input prompts.  That's typical of local LLM models from Huggingface.  We'll learn more about system prompts later.

In Lesson 11 we'll see how to use lower-level HuggingFace tools to get more control over our local LLM models or to be able to fine-tune the models.  For now, I encourage you to use `llm_configure` and `llm_generate` from our course package as we demonstrate in the next section.  As a bonus, `llm_generate`, be default, cleans the response text to remove the input and system prompts.



## Using the LLM tools in the Course Package

We included some functions in the course package to help you use LLMs from Python.  These are the kinds of helper functions you'd write for yourself to expedite sending prompts to an LLM and get responses.  `llm_configure` is used to choose a model and set some configuration options, while `llm_generate` is used for prompting.  These tools can be used to access local models as well to access Gemini and OpenAI APIs.

### Running Local Models

Here's an example where we load a local model called Mistral-7B-Instruct which is a small LLM from Mistral that has been fine-tuned to follow instructions.  Even with a GPU you'll likely notice that using a local LLM is slower than using one of the APIs like OpenAI or Gemini.

In [5]:
from introdl.nlp import llm_configure, llm_generate, display_markdown
from introdl.utils import wrap_print_text

print = wrap_print_text(print)

mistral_config = llm_configure("mistral-7B")
response = llm_generate(mistral_config, "What is the capital of France?")
print(response)

Local Generation:   0%|          | 0/1 [00:00<?, ?it/s]

The capital city of France is Paris.


Here, we'll repeat one of our prompts from the previous section.  This also shows how to pass a system prompt.  Note that `llm_generate` defaults to produce at most 200 new tokens.  We'll also switch to the smaller Llama-3.3-3B model because its faster.  The actual model that gets loaded is a quantized version of the model that has been fine-tuned to follow instructions.

In [6]:

llama32_config = llm_configure("llama-3p2-3B")
sys_instruct="You are helpful AI assistant who is also sarcastic and talks like a pirate."
response = llm_generate(llama32_config, "Tell me three interesting facts about space.", system_prompt=sys_instruct)
display_markdown(response)

🛑 Unloading model: unsloth/mistral-7b-instruct-v0.3-bnb-4bit from GPU...
✅ Model unsloth/mistral-7b-instruct-v0.3-bnb-4bit has been fully unloaded.
🚀 Loading model: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit (this may take a while)...
🟢 Model unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit loaded successfully.



Local Generation:   0%|          | 0/1 [00:00<?, ?it/s]

Ye be lookin' fer some intergalactic tidbits, eh? Alright then, matey! Here be three swashbucklin' spaceship facts fer ye:
1. **Space smells... sorta**: Ye might think that outer space don't have no smell, but scientists found out it does! It's mostly just the stink o' comets 'n' asteroids crashin' into each other (that's hydrogen sulfide, for landlubbers). But, trust ol' Blackbeak Billy, there be scents too – nitrogen oxides from supernovas, an' all sorts.
2. **The universe be full o' dark matter**: Now, I know what ye be thinkin': "Pirate, where in blazes did this mysterious stuff come from?" Well, matey, we still can't see it, nor touch it, nor even get close to it without hurtin' ourselves (it'd freeze us solid

To allow the model to generate more output tokens, pass `max_new_tokens = 500` or some suitable value to `llm_configure`.

In [7]:
prompt = """Write a short story about a cat who learns to play the piano."""

response = llm_generate(llama32_config, prompt, max_new_tokens=500)
display_markdown(response)

Local Generation:   0%|          | 0/1 [00:00<?, ?it/s]

Whiskers, a sleek black feline with bright green eyes, had always been fascinated by the sounds emanating from her owner's music room. As she lounged on the windowsill, watching her owner practice the piano for hours on end, Whiskers found herself mesmerized by the rhythmic flow of notes.
One day, as her owner was busy typing away at her desk, Whiskers decided it was time to take matters into her own paws. She hopped onto the keyboard and began to explore its keys. At first, all she managed to do was press random buttons, creating chaotic clatters and screeches. But Whiskers refused to give up.
With each passing minute, Whiskers grew more determined. Her paw danced across the keys, coaxing out tentative melodies that gradually transformed into something resembling music. To her surprise, the sound produced wasn't entirely unpleasant – quite the opposite!
As days went by, Whiskers devoted every spare moment to practicing. She'd curl up beside the piano during naptime or sneak back after midnight when everyone else was asleep. The housemates soon discovered their nocturnal serenades were no longer human compositions but rather Whiskers' nightly improvisations.
Her talent blossomed overnight (or so it seemed). Soon enough, friends started trickling over to listen. Local musicians took notice too; some even offered lessons specifically tailored to cats with musical inclinations like hers. 
The once-lonely evening pianist became celebrated composer extraordinaire! When asked what inspired this unlikely prodigy, people would smile knowingly and reply - 'It turns out those stray fingers belonged to a genius.' And though there may have been a little bit of magic involved, one undeniable truth held true: with dedication and passion came magical discoveries.

OK, it's probably not a great story, but we're just getting the idea of how locally run models can be used to respond to prompts.

### Accessing the APIs

The nice thing about using our course pacakge helper functions is that it's simple to try different models and APIs using the same syntax so we can focus on the programatic use of LLMs.  `llm_generate` also cleans the returned prompts so that don't include the input prompt and other extras.

For example, to use the most recent Gemini model (as of February 19, 2025).  Note: you'll need to have already set the GEMINI_API_KEY environment variable as we did previously.

In [10]:
gemini_config = llm_configure("gemini-flash-lite")
response = llm_generate(gemini_config, "Tell me three interesting facts about space.", 
                        system_prompt=sys_instruct,
                        max_new_tokens=500)
display_markdown(response)

Avast there, matey! Here be three facts about the vast, endless sea o' space, fit to make even the most seasoned spacefarer's jaw drop:
1.  **Space be completely silent, ye scurvy dogs!** Aye, there be no air to carry sound waves, so a cosmic explosion would be as quiet as a kraken's nap. Imagine, not even a whisper as galaxies collide!
2.  **There be more stars in the observable universe than grains o' sand on all the beaches o' Earth!** That's a whole lot o' shinin' treasure, matey. It's enough to make a pirate's heart swell with greed... or maybe just get a bit dizzy tryin' to count 'em all.
3.  **The scent o' space be said to smell like burnt steak, hot metal, or even gunpowder!** Aye, astronauts who've been on spacewalks say the vacuum o' space leaves this scent on their suits and equipment. Smells like a barbecue in the blackest void, eh?

Or to use OpenAI's gpt-4o-mini model (must have OPENAI_API_KEY):

In [11]:
openai_config = llm_configure("gpt-4o-mini")
response = llm_generate(openai_config, "Tell me three interesting facts about space.", 
                        system_prompt=sys_instruct,
                        max_new_tokens=500)
display_markdown(response)

Arrr, matey! Here be three fascinating tidbits 'bout the great beyond, where the stars be twinklin' like a treasure hoard!
1. **The Universe be Expanding**: Aye, ye heard it right! The universe be stretchin' out like a ship's sail in a fierce wind. It's doin' so at an acceleratin' pace, which be makin' astronomers scratch their heads and wonder what's causin' that mischief—might be dark energy or just a celestial prank!
2. **There be More Stars than Grains o' Sand**: If ye ever found yerself countin' grains o' sand on a beach, ye’d be better off countin' the stars in the universe! Scientists reckon there be over 100 billion galaxies out there, each with billions of stars. So, next time ye visit the shore, remember, the night sky be a much bigger treasure map!
3. **A Day on Venus be Longer than its Year**: Arrr, talk about takin' yer sweet time! Venus spins so slowly on its axis that a single day there—one complete rotation—takes about 243 Earth days. But it orbits the sun in just about 225 Earth days. So, if ye ever find yerself on Venus, ye might wanna bring a good book!
So there ye have it, matey! Space be a wondrous place, full o' mysteries and marvels!

Here's how you can see all of the models that are currently available.  Through our course package.

In [8]:
from introdl.nlp import llm_list_models
llm_list_models()

Available models:
 llama-3p1-8B => HuggingFace: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
 mistral-7B => HuggingFace: unsloth/mistral-7b-instruct-v0.3-bnb-4bit
 llama-3p2-3B => HuggingFace: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit
 gemini-flash-lite => needs GEMINI_API_KEY
 gemini-flash => needs GEMINI_API_KEY
 gpt-4o => needs OPENAI_API_KEY
 gpt-4o-mini => needs OPENAI_API_KEY
 o1-mini => needs OPENAI_API_KEY
 o3-mini => needs OPENAI_API_KEY


<zip at 0x2a9b506ecc0>

### Pricing

As of February 11, 2025, Google's API pricing for its Gemini models is as follows:

| Model           | Input Tokens (per 1M) | Output Tokens (per 1M) | Context Length | Modalities Supported |
|-----------------|-----------------------|------------------------|----------------|----------------------|
| **Gemini 2.0 Flash**| $0.10             | $0.40                    | 1M         | Text, Images, Video, Audio* |
| **Gemini 2.0 Flash Lite** | $0.075 | $0.30 | 1M | Text Images, Video, Audio |

*Audio costs more.

A nice thing about the Gemini models is they support free, limited API use for testing.  For Flash / Flash Lite the free tier is limited to 30 / 15 requests per minute or 1500 requests per day.  You can learn more about Gemini [pricing here](https://ai.google.dev/pricing#2_0flash).  


As of February 7, 2025, OpenAI's API pricing for various models is as follows:

| Model           | Input Tokens (per 1M) | Output Tokens (per 1M) | Context Length | Modalities Supported |
|-----------------|-----------------------|------------------------|----------------|----------------------|
| **OpenAI o1**   | $15                   | $60                    | 200k           | Text and Vision      |
| **OpenAI o3-mini** | $1.10               | $4.40                  | 200k           | Text                 |
| **GPT-4o**      | $2.50                 | $10                    | 128k           | Text and Vision      |
| **GPT-4o mini** | $0.15                 | $0.60                  | 128k           | Text and Vision      |

These models offer varying capabilities and pricing structures to accommodate different application needs. For more detailed information, you can refer to OpenAI's official API [pricing page](https://openai.com/api/pricing/). 


If you set the cost per M tokens in using llm_configure, then you can see the estimated cost of using the API like this:



In [13]:
openai_config = llm_configure("gemini-flash-lite", cost_per_M_input=0.15, cost_per_M_output=0.60)
response = llm_generate(openai_config, "Tell me five dad jokes.", 
                        system_prompt=sys_instruct,
                        max_new_tokens=500,
                        estimate_cost=True)
display_markdown(response)

💰 Estimated Cost: $0.000080 (Input: 21.0 tokens, Output: 128.0 tokens)


Ahoy there, matey! Here be five jokes to shiver yer timbers:
1.  Why did the scarecrow win an award? Because he was outstanding in his field! *Arrr!*
2.  What do you call a fish with no eyes? Fsh! *Har har!*
3.  Why don't scientists trust atoms? Because they make up everything! *Heave ho!*
4.  What do you call a lazy kangaroo? Pouch potato! *Blimey!*
5.  Why did the bicycle fall over? Because it was two tired! *Yo ho ho!*

## Processing Multiple Prompts

Many LLMs can handle batches of input prompts.  Our locally run HuggingFace LLMs include that functionality.  The API versions of the LLMs we're using do not directly support batches of prompts, but they do offer other asynchronous, batch-based APIs that are cheaper to use and should be considered for large scale processing.

In any case, our `llm_generate` function can handle a batch of prompts by passing a list of strings for the prompt.  It will automatically generate a list of cleaned response strings.

Here's an example. We include the necessary imports again for completeness:

In [14]:
from introdl.nlp import llm_configure, llm_generate
from introdl.utils import wrap_print_text

print = wrap_print_text(print)

mistral_config = llm_configure("mistral-7B")

prompts = ['What is the capital of France?', 'What is the capital of Germany?', 'What is the capital of Italy?']
responses = llm_generate(mistral_config, prompts)
for response in responses:
    print(response)
    print("---------------------------\n")


🛑 Unloading model: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit from GPU...
✅ Model unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit has been fully unloaded.
🚀 Loading model: unsloth/mistral-7b-instruct-v0.3-bnb-4bit (this may take a while)...
🟢 Model unsloth/mistral-7b-instruct-v0.3-bnb-4bit loaded successfully.

The capital of France is Paris.
---------------------------

The capital city of Germany is Berlin. It's one of 16 states in Germany and has
a rich history, vibrant culture, and numerous historical sites.
---------------------------

The capital of Italy is Rome (Roma in Italian). It's important to note that
while Rome serves as the political heart, the business and financial hub often
lies in Milan.
---------------------------



Sometimes we want construct prompts programmatically from a list (or other data structure) of strings.  In this example we prompt the LLM to do sentiment analysis on several sentences.  Don't worry, we'll see more about sentiment analysis in the next notebook and also Lesson 8.

In [15]:
# Define the system prompt for sentiment analysis
sys_instruct = "You are a sentiment analysis AI."

# List of texts for sentiment analysis
texts = [
    "I love the new design of your website!",
    "The service was terrible and I will not come back.",
    "The product is okay, but it could be better.",
    "Absolutely fantastic experience, highly recommend!",
    "I'm not sure how I feel about this."
]

# Pre-process the texts into a list of prompts
instruction = "Analyze the sentiment of this text and classify is as Positive, Negative, or Neutral. Give only the sentiment classification as the response."

prompts = [instruction + f'Text: {text}' for text in texts]

# Generate the list of responses using llm_generate
responses = llm_generate(mistral_config, prompts, system_prompt=sys_instruct)

# Print the responses
for prompt,response in zip(prompts,responses):
    print(f"Prompt: {prompt}\n")
    print(f"Response: {response}")
    print("---------------------------\n")

Prompt: Analyze the sentiment of this text and classify is as Positive,
Negative, or Neutral. Give only the sentiment classification as the
response.Text: I love the new design of your website!

Response: Positive
---------------------------

Prompt: Analyze the sentiment of this text and classify is as Positive,
Negative, or Neutral. Give only the sentiment classification as the
response.Text: The service was terrible and I will not come back.

Response: Negative
---------------------------

Prompt: Analyze the sentiment of this text and classify is as Positive,
Negative, or Neutral. Give only the sentiment classification as the
response.Text: The product is okay, but it could be better.

Response: Negative
---------------------------

Prompt: Analyze the sentiment of this text and classify is as Positive,
Negative, or Neutral. Give only the sentiment classification as the
response.Text: Absolutely fantastic experience, highly recommend!

Response: Positive
---------------------------