<a href="https://colab.research.google.com/github/itinstructor/JupyterNotebooks/blob/main/Notebooks/LLM_With_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with Large Language Models

## What is a large language model (LLM)?

A large language model (LLM) is a type of artificial intelligence (AI) program that can recognize and generate text, among other tasks. LLMs are trained on huge sets of data — hence the name "large." LLMs are built on machine learning: specifically, a type of neural network called a transformer model.

In simpler terms, an LLM is a computer program that has been fed enough examples to be able to recognize and interpret human language or other types of complex data. Many LLMs are trained on data that has been gathered from the Internet — thousands or millions of gigabytes' worth of text. But the quality of the samples impacts how well LLMs will learn natural language, so an LLM's programmers may use a more curated data set.

LLMs use a type of machine learning called deep learning in order to understand how characters, words, and sentences function together. Deep learning involves the probabilistic analysis of unstructured data, which eventually enables the deep learning model to recognize distinctions between pieces of content without human intervention.

LLMs are then further trained via tuning: they are fine-tuned or prompt-tuned to the particular task that the programmer wants them to do, such as interpreting questions and generating responses, or translating text from one language to another.

In [None]:
print("Hello World")

Hello World


## Transformers in a Nutshell

A **transformer** is a neural network architecture based on the concept of **attention**.
* They're what make LLMs work - behind ChatGPT et al.
* You feed a lot of text data into the neural network, and it learns which words relate to other words


<div>
    <center>
        <table>
            <tr>
                <td><img src="https://github.com/ericmanley/LLM4CSCurriculumWorkshop/blob/main/images/simple_self_attention.png?raw=1" width=400px></td>
                <td><img src="https://github.com/ericmanley/LLM4CSCurriculumWorkshop/blob/main/images/attention_vis1.png?raw=1" width=300px></td>
            </tr>
        </table>
    </center>
</div>


*image source:* Speech and Language Processing Fig. 10.2, https://web.stanford.edu/~jurafsky/slp3/10.pdf

*image source:* from the original paper on transformers - **attention is all you need** https://arxiv.org/pdf/1706.03762.pdf

## Why transformers?

Unlike previous neural network architectures, they can be trained *in parallel*.

LLMs use big models (take lots of words as input, encodings for lots of word senses, lots of layers for extracti.ng high level features of text, trained on massive amounts of text)

<div>
    <center>
        <img src="https://github.com/ericmanley/LLM4CSCurriculumWorkshop/blob/main/images/transformer_encoder_decoder.png?raw=1" width=300px>
    </center>
</div>

*image source:* Hugging Face NLP Course - **How do transformers work?** https://huggingface.co/learn/nlp-course/chapter1/4

# Hugging Face Access Token

1. Go to the left toolbar --> Click the Key icon.
2. Click --> Add new secret
3. Name: HF_TOKEN
4. Value: Paste your Hugging Face token
5. Actions: Click the Eye icon --> to hide the characters in your token.
3. Click Notebook access to turn it on.
6. Click the Key icon to close the Secrets panel.

## Installing the Hugging Face `transformers` library

Hugging Face transformers library is already installed in Google Colab. If this next code cell gives an error, don't worry. We are using a CPU to run the LLM

In [None]:
import sys
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

### What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide popular free, open-source libraries for natural language processing (and other) tasks.

Host *hundreds of thousands of models* that you can use in your own programs.

## A first tranformers program: the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use


<div>
    <center>
        <img src="https://github.com/ericmanley/LLM4CSCurriculumWorkshop/blob/main/images/full_nlp_pipeline.svg?raw=1" width=600px>
    </center>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

## Sentiment Analysis

* We *are* specifying the kind of task: `sentiment-analysis`
* This task analyzes text for the sentiment of the text in the text variable.
* We *are not* asking for a specific model, so it picks one of many it has by default
* The first time you do this, it will have to download the model - this can take some time depending on your network connection


In [None]:
# Import the pipeline module from the transformers library
from transformers import pipeline

# Create a sentiment analysis classifier pipeline using the default pre-trained model
classifier = pipeline("sentiment-analysis")
# classifier = pipeline("text-classification", model="madhurjindal/autonlp-Gibberish-Detector-492513457")

# Enter text to be classified
# example: "I love how easy it is to build sentiment-aware applications with the transformers library!"
text = input("Enter text to be classified: ")

# Use the created classifier to analyze the sentiment of the given text
results = classifier(text)

# Print the sentiment analysis results
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Enter text to be classified: I love teaching
[{'label': 'POSITIVE', 'score': 0.9993025064468384}]


**Test it out:** Try changing the input to get different labels/scores

## Activity: Specifying a model

Now try asking for a specific model.

Replace one line of code in your earlier example.

You can find out more about this model by checking out its model card: https://huggingface.co/SamLowe/roberta-base-go_emotions

What are some things you notice about this model that are different than the first one?

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

## Activity: Explore additional models

Go to the Hugging Face models page: https://huggingface.co/models
* Click `Text Classification`
* Find another model that looks interesting to you and try it out
* You might be able to find models for spam detection, fake news detection, topic classification, etc.

## What about sequence-to-sequence models?

The transformers library has models for generating output sequences - long text as input and output
* summarization
* translation
* question answering

Example:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [None]:
# Article copied from https://www.npr.org/2024/04/02/1242197022/biden-xi-jinping-call-china
example_news_article = """
BEIJING and WASHINGTON, D.C. — President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday, marking the first conversation between the leaders since their face-to-face meeting in California in November.
The latest thorn in Taiwan-China tensions: pineapples
World
The latest thorn in Taiwan-China tensions: pineapples

The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia's war in Ukraine.

According to the Chinese readout, Xi told Biden strategic awareness "must always be the first 'button' to be fastened" in bilateral ties. The Chinese leader also elaborated his position on issues concerning Hong Kong, human rights and the South China Sea, the readout says.
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China
World
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China

The Chinese leader warned again that the "Taiwan issue" is an "insurmountable red line" in bilateral ties. Xi also urged Biden to "translate" his commitment of not supporting "Taiwan independence" into concrete actions, according to the readout.

Biden, in the call, emphasized the importance of maintaining peace and stability across the Taiwan Strait and the rule of law and freedom of navigation in the South China Sea, according to a White House readout.

The two leaders also discussed the global geopolitical situation. Biden, according to the White House, raised concerns over China's support for Russia's defense industrial base and its impact on European and transatlantic security. He also emphasized Washington's "enduring commitment" to the complete denuclearization of the Korean Peninsula.

Tuesday's call was the first time Biden and Xi have talked since they met in northern California in November. There, they agreed on a range of steps to try to prevent increasingly fraught U.S.-China ties from slipping into conflict, including more frequent contact at the leader level, between militaries and beyond.

Ahead of the call, a senior administration official told reporters the conversation would not represent a change in U.S. policy toward China, and competition remains a key feature.

"Intense competition requires intense diplomacy to manage tensions, address misperceptions and prevent unintended conflict. And this call is one way to do that," said the official, who spoke on condition of anonymity as he was not permitted to speak on the record.

Biden raised perennial U.S. concerns about China's "unfair trade policies and non-market economic practices," according to the White House readout — an issue that will be front and center when Treasury Secretary Janet Yellen visits China later this week.

The president also reiterated to his Chinese counterpart that Washington will continue to "take necessary actions to prevent advanced U.S. technologies from being used to undermine our national security, without unduly limiting trade and investment," the White House readout said.
"""

In [None]:
summary = summarizer(example_news_article)
print(summary)

[{'summary_text': ' President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday . The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia\'s war in Ukraine . Tuesday\'s call was the first time Biden and Xi have talked since they met in northern California in November .'}]


## What about chat bots?

Chat bots need models that have been trained on conversational text.

To get the next response in a conversational thread, you need to pass in the entire conversation up to that point.

Models often use special tokens like `<s>` and `</s>` to indicate where a sequence begins and ends, but it is different for different models: https://huggingface.co/docs/transformers/en/model_doc/blenderbot

In [3]:
from transformers import pipeline
text_gen = pipeline("text2text-generation", model="facebook/blenderbot-400M-distill")

Device set to use cpu


In [4]:
conversation = "<s>What is computer science?</s>"
result = text_gen(conversation)
print(result)

Ask me a question: What is computer science
[{'generated_text': ' Computer science is a branch of mathematics that deals with computing.'}]


In [None]:
conversation += "<s>"+result[0]["generated_text"]+"</s>"
conversation += "<s>Is it only related to math?</s>"
result = text_gen(conversation)
print(result)

[{'generated_text': ' Yes, it is the study of algorithms and the theory of computation.'}]


In [None]:
conversation += "<s>"+result[0]["generated_text"]+"</s>"
print(conversation)

<s>What is computer science?</s><s> Computer science is a branch of mathematics that deals with computing.</s><s>Is it only related to math?</s><s> Yes, it is the study of algorithms and the theory of computation.</s>


**Challenge:** Take the above code and create a chatbot

In [6]:
from transformers import pipeline
text_gen = pipeline("text2text-generation", model="facebook/blenderbot-400M-distill")
conversation = input("Ask me a question: ")
# Add delimiters
conversation = "<s>" + conversation + "</s>"
result = text_gen(conversation)
print(result)
# Continue from here to create a conversation

Device set to use cpu


Ask me a question: What is candy
[{'generated_text': ' Candy is a sweet treat that is usually sweetened with sugar or sugar substitutes.'}]


## Resources

Free NLP Textbook: Speech and Language Processing by Dan Jurafsky and James H. Martin
* https://web.stanford.edu/~jurafsky/slp3/
* great for theoretical and intuitive understanding of concepts

Hugging Face NLP Course: https://huggingface.co/learn/nlp-course/
* great for engineering/implementation

Course Materials: https://github.com/ericmanley/F23-CS195NLP
* Natural Language Processing course for undergrads that includes lots of implementation
* Includes Jupyter Notebooks like this one

Fine-Tuning Models for new data
* Hugging Face fine-tuning chapter: https://huggingface.co/learn/nlp-course/chapter3/1
* From my NLP course: https://github.com/ericmanley/F23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb
