<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 2: 🤗 Hugging Face

**Description:** 

Learners will use 🤗 Hugging Face Transformers library to explore aspects of language models including:

* Text Generation
* Sentiment Analysis
* Named Entity Recognition
* Question Answering
* Summarization

We will primarily use two libraries: transformers and huggingface_hub's inference client.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics
* Pandas Basics

**Knowledge Recommended:** 
* Python Intermediate
* Pandas Intermediate

**Data Format:** None

**Libraries Used:** 
* [🤗 Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Pytorch](https://pytorch.org/)- a popular machine learning framework

**Research Pipeline:** None
___

# Hugging Face

Hugging Face is an online community focused on AI models, datasets, apps, and infrastructure. It is the best place for finding and working with a large variety of models.

This notebook works primarily with the 🤗 Hugging Face libraries [huggingface_hub](https://huggingface.co/docs/hub/index) and [transformers](https://huggingface.co/docs/transformers/index). The `huggingface_hub` library connects your code with a variety of resources including models, datasets, and the Inference API. The `transformers` library, not to be confused with the AI architecture called [transformer](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)), can download models, create pipelines for inference, train models, and fine-tune models.

## The Hugging Face Website

### Models

* Filtering model search- Searching for the right model? You can filter models in a large variety of ways.
* Model cards- Get background information on how a model was constructed and how to use.
* Gated models- Request permission to use a particular model.
* Model files- See the file names and sizes that make up the model.
* Community- See, submit, and fix current issues in the model.
* Deploy- See example code for deploying the model in various environments.
* Use this model- See how to use the model with the `transformers` library

### Apps and Demos
* Spaces- Build small demonstration apps using software development kits (SDKs) like Streamlit, Gradio, and Docker
* Inference API- Try a model right in the browser
* Huggingchat- Try a chat model right in the browser

### Documentation
* Docs- Learn more about how to use the Hugging Face libraries


## Choosing a Model
In the previous class, we used Jupyter AI and Jupyter Magics to interact with foundation models designed for general purpose tasks. Today, we will consider the wider variety of models available and how to get started with them. Here are some questions to consider.

### What is the task you are trying to accomplish?
There are models designed for a wide variety of research tasks. The models on Hugging Face work in a variety of modalities:

* **Natural Language Processing**: text classification, named entity recognition, question answering, summarization, translation, multiple choice, text generation, vector databases
* **Computer Vision**: image classification, object detection, segmentation
* **Audio**: automatic speech recognition, audio classification
* **Multimodal**: table question answering, optical character recognition, information extraction from scanned documents, video classification, visual question answering, text to audio/image/video

### Can you use an existing model?

Models are trained on a particular dataset. If your data is similar to the material in the model's training data, you may be able to use the model "off-the-shelf" without any changes. The process of using a model is called "inference", and it uses significantly less resources than training a model from scratch or "fine-tuning" an existing model. 

### What kind of compute is necessary for the task?

If a model has already been trained to do your task, then you can simply run inference on the existing model. If your data is slightly different than the model's training data, you may be able to "fine-tune" it to work better with your data. If you need a model to do a new kind of task, you may need to train it from scratch.

Very large models and/or complex tasks require more resources. While simple models can be run locally on a modest laptop, a very big model or complex task may require a high-end computer or server-grade hardware. It's a fool's errand to use too small of a model or low-grade hardware for a difficult task, but it is also a waste to use too much model or hardware for a simple task. The best way to discover if you have a good fit is to try running a variety of tasks before throwing all your data at the model. Sometimes, it may make sense to use different models for different parts of the data or to fine-tune an existing model. Training a model from scratch can be reasonable if you have lots of high-quality, labeled data and access to significant compute resources.

# Installations

In [None]:
# Install 🤗 Transformers
!pip install transformers

In [None]:
# Install Sentencepiece, a  a subword tokenizer and detokenizer for natural language processing
# that uses byte-pair-encoding (BPE)
!pip install sentencepiece

In [None]:
# Install sacremoses, a Python port of the Moses tokenizer
!pip install sacremoses

# Import libraries

In [None]:
from transformers import pipeline, set_seed
from huggingface_hub import login
from huggingface_hub import InferenceClient
from huggingface_hub import AsyncInferenceClient
import pandas as pd
from pathlib import Path
pd.set_option('display.max_colwidth', None)


# Local models
For researchers, one of the primary benefits of working with models locally is more transparency about the code and model weights. Most LLMs provided by recognizable tech companies share very little detail on their models including:

* How they are constructed
* What was in their training data
* The model weights
* The prompts and guardrails

This kind of opacity might be helpful for a commercial product, but it is also antithetical to the values of good research, including the [FAIR Guiding Principles](https://www.nature.com/articles/sdata201618) which assert data should be:
* Findable
* Accessible
* Interoperable
* Reusable

There are some additional advantages for using models locally:

* No API means no internet connection required
* Have the model weights
* Can fine-tune the weights

The downsides are:

* Models are usually not state-of-the-art
* Large models are too big for local inference
* Fine-tuning may require expensive hardware

## Managing Local Models

Language models and datasets come in many sizes. The models and datasets for this notebook were tested on the given tasks, but for other models/tasks it is a good idea to check the file size and requirements. If you load or use a language model that is too big, you may fill all of the available space (30 GB) and/or memory (8 GB) in your lab. If the memory is full, try restarting the kernel (or restarting the lab). If the disk space is full, before deleting your own files, delete the .cache directory to clear out downloaded datasets and models from your space. You can do this by running the following code cell:


In [None]:
# Delete the .cache folder
!rm -r /home/jovyan/.cache/

In [None]:
# Check current disk space usage
!df -h /home/jovyan/

If you are familiar with the command line, you can use a terminal session to remove individual models and datasets. 🤗 Hugging Face stores them in the following places.

**Datasets**
```~/.cache/huggingface/datasets```

**Models**
```~/.cache/hub```

See the `manage-disk-space.ipynb` notebook in the root directory for more information, strategies, and code examples.

___

# Hugging Face Transformers Pipelines

The Transformers library  contains a variety of pipelines for common model tasks. The `pipeline()` can help you accomplish a variety of tasks, including:

* `feature extraction`- Extracting features from a model for transfer learning
* `text-classification`- Classifying texts into groups, including sentiment analysis
* `sentiment-analysis`- Classifying texts into positive or negative sentiment
* `token-classification`- Group tokens, including Named Entity Recognition (NER)
* `ner`- Finding named entities in a text
* `question-answering`- Answering questions, often based on context
* `fill-mask`- Predicting masked tokens
* `summarization`- Create a shorted version summary of a longer document
* `translation_xx_to_yy`- Translation from language xx to language yy
* `text2text-generation`- Generate text from a text instructions
* `text-generation`- Predictive text generation based on a starting prompt
* `zero-shot-classification`- Attempt to classify texts without additional training with labeled data
* `conversational`- Conversational responses

There are also pipelines for working with other common model tasks, such as automatic speech recognition, audio classification, text-to-speech, text-to-image, image-segmentation,  etc.

## Text Generation
By default, the 🤗 Transformers library text generation pipeline uses the Generative Pre-trained Transformer 2 (GPT-2) model by [OpenAI](https://openai.com/). This is a precursor of GPT-3.5, the model used for ChatGPT. This model was released in 2019 and you can find more information by reading its [model card](https://huggingface.co/gpt2/tree/main) on the 🤗 Transformers website. We include here several parameters:

* `set_seed` Remove the randomness of the text generation by supplying the same seed value each time.
* `prompt` The prompt that the text generator uses to build the sequence.
* `truncation=True` The length of the response should be limited.
* `max_length` If truncation is set, this is the length of the text returned. More text requires more time and the limit is defined by the model.
* `num_return_sequences` Allows more than one sequence to be returned for the prompt.

In [None]:
# Text Generation
generator = pipeline('text-generation', model='gpt2')

In [None]:
# Define a task function
def create_text(prompt):
    #set_seed(42)
    output = generator(prompt, truncation=True, max_length=100, num_return_sequences=3)
    return output

In [None]:
# Pass a prompt into the task function
prompt = "In the year 2067, the American people were forced to admit"
create_text(prompt)

## Sentiment Analysis

By default, the 🤗 Transformers library text-classification pipeline uses the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model. This model is based on a distilled, uncased version of [BERT](https://huggingface.co/bert-base-uncased) that has been fine-tuned on the [Stanford Sentiment Treebank 2](https://huggingface.co/datasets/sst2) (SST-2) dataset. The SST-2 dataset is a binary classification dataset for training models to learn the sentiment of words, phrases, and sentences. It contains 215,154 unique manually labeled texts of varying lengths. The model card describes SST-2:

>
The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.
>

In [None]:
# Sentiment Analysis Prompts

### Deadpool and Wolverine Reviews
# A positive movie review of Deadpool and Wolverine from Metacritic
movie_review_pos = """
Deadpool & Wolverine is pure joy and such an enjoyable, entertaining ride. From beginning to end, the film takes you on a journey filled with laughter, absurdity, and a good bit of heart. This is a love letter to the Fox Marvel Universe, Hugh Jackman and Wolverine, and ultimate fan service to the long-time comic book and film adaptation fans of these characters. While it can be a bit thin on the overall depth of story, it more than makes up for this with big action set pieces, great oneliners and jokes throughout, along with giving us some of the most exhilarating cameos the MCU has seen yet. If this is the kick-off of the mutant saga and more, fans are in for a great ride. Deadpool & Wolverine deserves to be seen over and over again in theaters.
"""

# A negative movie review of Deadpool and Wolverine from Metacritic
movie_review_neg = """
Feige’s mainstream instincts are easy to detect here. The prior Deadpool films were scuzzy and cobbled together, even as the budget grew; the cameos from other Marvel characters felt half-hearted and perfunctory, inclusions for Deadpool to roll his eyes at, not for fans to cheer over. Deadpool & Wolverine, on the other hand, has that bland MCU sheen that makes all of its movies look expensive but nonthreatening, happily accepting of mediocrity rather than attempting something artsy or daring.
"""

### Tomorrow and Tomorrow and Tomorrow Reviews
# A positive book review of Tomorrow and Tomorrow and Tomorrow from Goodreads
book_review_pos = """
Tomorrow, and Tomorrow, and Tomorrow, is a multilayered novel about friendship, love, and video games.
Sam and Sadie met when they are kids and quickly bonded over their love of video games. They develop a friendship that spans almost 30 years. The novel follows the highs and lows of their friendship, including falling in love, falling out, a love triangle, successes, and failures. Throughout it all, the one constant in their lives is video games.
The narrative alternates primarily between Sadie and Sam's POVs. Sam and Sadie are both loveable, arrogant, infuriating, and flawed. The dynamics of their friendship are complicated by love, jealousy, and misunderstanding. I got a little sick of the friends to frenemies cycle between Sadie and Sam (more of Sadie’s anger towards Sam, but I understood her point of view). I loved them, but I also wanted to shake some sense in them.
Sam’s mother, Anna; Marx, Sam’s college roommate; Dov, Sadie's professor; are some additional characters who make an impact. My favorite characters were Sam’s grandparents, Dong Hyun and Bong Cha.
The novel blends reality and game worlds, and parts of the narrative take place in a virtual open world.
All characters are well-developed and multidimensional. Even the avatars are multidimensional.
I am not a huge fan of video games, but this book made me nostalgic for the video games of my childhood. I got all of the Oregon Trail and Mario references, but there were times that I was a little lost, but I didn’t mind because I learned so much about gaming. The reader doesn’t need to know much about video games to enjoy this book (but it might help!). There are also a lot of 80s, 90s, and early 2000s pop culture references mixed in. I loved reading the details behind creating a game and the gaming industry as I was introduced to a whole new world.
This is a well-written, complex, thought-provoking, and original novel. I was invested in the characters, and some moments hit me on an emotional level. I got teary-eyed towards the end. I won't forget these characters; this is a book that is going to stay with me for a long time.
"""

# A negative book review of Tomorrow and Tomorrow and Tomorrow from Goodreads
book_review_neg = """
This book is so utterly pretentiousness and trying so hard to be woke that I should have given up on it instead of seeing it to the end. I would have if the beginning hadn’t been so beautifully done. There’s a line in the book about a video game sequel being awful because it was farmed out to Indian programmers who had no interest in the game and that’s how this book feels after the incredible start.

The story began with Sadie and Sam central to the story. Sam was the obviously the more sympathetic of the two and the one you as a reader care about. Sadie was often annoying and then fell apart in a ridiculous way. I hoped her awful college self with the horrible college boyfriend would evolve and grow up but she never does. Even worse for the story is the tangents that from that point became the story. We suddenly get a new character who is rightly called boring later on. He is a NPC. He’s just too good and uninteresting to take up so much space. We get his backstory we don’t need. In a similar way later on we get two new characters that happen to be gay that bring nothing to the story other than a celebration of their sexuality which apparently is worth their inclusion. Much like tangents about their game that take up unnecessary page time and continue to dilute any attempt at storytelling. There’s plenty of politics, even to a ridiculously degree like actual comical bad guys intent on violence against those in favor of gay relationships and marriage. Ironically for a book full of wokeness with characters never being straight, celebrating gender fluidity, the book managed to ridicule cultural appropriation. The book is very focused on the race of the characters but never explores them in more than a superficial way.
One of the author’s worst faults was her pretentious word choices. Instead of writing in way that flowed she chose to constantly check her thesaurus for jarring words like jejune and verdigris every couple of pages. Ironically much like the criticism of a game her character created this book is pretentious and full of itself. The worst part is that could have been amazing if it had stayed as focused as it was in the beginning. This is not a story worth the journey so do not push play. I received a complimentary copy of this book.
"""

### Dave the Diver
# A positive video game review of Dave the Diver from Metacritic
game_review_pos = """
This is demanding work, but the game’s distinct but complementary loops of playful labour are highly compelling. The satisfaction of completing a challenging dive without needing to be rescued, then watching the rave reviews on “Cooksta” pour in, is profound. Stylish, witty and exquisitely designed, Dave the Diver uses several hooks to achieve its goal, while establishing the relationship between the food we eat and the world from which its harvested with useful urgency.
"""

# A negative video game review of Dave the Diver from Metacritic
game_review_neg = """
I thought it would be fun, but the exposition/mechanic dump in the opening hours really soured my experience. I spent as much time watching cutscenes and having dialogue spewed at me as playing the game, it felt like. The alternating management/action sections sounded interesting in the reviews I watched, but playing them was much slower and more monotonous than I would have anticipated. It's definitely the kind of game for which I'm thankful that Steam offers a 2 hour refund window.
"""

In [None]:
# Sentiment Analysis
classifier = pipeline("text-classification")

In [None]:
# Define a task function
def classify_sentiment(prompt):
        output = classifier(prompt)
        return output

In [None]:
# Pass a prompt into the task function
classify_sentiment(game_review_neg)

## Named Entity Recognition

The `aggregation_strategy` parameter defines the strategy used to group entity tokens together, like "New York". Remember, the tokenization may also be at the subword level, so you could see "Microsoft" broken up into "Micro" and "soft". Additional aggregration strategies such as "first", "average", and "max" are discussed in the 🤗 Transformers [documentation](https://huggingface.co/transformers/v4.7.0/_modules/transformers/pipelines/token_classification.html).

In [None]:
# Named Entity Recognition
ner_tagger = pipeline("ner", aggregation_strategy="simple")

In [None]:
# Define a task function
def extract_entities(prompt):
    output = ner_tagger(prompt)
    return pd.DataFrame(output)

In [None]:
# Pass the prompt into the task function
extract_entities(movie_review_pos)

## Question Answering

The Question Answering pipeline has two required parameters: 

* `question` The question being asked
* `context` The source material that should be used to answer this question

In [None]:
# Question Answering
reader = pipeline("question-answering")

In [None]:
# Define a task function
def answer_question(question, context):
    output = reader(question=question, context=context)
    return pd.DataFrame([output])

In [None]:
# Pass the prompt into the task function
question = "What are the best parts?"
answer_question(question, movie_review_pos)

## Summarization

The `clean_up_tokenization_spaces` parameter removes extraneous spaces created through the detokenization process. If tokenization breaks up a string into separate tokens, then detokenization joins together a series of tokens into a string.

In [None]:
# Summarization
summarizer = pipeline("summarization")

In [None]:
# Define a task function
def summarize(text):
    outputs = summarizer(text, max_length=75, clean_up_tokenization_spaces=True)
    return outputs[0]['summary_text']

In [None]:
# Pass the prompt into the task function
summarize(movie_review_pos)

## Translation

The translation pipeline may have length limitations based on the model selected. If your text is long, you may need to break it up into smaller chunks for analysis.

In [None]:
# Translation
ger_translator = pipeline('translation_en_to_de', model="Helsinki-NLP/opus-mt-en-de")

In [None]:
# Define a task function
def translate_to_german(text):
    output = ger_translator(text, clean_up_tokenization_spaces=True, min_length=100)
    return output

In [None]:
# Pass the prompt into the task function
translate_to_german(movie_review_pos)

# Clear the memory and cache
It is a good practice to clear your cache and any variables in memory after using a notebook that loads a significant amount of data.

In [None]:
# Remove all variables from memory
%reset -f

In [None]:
# Delete the .cache folder
!rm -r /home/jovyan/.cache/

# Using the huggingface_hub `InferenceClient`

If you want to use large or state of the art models, then using an API is the best choice. Hugging Face offers API inference in two different services:

* [Serverless Inference API](https://huggingface.co/docs/api-inference/index)- Small cost to access through PRO plan (~$9/month), but provides shared resources suitable for research and prototyping
* [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index)- Expensive enterprise, dedicated, auto-scaling designed for production applications

The cost of Inference Endpoints depends on the provider (AWS, MS Azure, Google Cloud) and the hardware (from \\$0.03/hr to more than \\$100/hr). Assuming you have a PRO account already, you can login and supply an API key to get started.

## Log in to InferenceClient()

In [None]:
# Log in using an access token
login()

## Generate images with `text_to_image`
Just like `transformers` has pipelines, the `inference_client` has methods for [a variety of NLP tasks](https://huggingface.co/docs/huggingface_hub/v0.24.2/en/package_reference/inference_client#huggingface_hub.InferenceClient) such as:

* `document_question_answering`
* `feature_extraction`
* `image_to_text`
* `text_classification`
* `translation`

and many more for audio and images. Let's try a `text_to_image` example:

In [None]:
# Use the default text to image inference model
client = InferenceClient()
client.text_to_image("An astronaut riding a horse on mars")

In [None]:
# Specify a text to image model
client = InferenceClient(model="prompthero/openjourney-v4")

# Save the image locally
image = client.text_to_image("an astronaut riding a horse on mars")
image.save("astronaut.png")

In [None]:
# Pass the model to text_to_image instead of InferenceClient
client = InferenceClient()
client.text_to_image("an astronaut riding a horse on mars", model="prompthero/openjourney-v4")

We can choose another model and add additional inference steps to increase the quality of the output.

In [None]:
# A higher quality example from Stable Diffusion
client = InferenceClient(model="stabilityai/stable-diffusion-xl-base-1.0")
client.text_to_image("A photograph of a shark jumping out of a swimming pool", num_inference_steps=100)

For additional parameters, see the `text_to_image` [documentation](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_to_image).
___


## Sentiment Analysis with `text_classification`
If we want to do sentiment analysis, we could change to a `text_classification` task and specify a model for sentiment analysis.

In [None]:
# Specify a text_classification model for sentiment analysis
client = InferenceClient(model="cardiffnlp/twitter-roberta-base-sentiment-latest")
client.text_classification("This guy is a jerk. I would like to beat him up. He ruined my party and ate all my ice cream.")

In [None]:
# Specify a text_classification model for emotion analysis
client = InferenceClient(model="SamLowe/roberta-base-go_emotions")
client.text_classification("This guy is a jerk. I would like to beat him up. He ruined my party and ate all my ice cream.")

## Named Entity Recognition with `token_classification`
Similarly, we can quickly do Named Entity Recognition using `token_classification`.

In [None]:
# Specify a token_classification model for NER
client = InferenceClient(model='dslim/bert-base-NER')
client.token_classification(book_review_pos)

## Translate with `translation`
Here are some examples using translation.

In [None]:
# Specify a translation model: English to French
client = InferenceClient(model="facebook/mbart-large-50-many-to-many-mmt")
client.translation("I would like to eat ice cream in the Louvre, but they told me to jump in the Seine.", src_lang="en_XX", tgt_lang="fr_XX")

In [None]:
# Specify a translation model: English to Hindi
client = InferenceClient(model="facebook/mbart-large-50-many-to-many-mmt")
client.translation("I would like to eat ice cream in the National Museum, but they told me to jump in the Yamuna.", src_lang="en_XX", tgt_lang="hi_IN")

## Classify without labeled data with `zero_shot_classification`

In [None]:
# Specify a Zero Shot Classification model
client = InferenceClient()
client.zero_shot_classification(
    text=book_review_neg,
    labels=["positive", "negative", "pessimistic", "optimistic", "indifferent"],
    multi_label=True,
    hypothesis_template="This text is {} towards Sadie"
)


## Convert text to speech with `text_to_speech`
Here we use the asynchronous version of the inference client, which allows Python to wait until the results come back from the server.

In [None]:
# Specify a text to audio model
# Run this model async
client = AsyncInferenceClient(model='suno/bark')
audio = await client.text_to_speech("I would like to eat ice cream in the Louvre, but they told me to jump in the Seine!")
Path("ice_cream.flac").write_bytes(audio)