# Overview
- Learn about the structure of Hugging Face
- Understand the difference between pipeline and automl
- Distinguish encoder model, decoder model and encoder-decoder model
- Capture common tasks in NLP
    - Text classification
    - Sentiment analyis
    - Text summaration
    - Text translation
    - Text generation
    - Question-answering problem
- Evaluate the LLM response

# What is Hugging Face
- Be a home of the AI community where people share the open source models, datasets, applications
- Have two types of inference: Local inference (Free and convenient but slow) and inference provider (fast and paid fee)
- Hugging Face introduce the Transormers library to simplify working with pre-trained models. It includes two ways to use
    - Use pipeline: easy and quick but be difficult to customize and adjust the settings 
    - Use 

## Use pipeline

### Text generation

In [7]:
from transformers import pipeline

gpt2_pipeline = pipeline(task="text-generation", model="openai-community/gpt2")

print(gpt2_pipeline("What if AI"))

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'What if AI had made people do something about their own lives?\n\nAI, as a species, is evolving rapidly, and its capabilities are becoming increasingly sophisticated.\n\nBut that evolution is being accompanied by a fundamental change in our thinking. The concept of "intelligence" is no longer a technical concept, but a philosophical one. We are becoming more and more aware of the power of human thought, and our thinking is beginning to take this into account.\n\nWhat are the implications of this shift?\n\nMost people don\'t realize the magnitude of this change. For example, humans are constantly in danger of being replaced by robots.\n\nBut many people don\'t realize that, and many of us have already invested our lives in learning how to use algorithms or other technologies to help us find and solve problems that threaten our freedom and our safety.\n\nThe implications are profound.\n\nThe U.S. government has been using AI to help solve problems that we worry about

In [9]:
results = gpt2_pipeline("What if AI", max_new_tokens=10, num_return_sequences=2)

for result in results:
    print(result['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What if AI could have a really good heart? That could be
What if AI was completely non-linear?

A.


### Text classification
Some models to support this task
- `abdulmatinomotoso/English_Grammar_Checker`: check grammar
- `cross-encoder/qnli-electra-base`: check the answer associate with the question
- `zero-shot-classification`: categorie texts into specific classes


In [10]:
# Create a pipeline for grammar checking
grammar_checker = pipeline(
  task="text-classification", 
  model="abdulmatinomotoso/English_Grammar_Checker"
)

# Check grammar of the input text
output = grammar_checker("I will walk dog")
print(output)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


[{'label': 'LABEL_0', 'score': 0.9956323504447937}]


  return forward_call(*args, **kwargs)


In [11]:
# Check grammar of the input text
output = grammar_checker("I love you")
print(output)

[{'label': 'LABEL_1', 'score': 0.9991169571876526}]


In [12]:
# Create the pipeline
classifier = pipeline(task="text-classification", model="cross-encoder/qnli-electra-base")

# Predict the output
output = classifier("Where is the capital of France?, Brittany is known for its stunning coastline.")

print(output)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu
  return forward_call(*args, **kwargs)


[{'label': 'LABEL_0', 'score': 0.016212010756134987}]


In [13]:
text = "AI-powered robots assist in complex brain surgeries with precision."

# Create the pipeline
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

# Create the categories list
categories = ["politics", "science", "sports"]

# Predict the output
output = classifier(text, categories)

# Print the top label and its score
print(f"Top Label: {output['labels'][0]} with score: {output['scores'][0]}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


Top Label: science with score: 0.9510334730148315


In [20]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
categories = ["positive", "negative", "neutral"]

output = pipe("I love using Hugging Face!")
print(output)



Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9997085928916931}]


### Text summarization

In [22]:
original_text = "'\nGreece has many islands, with estimates ranging from somewhere around 1,200 to 6,000, depending on the minimum size to take into account. The number of inhabited islands is variously cited as between 166 and 227.\nThe Greek islands are traditionally grouped into the following clusters: the Argo-Saronic Islands in the Saronic Gulf near Athens; the Cyclades, a large but dense collection occupying the central part of the Aegean Sea; the North Aegean islands, a loose grouping off the west coast of Turkey; the Dodecanese, another loose collection in the southeast between Crete and Turkey; the Sporades, a small tight group off the coast of Euboea; and the Ionian Islands, chiefly located to the west of the mainland in the Ionian Sea. Crete with its surrounding islets and Euboea are traditionally excluded from this grouping.\n'"
# Create the summarization pipeline
summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum")

# Summarize the text
summary_text = summarizer(original_text)

# Compare the length
print(f"Original text length: {len(original_text)}")
print(f"Summary length: {len(summary_text[0]['summary_text'])}")
# Print the summary
print(f"Summary: {summary_text[0]['summary_text']}")

Device set to use cpu


Original text length: 831
Summary length: 438
Summary: the number of inhabited islands is diversely cited as between 166 and 227. The Greek islands are traditionally grouped into the following clusters: the Argo-Saronic Islands in the Saronic Gulf near Athens; the Cyclades, a large but dense collection occupying the central part of the Aegean Sea; the North Aegesan islands, . a loose grouping off the west coast of Turkey; the Dodecanese, another loose collection in the southeast between C


In [24]:
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_new_tokens=1, max_new_tokens=10)

# Summarize the input text
short_summary_text = short_summarizer(original_text)

# Print the short summary
print(short_summary_text[0]["summary_text"])

# Repeat for a long summarizer
long_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_new_tokens=50, max_new_tokens=150)

long_summary_text = long_summarizer(original_text)

# Print the long summary
print(long_summary_text[0]["summary_text"])

Device set to use cpu


the number of inhabited islands is diversely 


Device set to use cpu


the number of inhabited islands is diversely cited as between 166 and 227. The Greek islands are traditionally grouped into the following clusters: the Argo-Saronic Islands in the Saronic Gulf near Athens; the Cyclades, a large but dense collection occupying the central part of the Aegean Sea; the North Aegesan islands, . a loose grouping off the west coast of Turkey; the Dodecanese, another loose collection in the southeast between C


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love using Hugging Face!", return_tensors="pt")
outputs = model(**inputs)   
print(outputs.logits)

tensor([[-3.9266,  4.2142]], grad_fn=<AddmmBackward0>)
