<a href="https://colab.research.google.com/github/jchen8000/DemystifyingLLMs/blob/main/4_Pre-Training/Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4. Pre-Training

This script uses the HuggingFace datases, as a prerequisite you need a HuggingFace account and obtain a access token, see https://huggingface.co/docs/hub/security-tokens. You should add the token to Colab Secrets as HF_TOKEN.


## 4.10 Pipelines

In [None]:
%pip install -q transformers==4.57.3

### Translation Example:

In [None]:
from transformers import pipeline

# Load the translation pipeline
translator = pipeline("translation_en_to_fr")


No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pa

In [None]:
# Input text to translate
text = "Will you help me with my homework?"

# Translate the text
translation = translator(text)[0]["translation_text"]

print(f"English: {text}")
print(f"Frenchn: {translation}")

English: Will you help me with my homework?
Frenchn: Avez-vous de l'aide pour mes devoirs?


### Question-Answer Example:

In [None]:
qa_pipeline = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
context = """
The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials,
generally built along an east-to-west line across the historical northern borders of China in part to protect the
Chinese states and empires against the raids and invasions of the various nomadic groups from the Eurasian Steppe.
"""

question = "What is the Great Wall of China made of?"

# Use the pipeline to answer the question
answer = qa_pipeline(question=question, context=context)

print(f"Question: {question}")
print(f"Answer: {answer['answer']}")
print(f"Score: {answer['score']}")

Question: What is the Great Wall of China made of?
Answer: stone, brick, tamped earth, wood, and other materials
Score: 0.7061177492141724


### Sentiment Analysis

In [None]:
# sentiment_pipeline = pipeline("sentiment-analysis")
sentiment_pipeline = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
prompt1 = f"""
Customer Review: 'The shipping was quick and the item was perfect. Totally satisfied!'.
"""
sentiment_pipeline(prompt1)

[{'label': 'positive', 'score': 0.9825846552848816}]

In [None]:
prompt2 = f"""
Customer Review: 'The restaurant was terrible, and the service was even worse. Not going back there again.'
"""
sentiment_pipeline(prompt2)

[{'label': 'negative', 'score': 0.9405812621116638}]

In [None]:
prompt3 = f"""
Customer Review: 'This restaurant was clean in general, but it took a while to get my foods served.'
"""
sentiment_pipeline(prompt3)

[{'label': 'neutral', 'score': 0.5042181015014648}]