<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/transformers_and_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<iframe src="https://www.kaggle.com/embed/aliabdin1/llm-01-how-to-use-llms-with-hugging-face?cellIds=4&kernelSessionId=140351055" height="300" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="⚡LLM 01 - How to use LLMs with Hugging Face⚡"></iframe>

<iframe src="https://www.kaggle.com/embed/aliabdin1/llm-01-how-to-use-llms-with-hugging-face?cellIds=4&kernelSessionId=140351055" height="300" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="⚡LLM 01 - How to use LLMs with Hugging Face⚡"></iframe>

In [None]:
# Last amended: 3rd April, 2024
# Ref: Kaggle: https://www.kaggle.com/code/aliabdin1/llm-01-how-to-use-llms-with-hugging-face?scriptVersionId=140351055

## Log into huggingface using your token:
See [this](https://www.youtube.com/watch?v=mn_hdJ5w92A) video as to how to supply token in colab for logging into huggingface.

In [1]:
# 0.0 Login to huggingface using your token:
#     Keep your token ready:

! pip install huggingface_hub
from huggingface_hub import notebook_login
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Common LLM applications

The goal of this section is to get your feet wet with several LLM applications and to show how easy it can be to get started with LLMs.

As you go through the examples, note the datasets, models, APIs, and options used. These simple examples can be starting points when you need to build your own application.


In [2]:
# 0.1 Sacremoses is for the translation model Helsinki-NLP/opus-mt-en-es
%pip install sacremoses==0.0.53

Collecting sacremoses==0.0.53
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895240 sha256=9fa2205f98aae84f50b1d3a5f7205b1e796f87f396e32508f8b9cfa48d4dbfc0
  Stored in directory: /root/.cache/pip/wheels/00/24/97/a2ea5324f36bc626e1ea0267f33db6aa80d157ee977e9e42fb
Successfully built sacremoses
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.53


### About Accelerate

Each distributed training framework has their own way of doing things which can require writing a lot of custom code to adapt it to your PyTorch training code and training environment. Accelerate offers a friendly way to interface with these distributed training frameworks without having to learn the specific details of each one. Accelerate takes care of those details for you, so you can focus on the training code and scale it to any distributed training environment. One can run code on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp8, fp16, bf16).

In [3]:
!pip install -U accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

### About transformers:   
transformers is a library maintained by Hugging Face and the community, for state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. We are a bit biased, but we really like 🤗 transformers!

Exploring 🤗 transformers in the Hub

There are over 25,000 transformers models in the Hub which you can find by filtering at the left of the models page.

You can find models for many different tasks:

* Extracting the answer from a context [question-answering](https://huggingface.co/models?library=transformers&pipeline_tag=question-answering&sort=downloads)
* Creating summaries from a large text [summarization](https://huggingface.co/models?library=transformers&pipeline_tag=summarization&sort=downloads).
* Classify text e.g. as spam or not spam, [text-classification](https://huggingface.co/models?library=transformers&pipeline_tag=text-classification&sort=downloads).
* Generate a new text with models such as GPT [text-generation](https://huggingface.co/models?library=transformers&pipeline_tag=text-generation&sort=downloads).
* Identify parts of speech (verb, subject, etc.) or entities (country, organization, etc.) in a sentence [token-classification](https://huggingface.co/models?library=transformers&pipeline_tag=token-classification&sort=downloads).
* Transcribe audio files to text [automatic-speech-recognition](https://huggingface.co/models?library=transformers&pipeline_tag=automatic-speech-recognition&sort=downloads).
* Classify the speaker or language in an audio file [audio-classification](https://huggingface.co/models?library=transformers&pipeline_tag=audio-classification&sort=downloads).
* Detect objects in an image [object-detection](https://huggingface.co/models?library=transformers&pipeline_tag=object-detection&sort=downloads).
* Segment an image [image-segmentation](https://huggingface.co/models?library=transformers&pipeline_tag=image-segmentation&sort=downloads).
* Do Reinforcement Learning [reinforcement-learning](https://huggingface.co/models?library=transformers&pipeline_tag=reinforcement-learning&sort=downloads)!

You can try out the models directly in the browser if you want to test them out without downloading them thanks to the in-browser widgets!

In [4]:
# transformers is a library maintained by Hugging Face and the community, for state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. We are a bit biased, but we really like 🤗 transformers!
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [5]:
from datasets import load_dataset
from transformers import pipeline

In [None]:
! mkdir cache

In [None]:
xsum_dataset = load_dataset(
                             "xsum", version="1.2.0", cache_dir="/content/cache/"
                            )  # Note: We specify cache_dir to use predownloaded data.
xsum_dataset  # The printed representation of this object shows the `num_rows` of each dataset split.

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [None]:
type(xsum_dataset)    # A special object of dict type
                      # Try xsum_dataset.<tab>

datasets.dataset_dict.DatasetDict

In [None]:
xsum_sample = xsum_dataset["train"].select(range(10))
# Transform to pandas:
display(xsum_sample.to_pandas())

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


[pipeline()](https://huggingface.co/docs/transformers.js/en/pipelines)    

The `pipeline()` function is the easiest and fastest way to use a pretrained model for inference. The pipeline() function is a great way to quickly use a pretrained model for inference, as it takes care of all the preprocessing and postprocessing for you. By default, the quantized version of the model (that you specify) is used, which is smaller and faster, but usually less accurate. To override this behaviour (i.e., use the unquantized model), you can use a custom PretrainedOptions object as the third parameter to the pipeline function:   
   
`pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', { quantized: false,});`

How to select a huggingface model:    

To select a `summarization` model, go to [huggingface](https://huggingface.co/). Click on `Models` (It is Ist among the top menu). When on `Models` page, look for `Natural Language Processing-->Summarization`. You will get models that perform summarization. Among these, sort models by, say, `Trending` or `Most likes` etc. Click on any one of the models names. Go to its page (`Model Card`) and copy its full name and write it below against model.

In [None]:
# Create a pipeline object:
summarizer = pipeline(
                      task="summarization",
                      model="facebook/bart-large-cnn",   # "t5-small",
                      min_length=20,
                      max_length=40,
                      truncation=True,
                      model_kwargs={"cache_dir": "/content/cache/"},
                     )  # Note: We specify cache_dir to use predownloaded models.

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Apply to 1 article
summarizer(xsum_sample["document"][0])

[{'summary_text': 'Many roads in Peeblesshire remain badly affected by standing water. The full cost of damage in Newton Stewart is still being assessed. First Minister Nicola Sturgeon visited the area to inspect'}]

In [None]:
# Create a pipeline object with defaults:
summarizer = pipeline(
                      task="summarization",
                      model_kwargs={"cache_dir": "/content/cache/"},
                     )  # Note: We specify cache_dir to use predownloaded models.

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
summarizer(xsum_sample["document"][0])

[{'summary_text': ' First Minister Nicola Sturgeon visited the area to inspect the damage . Many businesses and householders were affected by flooding in Newton Stewart . Many roads in Peeblesshire remain badly affected by standing water . Flood alert remains in place across the Borders because of the constant rain . Peebles was badly hit by problems, sparking calls to introduce more defences .'}]

In [None]:
len(xsum_sample['document'])   # 10

10

In [None]:
results = summarizer(xsum_sample["document"][3])

## Sentiment analysis

In [None]:
poem_dataset = load_dataset(
                             "poem_sentiment", version="1.0.0", cache_dir="/content/cache/"
                           )

poem_sample = poem_dataset["train"].select(range(10))
display(poem_sample.to_pandas())

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/105 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/104 [00:00<?, ? examples/s]

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2


Please see this [link](https://huggingface.co/nickwong64/bert-base-uncased-poems-sentiment) on the model card of `nickwong64/bert-base-uncased-poems-sentiment`

In [None]:
sentiment_classifier = pipeline(
                                 task="text-classification",
                                 model="nickwong64/bert-base-uncased-poems-sentiment",
                                 model_kwargs={"cache_dir": "/content/cache/"},
                                )

config.json:   0%|          | 0.00/923 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/923 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
results = sentiment_classifier(poem_sample["verse_text"])

In [None]:
results

[{'label': 'positive', 'score': 0.9965937733650208},
 {'label': 'no_impact', 'score': 0.9987409710884094},
 {'label': 'negative', 'score': 0.995965838432312},
 {'label': 'mixed', 'score': 0.9687354564666748},
 {'label': 'mixed', 'score': 0.9759674668312073},
 {'label': 'mixed', 'score': 0.9665797352790833},
 {'label': 'no_impact', 'score': 0.9986388087272644},
 {'label': 'no_impact', 'score': 0.9986108541488647},
 {'label': 'negative', 'score': 0.9965572357177734},
 {'label': 'no_impact', 'score': 0.9985186457633972}]

In [None]:
# Display the predicted sentiment side-by-side with the ground-truth label and original text.
# The score indicates the model's confidence in its prediction.

# Join predictions with ground-truth data
joined_data = (
    pd.DataFrame.from_dict(results)
    .rename({"label": "predicted_label"}, axis=1)
    .join(pd.DataFrame.from_dict(poem_sample).rename({"label": "true_label"}, axis=1))
)

# Change label indices to text labels
sentiment_labels = {0: "negative", 1: "positive", 2: "no_impact", 3: "mixed"}
joined_data = joined_data.replace({"true_label": sentiment_labels})

display(joined_data[["predicted_label", "true_label", "score", "verse_text"]])

Unnamed: 0,predicted_label,true_label,score,verse_text
0,positive,positive,0.996594,with pale blue berries. in these peaceful shad...
1,no_impact,no_impact,0.998741,"it flows so long as falls the rain,"
2,negative,negative,0.995966,"and that is why, the lonesome day,"
3,mixed,mixed,0.968735,"when i peruse the conquered fame of heroes, an..."
4,mixed,mixed,0.975967,of inward strife for truth and liberty.
5,mixed,mixed,0.96658,the red sword sealed their vows!
6,no_impact,no_impact,0.998639,and very venus of a pipe.
7,no_impact,no_impact,0.998611,"who the man, who, called a brother."
8,negative,negative,0.996557,"and so on. then a worthless gaud or two,"
9,no_impact,no_impact,0.998519,to hide the orb of truth--and every throne
