<a href="https://colab.research.google.com/github/hackerpranavpandey/AIProject/blob/main/LLM_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Learning Objective
1. Use a variety of existing models for a variety of common applications
2. Understanding basic prompt engineering
3. Understanding search vs sampling for LLM interfaces.
4. Get familiar with the main Hugging Face abstractions : datasets, pipelines,takenization, and models.

Sacremoses is for the translation model

In [None]:
%pip install sacremoses==0.0.53



In [None]:
%pip install datasets ## needs to do this in order to use library dat
from datasets import load_dataset
from transformers import pipeline



**Summarization can take two forms:**

1- **extractive** (select representative excerts from the text)

2- **abstractive** (generating novel text summaries)

Here we ar using model which does abstractive summarization.

Background Reading: Hugging face summarization page

We are using :-

Data: **xsum dataset**, which provides a set of BBC articles and summaries

Model: **t5-small model**, which has 60 millions parameters, developed by google is a encoder-decoder model which support several tasks such as summarization, translation,Q&A, and text classification.



In [None]:
xsum_dataset=load_dataset(
    "xsum",
    version="1.2.0")
xsum_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

This dataset provides 3 columns:

:- **documents** : BBC articles

:- **summary**: a ground-truth summary

:- **id **- article ID

In [None]:
## Lets analyse some data
import numpy as np
import sys
print(sys.getsizeof(xsum_dataset["validation"])) ## gives size in terms of memory
print("Number of articles for training",np.size(xsum_dataset["train"]))
print("Number of articles for validation",np.size(xsum_dataset["validation"]))
print("Number of articles for testing",np.size(xsum_dataset["test"]))

48
Number of articles for training 204045
Number of articles for validation 11332
Number of articles for testing 11334


In [None]:
## just using few among the train to use
xsum_samples= xsum_dataset["train"].select(range(100))
display(xsum_samples)   ## will not print them so need to convert then to first pandas
display(xsum_samples.to_pandas()[0:5])   ## use display instead of print

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 100
})

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


We next use the Hugging Face pipeline tool to load a pre-trained model.In this LLM pipeline constructor, we specify:

**:-task:** The first argument specify the task . More are available on hugging fase

**:-model:** this is the name of model that that we want to use for our task like here 't5-small'

**:-min_lenght,max_length:** summarize the token between these two lenghts

**:-truncation:** truncate the input if needed since most of LLMs have limit on size of input but our article may go beyond the limit


In [None]:
summarizer=pipeline(
    task="summarization",
    model="t5-small",
    min_length=10,
    max_length=50,
    truncation=True
)

In [None]:
## Apply to 1 article
summarizer(xsum_samples["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]

In [None]:
# Apply to all the article
results=summarizer(xsum_samples["document"][0:10])

In [None]:
## lets display generated one with actual summary
import pandas as pd
display(
    pd.DataFrame.from_dict(results)
    .rename({"summary_text": "generated_summary"}, axis=1)
    .join(pd.DataFrame.from_dict(xsum_samples))[
        ["generated_summary", "summary", "document"]
    ]
)

Unnamed: 0,generated_summary,summary,document
0,the full cost of damage in Newton Stewart is s...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one..."
1,a fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...
2,Sebastian Vettel will start third ahead of tea...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...
3,the 67-year-old is accused of committing the o...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco..."
4,a man receiving psychiatric treatment at the c...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...
5,Gregor Townsend gave a debut to powerhouse win...,Defending Pro12 champions Glasgow Warriors bag...,Simone Favaro got the crucial try with the las...
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...,Welsh cyclist Luke Rowe says changes to the sp...,Belgian cyclist Demoitie died after a collisio...
8,gundogan will not be fit for the start of the ...,Manchester City midfielder Ilkay Gundogan says...,"Gundogan, 26, told BBC Sport he ""can see the f..."
9,the crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,The crash happened about 07:20 GMT at the junc...


Now Lets use sentiment analysis model

**data**:- poem sentiment is dataset with sentiment label for each of the poem

**model**- bold textbold textfine tuned version of BERT use for entity recognition too

In [None]:
%pip install dataset

Collecting dataset
  Downloading dataset-1.6.2-py2.py3-none-any.whl (18 kB)
Collecting sqlalchemy<2.0.0,>=1.3.2 (from dataset)
  Downloading SQLAlchemy-1.4.52-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=0.6.2 (from dataset)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting banal>=1.0.1 (from dataset)
  Downloading banal-1.0.6-py2.py3-none-any.whl (6.1 kB)
Collecting Mako (from alembic>=0.6.2->dataset)
  Downloading Mako-1.3.5-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: banal, sqlalchemy, Mako, alembic, dataset


In [None]:
poem_dataset=load_dataset(
    "poem_sentiment",
    version="1.0.0"
)
print(poem_dataset)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/105 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/104 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


In [None]:
print(np.size(poem_dataset["train"]))
print(np.size(poem_dataset["validation"]))
print(np.size(poem_dataset["test"]))
sys.getsizeof(poem_dataset["train"])

892
105
104


48

In [None]:
poem_samples=poem_dataset["train"].select(range(10))
display(poem_samples.to_pandas())

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2
