<a href="https://colab.research.google.com/github/skyfallsin/MIECO/blob/main/AI_Guide_pick_a_model%2C_test_a_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First Steps with Language Models

Unlike other guides, this one is designed to:
- teach you how to always remain on the bleeding edge of published AI research
- not be tied to a closed-source / closed-data large language model (ex OpenAI, Anthropic)
- broaden your perspective on what's already out there for any given task
- create a data-led system for always identifying and using the state-of-the-art (SOTA) model for any particular task.

We're going to hone in on "text summarization" as our first task.

## So... why are we not using an existing LLM?

Great question. Most available LLMs worth their salt can do many tasks, including summarization.

However, many of them are not open, are trained on undisclosed data and exhibit biases. Responsible AI use require careful choices, and we're here to help you make them.

Finally, most large LLMs require powerful GPU compute to use. While there are many models that you can use as a service, most of them cost money per API call. Unnecessary when some of the more common tasks can be done at good quality with already available open models and off-the-shelf hardware.

## Why do using open models matter?

Over the last few decades, engineers have been blessed with being able to onboard by starting with open source projects, and eventually shipping open source to production. This default state is now at risk.

Yes, there are many open models available that do a great job. However, most guides don't discuss how to get started with them using simple steps and instead bias towards existing closed APIs.

Funding is flowing to commercial AI projects, who have larger budgets than open source contributors to market their work, which inevitably leads to engineers starting with closed source projects and shipping expensive closed projects to production.

# Our First Project - Summarization

We're going to:
- Get some long documents to summarize.
- Figure out how to summarize them using the current state-of-the-art open source models.
- Write some code to do so.

### Where can I grab some documents?
For simplicity's sake, let's grab a few HTML pages.

Note that in the real world, you will likely have use other libraries to extract content for any particular file type.

In [1]:
# first, we will import the `requests` library to grab webpages
import requests

# TODO: replace these URLs
urls = ['https://www.cnn.com/2023/08/29/tech/ai-chatbot-hallucinations/index.html']
html_pages = [requests.get(url).text for url in urls]

Next, let's use the Python HTML parser BeautifulSoup to grab the body text of these pages

In [2]:
from bs4 import BeautifulSoup

page_content = []

for html_page in html_pages:
    soup = BeautifulSoup(html_page, 'html.parser')
    if soup.body:
        for tag in soup.body(['footer', 'div.footer']):
          tag.decompose()
        page_content.append(soup.body.get_text())

#print(page_content[0])

Great. Now we're ready to start summarizing.

### A brief pause for context.

The AI space is moving so fast that it requires a tremendous amount of catching up on scientific papers every week to understand the lay of the land and the state of the art.

It's quite difficult for an engineer who is brand new to AI to:
* discover which open models are even out there
* which models are appropriate for a particular task
* which benchmarks are used to evaluate those models
* which models are performing well based on evaluations
* which models can actually run on available hardware

For the working engineer on a deadline, this is problematic. There's not much centralized discourse on working with open source AI models. Instead there are fragmented X (formerly Twitter) threads, random private groups and lots of word-of-mouth transfer.

However, once you master a framework on how to address all of the above, you will have the means to forever be on the bleeding age of published AI research.


### How do I get a list of available open summarization models?

For now, we recommend [Huggingface](https://huggingface.co/models?pipeline_tag=summarization) and their large directory of open models broken down by task. This is a great starting point. Note that larger LLMs are also included in these lists, so we will have to filter.

In this huge list of summarization models, which ones do we choose?

We don't know what any of these models are trained on. For example, a summarizer trained on news articles vs Reddit posts will perform better on news articles.

What we need is a set of metrics and benchmarks that we can use to do apples-to-apples comparisons of these models.

### How do I evaluate summarization models?

These steps below can be used to evaluate any available model for any task. It requires hopping between a few sources of data for now, but we will be making this a lot easier moving forward.

Steps:
1. Find the most common datasets used to train models for summarization.
2. Find the most common metrics used to evaluate models for summarization across those datasets.

#### Finding datasets

The easiest way to do this is using _[Papers With Code](https://paperswithcode.com/methods)_, an excellent resource for finding the latest scientific papers by task that also have code repositories attached.

First, filter _Papers With Code_ datasets [by most cited text-based English datasets](https://paperswithcode.com/datasets?q=&v=lst&o=cited&lang=english&mod=texts&task=text-summarization&page=1)

Let's pick (as of this writing) the most cited dataset -- the "[CNN/DailyMail](https://paperswithcode.com/dataset/cnn-daily-mail-1)" dataset. Usually most cited is one marker of popularity.

Now, you don't need to download this dataset. But we're going to review the info _Papers With Code_ have provided to learn more about it for the next step. This dataset is also available on [Huggingface](https://huggingface.co/datasets/cnn_dailymail).

First, check the license. In this case, it's MIT licensed, which means it can be used for both commercial and personal projects.

Next, see if the papers using this dataset are recent. You can do this by sorting Papers in descending order. This particular dataset has many papers from 2023 - great!

Now, let's dig into how we can evaluate models that use this dataset.


#### Evaluating models

Next, we look for measured metrics that are common across datasets for the summarization task. BUT, if you're not familiar with the literature on summarization, you have no idea what those are.

To find out, pick a "Subtask" that's close to what you'd like to see. We'd like to summarize the CNN article we pulled down above, so let's choose "[Document Summarization](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail)".

Now we're in business! This page contains a significant amount of new information.

There are mentions of three new terms: ROUGE-1, ROUGE-2 and ROUGE-L. These are the metrics that are used to [measure summarization performance](https://en.wikipedia.org/wiki/ROUGE_(metric)).

There are also a list of models and their scores on these three metrics - this is exactly what we're looking for.

Assuming we're looking at ROUGE-1 as our metric, we now have the top 3 models that we can evaluate in more detail. All 3 are close to 50, which is a promising ROUGE score (read up on ROUGE)

### Testing out a model

OK, we have a few candidates, so let's pick a model that will run on our local machines. Many models get their best performance when running on GPUs, but there are many that also generate summaries fast on CPUs. Let's pick one of those to start - Google's Pegasus.

In [3]:
# first we install huggingface's transformers library
%pip install transformers sentencepiece

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

Then we [find Pegasus](https://huggingface.co/google/pegasus-cnn_dailymail) on Huggingface. Cool, there's a version trained entirely on the CNN/DailyMail dataset.

In [7]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

#src_text = [
#    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
#]

content_to_summarize = page_content[0]

# first we choose GPUs if available
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model from Huggingface
model_name = "google/pegasus-cnn_dailymail"
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

# setup the tokenizer to tokenize the text
tokenizer = PegasusTokenizer.from_pretrained(model_name)
batch = tokenizer(content_to_summarize, truncation=True, padding="longest", return_tensors="pt").to(device)

# now call the model to summarize the text
summarized = model.generate(**batch)

# run it through the decoder
summarized_text = tokenizer.batch_decode(summarized, skip_special_tokens=True)

print()
print(summarized_text[0])


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



ChatGPT has mesmerized us with its ability to produce authoritative, human-sounding responses to seemingly any prompt.<n>But as more people turn to this buzzy technology for things like homework help, workplace research, or health inquiries, one of its biggest pitfalls is becoming increasingly apparent.<n>Researchers have come to refer to this tendency of AI models to spew inaccurate information as hallucinations.
