<a href="https://colab.research.google.com/github/skyfallsin/MIECO/blob/main/AI_Guide_pick_a_model%2C_test_a_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First Steps with Language Models

Unlike other guides, this one is designed to:
- not be tied to a closed-source / closed-data large language model (ex OpenAI, Anthropic)
- broaden your perspective on what's already out there for any given task
- create a data-led system for always identifying and using the state-of-the-art (SOTA) model for any particular task.

We're going to hone in on "text summarization" as our first task.

## So... why are we not using an existing LLM?

Great question. Most available LLMs worth their salt can do many tasks, including summarization.

However, many of them are not open, are trained on undisclosed data and exhibit biases. Responsible AI use require careful choices, and we're here to help you make them.

Finally, most large LLMs require powerful GPU compute to use. While there are many models that you can use as a service, most of them cost money per API call. Unnecessary when some of the more common tasks can be done at good quality with already available open models and off-the-shelf hardware.

## Why do using open models matter?

Over the last few decades, us engineers have been blessed with being able to onboard by starting with open source projects, and eventually shipping open source to production.

However, recent developments in AI have changed this default state.

Yes, there are many open models available. However, most guides don't discuss how to get started with them using simple steps.

# Our First Project - Summarization

We're going to:
- Get some long documents to summarize.
- Figure out how to summarize them using the current state-of-the-art open source models.
- Write some code to do so.

### Where can I grab some documents?
For simplicity's sake, let's grab a few HTML pages.

Note that in the real world, you will likely have use other libraries to extract content for any particular file type.

In [None]:
# first, we will import the `requests` library to grab webpages
import requests

# TODO: replace these URLs
urls = ['https://blog.mozilla.org/en/mozilla/responsible-ai-challenge-winners/']
html_pages = [requests.get(url).text for url in urls]

Next, let's use the Python HTML parser BeautifulSoup to grab the body text of these pages

In [None]:
from bs4 import BeautifulSoup

page_content = []

for html_page in html_pages:
    soup = BeautifulSoup(html_page, 'html.parser')
    if soup.body:
        for tag in soup.body(['footer', 'div.footer']):
          tag.decompose()
        page_content.append(soup.body.get_text())

print(page_content[0])

Great. Now we're ready to start summarizing.

### What's next? A brief pause.

The AI space is moving so fast that it requires a tremendous amount of catching up on scientific papers every week to understand the lay of the land.

It's quite difficult for an engineer who is brand new to AI to:
1. discover which open models are even out there
2. of those, which models are appropriate for a particular task
3. which benchmarks are used to evaluate the models
4. which of those models are performing well based on evaluations
5. of those models, which ones can actually run on available hardware

For the journeyman engineer on a deadline, this is not tractable.


### How do I select a summarization model?

For now, we recommend [Huggingface](https://huggingface.co/models?pipeline_tag=summarization) and their large directory of open models broken down by task. This is a great starting point. Note that larger LLMs are also included in these lists, so we will have to filter.

There are a lot of models available, many forks of existing models as well, and it's not entirely easy even to decide on three models to test out.

We also don't know what any of these models are trained on. For example, a summarizer trained on news articles vs Reddit posts will perform better on news articles.

We also don't know whether these models can run on available commodity hardware or require GPUs to run well.

What we need is a set of metrics and benchmarks that we can use to do apples-to-apples comparisons of these models.

So, the next step is to uncover how summarization models are compared against each other.

### How do I benchmark summarization models?