<a href="https://colab.research.google.com/github/trancethehuman/ai-workshop-code/blob/main/Web_scraping_for_LLM_in_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Let's setup our test: Get my competitors' pricing from their websites

This is real. I am doing this not for shits and giggles.

I'm building an interactive learning platform where content is taught using AI. Seems like everyone else is focusing on augmenting the authoring process and not the learning experience, but whatever.

In [4]:
competitor_sites = [
    {
        "name": "Cobertura y Bonificaciones FONASA",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3459.html"
    },
    {
        "name": "Cotizaciones de Salud FONASA",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3458.html"
    },
    {
        "name": "Afiliacion y desafiliacion FONASA",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3457.html"
    },
    {
        "name": "Seguro catastrófico",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-4010.html"
    },
    {
        "name": "Garantías explícitas en Salud GES",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3460.html"
    },
    {
        "name": "Licencias Médicas y Subsidios por Incapacidad Laboral",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3461.html"
    },
    {
        "name": "Ley Ricarte Soto",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-7142.html"
    },
    {
        "name": "Beneficiarios Fonasa",
        "url": "https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-6058.html"
    }
]

### Let's setup cost calculations

So we can compare them side-by-side

We can calculate how much it'll cost by using OpenAI's `tiktoken` library.

(side note: as of today, OpenAI hasn't updated `tiktoken` with the actual algorithm used to in `gpt-4o`, so we'll guesstimate using `gpt-4` tokenization encoding (cl100k_base).

In [5]:
pip install tiktoken --quiet

Note: you may need to restart the kernel to use updated packages.


In [6]:
import tiktoken

def count_tokens(input_string: str) -> int:
    tokenizer = tiktoken.get_encoding("cl100k_base")

    tokens = tokenizer.encode(input_string)

    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 5) -> float:
    num_tokens = count_tokens(input_string)

    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens

    return total_cost

# Example usage:
input_string = "What's the difference between beer nuts and deer nuts? Beer nuts are about 5 dollars. Deer nuts are just under a buck."
cost = calculate_cost(input_string)
print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

The total cost for using gpt-4o is: $US 0.000135


### Additionally, I want to see the test results in a nice table, so let's set that up.

In [7]:
pip install prettytable tqdm --quiet

Note: you may need to restart the kernel to use updated packages.


In [8]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(
        scrape_url_functions: List[Dict[str, Callable[[str], str]]], 
        sites_list: List[Dict[str, str]], 
        characters_to_display: int = 500, 
        table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = ALL

    cost_table.max_width = table_max_width
    cost_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data



## Setup all the scrapers

Let's setup all of our scrapers.

### Beautiful Soup

In [9]:
pip install requests beautifulsoup4 --quiet

Note: you may need to restart the kernel to use updated packages.


In [10]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)


### Reader API by Jina AI

Let's setup Jina AI's scrape method. This one is dead easy.

In [9]:
import requests

def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

### Firecrawl from Mendable.

In [12]:
pip install firecrawl-py --quiet

Note: you may need to restart the kernel to use updated packages.


In [13]:
import firecrawl
import getpass

FIRECRAWL_API_KEY = getpass.getpass("Mendable API Key: ")

def scrape_firecrawl(url: str):
    app = firecrawl.FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    scraped_data = app.scrape_url(url)["markdown"]
    return scraped_data

## Let's run all the scrapers and display them in our comparison table

In [10]:
list_of_scraper_functions = [
      # {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      # {"name": "Firecrawl", "function": scrape_firecrawl},
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

#all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 700, 20)

In [11]:
import os 

In [19]:
def view_scraped_content2(
        scrape_url_functions: List[Dict[str, Callable[[str], str]]], 
        sites_list: List[Dict[str, str]], 
        characters_to_display: int = 500, 
        table_max_width: int = 50) -> List[Dict[str, str]]:
    
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    # Ensure the data/raw directory exists
    raw_data_dir = '/Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw'
    os.makedirs(raw_data_dir, exist_ok=True)
    if not os.path.exists(raw_data_dir):
        print(f"Directory creation failed: {raw_data_dir}")
        return scraped_data

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})

                    # Save content to a file
                    filename = os.path.join(raw_data_dir, f"{site['name'].replace(' ', '_')}_{function_name}.txt")
                    with open(filename, 'w', encoding='utf-8') as f:
                        f.write(content)
                    if os.path.exists(filename):
                        print(f"Successfully saved content to {filename}")
                    else:
                        print(f"Failed to save content to {filename}")
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    print(f"Failed to process site {site['name']} using {function_name}: {e}")
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = ALL

    cost_table.max_width = table_max_width
    cost_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use GPT-4 to parse this content for extraction.")
    print(cost_table)

    return scraped_data

In [21]:
all_content = view_scraped_content2(list_of_scraper_functions, competitor_sites, 700, 20)

Processing site Cobertura y Bonificaciones FONASA using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  2.35it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Cobertura_y_Bonificaciones_FONASA_Jina AI.txt


Processing site Cotizaciones de Salud FONASA using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.76it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Cotizaciones_de_Salud_FONASA_Jina AI.txt


Processing site Afiliacion y desafiliacion FONASA using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.97it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Afiliacion_y_desafiliacion_FONASA_Jina AI.txt


Processing site Seguro catastrófico using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.98it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Seguro_catastrófico_Jina AI.txt


Processing site Garantías explícitas en Salud GES using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  2.47it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Garantías_explícitas_en_Salud_GES_Jina AI.txt


Processing site Licencias Médicas y Subsidios por Incapacidad Laboral using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.20it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Licencias_Médicas_y_Subsidios_por_Incapacidad_Laboral_Jina AI.txt


Processing site Ley Ricarte Soto using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]


Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Ley_Ricarte_Soto_Jina AI.txt


Processing site Beneficiarios Fonasa using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.75it/s]

Successfully saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Beneficiarios_Fonasa_Jina AI.txt
Content Table:
+----------------------+----------------------+
|      Site Name       |   Jina AI content    |
+----------------------+----------------------+
|     Cobertura y      |  Title: Cobertura y  |
|    Bonificaciones    |    Bonificaciones    |
|        FONASA        |                      |
|                      | URL Source: https:// |
|                      | www.superdesalud.gob |
|                      | .cl/consultas/667/w3 |
|                      |   -propertyvalue-    |
|                      |      3459.html       |
|                      |                      |
|                      |  Markdown Content:   |
|                      |  #### [¿A partir de  |
|                      |      cuándo los      |
|                      |  beneficiarios del   |
|                      |    Fonasa pueden     |
|                      | comprar bonos?abrir  |
|  




In [28]:
all_content

[{'provider': 'Cobertura y Bonificaciones FONASA',
  'sites': [{'name': 'Jina AI',
    'content': 'Title: Cobertura y Bonificaciones\n\nURL Source: https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3459.html\n\nMarkdown Content:\n#### [¿A partir de cuándo los beneficiarios del Fonasa pueden comprar bonos?abrir cerrar](https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3459.html#recuadros_articulo_5078_0 "Los afiliados de Fonasa y sus cargas pueden comprar bonos, una vez que cumplan con los requisitos de pago de cotizaciones según la calidad laboral o previsional del cotizante.")\n\nLos afiliados de Fonasa y sus cargas pueden comprar bonos, una vez que cumplan con los requisitos de pago de cotizaciones según la calidad laboral o previsional del cotizante.\n\nLos bonos de Fonasa permiten acceder a una consulta o procedimiento médico con un prestador que se encuentre en convenio, a través de libre elección.\n\nFonasa cuenta con cuatro tipos de bonos para que te at

In [31]:
from pathlib import Path

In [2]:
import os 

current_directory = os.getcwd()
print(f"Current Directory: {current_directory}")

Current Directory: /Users/nicorod/Documents/repos/Chatbot-Fonasa/web_scraping


In [3]:
# Get the directory where the current script is located
script_dir = os.path.dirname(os.path.abspath(__file__))

# Assuming the script is in the src directory, we move up one level to get the project root
project_root = os.path.abspath(os.path.join(script_dir, os.pardir))

print(f"Project Root: {project_root}")

NameError: name '__file__' is not defined

In [14]:
def write_content_to_files(all_content: List[Dict[str, any]], output_dir: str, project_root: str):
    # Set the root directory of the project manually
    print(f"Manually Set Project Root Directory: {project_root}")
    
    # Ensure the output directory exists
    output_path = os.path.join(project_root, output_dir)
    os.makedirs(output_path, exist_ok=True)
    
    for site in all_content:
        provider = site.get('provider', '')
        for site_info in site.get('sites', []):
            name = site_info.get('name', '')
            content = site_info.get('content', '')
            
            # Create the content for the file
            file_content = f"Title: {provider}\n\n{content}"
            
            # Create a safe filename
            filename = f"{provider.replace(' ', '_')}_{name.replace(' ', '_')}.txt"
            file_path = os.path.join(output_path, filename)
            
            # Write the content to the file
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(file_content)
            print(f"Saved content to {file_path}")

In [15]:
all_content

[{'provider': 'Cobertura y Bonificaciones FONASA',
  'sites': [{'name': 'Jina AI',
    'content': 'Title: Cobertura y Bonificaciones\n\nURL Source: https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3459.html\n\nMarkdown Content:\n#### [¿A partir de cuándo los beneficiarios del Fonasa pueden comprar bonos?abrir cerrar](https://www.superdesalud.gob.cl/consultas/667/w3-propertyvalue-3459.html#recuadros_articulo_5078_0 "Los afiliados de Fonasa y sus cargas pueden comprar bonos, una vez que cumplan con los requisitos de pago de cotizaciones según la calidad laboral o previsional del cotizante.")\n\nLos afiliados de Fonasa y sus cargas pueden comprar bonos, una vez que cumplan con los requisitos de pago de cotizaciones según la calidad laboral o previsional del cotizante.\n\nLos bonos de Fonasa permiten acceder a una consulta o procedimiento médico con un prestador que se encuentre en convenio, a través de libre elección.\n\nFonasa cuenta con cuatro tipos de bonos para que te at

In [18]:
output_directory = "data/raw"
project_root = "/Users/nicorod/Documents/repos/Chatbot-Fonasa/"
write_content_to_files(all_content, output_directory, project_root)

Manually Set Project Root Directory: /Users/nicorod/Documents/repos/Chatbot-Fonasa/
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Cobertura_y_Bonificaciones_FONASA_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Cotizaciones_de_Salud_FONASA_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Afiliacion_y_desafiliacion_FONASA_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Seguro_catastrófico_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Garantías_explícitas_en_Salud_GES_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Licencias_Médicas_y_Subsidios_por_Incapacidad_Laboral_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Ley_Ricarte_Soto_Jina_AI.txt
Saved content to /Users/nicorod/Documents/repos/Chatbot-Fonasa/data/raw/Beneficiarios_Fonasa_Jina_AI.txt


## Now let's use OpenAI and extract just the information we need

Let's see how accurate the extraction task is between each provider.



First, we create an extraction function using OpenAI's gpt-4o to get only the pricing content from each scraped website from each provider.

In [15]:
pip install openai --quiet

Note: you may need to restart the kernel to use updated packages.


In [16]:
import getpass
from openai import OpenAI

OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key: ')

client = OpenAI(api_key=OPENAI_API_KEY)

def extract(user_input: str):
  entity_extraction_system_message = {"role": "system", "content": "Get me the three pricing tiers from this website's content, and return as a JSON with three keys: {cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}"}

  messages = [entity_extraction_system_message]
  messages.append({"role": "user", "content": user_input})

  response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=False,
        response_format={"type": "json_object"}
    )

  return response.choices[0].message.content

Enter your OpenAI API key: ··········


### Then, we create a utility function to display that content in a table.

In [17]:
def display_extracted_content(results: List[Dict[str, any]], num_objects: int):
    table = PrettyTable()
    table.field_names = ["Site", "Provider Name", "Extracted Content"]

    # Ensure num_objects does not exceed the length of the results list
    num_objects = min(num_objects, len(results))

    # Process the specified number of items from the results list with a progress bar
    for result in tqdm(results[:num_objects], desc="Processing results"):
        provider_name = result["provider"]

        for site in result["sites"]:
            function_name = site["name"]
            content = site["content"]

            # Progress bar for each function
            for _ in tqdm(range(1), desc=f"Extracting content with {provider_name} for {function_name}"):
                extracted_content = extract(content)
                table.add_row([provider_name, function_name, extracted_content])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = ALL

    print("Extracted Content Table:")
    print(table)

In [18]:
display_extracted_content(all_content, num_objects=9)

Processing results:   0%|          | 0/4 [00:00<?, ?it/s]
Extracting content with Articulate 360 by Adobe for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Beautiful Soup: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]

Extracting content with Articulate 360 by Adobe for Firecrawl:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Firecrawl: 100%|██████████| 1/1 [00:01<00:00,  1.72s/it]

Extracting content with Articulate 360 by Adobe for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
Processing results:  25%|██▌       | 1/4 [00:04<00:14,  4.72s/it]
Extracting content with 7taps for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with 7taps for Beautiful Soup: 100%|██████████| 1/1 [00:02<00:00,  2.44s/it]

Extracting content with 7taps for Firecra

Extracted Content Table:
+-------------------------+----------------+----------------------------------------------------+
|           Site          | Provider Name  |                 Extracted Content                  |
+-------------------------+----------------+----------------------------------------------------+
| Articulate 360 by Adobe | Beautiful Soup |                         {                          |
|                         |                |                    "cheapest": {                   |
|                         |                |                      "name": "",                   |
|                         |                |                      "price": 0.0                  |
|                         |                |                          },                        |
|                         |                |                     "middle": {                    |
|                         |                |                      "name": "",                




## Bonus: Scrapegraph-ai

### Scrapegraph-ai

Scrapegraph-ai takes care of the entire flow (from scrape to extraction). It's also interesting that it's node-based, and can run off of local models (Ollama supported). But I couldn't find a way to get cost estimates based on tokens used.

Demo link: https://scrapegraph-ai-demo.streamlit.app/
