## Scrape AWS Machine Learning Blog posts

### Scraping using RSS Feed

In the code block below, we are parsing the RSS feed from the AWS Machine Learning blog using the `feedparser` library. For each entry in the feed, we extract the title, published date, tags, content, and URL of the blog post.

We then store this information in a pandas DataFrame. Each row in the DataFrame corresponds to a single blog post.

Next, we extract the trailing string from the URL of the blog post and use it as the filename for the Parquet file. This is done using the `os.path.basename` and `os.path.normpath` functions.

Finally, we save the DataFrame to a Parquet file using the `to_parquet` method. We specify the `pyarrow` engine for writing the Parquet file and use Snappy compression to reduce the file size.

This results in a separate Parquet file for each blog post in the RSS feed, with the filename corresponding to the trailing string in the URL of the blog post.


In [None]:
# !pip install -U requests bs4 pyarrow pandas feedparser --quiet

In [22]:
import os
from pathlib import Path

import feedparser
import pandas as pd
import requests
from bs4 import BeautifulSoup
from rich import print

url = "https://aws.amazon.com/blogs/machine-learning/feed/"
feed = feedparser.parse(url)

# Path to store extracted blog posts to
DATADIR = Path("./data/aws/ml_blog_posts/rss")
DATADIR.mkdir(parents=True, exist_ok=True)


for entry in feed.entries:
    title = entry.title
    published = entry.published
    tags = [tag.term for tag in entry.tags]
    content = entry.content[0].value
    link = entry.link

    # Store the extracted information in a pandas DataFrame
    data = {
        "Title": [title],
        "Published": [published],
        "Tags": [tags],
        "Content": [content],
        "URL": [link],
    }
    df = pd.DataFrame(data)

    # Extract the trailing string from the URL
    filename = os.path.basename(os.path.normpath(link))
    parquet_file = Path(f"{DATADIR}/{filename}.parquet")
    if not parquet_file.exists():
        # Save the DataFrame to a Parquet file
        print(f"Saving: {parquet_file}")
        df.to_parquet(parquet_file, engine="pyarrow", compression="snappy")

print(f"Files written to {DATADIR}")

In [None]:
def extract_blog_urls(feed_url):
    feed = feedparser.parse(feed_url)
    blog_urls = [entry.link for entry in feed.entries]
    return blog_urls


feed_url = "https://aws.amazon.com/blogs/machine-learning/feed/"
blog_urls = extract_blog_urls(feed_url)

for url in blog_urls:
    print(url)

## Scraping using BeautifulSoup

In the code block below, we are scraping specific blog posts from the AWS Machine Learning blog using the `requests` and `BeautifulSoup` libraries. We specify the URLs of the blog posts we want to scrape in the `urls` list.

For each URL in the list, we send a GET request to the URL and parse the response using BeautifulSoup. We then locate and extract the title, metadata, authors, published date, content, and image URLs of the blog post using BeautifulSoup's `find` and `find_all` methods.

We store this information in a pandas DataFrame, with each row in the DataFrame corresponding to a single blog post. We then extract the trailing string from the URL of the blog post and use it as the filename for the Parquet file.

Finally, we save the DataFrame to a Parquet file using the `to_parquet` method. We specify the `pyarrow` engine for writing the Parquet file and use Snappy compression to reduce the file size.

This results in a separate Parquet file for each blog post in the `urls` list, with the filename corresponding to the trailing string in the URL of the blog post. The Parquet files are saved in the specified directory (`DATADIR`).

This script provides an efficient way to scrape specific blog posts from the AWS Machine Learning blog and store the scraped data in a structured format for further analysis.


In [8]:
# Replace with the URL of the blog post you want to scrape
urls = [
    "https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/",
    "https://aws.amazon.com/blogs/machine-learning/deploy-amazon-sagemaker-autopilot-models-to-serverless-inference-endpoints/",
    "https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/",
    "https://aws.amazon.com/blogs/machine-learning/achieve-high-performance-with-lowest-cost-for-generative-ai-inference-using-aws-inferentia2-and-aws-trainium-on-amazon-sagemaker/",
    "https://aws.amazon.com/blogs/machine-learning/part-1-analyze-amazon-sagemaker-spend-and-determine-cost-optimization-opportunities-based-on-usage-part-1/",
    "https://aws.amazon.com/blogs/machine-learning/part-2-analyze-amazon-sagemaker-spend-and-determine-cost-optimization-opportunities-based-on-usage-part-2-sagemaker-notebooks-and-studio/",
    "https://aws.amazon.com/blogs/machine-learning/part-3-analyze-amazon-sagemaker-spend-and-determine-cost-optimization-opportunities-based-on-usage-part-3-processing-and-data-wrangler-jobs/",
    "https://aws.amazon.com/blogs/machine-learning/part-4-analyze-amazon-sagemaker-spend-and-determine-cost-optimization-opportunities-based-on-usage-part-4-training-jobs/",
    "https://aws.amazon.com/blogs/machine-learning/part-5-analyze-amazon-sagemaker-spend-and-determine-cost-optimization-opportunities-based-on-usage-part-5-hosting/",
]


DATADIR = Path("./data/aws/ml_blog_posts/bs")
DATADIR.mkdir(parents=True, exist_ok=True)

for url in urls:
    post_name = url.split("/")[-2]
    parquet_file = Path(f"{DATADIR}/{post_name}.parquet")

    if not parquet_file.exists():
        print(f"Scraping blog: {post_name}")
        response = requests.get(url)
        data = {}
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")

            # Locate and extract the desired information
            title = soup.find("h1", class_="blog-post-title").text.strip()
            # metadata = soup.find('footer', class_='blog-post-meta').text.strip()
            metadata_elements = soup.find_all(
                "span", attrs={"property": "articleSection"}
            )
            metadata = [mdata_element.text for mdata_element in metadata_elements]
            # print(metadata)

            author_elements = soup.find_all("span", attrs={"property": "author"})
            # Extract the author names and store them in a list
            author_names = [
                author_element.find("span", attrs={"property": "name"}).text
                for author_element in author_elements
            ]
            # print(author_names)

            # Extract datePublished
            time_element = soup.find("time", attrs={"property": "datePublished"})
            date_published = time_element["datetime"]
            # print(date_published)

            section_element = soup.find("section", class_="blog-post-content")
            content = section_element.text.strip()
            image_urls = [
                img["src"] for img in soup.find_all("img", class_="alignnone")
            ]

            # Store the extracted information in a pandas DataFrame
            data = {
                "Title": [title],
                "Tags": [metadata],
                "Authors": [author_names],
                "Published Date": [date_published],
                "Content": [content],
                "Image URLs": [image_urls],
                "URL": [url],
            }
            df = pd.DataFrame(data)

            df.to_parquet(parquet_file, engine="pyarrow", compression="snappy")
        else:
            print(f"Error fetching data: {response.status_code}")

print(f"Files written to {DATADIR}")

Scraping blog: achieve-high-performance-with-lowest-cost-for-generative-ai-inference-using-aws-inferentia2-and-aws-trainium-on-amazon-sagemaker
Files written to data/aws/ml_blog_posts/bs


### List all extracted files

In [None]:
# Use rglob to recursively find all files
file_list = list(DATADIR.rglob("*.parquet"))

# Print the list of files
print(file_list)