Version: 02.14.2023

# Lab 3.1: Extracting Text from Webpages and Images

In this lab, you will use Beautiful Soup and Amazon Textract to extract text from the web and turn the results into a pandas dataframe.

In the second part of the lab, you will experiment with Amazon Textract to extract text from images.


## Lab steps

To complete this lab, you will follow these steps:

1. [Extracting information from a webpage](#1.-Extracting-information-from-a-webpage)
2. [Extracting text from images](#2.-Extracting-text-from-images)
    


In [1]:
#Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade sagemaker
!pip install --upgrade beautifulsoup4
!pip install --upgrade html5lib
!pip install --upgrade requests
!pip install --upgrade textract-trp

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.0.1
    Uninstalling pip-25.0.1:
      Successfully uninstalled pip-25.0.1
Successfully installed pip-25.1.1
Collecting sagemaker
  Downloading sagemaker-2.244.0-py3-none-any.whl.metadata (16 kB)
Downloading sagemaker-2.244.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.243.3
    Uninstalling sagemaker-2.243.3:
      Successfully uninstalled sagemaker-2.243.3
Successfully installed sagemaker-2.244.0
Collecting html5lib
  D

## 1. Extracting information from a webpage
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will use Beautiful Soup to extract the titles, authors, summaries, published data, and hyperlinks from blog posts. The extracted text could then be used in a downstream NLP task, such as topic extraction, sentiment analysis, text-to-speech, or translation.

Start by importing both the **Beautiful Soup** and **requests** packages.

In [2]:
from bs4 import BeautifulSoup
import requests

The blog post you will parse is the [AWS Machine Learning blog](https://aws.amazon.com/blogs/machine-learning/) at https://aws.amazon.com/blogs/machine-learning/.

Using your web browser, open the AWS Machine Learning page. 

Use the browser's *inspector mode* to discover the structure of the page. In Mozilla FireFox and Google Chrome, you can open the inspector by pressing CTRL+SHIFT+C. If you use a different browser, consult the browser documentation.

View the different elements of the webpage by moving your pointer over the page. Move the pointer over the following elements, and see whether you can find the tags that are used to identify the informtion:

* Title of the blog post
* Author
* Date published
* Text summary
* Hyperlink to the blog post

Don't worry if you can't find all the tags. The following code walkthrough will help you find tags.


First, use the **requests** library to load the webpage. Before you proceed, confirm that the HTTP status code is *200*.

In [3]:
page = requests.get('https://aws.amazon.com/blogs/machine-learning/')
page.status_code

200

Load the **content** from the page into a **soup** object.

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')

View the entire page by using the `soup.prettify()` function.

**Note:** The content from the AWS Blogs page might be lengthy. To move to the next task, scroll down in this notebook.

In [None]:
print(soup.prettify())

All the elements on the page can be accessed using dot (.) notation. Thus, to view the title, you could use `soup.title`. If you want only the `text`, use the text element as follows:

In [6]:
print(soup.title.text)

AWS Machine Learning Blog


When you used the inspector to search for tags on the AWS Blogs page, you might have found that blog-post content is organized/categorized/marked with `<article>` tags, which indicate a self-contained unit of content.

In [7]:
print(soup.article.prettify())

<article class="blog-post" typeof="TechArticle" vocab="https://schema.org/">
 <meta content="en-US" property="inLanguage"/>
 <meta content="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2025/05/01/ML-16708-Image1.png" property="image"/>
 <div class="lb-row lb-snap">
  <div class="lb-col lb-mid-6 lb-tiny-24">
   <a href="https://aws.amazon.com/blogs/machine-learning/elevate-marketing-intelligence-with-amazon-bedrock-and-llms-for-content-creation-sentiment-analysis-and-campaign-performance-evaluation/" property="url" rel="bookmark">
    <img alt="" class="attachment-large size-large wp-post-image" height="318" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2025/05/01/ML-16708-Image1.png" width="936"/>
   </a>
  </div>
  <div class="lb-col lb-mid-18 lb-tiny-24">
   <h2 class="lb-bold blog-post-title">
    <a href="https://aws.amazon.com/blogs/machine-learning/elevate-marketing-intelligence-with-amazon-bedrock-and-llms-f

Review the output. Can you find the title?

The title can be found at `soup.article.h2.span`:

In [8]:
print(soup.article.h2.span.prettify())

<span property="name headline">
 Elevate marketing intelligence with Amazon Bedrock and LLMs for content creation, sentiment analysis, and campaign performance evaluation
</span>



To display only the text, use the `text` property:

In [9]:
print(soup.article.h2.span.text)

Elevate marketing intelligence with Amazon Bedrock and LLMs for content creation, sentiment analysis, and campaign performance evaluation


Find the publish date of the article:

In [10]:
print(soup.article.time.text)

09 MAY 2025


Next, extract the article summary:

In [11]:
print(soup.article.section.p.text)

In the media and entertainment industry, understanding and predicting the effectiveness of marketing campaigns is crucial for success. Marketing campaigns are the driving force behind successful businesses, playing a pivotal role in attracting new customers, retaining existing ones, and ultimately boosting revenue. However, launching a campaign isn’t enough; to maximize their impact and help achieve […]


The author name is in the footer. A blog post can have multiple authors. However, for now, retrieve only the *first author*:

In [12]:
print(soup.article.footer.span.prettify())

<span>
 by
 <span property="author" typeof="Person">
  <span property="name">
   Namita Mathew
  </span>
 </span>
 ,
 <span property="author" typeof="Person">
  <span property="name">
   Mayank Agrawal
  </span>
 </span>
 ,
 <span property="author" typeof="Person">
  <span property="name">
   Arghya Banerjee
  </span>
 </span>
 ,
 <span property="author" typeof="Person">
  <span property="name">
   Dhara Vaishnav
  </span>
 </span>
 , and
 <span property="author" typeof="Person">
  <span property="name">
   Wesley Petry
  </span>
 </span>
</span>



The hyperlink to the full article text is the last piece of information that you must find:

In [13]:
print(soup.article.div.a['href'])

https://aws.amazon.com/blogs/machine-learning/elevate-marketing-intelligence-with-amazon-bedrock-and-llms-for-content-creation-sentiment-analysis-and-campaign-performance-evaluation/


You have now identified all the relevant elements. You can find all the articles by using the `find_all()` function. You can then loop through the results and output information about the blog post, such as the title, author, and so on.

For example, to find all the authors and then loop through them, the author, use `find_all()`:

In [14]:
for article in soup.find_all('article'):
    print('==========================================')
    print(article.h2.span.text)
    authors = article.footer.find_all('span', {"property":"author"})
    print('by', end=' ')
    for author in authors:
        if author.span != None:
            print(author.span.text, end=', ')
    print(f'on {article.time.text}')
    print(article.section.p.text)
    print(article.div.a['href'])
    

Elevate marketing intelligence with Amazon Bedrock and LLMs for content creation, sentiment analysis, and campaign performance evaluation
by Namita Mathew, Mayank Agrawal, Arghya Banerjee, Dhara Vaishnav, Wesley Petry, on 09 MAY 2025
In the media and entertainment industry, understanding and predicting the effectiveness of marketing campaigns is crucial for success. Marketing campaigns are the driving force behind successful businesses, playing a pivotal role in attracting new customers, retaining existing ones, and ultimately boosting revenue. However, launching a campaign isn’t enough; to maximize their impact and help achieve […]
https://aws.amazon.com/blogs/machine-learning/elevate-marketing-intelligence-with-amazon-bedrock-and-llms-for-content-creation-sentiment-analysis-and-campaign-performance-evaluation/
How Deutsche Bahn redefines forecasting using Chronos models – Now available on Amazon Bedrock Marketplace
by Kilian Zimmerer, Daniel Ringler, Michael Bohlke-Schneider, Florian

After you figure out the data format, you can add the results to an array:

In [15]:
blog_posts = []
for article in soup.find_all('article'):
    authors = article.footer.find_all('span', {"property":"author"})
    author_text = []
    for author in authors:
        if author.span != None:
            author_text.append(author.span.text)
    blog_posts.append([article.h2.span.text, ', '.join(author_text), article.time.text, article.section.p.text, article.div.a['href'] ])
    

Next, load the array into a pandas dataframe:

In [16]:
import pandas as pd
import time

In [17]:
df = pd.DataFrame(blog_posts, columns=['title','authors','published','summary','link'])

You must convert the **published** column to a `datetime` value.

In [18]:
df['published'] = pd.to_datetime(df['published'], format='%d %b %Y')

Adjust the column width for pandas, and display the first five rows of the dataframe:

In [19]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,title,authors,published,summary,link
0,"Elevate marketing intelligence with Amazon Bedrock and LLMs for content creation, sentiment analysis, and campaign performance evaluation","Namita Mathew, Mayank Agrawal, Arghya Banerjee, Dhara Vaishnav, Wesley Petry",2025-05-09,"In the media and entertainment industry, understanding and predicting the effectiveness of marketing campaigns is crucial for success. Marketing campaigns are the driving force behind successful businesses, playing a pivotal role in attracting new customers, retaining existing ones, and ultimately boosting revenue. However, launching a campaign isn’t enough; to maximize their impact and help achieve […]",https://aws.amazon.com/blogs/machine-learning/elevate-marketing-intelligence-with-amazon-bedrock-and-llms-for-content-creation-sentiment-analysis-and-campaign-performance-evaluation/
1,How Deutsche Bahn redefines forecasting using Chronos models – Now available on Amazon Bedrock Marketplace,"Kilian Zimmerer, Daniel Ringler, Michael Bohlke-Schneider, Florian Saupe, John Liu, Pedro Eduardo Mercado Lopez, Simeon Brueggenjuergen",2025-05-07,"Whereas traditional forecasting methods typically rely on statistical modeling, Chronos treats time series data as a language to be modeled and uses a pre-trained FM to generate forecasts — similar to how large language models (LLMs) generate texts. Chronos helps you achieve accurate predictions faster, significantly reducing development time compared to traditional methods. In this post, we share how Deutsche Bahn is redefining forecasting using Chronos models, and provide an example use case to demonstrate how you can get started using Chronos.",https://aws.amazon.com/blogs/machine-learning/how-deutsche-bahn-redefines-forecasting-using-chronos-models-now-available-on-amazon-bedrock-marketplace/
2,Use custom metrics to evaluate your generative AI application with Amazon Bedrock,"Shreyas Subramanian, Adewale Akinfaderin, Ishan Singh, Jesse Manders",2025-05-06,"Now with Amazon Bedrock, you can develop custom evaluation metrics for both model and RAG evaluations. This capability extends the LLM-as-a-judge framework that drives Amazon Bedrock Evaluations. In this post, we demonstrate how to use custom metrics in Amazon Bedrock Evaluations to measure and improve the performance of your generative AI applications according to your specific business requirements and evaluation criteria.",https://aws.amazon.com/blogs/machine-learning/use-custom-metrics-to-evaluate-your-generative-ai-application-with-amazon-bedrock/
3,Build a gen AI–powered financial assistant with Amazon Bedrock multi-agent collaboration,"Suheel Farooq, Aswath Ram A Srinivasan, Girish Krishna Tokachichu, Qingwei Li",2025-05-02,"This post explores a financial assistant system that specializes in three key tasks: portfolio creation, company research, and communication. This post aims to illustrate the use of multiple specialized agents within the Amazon Bedrock multi-agent collaboration capability, with particular emphasis on their application in financial analysis.",https://aws.amazon.com/blogs/machine-learning/build-a-gen-ai-powered-financial-assistant-with-amazon-bedrock-multi-agent-collaboration/
4,WordFinder app: Harnessing generative AI on AWS for aphasia communication,"Kori Ramijoo, Scott Harding, Sonia Brownsett, David Copland, Kurt Sterzl, Mark Promnitz",2025-05-02,"In this post, we showcase how Dr. Kori Ramajoo, Dr. Sonia Brownsett, Prof. David Copland, from QARC, and Scott Harding, a person living with aphasia, used AWS services to develop WordFinder, a mobile, cloud-based solution that helps individuals with aphasia increase their independence through the use of AWS generative AI technology.",https://aws.amazon.com/blogs/machine-learning/wordfinder-app-harnessing-generative-ai-on-aws-for-aphasia-communication/


Now that the data is in a pandas dataframe, you can use this data in downstream NLP tasks. You will come back to this data in Module 5.

## 2. Extracting text from images
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will extract the text from an image by using Amazon Textract.

For this exercise, you will use the following simple image. This file was loaded into Amazon Simple Storage Service (Amazon S3) when you started the lab.

![Image of a simple document](../s3/simple-document-image.jpg)

Start by importing the library for the AWS SDK for Python (Boto3).

In [20]:
import boto3

Setup the variables for the bucket and document name.

In [21]:
# Document
s3BucketName = "c163835a4206196l10271439t1w761127724819-labbucket-jsdq2weqwidb"
documentName = "lab31/simple-document-image.jpg"

Extract text from the image by using Amazon Textract to call an application programming interface (API).

In [22]:
# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

print(response)

{'DocumentMetadata': {'Pages': 1}, 'Blocks': [{'BlockType': 'PAGE', 'Geometry': {'BoundingBox': {'Width': 1.0, 'Height': 1.0, 'Left': 0.0, 'Top': 0.0}, 'Polygon': [{'X': 0.0, 'Y': 0.0}, {'X': 1.0, 'Y': 0.0}, {'X': 1.0, 'Y': 1.0}, {'X': 0.0, 'Y': 1.0}]}, 'Id': '17c61284-2888-48c2-a0d9-8e9fb6324523', 'Relationships': [{'Type': 'CHILD', 'Ids': ['9ce87fcc-a086-4743-a514-31d4bdb94727', '0c83d3b2-6746-4e54-84d3-2a90717be6d4', '3e3a2d66-6914-4a2c-b1fa-b8844c622971', '59c9f4e4-350a-4353-a6d2-980a59cab316']}]}, {'BlockType': 'LINE', 'Confidence': 99.52398681640625, 'Text': 'Amazon.com, Inc. is located in Seattle, WA', 'Geometry': {'BoundingBox': {'Width': 0.512660026550293, 'Height': 0.06824082136154175, 'Left': 0.06333211064338684, 'Top': 0.1989629715681076}, 'Polygon': [{'X': 0.06337157636880875, 'Y': 0.20793944597244263}, {'X': 0.5759921669960022, 'Y': 0.1989629715681076}, {'X': 0.5759671330451965, 'Y': 0.2590251564979553}, {'X': 0.06333211064338684, 'Y': 0.26720380783081055}]}, 'Id': '9ce87

The response looks unformatted, but the **Blocks** list contains the key information that you need. 

Extract this information from the **Blocks** list:

In [None]:
# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]

You have now extracted the text from the image. You can use this text in a downstream NLP task.

You will now experiment with one additional image. This image contains *tables* of text.

![Image of Employment Application](../s3/employmentapp.png)

Set the new document name:

In [23]:
# Document
documentName = "lab31/employmentapp.png"

Call the Amazon Textract API again. However, this time, specify the **TABLES** feature type:

In [24]:
# Amazon Textract client

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])


Parse the table by using the Amazon Textract results parser (**textract-trp**).

**Note:** You installed the Amazon Textract results parser when you ran the `pip install --upgrade textract-trp` command at the start of this notebook.

In [25]:
from trp import Document
doc = Document(response)

for page in doc.pages:
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

Table[0][0] = Applicant 
Table[0][1] = Information 
Table[1][0] = Full Name: Jane 
Table[1][1] = Doe 
Table[2][0] = Phone Number: 
Table[2][1] = 555-0100 
Table[3][0] = Home Address: 
Table[3][1] = 123 Any Street, Any Town, USA 
Table[4][0] = Mailing Address: 
Table[4][1] = same as home address 
Table[0][0] = 
Table[0][1] = 
Table[0][2] = Previous Employment 
Table[0][3] = History 
Table[0][4] = 
Table[1][0] = Start Date 
Table[1][1] = End Date 
Table[1][2] = Employer Name 
Table[1][3] = Position Held 
Table[1][4] = Reason for leaving 
Table[2][0] = 1/15/2009 
Table[2][1] = 6/30/2011 
Table[2][2] = AnyCompany 
Table[2][3] = Assistant Baker 
Table[2][4] = Family relocated 
Table[3][0] = 7/1/2011 
Table[3][1] = 8/10/2013 
Table[3][2] = AnyCompany Bread 
Table[3][3] = Baker 
Table[3][4] = Better opportunity 
Table[4][0] = 8/15/2013 
Table[4][1] = present 
Table[4][2] = Example Corp. 
Table[4][3] = Head Baker 
Table[4][4] = N/A, current employer 


You have now extracted the text from a different image, and you could continue to process it further, if needed.

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2023 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*