<a href="https://colab.research.google.com/github/jxtngx/torchtune-cookbook/blob/main/summarization/L0_Data_Acquisition_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will help with a basic example of using BeautifulSoup to acquire and process data from Wikipedia.

Using BeautifulSoup, we can collect text data from a source page and treat the content body as text data for a downstream model to summarize.

Note that this is a basic example of preprocessing and allows for future work on additional data prep techniques that
would remove unneeded tokens from the context.

In [1]:
import json
import pandas as pd
import requests
from bs4 import BeautifulSoup

Step 1: fetch the page data

In [2]:
# set the base url for the source page
url = "https://en.wikipedia.org/wiki/Intelligent_agent"
# request the source page content
content = requests.get(url)
# parse the source page with BeautifulSoup
soup = BeautifulSoup(content.text, features='html.parser')

Step 2: get the text from the article body

In [3]:
# find the body content
body = soup.find(attrs={"id": "bodyContent"})

Step 3: pack the data into a dict to save in json format

In [4]:
# create a data collection using the root article as the first entry
data = {"Intelligent Agent": {'url': url, 'body': body.get_text(separator=" ", strip=True)}}

In [5]:
# check out a slice of the article
data['Intelligent Agent']['body'][:500]

'From Wikipedia, the free encyclopedia Software agent which acts autonomously For the term in intelligent design, see Intelligent designer . Not to be confused with Embodied agent . Simple reflex agent diagram In artificial intelligence , an intelligent agent is an entity that perceives its environment , takes actions autonomously to achieve goals, and may improve its performance through machine learning or by acquiring knowledge . Leading AI textbooks define artificial intelligence as the "study'

Step 4: Check that the content body is less than the model's context length

In [6]:
# check the context length
len(data['Intelligent Agent']['body'])

30259

Step 5: save this data as a json file for later use.



> Make sure to mount your Google Drive and to add a new folder titled `intelligent agents` before moving on


Be sure to uncomment the cells below to save the json file to your Google Drive

In [7]:
# dest = "/content/drive/MyDrive/intelligent-agents/"
# filepath = dest + "intelligent_agent.json"
# filepath

In [8]:
# with open(filepath, 'w') as fp:
#     json.dump(data, fp, indent=4)