# Data scraping

This notebook collects articles from the CBS Sports RSS feeds. Our goal is to go from no data to a set of articles so we may index and query them later.

Since CBS sports has an RSS feed for each sports category, we can avoid painful data scraping and instead use the RSS feeds to collect articles. This is a much more efficient way to collect data, as it allows us to focus on the content rather than the structure of the website. Additionally the RSS feed is explicitly meant to be consumed, whereas scraping a website is often against TOS.

## Setup

In [14]:
from momento_buffconf_workshop import NotebookConfiguration

config = NotebookConfiguration.for_scraping(run_demos_live=False)
config.print_status_banner()

🟡 USING CACHED DATA — 02-data-scraping relies on cached data (📦 snapshot 2025-07-22-16-36-20 (auto))


In [15]:
from copy import deepcopy
import hashlib

from langchain_community.document_loaders import RSSFeedLoader
from langchain_core.documents import Document
from tqdm import tqdm

from momento_buffconf_workshop import ArticleContent

## Collect Articles

### Choose feeds

We will use the [CBS sports RSS feeds](https://www.cbssports.com/xml/rss) to collect articles. The feeds are available for each sports category, such as NFL, NBA, MLB, etc:
![RSS feeds](../images/cbs-rss-feeds.png)


An example of one such feed is below for college basketball:
![College Basketball RSS feed](../images/cbs-rss-feed-basketball.png)

Get hands on and check out the raw RSS feeds by clicking the link above.

In [16]:
cbssports_rss_feeds = {
    "general": "https://www.cbssports.com/rss/headlines/",
    "boxing": "https://www.cbssports.com/rss/headlines/boxing",
    "college_basketball": "https://www.cbssports.com/rss/headlines/college-basketball",
    "college_football": "https://www.cbssports.com/rss/headlines/college-football",
    "golf": "https://www.cbssports.com/rss/headlines/golf",
    "masters": "https://www.cbssports.com/rss/tag/masters/",
    "mlb": "https://www.cbssports.com/rss/headlines/mlb",
    "mma": "https://www.cbssports.com/rss/headlines/mma",
    "nba": "https://www.cbssports.com/rss/headlines/nba",
    "nfl": "https://www.cbssports.com/rss/headlines/nfl",
    "nhl": "https://www.cbssports.com/rss/headlines/nhl",
    "soccer": "https://www.cbssports.com/rss/headlines/soccer",
    "tennis": "https://www.cbssports.com/rss/headlines/tennis",
    "wwe": "https://www.cbssports.com/rss/headlines/wwe",
    "betting": "https://www.cbssports.com/rss/headlines/betting/"
}

### Fetch and parse

This uses [feedparser](https://github.com/kurtmckee/feedparser) to fetch and parse the RSS feeds. It will return a list of articles, then [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) under the hood to parse the HTML content of each article.

We use the helper class `ArticleContent` to store the content of each article, including the title, URL, and text content.

In [17]:
if config.run_demos_live:
    articles: dict[str, list[Document]] = {}
    for category, url in tqdm(cbssports_rss_feeds.items()):
        loader = RSSFeedLoader(urls=[url], nlp=False)
        articles[category] = loader.load()

In [18]:
if config.run_demos_live:
    ArticleContent(
        articles=articles
    ).save_json(config.raw_article_path)
else:
    article_content = ArticleContent.load_json(config.raw_article_path)
    articles = article_content.articles

### Normalize ids

In [19]:
def hash_url_to_int(url: str) -> str:
    return str(int(hashlib.sha256(url.encode("utf-8")).hexdigest()[:16], 16))

In [20]:
article_copy = deepcopy(articles)

In [21]:
# Use the hash of the url as the document id

for category, docs in article_copy.items():
    for doc in docs:
        doc.id = hash_url_to_int(doc.metadata["link"])

In [22]:
ArticleContent(
    articles=article_copy
).save_json(config.normalized_article_path)

### Quick sanity check

In [23]:
num_docs = sum(len(docs) for docs in article_copy.values())
ids = {doc.id for docs in article_copy.values() for doc in docs}
num_unique = len(ids)
print(f"Number of documents: {num_docs}")
print(f"Number of unique documents: {num_unique}")

Number of documents: 540
Number of unique documents: 457
