# Scraping Forbes by Topics

- We are scraping forbes.com by topics, which appear as "/{topic}" in front of the base site URL
- We will get a list of topics, which includes the title and URL
- For each topic, we will get the top 25 stories
- For each story, we will grab its name, author(s), author type(s) - staff, contributor, etc., date, and URL
- For each topic, we will create a CSV in the format:
```
Article Name, Description, Author(s),Author Type(s),Date,Article URL

InnovationRx: Phil Knight’s $2 Billion Cancer Gift,"Amy Feldman, Alex Knapp","Forbes Staff, Forbes Staff","Aug 20, 2025",https://www.forbes.com/sites/innovationrx/2025/08/20/innovationrx-phil-knights-2-billion-cancer-gift/?ss=ai

"AI Therapists Belong In The Back Office, Not The Chair",John Samuels,Contributor,"Aug 19, 2025",https://www.forbes.com/sites/johnsamuels/2025/08/19/ai-therapists-belong-in-the-back-office-not-the-chair/?ss=ai
```

### Use the Requests library to load pages

Install and import

In [1]:
%pip install requests --upgrade --quiet
import requests

Note: you may need to restart the kernel to use updated packages.


Save page into a local file

In [2]:
forbes_home = requests.get('https://forbes.com').text
with open('home-page.html', 'w+') as f:
  f.write(forbes_home)

### Use Beautiful Soup to parse and extract information

Install and import

In [3]:
%pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

Note: you may need to restart the kernel to use updated packages.


Grab all "a" tags containing similar attribute content for story topics. In this case, attribute "data-ga-track" starting with "hamburger-L2". In addition, ignore Paid Program topics

In [4]:
a_html = ''
article_tags = []

a_tags = BeautifulSoup(forbes_home, 'html.parser').find_all('a')

for a in a_tags:
  a_str = a.prettify()
  if 'data-ga-track="hamburger-L2' in a_str:
    if not '| Paid Program' in a_str:
      a_html += a_str
      article_tags.append(a)

- Create a class for extracting basic information out of topic tags
- Create a list of objects of that class

In [5]:
class topic_url:
  def __init__(self, name, href):
    self.name = name
    self.link = href
  def printer(self):
    print(self.name, self.link)

topics_urls = []

for tag in article_tags:
  try:
    topics_urls.append(topic_url(str(tag.string), tag['href']))
  except:
    continue

### Use Pandas to map out topics data

Install and import

In [6]:
%pip install pandas --upgrade --quiet
import pandas as pd

Note: you may need to restart the kernel to use updated packages.


In [7]:
topics_df = pd.DataFrame({'Topic': [topic.name for topic in topics_urls], 'URL': [topic.link for topic in topics_urls]})
topics_df.to_csv('data/topics.csv')
topics_df

Unnamed: 0,Topic,URL
0,Best-In-State Top Next-Gen Wealth Advisors 2025,https://www.forbes.com/lists/best-in-state-nex...
1,Breaking News,https://www.forbes.com/news/
2,White House Watch,https://www.forbes.com/trump/
3,Daily Cover Stories,https://www.forbes.com/daily-cover-stories/
4,AI’s Nuanced Impact And A Quest To Quantify It,https://www.forbes.com/sites/forbes-research/2...
...,...,...
138,Pinpoint by LinkedIn,https://www.forbes.com/games/pinpoint/
139,Queens by LinkedIn,https://www.forbes.com/games/queens/
140,Crossclimb by LinkedIn,https://www.forbes.com/games/crossclimb/
141,Forbes Video,https://www.forbes.com/video/


### Repeat process for each topic

Load topic pages, work from description tags

In [8]:
topics_df['Article'] = 'NaN'

for i in range(0,5):
  topic_df = pd.DataFrame(columns=['Article Name', 'Description', 'Author(s)', 'Author Type(s)', 'Date','URL'])

  topic_name = topics_df['Topic'][i]
  topic_url = topics_df['URL'][i]

  try:
    doc = requests.get(topic_url).text
  except:
    continue
  
  desc_tags = BeautifulSoup(doc, "html.parser").find_all('span', {'style': '-webkit-line-clamp:3'})
  
  for desc_tag in desc_tags:
    
    content_tag = desc_tag.parent.parent
    title_tag = content_tag.find('a')

    print(content_tag)
    
    topic_df.loc[len(topic_df), ['Article Name', 'Description','URL']] = [title_tag.text, desc_tag.text, title_tag['href']]

  for article_url in topic_df['URL']:
    try:
      doc = requests.get(article_url).text
    except:
      continue  

  bad_chars = ['/','*','?',':','\\','<','>','|']
  file_name = ''
  for char in topic_name:
    if not char in bad_chars:
      file_name += char
    else:
      file_name += '...'

  topic_df.to_csv(f'data/topics/{file_name}.csv')

<div class="WjVFB823"><div class="IE8ecQMQ"><span class="ycHdAQ4U jMF46RP6">1 hour ago</span></div><h3 class="HNChVRGc"><a class="_1-FLFW4R" href="https://www.forbes.com/sites/antoniopequenoiv/2025/09/03/trump-administrations-22-billion-harvard-funding-freeze-ruled-unconstitutional/" rel="" target="_self">Trump Administration’s $2.2 Billion Harvard Funding Freeze Ruled Unconstitutional</a></h3><p class="_5v7prWgS"><span class="Ccg9Ib-7" style="-webkit-line-clamp:3"></span></p><div class="Na3NTglQ _51GWpoAk _0AGvFFWM"><p class="ujvJmzbB">By<a aria-label="Antonio Pequeño IV" class="_4tin10wS YbfXuVMn" href="https://www.forbes.com/sites/antoniopequenoiv/" target="_self" title="https://www.forbes.com/sites/antoniopequenoiv/">Antonio Pequeño IV</a><span class="S7tzPEZ-">,</span></p><p class="AtUb4gy8">Forbes Staff</p></div><div class="klKBDGvF"><button aria-label="save article" class="sElHJWe4 NQX0jJYe" data-theme="light" disabled="" type="button"><svg class="fs-icon fs-icon--saved-article"