# Stage 04 — Scrape a Small Table

Repo: `notebooks/stage04_scrape.ipynb`

## Task
- Public page with a simple table
- Parse with BeautifulSoup → build DataFrame
- Validate numeric/text columns
- Save raw CSV to `data/raw/`

In [1]:
import os, pandas as pd, requests
from bs4 import BeautifulSoup

RAW_DIR = 'data/raw'
os.makedirs(RAW_DIR, exist_ok=True)
# Example target (you can change it):
URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
html = requests.get(URL, timeout=30).text
soup = BeautifulSoup(html, 'lxml')
tables = pd.read_html(str(soup.select_one('table.wikitable')))
df = tables[0]
print('Rows x Cols:', df.shape)
df.head()

Rows x Cols: (503, 8)


  tables = pd.read_html(str(soup.select_one('table.wikitable')))


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [3]:
# --- Validate & save ---
print(df.dtypes)
print('\nNA counts:')
print(df.isna().sum())

out_path = os.path.join(RAW_DIR, 'sp500_constituents.csv')
df.to_csv(out_path, index=False)
print('Saved:', out_path)

Symbol                   object
Security                 object
GICS Sector              object
GICS Sub-Industry        object
Headquarters Location    object
Date added               object
CIK                       int64
Founded                  object
dtype: object

NA counts:
Symbol                   0
Security                 0
GICS Sector              0
GICS Sub-Industry        0
Headquarters Location    0
Date added               0
CIK                      0
Founded                  0
dtype: int64
Saved: data/raw/sp500_constituents.csv
