In [1]:
import os
import pandas as pd
import requests

In [2]:
# Install needed libraries (safe to re-run)
%pip install -q yfinance python-dotenv beautifulsoup4 lxml

# Make sure the data/raw folder exists
import os, pathlib
pathlib.Path("../data/raw").mkdir(parents=True, exist_ok=True)
print("Ready: data/raw exists ✅")

[33m  DEPRECATION: Building 'multitasking' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'multitasking'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Ready: data/raw exists ✅


In [3]:
# --- API INGESTION: yfinance fallback per homework ---
# PDF asks: choose an endpoint/ticker, pull data (API or yfinance), parse types, validate, save to data/raw/:contentReference[oaicite:1]{index=1}

from datetime import datetime
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv
import yfinance as yf

# 0) Load secrets if needed (safe even if you don't use any API keys here)
load_dotenv()  # looks for a local .env (not committed to GitHub)

# 1) Pick a ticker (you can change later if you want)
TICKER = "AAPL"

# 2) Pull recent daily data via yfinance (fallback allowed by the homework)
df_api = yf.download(TICKER, period="6mo", interval="1d", auto_adjust=False, progress=False)

# 3) Tidy: move Date out of index, ensure dtypes (dates/floats)
df_api = df_api.reset_index()  # brings 'Date' out of the index
df_api["Date"] = pd.to_datetime(df_api["Date"])  # parse to datetime

# 4) Validate required columns, NA counts, and shape (simple rules)
required_cols = ["Date", "Open", "High", "Low", "Close", "Volume"]
missing = [c for c in required_cols if c not in df_api.columns]
assert not missing, f"Missing required columns: {missing}"

na_counts = df_api[required_cols].isna().sum()
assert na_counts.sum() == 0, f"Found NAs in required columns:\n{na_counts}"

assert len(df_api) > 0, "No rows returned from API."

print("Validation passed ✅")
print("Rows:", len(df_api), "| Columns:", list(df_api.columns))

# 5) Save raw CSV to data/raw/ with timestamped, reproducible filename
ts = datetime.now().strftime("%Y%m%d-%H%M")
out_path = Path("../data/raw") / f"api_yfinance_{TICKER}_{ts}.csv"
df_api.to_csv(out_path, index=False)
print("Saved to:", out_path.resolve())

Validation passed ✅
Rows: 125 | Columns: [('Date', ''), ('Adj Close', 'AAPL'), ('Close', 'AAPL'), ('High', 'AAPL'), ('Low', 'AAPL'), ('Open', 'AAPL'), ('Volume', 'AAPL')]
Saved to: /Users/ivysingal/bootcamp_ivy_singal/data/raw/api_yfinance_AAPL_20250820-2218.csv


In [4]:
# --- WEB SCRAPING: small table from Wikipedia (per homework) ---
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from pathlib import Path

# 1) Target page with a nice table
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")

# 2) Find the first big wikitable
table = soup.find("table", {"id": "constituents"})

# 3) Parse table into DataFrame
df_scrape = pd.read_html(str(table))[0]

# 4) Basic validation
print("First 5 rows:")
print(df_scrape.head())

# Ensure key columns exist
required_cols = ["Symbol", "Security", "GICS Sector"]
missing = [c for c in required_cols if c not in df_scrape.columns]
assert not missing, f"Missing required columns: {missing}"

# Drop NAs and check length
df_scrape = df_scrape.dropna(subset=required_cols)
assert len(df_scrape) > 0, "Scraped DataFrame is empty!"

print("Validation passed ✅")
print("Rows:", len(df_scrape), "| Columns:", list(df_scrape.columns))

# 5) Save raw CSV to data/raw/
ts = datetime.now().strftime("%Y%m%d-%H%M")
out_path = Path("../data/raw") / f"scraped_sp500_{ts}.csv"
df_scrape.to_csv(out_path, index=False)
print("Saved to:", out_path.resolve())

First 5 rows:
  Symbol             Security             GICS Sector  \
0    MMM                   3M             Industrials   
1    AOS          A. O. Smith             Industrials   
2    ABT  Abbott Laboratories             Health Care   
3   ABBV               AbbVie             Health Care   
4    ACN            Accenture  Information Technology   

                GICS Sub-Industry    Headquarters Location  Date added  \
0        Industrial Conglomerates    Saint Paul, Minnesota  1957-03-04   
1               Building Products     Milwaukee, Wisconsin  2017-07-26   
2           Health Care Equipment  North Chicago, Illinois  1957-03-04   
3                   Biotechnology  North Chicago, Illinois  2012-12-31   
4  IT Consulting & Other Services          Dublin, Ireland  2011-07-06   

       CIK      Founded  
0    66740         1902  
1    91142         1916  
2     1800         1888  
3  1551152  2013 (1888)  
4  1467373         1989  
Validation passed ✅
Rows: 503 | Columns: [

  df_scrape = pd.read_html(str(table))[0]


## Data Ingestion Documentation (Stage 04)

**Sources & Endpoints**
- API (fallback): yfinance daily prices for `AAPL` (last ~6 months).
- Web scrape: Wikipedia “List of S&P 500 companies” (table id = `constituents`).

**Parameters Used**
- API: period = 6mo, interval = 1d, auto_adjust = False.
- Scrape: single HTML table parsed with pandas.read_html over the `#constituents` table.

**Validation Logic**
- API: required columns = Date, Open, High, Low, Close, Volume; assert no NAs; assert >0 rows.
- Scrape: required columns = Symbol, Security, GICS Sector; drop NA in required cols; assert >0 rows.

**Saved Files**
- `data/raw/api_yfinance_AAPL_<YYYYMMDD-HHMM>.csv`
- `data/raw/scraped_sp500_<YYYYMMDD-HHMM>.csv`

**.env & Reproducibility**
- `.env` (secrets) kept local, **not committed**.
- If an API requires a key, load via `dotenv` in code.
- Notebook contains sources, params, and validation steps as required. :contentReference[oaicite:0]{index=0}

**Assumptions & Risks**
- Wikipedia structure can change (table id/columns).
- Market data can have holidays/missing days.
- If an API rate-limits or key expires, ingestion may fail; yfinance used as a permitted fallback.