## Setup & Configuration

# SEC EDGAR S-1 Filing Downloader

**IMPORTANT: Run this notebook on your LOCAL computer, NOT in QuantConnect!**

QuantConnect has network restrictions that block SEC EDGAR access. This notebook downloads S-1 filings to your local machine, then you can upload them to QuantConnect if needed.

## What This Notebook Does

1. Reads IPO calendar CSV with ticker symbols
2. Searches SEC EDGAR for each company's S-1 filing
3. Downloads the S-1 HTML document
4. Saves files locally for parsing
5. Generates download report

## Requirements

```bash
pip install pandas requests beautifulsoup4 lxml tqdm
```

## SEC Requirements

The SEC requires you to:
- Identify yourself with email in User-Agent header
- Limit to 10 requests per second
- Not overwhelm their servers

**You MUST update your email below before running!**

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import os
from pathlib import Path
import json
from datetime import datetime
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")
print(f"Current directory: {os.getcwd()}")

✓ Imports successful
Current directory: /Users/leole/workspace/HandsOnAITradingBook/IPO Pre-Analysis Day-1 Trading Strategy




In [None]:
# CONFIGURATION - UPDATE THESE!

# REQUIRED: Your email address (SEC requirement)
YOUR_EMAIL = "ltd@bu.edu"  # ⚠️ CHANGE THIS!
YOUR_NAME = "Gnu Led LLC"      # ⚠️ CHANGE THIS!

# Input/Output paths
INPUT_CSV = "data/ipo_calendar.csv"     # CSV with 'ticker' column
OUTPUT_DIR = "data/s1_filings/"         # Where to save S-1 files

# Download settings
DELAY_BETWEEN_REQUESTS = 0.5            # Seconds (SEC allows 10/sec, we use 2/sec)
MAX_RETRIES = 3                         # Retry failed downloads
TIMEOUT = 30                            # Request timeout (seconds)

# Validation
if "your.email@example.com" in YOUR_EMAIL:
    print("❌ ERROR: Please update YOUR_EMAIL with your actual email!")
    print("The SEC requires identification in the User-Agent header.")
    print("This is a legal requirement, not optional.")
else:
    print("✓ Configuration valid")
    print(f"User-Agent: {YOUR_NAME} {YOUR_EMAIL}")
    print(f"Input: {INPUT_CSV}")
    print(f"Output: {OUTPUT_DIR}")

## Helper Functions

In [None]:
def get_headers():
    """Generate SEC-compliant headers."""
    return {
        'User-Agent': f"{YOUR_NAME} {YOUR_EMAIL}",
        'Accept-Encoding': 'gzip, deflate',
        'Host': 'www.sec.gov'
    }

def make_request(url, retries=MAX_RETRIES):
    """Make HTTP request with retry logic."""
    for attempt in range(retries):
        try:
            response = requests.get(url, headers=get_headers(), timeout=TIMEOUT)
            response.raise_for_status()
            time.sleep(DELAY_BETWEEN_REQUESTS)
            return response
        except requests.exceptions.RequestException as e:
            if attempt == retries - 1:
                raise e
            time.sleep(2 ** attempt)  # Exponential backoff
    return None

print("✓ Helper functions defined")

In [None]:
def get_cik_from_ticker(ticker):
    """
    Get CIK (Central Index Key) from ticker symbol.
    
    SEC uses CIK numbers instead of ticker symbols internally.
    """
    search_url = f"https://www.sec.gov/cgi-bin/browse-edgar?company={ticker}&action=getcompany"
    
    try:
        response = make_request(search_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Look for CIK in the company name span
        company_info = soup.find('span', {'class': 'companyName'})
        
        if company_info:
            text = company_info.get_text()
            if 'CIK#:' in text:
                # Extract CIK like "CIK#: 0001234567"
                cik = text.split('CIK#:')[1].split()[0].strip()
                return cik
        
        return None
        
    except Exception as e:
        print(f"  Error getting CIK for {ticker}: {e}")
        return None

# Test
test_cik = get_cik_from_ticker("AAPL")
print(f"✓ CIK lookup working (AAPL CIK: {test_cik})")

In [None]:
def find_s1_url(ticker, cik=None):
    """
    Find the URL for the main S-1 filing document.
    
    Returns:
        URL of S-1 HTML document, or None if not found
    """
    # Get CIK if not provided
    if not cik:
        cik = get_cik_from_ticker(ticker)
        if not cik:
            return None
    
    # Search for S-1 filings
    search_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=S-1&dateb=&owner=exclude&count=10"
    
    try:
        response = make_request(search_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find filings table
        filings_table = soup.find('table', {'class': 'tableFile2'})
        
        if not filings_table:
            return None
        
        # Get first filing (most recent)
        rows = filings_table.find_all('tr')[1:]  # Skip header
        
        if not rows:
            return None
        
        # Find documents link in first row
        first_row = rows[0]
        docs_button = first_row.find('a', {'id': 'documentsbutton'})
        
        if not docs_button:
            return None
        
        # Go to documents page
        docs_url = 'https://www.sec.gov' + docs_button['href']
        response = make_request(docs_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find document table
        doc_table = soup.find('table', {'class': 'tableFile'})
        
        if not doc_table:
            return None
        
        # Find main S-1 document
        for row in doc_table.find_all('tr')[1:]:
            cols = row.find_all('td')
            
            if len(cols) >= 4:
                doc_type = cols[3].get_text().strip()
                
                # Look for S-1 or S-1/A document
                if 'S-1' in doc_type:
                    doc_link = cols[2].find('a')
                    
                    if doc_link and doc_link.get('href'):
                        doc_url = 'https://www.sec.gov' + doc_link['href']
                        return doc_url
        
        return None
        
    except Exception as e:
        print(f"  Error finding S-1 for {ticker}: {e}")
        return None

# Test
test_url = find_s1_url("RDDT")
if test_url:
    print(f"✓ S-1 URL finder working")
    print(f"  Example URL: {test_url[:80]}...")
else:
    print("⚠️  S-1 finder test failed (RDDT might not have S-1, or network issue)")

In [None]:
def download_s1(ticker, output_dir, cik=None):
    """
    Download S-1 filing and save as HTML.
    
    Returns:
        dict with status, filepath, and any error message
    """
    result = {
        'ticker': ticker,
        'success': False,
        'filepath': None,
        'cik': cik,
        'error': None,
        'url': None
    }
    
    try:
        # Check if already downloaded
        output_path = os.path.join(output_dir, f"{ticker}_s1.html")
        if os.path.exists(output_path):
            result['success'] = True
            result['filepath'] = output_path
            result['error'] = 'Already exists (skipped)'
            return result
        
        # Find S-1 URL
        url = find_s1_url(ticker, cik)
        
        if not url:
            result['error'] = 'S-1 filing not found'
            return result
        
        result['url'] = url
        
        # Download document
        response = make_request(url)
        
        # Save to file
        os.makedirs(output_dir, exist_ok=True)
        
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(response.text)
        
        # Get file size
        file_size = os.path.getsize(output_path) / 1024  # KB
        
        result['success'] = True
        result['filepath'] = output_path
        result['file_size_kb'] = file_size
        
        return result
        
    except Exception as e:
        result['error'] = str(e)
        return result

print("✓ Download function defined")

## Load IPO Calendar

In [None]:
# Load IPO calendar
if not os.path.exists(INPUT_CSV):
    print(f"❌ Input file not found: {INPUT_CSV}")
    print("\nCreating sample IPO calendar...")
    
    # Create sample data
    sample_ipos = [
        {'ticker': 'ARM', 'company': 'Arm Holdings', 'ipo_date': '2023-09-14', 'sector': 'Technology'},
        {'ticker': 'RDDT', 'company': 'Reddit Inc', 'ipo_date': '2024-03-21', 'sector': 'Technology'},
        {'ticker': 'CART', 'company': 'Instacart', 'ipo_date': '2023-09-19', 'sector': 'Technology'},
        {'ticker': 'BIRK', 'company': 'Birkenstock', 'ipo_date': '2023-10-11', 'sector': 'Consumer'},
        {'ticker': 'KVUE', 'company': 'Kenvue Inc', 'ipo_date': '2023-05-04', 'sector': 'Healthcare'},
    ]
    
    df = pd.DataFrame(sample_ipos)
    os.makedirs('data', exist_ok=True)
    df.to_csv(INPUT_CSV, index=False)
    print(f"✓ Created sample calendar with {len(df)} IPOs")
else:
    df = pd.read_csv(INPUT_CSV)
    print(f"✓ Loaded {len(df)} IPOs from {INPUT_CSV}")

# Validate
if 'ticker' not in df.columns:
    print("❌ ERROR: CSV must have 'ticker' column")
else:
    print(f"\nIPOs to download:")
    print(df[['ticker', 'company'] if 'company' in df.columns else ['ticker']].head(10))
    
    if len(df) > 10:
        print(f"... and {len(df) - 10} more")

## Download S-1 Filings

This will:
1. Loop through all tickers
2. Search SEC EDGAR for S-1 filing
3. Download and save HTML
4. Show progress bar
5. Generate report

**Estimated time:** ~30 seconds per IPO (includes SEC rate limiting)
- 10 IPOs: ~5 minutes
- 50 IPOs: ~25 minutes
- 100 IPOs: ~50 minutes

In [None]:
# Optional: Limit number for testing
LIMIT = None  # Set to 5 for testing, None for all

tickers = df['ticker'].tolist()
if LIMIT:
    tickers = tickers[:LIMIT]
    print(f"⚠️  Limited to first {LIMIT} tickers for testing")

print(f"\nDownloading S-1 filings for {len(tickers)} IPOs...")
print(f"Estimated time: ~{len(tickers) * 30 / 60:.0f} minutes")
print(f"Output directory: {OUTPUT_DIR}\n")

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Download each filing
results = []
start_time = time.time()

for ticker in tqdm(tickers, desc="Downloading"):
    result = download_s1(ticker, OUTPUT_DIR)
    results.append(result)
    
    # Show status in progress bar
    status = "✓" if result['success'] else "✗"
    tqdm.write(f"{status} {ticker:6s} - {result.get('error', 'Downloaded')}")

elapsed = time.time() - start_time

# Create results dataframe
df_results = pd.DataFrame(results)

print(f"\n{'='*60}")
print("Download Complete!")
print(f"{'='*60}")
print(f"Time elapsed: {elapsed/60:.1f} minutes")
print(f"Average: {elapsed/len(tickers):.1f} seconds per IPO")

## Download Report

In [None]:
# Summary statistics
total = len(df_results)
successful = df_results['success'].sum()
failed = total - successful
success_rate = successful / total * 100

print(f"\n📊 Summary Statistics")
print(f"{'-'*40}")
print(f"Total IPOs:        {total}")
print(f"Successful:        {successful} ({success_rate:.1f}%)")
print(f"Failed:            {failed} ({100-success_rate:.1f}%)")

# File sizes
if 'file_size_kb' in df_results.columns:
    downloaded = df_results[df_results['file_size_kb'].notna()]
    if len(downloaded) > 0:
        print(f"\nAverage file size: {downloaded['file_size_kb'].mean():.0f} KB")
        print(f"Total downloaded:  {downloaded['file_size_kb'].sum()/1024:.1f} MB")

# Error breakdown
if failed > 0:
    print(f"\n❌ Failed Downloads ({failed}):")
    print(f"{'-'*40}")
    
    failed_df = df_results[~df_results['success']]
    error_counts = failed_df['error'].value_counts()
    
    for error, count in error_counts.items():
        print(f"  {error}: {count}")
    
    print("\nFailed tickers:")
    for _, row in failed_df.iterrows():
        print(f"  - {row['ticker']:6s}: {row['error']}")

# Success examples
if successful > 0:
    print(f"\n✓ Successfully Downloaded ({min(5, successful)} examples):")
    print(f"{'-'*40}")
    
    success_df = df_results[df_results['success']]
    for _, row in success_df.head(5).iterrows():
        size = f"{row['file_size_kb']:.0f} KB" if 'file_size_kb' in row and pd.notna(row['file_size_kb']) else "N/A"
        print(f"  {row['ticker']:6s} - {size}")

## Save Download Log

In [None]:
# Save detailed log
log_file = os.path.join(OUTPUT_DIR, '_download_log.csv')
df_results.to_csv(log_file, index=False)

print(f"\n📄 Download log saved: {log_file}")

# Save summary report
summary_file = os.path.join(OUTPUT_DIR, '_download_summary.txt')
with open(summary_file, 'w') as f:
    f.write("SEC EDGAR S-1 Download Report\n")
    f.write("="*60 + "\n\n")
    f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"Total IPOs: {total}\n")
    f.write(f"Successful: {successful} ({success_rate:.1f}%)\n")
    f.write(f"Failed: {failed}\n")
    f.write(f"Time: {elapsed/60:.1f} minutes\n\n")
    
    if failed > 0:
        f.write("Failed Tickers:\n")
        f.write("-"*40 + "\n")
        for _, row in df_results[~df_results['success']].iterrows():
            f.write(f"{row['ticker']}: {row['error']}\n")

print(f"📄 Summary report saved: {summary_file}")

print(f"\n✓ All files saved to: {OUTPUT_DIR}")

## Verify Downloads

In [None]:
# List all downloaded files
s1_files = [f for f in os.listdir(OUTPUT_DIR) if f.endswith('_s1.html')]

print(f"\n📁 Files in {OUTPUT_DIR}:")
print(f"{'-'*60}")
print(f"Total S-1 HTML files: {len(s1_files)}")

if len(s1_files) > 0:
    print(f"\nFirst 10 files:")
    for filename in sorted(s1_files)[:10]:
        filepath = os.path.join(OUTPUT_DIR, filename)
        size = os.path.getsize(filepath) / 1024  # KB
        print(f"  {filename:30s} {size:8.0f} KB")
    
    if len(s1_files) > 10:
        print(f"  ... and {len(s1_files) - 10} more files")

## Preview Sample S-1 (Optional)

In [None]:
# Preview first downloaded S-1
if s1_files:
    sample_file = os.path.join(OUTPUT_DIR, s1_files[0])
    
    print(f"\n📄 Preview of {s1_files[0]}:")
    print(f"{'-'*60}")
    
    with open(sample_file, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Parse with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    text = soup.get_text()
    
    # Show first 1000 characters
    preview = text[:1000].strip()
    print(preview)
    print("\n...")
    print(f"\n✓ File is valid HTML with {len(text)} characters of text")
else:
    print("No S-1 files to preview")

## Next Steps

### What to Do Now:

1. **Review failed downloads:**
   - Check `_download_log.csv` for errors
   - Manually download missing S-1s from SEC EDGAR
   - Some companies may not have S-1s (foreign issuers use F-1)

2. **Parse fundamentals:**
   - Go to `data_collection.ipynb` Step 3
   - Extract financial data from S-1 HTML files
   - Use manual entry template for accuracy

3. **Upload to QuantConnect (if needed):**
   - Zip the `data/s1_filings/` folder
   - Upload to your QuantConnect project
   - Reference in data_collection.ipynb

### Manual Download for Failed IPOs:

For any failed downloads:
1. Visit: `https://www.sec.gov/cgi-bin/browse-edgar?company=TICKER&type=S-1`
2. Find most recent S-1 or S-1/A filing
3. Click "Documents" button
4. Download the main .htm file
5. Save as `TICKER_s1.html` in the output directory

### Success Criteria:

- ✓ 80%+ success rate is excellent
- ✓ Files should be 100-500 KB each (typical S-1 size)
- ✓ Can open HTML files in browser and see financial tables

### Troubleshooting:

**No S-1 found:**
- Company may have filed F-1 (foreign issuers)
- Check if company went public via SPAC (different filings)
- Some direct listings don't have S-1s

**Rate limit errors:**
- Increase DELAY_BETWEEN_REQUESTS to 1.0 seconds
- Run during off-peak hours

**Connection errors:**
- Check internet connection
- SEC EDGAR may be down (rare)
- Try again later

In [None]:
print("\n" + "="*60)
print("🎉 S-1 DOWNLOAD COMPLETE!")
print("="*60)

print(f"\n✓ Downloaded {successful}/{total} S-1 filings")
print(f"✓ Saved to: {os.path.abspath(OUTPUT_DIR)}")
print(f"✓ Log file: {log_file}")

if failed > 0:
    print(f"\n⚠️  {failed} downloads failed - see log for details")
    print(f"You can manually download these from SEC EDGAR")

print(f"\n📋 Next: Extract fundamentals from S-1 files")
print(f"   → Open data_collection.ipynb Step 3")
print(f"   → Or manually enter data into CSV template")