# 01 - Data Acquisition

This notebook guides you through acquiring US trade data from USITC DataWeb.

## Data Sources

| Source | Coverage | URL |
|--------|----------|-----|
| **USITC DataWeb** (Primary) | 1989-Present | https://dataweb.usitc.gov/ |
| **FRED GDP Deflator** | 1947-Present | https://fred.stlouisfed.org/series/GDPDEF |

## Step 1: Setup Environment

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys

# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Define paths
PROJECT_ROOT = Path.cwd().parent
DATA_RAW = PROJECT_ROOT / 'data' / 'raw' / 'usitc'
DATA_REFERENCE = PROJECT_ROOT / 'data' / 'reference'

print(f"Project root: {PROJECT_ROOT}")
print(f"Raw data dir: {DATA_RAW}")

## Step 2: Download Data from USITC DataWeb

### Manual Download Process

1. **Go to** https://dataweb.usitc.gov/
2. **Sign in** (create free account if needed)
3. **Click** "Imports: For Consumption" (or "Exports: Total" for export data)
4. **Configure query:**

   **Step 1: Trade Flow** - Already selected (Imports or Exports)
   
   **Step 2: Time Period**
   - Select "Annual"
   - Start Year: 1995
   - End Year: 2025 (or latest available)
   
   **Step 3: Commodity Classification**
   - For country-level analysis: Select "All Commodities" or use NAICS 2-digit for sector breakdown
   
   **Step 4: Geography**
   - Select "All Countries" or choose specific countries
   
   **Step 5: Measure**
   - Select "Customs Value" (for imports) or "FAS Value" (for exports)
   
   **Step 6: Disaggregation**
   - Check "Country" to get country-level breakdown
   - Optionally check "NAICS" for sector breakdown
   
5. **Download** as CSV

### Recommended Downloads

| File | Description | Query Settings |
|------|-------------|----------------|
| `imports_by_country_1995_2025.csv` | Annual imports by country | All commodities, All countries, 1995-2025 |
| `exports_by_country_1995_2025.csv` | Annual exports by country | All commodities, All countries, 1995-2025 |
| `imports_by_country_naics_1995_2025.csv` | Imports by country and sector | NAICS 2-digit, All countries, 1995-2025 |
| `exports_by_country_naics_1995_2025.csv` | Exports by country and sector | NAICS 2-digit, All countries, 1995-2025 |

## Alternative: Manual Download

If the API doesn't work, you can download manually:

1. Go to https://dataweb.usitc.gov/
2. Sign in with your account
3. Click "Imports: For Consumption" 
4. Configure: Annual, 1995-2024, All Countries, Disaggregate by Country
5. Download as CSV
6. Repeat for Exports

## Step 3: Verify Reference Data

Check that reference data files are in place.

In [None]:
# Check GDP deflator data
deflator_path = DATA_REFERENCE / 'gdp_deflator.csv'
if deflator_path.exists():
    deflator_df = pd.read_csv(deflator_path)
    print(f"GDP Deflator data loaded: {len(deflator_df)} years")
    print(f"Years covered: {deflator_df['year'].min()} - {deflator_df['year'].max()}")
    display(deflator_df.head(10))
else:
    print(f"WARNING: GDP deflator file not found at {deflator_path}")

In [None]:
# Check country/region reference data
regions_path = DATA_REFERENCE / 'country_regions.csv'
if regions_path.exists():
    regions_df = pd.read_csv(regions_path)
    print(f"Country/Region data loaded: {len(regions_df)} countries")
    display(regions_df.head(10))
else:
    print(f"WARNING: Country regions file not found at {regions_path}")

## Step 4: Verify Downloaded Trade DataeyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIyMDA0ODM0IiwianRpIjoiNjc3MTc4NTAtNWE4NC00ODBmLWFkYTEtYjVmZWFjZDNkYjYxIiwiaXNzIjoiZGF0YXdlYiIsImlhdCI6MTc2OTA1MDE0MSwiZXhwIjoxNzg0NjAyMTQxfQ.vCyxchK1RmaG6QgYQn4das5suD68W9D4lLbgDqK1rOpnp-mmwwQ8m7moCa2cfiyZn_ToUSoSofholUf-uutIPQeyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIyMDA0ODM0IiwianRpIjoiNjc3MTc4NTAtNWE4NC00ODBmLWFkYTEtYjVmZWFjZDNkYjYxIiwiaXNzIjoiZGF0YXdlYiIsImlhdCI6MTc2OTA1MDE0MSwiZXhwIjoxNzg0NjAyMTQxfQ.vCyxchK1RmaG6QgYQn4das5suD68W9D4lLbgDqK1rOpnp-mmwwQ8m7moCa2cfiyZn_ToUSoSofholUf-uutIPQeyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIyMDA0ODM0IiwianRpIjoiNjc3MTc4NTAtNWE4NC00ODBmLWFkYTEtYjVmZWFjZDNkYjYxIiwiaXNzIjoiZGF0YXdlYiIsImlhdCI6MTc2OTA1MDE0MSwiZXhwIjoxNzg0NjAyMTQxfQ.vCyxchK1RmaG6QgYQn4das5suD68W9D4lLbgDqK1rOpnp-mmwwQ8m7moCa2cfiyZn_ToUSoSofholUf-uutIPQeyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIyMDA0ODM0IiwianRpIjoiNjc3MTc4NTAtNWE4NC00ODBmLWFkYTEtYjVmZWFjZDNkYjYxIiwiaXNzIjoiZGF0YXdlYiIsImlhdCI6MTc2OTA1MDE0MSwiZXhwIjoxNzg0NjAyMTQxfQ.vCyxchK1RmaG6QgYQn4das5suD68W9D4lLbgDqK1rOpnp-mmwwQ8m7moCa2cfiyZn_ToUSoSofholUf-uutIPQUserasdasfkbdfjseef

sdf

In [None]:
# List files in raw data directory
print("Files in raw data directory:")
if DATA_RAW.exists():
    files = list(DATA_RAW.glob('*.*'))
    data_files = [f for f in files if f.suffix.lower() in ['.csv', '.xlsx', '.xls']]
    if data_files:
        for f in data_files:
            size_mb = f.stat().st_size / (1024 * 1024)
            print(f"  {f.name} ({size_mb:.2f} MB)")
    else:
        print("  No data files found. Please download from USITC DataWeb.")
else:
    print(f"  Directory does not exist: {DATA_RAW}")

In [None]:
# If data files exist, preview them
csv_files = list(DATA_RAW.glob('*.csv'))
if csv_files:
    for csv_file in csv_files[:2]:  # Preview first 2 files
        print(f"\n=== {csv_file.name} ===")
        df = pd.read_csv(csv_file, nrows=5)
        print(f"Columns: {list(df.columns)}")
        display(df)
else:
    print("No CSV files found yet. Download data from USITC DataWeb first.")

## Next Steps

Once you have downloaded the trade data files:

1. Place them in `data/raw/usitc/`
2. Proceed to **02_data_cleaning.ipynb** to clean and harmonize the data