# üèÖ Team USA: Data Exploration & BigQuery Loading

In this notebook, you'll load and explore 120 years of Team USA Olympic and Paralympic data, then load it into BigQuery for SQL analytics and machine learning.

**What you'll do:**
1. Load the athletes dataset into pandas for interactive exploration
2. Use the **Data Science Agent** to profile and visualize the data with AI
3. Load both datasets into BigQuery for analysis in the next task
4. Verify the data loaded correctly

---

## Step 1: Load the Athletes Data

Let's start by loading the athletes dataset from Google Cloud Storage into a pandas DataFrame. This gives you an in-memory copy that the Data Science Agent can analyze interactively.

The dataset contains **11,843 Team USA athletes** spanning the 1896 Athens Olympics through the 2024 Paris Games ‚Äî both Olympic and Paralympic competitors in a single unified table.

In [None]:
import pandas as pd

# Load athletes data directly from Google Cloud Storage
df = pd.read_csv('gs://class-demo/team-usa/final/team_usa_athletes.csv')

print(f"Dataset shape: {df.shape[0]:,} athletes √ó {df.shape[1]} columns")
print(f"\nColumns: {', '.join(df.columns.tolist())}")
print(f"\nFirst 5 rows:")
df.head()

## Step 2: Load Data into BigQuery

You've explored the data interactively ‚Äî now let's load it into BigQuery where you can run SQL analytics and train ML models. We'll load directly from Google Cloud Storage using BigQuery's native import, which is much faster than uploading from this notebook.

First, set your project ID:

In [None]:
# TODO: Replace with your lab project ID
PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}

print(f"Project ID set to: {PROJECT_ID}")

### Load the athletes table

This loads all 11,843 athletes with schema autodetection. The embedding column (3072-dimension vectors stored as JSON strings) makes this file large, so the load takes about a minute.

In [None]:
!bq load --project_id=$PROJECT_ID \
  --source_format=CSV \
  --autodetect \
  --replace \
  team_usa.athletes \
  gs://class-demo/team-usa/final/team_usa_athletes.csv

### Load the results table

The results table (24,198 competition records) is smaller and loads in seconds.

In [None]:
!bq load --project_id=$PROJECT_ID \
  --source_format=CSV \
  --autodetect \
  --replace \
  team_usa.results \
  gs://class-demo/team-usa/final/team_usa_results.csv

> **Note on data types:** BigQuery's autodetect may type integer columns like `total_medals` and `games_count` as FLOAT64 (because some rows have NULL values). This is perfectly fine ‚Äî BQML handles FLOAT64 without issues, and you can CAST to INT64 in queries if you prefer cleaner display.

## Step 3: Verify the Load

Let's confirm everything loaded correctly with a few quick queries.

### Row counts

In [None]:
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)

# Check row counts
for table in ['athletes', 'results']:
    query = f"SELECT COUNT(*) as row_count FROM `{PROJECT_ID}.team_usa.{table}`"
    result = client.query(query).result()
    for row in result:
        print(f"team_usa.{table}: {row.row_count:,} rows")

You should see:
- `team_usa.athletes`: **11,843** rows
- `team_usa.results`: **24,198** rows

### Top medalists ‚Äî who leads Team USA's all-time medal count?

In [None]:
query = """
SELECT
    name,
    primary_sport,
    games_type,
    CAST(games_count AS INT64) AS games_count,
    CAST(gold_count AS INT64) AS gold,
    CAST(silver_count AS INT64) AS silver,
    CAST(bronze_count AS INT64) AS bronze,
    CAST(total_medals AS INT64) AS total_medals
FROM `team_usa.athletes`
WHERE total_medals > 0
ORDER BY total_medals DESC
LIMIT 15
"""

top_medalists = client.query(query).to_dataframe()
print("üèÖ Team USA All-Time Top Medalists:")
top_medalists

### Olympic vs. Paralympic athlete counts

In [None]:
# sql_engine: bigquery
# output_variable: breakdown
# start _sql
_sql = """
SELECT
    games_type,
    COUNT(*) as athlete_count,
    CAST(SUM(total_medals) AS INT64) as total_medals
FROM `team_usa.athletes`
GROUP BY games_type
""" # end _sql
from google.colab.sql import bigquery as _bqsqlcell
breakdown = _bqsqlcell.run(_sql)
breakdown

---

## ‚úÖ Data Loaded ‚Äî Ready for Analytics!

You've loaded and explored 120 years of Team USA data. Here's what you accomplished:

- **11,843 athletes** and **24,198 competition results** are now in BigQuery
- You used the **Data Science Agent** to understand the data without writing analysis code
- You've identified key patterns: the Olympic/Paralympic split, sport distributions, and data quality characteristics
