# 🏅 Team USA: Data Exploration & BigQuery Loading

In this notebook, you'll load and explore 120 years of Team USA Olympic and Paralympic data, then load it into BigQuery for SQL analytics and machine learning.

**What you'll do:**
1. Load the athletes dataset into pandas for interactive exploration
2. Use the **Data Science Agent** to profile and visualize the data with AI
3. Load both datasets into BigQuery for analysis in the next task
4. Verify the data loaded correctly

---

## Step 1: Load the Athletes Data

Let's start by loading the athletes dataset from Google Cloud Storage into a pandas DataFrame. This gives you an in-memory copy that the Data Science Agent can analyze interactively.

The dataset contains **12,222 Team USA athletes** spanning the 1896 Athens Olympics through the 2024 Paris Games — both Olympic and Paralympic competitors in a single unified table.

In [None]:
import pandas as pd

# Load athletes data directly from Google Cloud Storage
df = pd.read_csv('gs://class-demo/team-usa/final/team_usa_athletes.csv')

print(f"Dataset shape: {df.shape[0]:,} athletes × {df.shape[1]} columns")
print(f"\nColumns: {', '.join(df.columns.tolist())}")
print(f"\nFirst 5 rows:")
df.head()

## Step 2: Explore with the Data Science Agent

Now let's use the AI-powered **Data Science Agent** to explore this data. Instead of writing analysis code yourself, you'll describe what you want to understand in plain English, and the agent will plan, code, execute, and visualize autonomously.

### How to use the agent

1. Look at the **chat panel at the bottom** of this notebook — it says *"What can I help you build?"*
2. Type a prompt describing what you want to explore
3. The agent will propose code cells (marked **✦ Gemini**) with a plan showing its steps
4. Click **Accept & Run** to execute each proposed cell
5. The agent may propose follow-up steps based on the results — keep clicking **Accept & Run**

> **Note:** The first prompt may take a minute or two as the agent initializes. Subsequent prompts will be faster.

---

### Prompt 1 — Dataset Profiling

Start by getting an overview of the data quality and structure. Copy this into the chat panel:

```
Profile this dataset. Show me the shape, data types, missing values, and basic statistics for the numeric columns.
```

**What to look for:**
- `height_cm` and `weight_kg` — available for ~65-69% of athletes (mostly Olympic). Almost no biometric data exists for Paralympic athletes.
- `reason`, `hero`, `philosophy`, etc. — populated only for Paris 2024 Paralympic athletes
- `profile_summary` and `embedding` — 100% coverage (AI-generated for every athlete)
- Most athletes have zero medals — making the team is the achievement!

---

### Prompt 2 — Olympic vs. Paralympic Distribution

```
Show me the distribution of athletes by games_type and gender. Include a visualization.
```

**What to notice:** The Olympic/Paralympic split, gender balance, and how representation has evolved.

---

### Prompt 3 — Sports and Medal Distribution

```
Which sports have the most athletes and the most total medals? Show the top 10 of each with visualizations.
```

**What to notice:** Athletics and Swimming dominate. Paralympic sports like Wheelchair Basketball appear in the rankings — this dataset tells the *complete* Team USA story.

---

### Keep Exploring! (Optional)

The agent is context-aware — it remembers everything in your notebook. Try your own questions:

- *"Who has the most career gold medals?"*
- *"How has the number of female athletes changed over the decades?"*
- *"What's the average career span of medalists vs. non-medalists?"*
- *"Show the distribution of athletes across decades by games_type"*

When you're ready, move on to Step 3 to load the data into BigQuery.

## Step 3: Load Data into BigQuery

You've explored the data interactively — now let's load it into BigQuery where you can run SQL analytics and train ML models. We'll load directly from Google Cloud Storage using BigQuery's native import, which is much faster than uploading from this notebook.

First, set your project ID:

In [None]:
# TODO: Replace with your lab project ID
PROJECT_ID = "YOUR_PROJECT_ID"  # e.g., "qwiklabs-gcp-00-abc123def456"

print(f"Project ID set to: {PROJECT_ID}")

### Load the athletes table

This loads all 12,222 athletes with schema autodetection. The embedding column (3072-dimension vectors stored as JSON strings) makes this file large, so the load takes about a minute.

In [None]:
!bq load --project_id=$PROJECT_ID \
  --source_format=CSV \
  --autodetect \
  --replace \
  team_usa.athletes \
  gs://class-demo/team-usa/final/team_usa_athletes.csv

### Load the results table

The results table (24,945 competition records) is smaller and loads in seconds.

In [None]:
!bq load --project_id=$PROJECT_ID \
  --source_format=CSV \
  --autodetect \
  --replace \
  team_usa.results \
  gs://class-demo/team-usa/final/team_usa_results.csv

> **Note on data types:** BigQuery's autodetect may type integer columns like `total_medals` and `games_count` as FLOAT64 (because some rows have NULL values). This is perfectly fine — BQML handles FLOAT64 without issues, and you can CAST to INT64 in queries if you prefer cleaner display.

## Step 4: Verify the Load

Let's confirm everything loaded correctly with a few quick queries.

### Row counts

In [None]:
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)

# Check row counts
for table in ['athletes', 'results']:
    query = f"SELECT COUNT(*) as row_count FROM `{PROJECT_ID}.team_usa.{table}`"
    result = client.query(query).result()
    for row in result:
        print(f"team_usa.{table}: {row.row_count:,} rows")

You should see:
- `team_usa.athletes`: **12,222** rows
- `team_usa.results`: **24,945** rows

### Top medalists — who leads Team USA's all-time medal count?

In [None]:
query = """
SELECT 
    name,
    primary_sport,
    games_type,
    CAST(games_count AS INT64) AS games_count,
    CAST(gold_count AS INT64) AS gold,
    CAST(silver_count AS INT64) AS silver,
    CAST(bronze_count AS INT64) AS bronze,
    CAST(total_medals AS INT64) AS total_medals
FROM `team_usa.athletes`
WHERE total_medals > 0
ORDER BY total_medals DESC
LIMIT 15
"""

top_medalists = client.query(query).to_dataframe()
print("🏅 Team USA All-Time Top Medalists:")
top_medalists

### Olympic vs. Paralympic athlete counts

In [None]:
query = """
SELECT 
    games_type,
    COUNT(*) as athlete_count,
    CAST(SUM(total_medals) AS INT64) as total_medals
FROM `team_usa.athletes`
GROUP BY games_type
ORDER BY athlete_count DESC
"""

breakdown = client.query(query).to_dataframe()
print("Olympic vs. Paralympic:")
breakdown

---

## ✅ Data Loaded — Ready for Analytics!

You've loaded and explored 120 years of Team USA data. Here's what you accomplished:

- **12,222 athletes** and **24,945 competition results** are now in BigQuery
- You used the **Data Science Agent** to understand the data without writing analysis code
- You've identified key patterns: the Olympic/Paralympic split, sport distributions, and data quality characteristics

**In the next task**, you'll move to the BigQuery console to run analytical queries with **Gemini Cloud Assist** and train a **K-Means clustering model** that discovers hidden athlete career archetypes — using nothing but SQL.

Close this notebook and head back to the lab instructions for **Task 3**.