# Notebook 2 — Aggregates & Descriptives (UFO)

_INFO 4614 • Unit 1: Relational Databases (MySQL)_

**Builds on Notebook 1.** We'll stay in Python + SQL and frame everything as **research questions**, then translate those questions into queries. No joins yet.

We focus on:
- Aggregates: `COUNT`, `COUNT(DISTINCT ...)`, `MIN`, `MAX`, `AVG`, `SUM`
- `GROUP BY` and `HAVING`
- Handling **NULLs** with `COALESCE` / `IFNULL`
- Light **string** and **date/time** helpers for grouping (e.g., `LOWER`, `TRIM`, `STR_TO_DATE`, `YEAR`)

**Dataset.** Database **`ufo`**, table **`sightings`** (structure mirrors `ufo_scrubbed.csv`).

> Tip: Copy your working connection cell from **Notebook 1** or use the setup below.

**Submission checklist**

- ✅ All cells run top–to–bottom without errors
- ✅ Every **Exercise** cell is filled in and produces an output
- ✅ Results appear as DataFrames (and optional plots if you do the stretch goal)
- ✅ File is renamed `Notebook2_LastFirst.ipynb` before submission

## 0) Setup & connection

We'll install dependencies (same as Notebook 1), set connection variables for the **UFO** database, and ensure **TLS** using the Amazon RDS **global CA** (`global-bundle.pem`).

**Quoting column names with spaces.** If a column name has spaces or punctuation (e.g., ``duration (seconds)``), use **backticks** in MySQL: ``SELECT `duration (seconds)` FROM sightings``. You can also use `AS` to assign cleaner output names.

In [None]:
# Install light dependencies (same as Notebook 1)
!pip -q install mysql-connector-python SQLAlchemy pandas matplotlib

# Versions
import sys, sqlalchemy, pandas
print("Python:", sys.version.split()[0])
print("SQLAlchemy:", sqlalchemy.__version__)
print("pandas:", pandas.__version__)

In [None]:
# Connection settings (same host/port as Notebook 1; different DB name)
DB_HOST = "info4614.c7kemoi0y6yq.us-east-2.rds.amazonaws.com"
DB_PORT = 3306
DB_NAME = "ufo"

CA_CERT_PATH = "global-bundle.pem"

In [None]:
# Download the current Amazon RDS global certificate bundle (TLS)
!wget -q https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem -O "$CA_CERT_PATH"

import os
assert os.path.exists(CA_CERT_PATH), "PEM download failed — re-run or upload manually"
print("Saved:", os.path.abspath(CA_CERT_PATH))

In [None]:
# Alternative: upload a PEM you already have
# (Skip if the previous cell succeeded)
try:
    from google.colab import files
    print("Upload global-bundle.pem if needed")
    uploaded = files.upload()
except Exception as e:
    print("Upload not available here:", e)

### Enter your read‑only credentials

In [None]:
from getpass import getpass
DB_USER = input("MySQL username: ").strip()
DB_PASS = getpass("MySQL password: ").strip()

### Create SQLAlchemy engine

In [None]:
from sqlalchemy import create_engine, text

engine = create_engine(
    f"mysql+mysqlconnector://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}",
    connect_args={"ssl_ca": CA_CERT_PATH},
    pool_pre_ping=True,
)

with engine.connect() as conn:
    print("SELECT 1 →", conn.execute(text("SELECT 1")).scalar_one())
    cipher = conn.execute(text("SHOW SESSION STATUS LIKE 'Ssl_cipher'")).fetchone()
    print("SSL cipher:", cipher[1] if cipher else None)

## 1) Quick schema check

In [None]:
import pandas as pd
from sqlalchemy import text

with engine.connect() as conn:
    print("Tables:")
    display(pd.read_sql_query("SHOW TABLES", conn))

    print("\n`ufo.sightings` definition:")
    display(pd.read_sql_query("DESCRIBE sightings", conn))


## From research question → SQL query (mini recipe)

When you turn a research question into SQL, think in this order:

1. **What rows do I need?** → `FROM ... WHERE ...`
2. **What is the unit of analysis?** → columns in `GROUP BY` (or no grouping if a single summary)
3. **What measurements per group?** → aggregates like `COUNT(*)`, `AVG(x)`, `MIN/MAX`
4. **Which groups qualify?** → `HAVING` (filters *after* grouping), e.g., `HAVING COUNT(*) >= 30`
5. **How should results be presented?** → `ORDER BY ...`, `LIMIT ...`
6. **Any cleanup?** → `COALESCE(...)` for NULLs, `TRIM(...)` for stray whitespace, aliases with `AS`

We'll practice this repeatedly:
- Start with a **question**
- Write the **SQL pattern**
- Fill in the **code cell** to get a pandas DataFrame


## 2) Aggregates and GROUP BY

We’ll start with counts and then layer on more aggregate functions.

### RQ 1 — How big is this dataset? How diverse are shapes and countries?
**Question.** *How many total reports are there? How many distinct shapes and countries are represented?*

**Translate → SQL.**
- **Rows:** `FROM sightings`
- **Measures:** `COUNT(*)`, `COUNT(DISTINCT shape)`, `COUNT(DISTINCT country)`
- **Output:** one row with columns `total_rows`, `distinct_shapes`, `distinct_countries`

In [None]:
# 🔧 Exercise 2.1
import pandas as pd
from sqlalchemy import text

sql = """
SELECT
  COUNT(*)            AS total_rows,
  COUNT(DISTINCT shape)   AS distinct_shapes,
  COUNT(DISTINCT country) AS distinct_countries
FROM sightings;
"""
with engine.connect() as conn:
    df_counts = pd.read_sql_query(text(sql), conn)

df_counts

### RQ 2 — Which shapes get reported the most?
**Question.** *What are the top 10 reported UFO shapes?*

**Translate → SQL.**
- **Rows:** exclude NULL/empty `shape`
- **Group:** by `shape`
- **Measure:** `COUNT(*) AS n`
- **Present:** `ORDER BY n DESC LIMIT 10`

In [None]:
# 🔧 Exercise 2.2
sql = """
SELECT
  shape,
  COUNT(*) AS n
FROM sightings
WHERE ...
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_shapes = pd.read_sql_query(text(sql), conn)

df_shapes.head(10)

**📝 Interpretation (1–2 sentences).** In plain English, answer the research question above based on your result. 

- Mention any pattern or outlier you see.
- If relevant, note one limitation (e.g., small group sizes, missing data).

*Type your short answer here.*

### RQ 3 — Which U.S. states have the most sightings?
**Question.** *Among U.S. reports, which states have the highest counts?*

**Translate → SQL.**
- **Rows:** `WHERE country = 'us'` and non‑NULL/non‑empty `state`
- **Group:** by `state`
- **Measure:** `COUNT(*) AS n`
- **Present:** `ORDER BY n DESC LIMIT 15`

In [None]:
# 🔧 Exercise 2.3
sql = """
SELECT
  state,
  COUNT(*) AS n
FROM sightings
WHERE ...
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_states = pd.read_sql_query(text(sql), conn)

df_states.head(15)

## 3) Averages, mins/maxes, and HAVING

We’ll use the numeric column `` `duration (seconds)` `` and compute summaries by group. Use `HAVING` to filter **after** aggregation.

### RQ 4 — Which shapes tend to have longer reported durations?
**Question.** *For each shape, what is the average reported `duration (seconds)`? Consider only shapes with enough data.*

**Translate → SQL.**
- **Group:** by `shape`
- **Measure:** `AVG(\`duration (seconds)\`) AS avg_seconds`, plus `COUNT(*) AS n`
- **Quality filter:** `HAVING n >= 30`
- **Present:** `ORDER BY avg_seconds DESC`

In [None]:
# 🔧 Exercise 3.1
sql = """
SELECT
  shape,
  AVG(`duration (seconds)`) AS avg_seconds,
  COUNT(*)                  AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_avg_shape = pd.read_sql_query(text(sql), conn)

df_avg_shape.head(10)

### RQ 5 — How spread out are durations within each shape?
**Question.** *What are the min/max durations by shape, and a rounded average?*

**Translate → SQL.**
- **Group:** by `shape`
- **Measures:** `MIN`, `MAX`, `ROUND(AVG(...), 1)`, `COUNT(*) AS n`
- **Quality filter:** keep only groups with enough records if needed (e.g., `HAVING n >= 30`)
- **Present:** sort by `n` or by `avg_s_rounded` as useful

In [None]:
# 🔧 Exercise 3.2
sql = """
SELECT
  shape,
  MIN(`duration (seconds)`)              AS min_s,
  MAX(`duration (seconds)`)              AS max_s,
  ROUND(AVG(`duration (seconds)`), 1)    AS avg_s_rounded,
  COUNT(*)                               AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_minmax = pd.read_sql_query(text(sql), conn)

df_minmax.head(10)

## 4) Handling NULL/empty and simple string helpers

We'll group on a **filled** state value using `COALESCE`. Example pattern:

```sql
COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN')
```

That treats whitespace/empty strings as NULL and replaces missing with `'UNKNOWN'`.

### RQ 6 — How many U.S. reports have missing/blank states?
**Question.** *Count U.S. sightings by state, but put missing/blank ones into an `UNKNOWN` bucket.*

**Translate → SQL.**
- **Row filter:** `WHERE country = 'us'`
- **Grouping key:** `COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN')`
- **Measure:** `COUNT(*) AS n`
- **Present:** `ORDER BY n DESC`

In [None]:
# 🔧 Exercise 4.1
sql = """
SELECT
  COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN') AS state_bucket,
  COUNT(*) AS n
FROM sightings
WHERE country = 'us'
GROUP BY ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_state_bucket = pd.read_sql_query(text(sql), conn)

df_state_bucket.head(20)

## 5) Light date/time derived columns

The `datetime` column is stored as text in the form `MM/DD/YYYY HH:MM`. MySQL can parse it with `STR_TO_DATE`.
We'll extract **year** to see long‑term patterns.

Format string: `'%m/%d/%Y %H:%i'`.

### RQ 7 — Which years were the busiest for reports?
**Question.** *How many sightings were logged each year?*

**Translate → SQL.**
- **Derived year:** `YEAR(STR_TO_DATE(\`datetime\`, '%m/%d/%Y %H:%i')) AS yr`
- **Group:** by `yr`
- **Measure:** `COUNT(*) AS n`
- **Present:** sort by `n` desc; show top 15

In [None]:
# 🔧 Exercise 5.1
sql = """
SELECT
  YEAR(STR_TO_DATE(`datetime`, '%m/%d/%Y %H:%i')) AS yr,
  COUNT(*) AS n
FROM sightings
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_years = pd.read_sql_query(text(sql), conn)

df_years.head(15)

**📝 Interpretation (1–2 sentences).** In plain English, answer the research question above based on your result. 

- Mention any pattern or outlier you see.
- If relevant, note one limitation (e.g., small group sizes, missing data).

*Type your short answer here.*

### RQ 8 — Does average reported duration change over time?
**Question.** *For each year with at least 100 reports, what is the average duration?*

**Translate → SQL.**
- **Derived year:** same `yr` expression as above
- **Group:** by `yr`
- **Measures:** `AVG(\`duration (seconds)\`) AS avg_seconds`, `COUNT(*) AS n`
- **Quality filter:** `HAVING n >= 100`
- **Present:** `ORDER BY yr ASC`

In [None]:
# 🔧 Exercise 5.2
sql = """
SELECT
  YEAR(STR_TO_DATE(`datetime`, '%m/%d/%Y %H:%i')) AS yr,
  AVG(`duration (seconds)`) AS avg_seconds,
  COUNT(*) AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_avg_by_year = pd.read_sql_query(text(sql), conn)

df_avg_by_year.head()

## 6) Light pandas (inspect + simple viz)

**RQ 2 (visual):** *Which shapes get reported the most?* Make a quick bar chart of your top 10 shapes result.

We'll keep pandas simple. Convert one of your SQL results into a small **bar chart**.

> If plotting causes issues, skip the plot and just show the DataFrame.

In [None]:
# Bar chart of top shapes (from Exercise 2.2)
import matplotlib.pyplot as plt

df_shapes_plot = df_shapes.copy().head(10)
df_shapes_plot.plot(kind="bar", x="shape", y="n", legend=False, title="Top 10 UFO shapes by count")
plt.xlabel("shape"); plt.ylabel("count")
plt.show()

## (Optional) Fallback: local CSV (pandas only)

If you need to test without a DB connection, you can upload `ufo_scrubbed.csv` and practice the pandas parts.
This does **not** replace the requirement to connect to the database for grading.

In [None]:
import pandas as pd
try:
    from google.colab import files
    print("Upload ufo_scrubbed.csv")
    uploaded = files.upload()
    csv_name = next(iter(uploaded.keys()))
    ufo_csv = pd.read_csv(csv_name)
    ufo_csv.head()
except Exception as e:
    print("Upload not available here:", e)

## Troubleshooting

- **`Access denied ...` (1045)** → Check your read‑only username/password and ask the instructor if needed.
- **Timeout / can’t connect (2003)** → Network or RDS SG issue. Try campus Wi‑Fi or VPN; ask the instructor to whitelist your IP if necessary.
- **`ssl ca certificate` errors** → Re‑run the PEM download cell or upload the correct `global-bundle.pem` and ensure `connect_args={"ssl_ca": "global-bundle.pem"}`.
- **SQL errors** → Check backticks around columns with spaces or punctuation.

## Submit

- **Runtime → Restart and run all** to confirm a clean run
- **File → Download .ipynb**, rename to `Notebook2_LastFirst.ipynb`, and submit as directed