# Notebook 2 — Aggregates & Descriptives (UFO)

_INFO 4614 • Unit 1: Relational Databases (MySQL)_

**Builds on Notebook 1.** We'll stay in Python + SQL, now focusing on **aggregates** and **descriptive summaries** without joins:
- `COUNT`, `COUNT(DISTINCT ...)`, `MIN`, `MAX`, `AVG`, `SUM`
- `GROUP BY` and `HAVING`
- Handling **NULLs** with `COALESCE` / `IFNULL`
- Light **string** and **date/time** helpers for grouping (e.g., `LOWER`, `TRIM`, `STR_TO_DATE`, `YEAR`)

**Dataset.** Database **`ufo`**, table **`sightings`** (structure mirrors `ufo_scrubbed.csv`).

> Tip: You can copy your working connection code from **Notebook 1**. The cells below also provide a clean setup if you prefer to start fresh.

**Submission checklist**

- ✅ All cells run top–to–bottom without errors
- ✅ Every **Exercise** cell is filled in and produces an output
- ✅ Results appear as DataFrames (and optional plots if you do the stretch goal)
- ✅ File is renamed `Notebook2_LastFirst.ipynb` before submission

## 0) Setup & connection

We'll install dependencies (same as Notebook 1), set connection variables for the **UFO** database, and ensure **TLS** using the Amazon RDS **global CA** (`global-bundle.pem`).

⚠️ **Quoting column names with spaces.** In this table, some columns have spaces or punctuation (e.g., ``duration (seconds)``, ``longitude `` has a trailing space). In MySQL, use **backticks** to quote such identifiers: ``SELECT `duration (seconds)` FROM sightings``. You can also use `AS` to assign cleaner output names.

In [None]:
# Install light dependencies (same as Notebook 1)
!pip -q install mysql-connector-python SQLAlchemy pandas matplotlib

# Versions
import sys, sqlalchemy, pandas
print("Python:", sys.version.split()[0])
print("SQLAlchemy:", sqlalchemy.__version__)
print("pandas:", pandas.__version__)

In [None]:
# Connection settings (same host/port as Notebook 1; different DB name)
DB_HOST = "info4614.c7kemoi0y6yq.us-east-2.rds.amazonaws.com"
DB_PORT = 3306
DB_NAME = "ufo"

CA_CERT_PATH = "global-bundle.pem"

In [None]:
# Download the current Amazon RDS global certificate bundle (TLS)
!wget -q https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem -O "$CA_CERT_PATH"

import os
assert os.path.exists(CA_CERT_PATH), "PEM download failed — re-run or upload manually"
print("Saved:", os.path.abspath(CA_CERT_PATH))

In [None]:
# Alternative: upload a PEM you already have
# (Skip if the previous cell succeeded)
try:
    from google.colab import files
    print("Upload global-bundle.pem if needed")
    uploaded = files.upload()
except Exception as e:
    print("Upload not available here:", e)

### Enter your read‑only credentials

In [None]:
from getpass import getpass
DB_USER = input("MySQL username: ").strip()
DB_PASS = getpass("MySQL password: ").strip()

### Create SQLAlchemy engine

In [None]:
from sqlalchemy import create_engine, text

engine = create_engine(
    f"mysql+mysqlconnector://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}",
    connect_args={"ssl_ca": CA_CERT_PATH},
    pool_pre_ping=True,
)

with engine.connect() as conn:
    print("SELECT 1 →", conn.execute(text("SELECT 1")).scalar_one())
    cipher = conn.execute(text("SHOW SESSION STATUS LIKE 'Ssl_cipher'")).fetchone()
    print("SSL cipher:", cipher[1] if cipher else None)

## 1) Quick schema check

In [None]:
import pandas as pd
from sqlalchemy import text

with engine.connect() as conn:
    print("Tables:")
    display(pd.read_sql_query("SHOW TABLES", conn))

    print("\n`ufo.sightings` definition:")
    display(pd.read_sql_query("DESCRIBE sightings", conn))

## 2) Aggregates and GROUP BY

We’ll start with counts and then layer on more aggregate functions.

### 2.1 Overall row count + distincts
- Total number of sightings
- Number of **distinct** `shape` values
- Number of **distinct** `country` values

Return a single row with three columns: `total_rows`, `distinct_shapes`, `distinct_countries`.

In [None]:
# 🔧 Exercise 2.1
import pandas as pd
from sqlalchemy import text

sql = """
SELECT
  COUNT(*)            AS total_rows,
  COUNT(DISTINCT shape)   AS distinct_shapes,
  COUNT(DISTINCT country) AS distinct_countries
FROM sightings;
"""
with engine.connect() as conn:
    df_counts = pd.read_sql_query(text(sql), conn)

df_counts

### 2.2 Top shapes by count
List the **top 10 shapes** (by number of rows), with columns `shape`, `n`. Exclude NULL/empty shapes.

**Hints:** `WHERE shape IS NOT NULL AND shape <> ''`, `GROUP BY shape`, `ORDER BY n DESC`, `LIMIT 10`.

In [None]:
# 🔧 Exercise 2.2
sql = """
SELECT
  shape,
  COUNT(*) AS n
FROM sightings
WHERE ...
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_shapes = pd.read_sql_query(text(sql), conn)

df_shapes.head(10)

### 2.3 U.S. states with the most reports
Filter to U.S. sightings (`country = 'us'`), group by `state`, and return the **top 15** states with the most reports.
Only include non‑NULL/non‑empty states.

Columns: `state`, `n`. Sorted by `n` descending.

In [None]:
# 🔧 Exercise 2.3
sql = """
SELECT
  state,
  COUNT(*) AS n
FROM sightings
WHERE ...
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_states = pd.read_sql_query(text(sql), conn)

df_states.head(15)

## 3) Averages, mins/maxes, and HAVING

We’ll use the numeric column `` `duration (seconds)` `` and compute summaries by group. Use `HAVING` to filter **after** aggregation.

### 3.1 Average duration by shape
Compute `AVG(\`duration (seconds)\`)` by `shape`. Return columns:
`shape`, `avg_seconds`, `n` (count per shape).

Sort by `avg_seconds` descending. Include only shapes with at least **30** sightings.

In [None]:
# 🔧 Exercise 3.1
sql = """
SELECT
  shape,
  AVG(`duration (seconds)`) AS avg_seconds,
  COUNT(*)                  AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_avg_shape = pd.read_sql_query(text(sql), conn)

df_avg_shape.head(10)

### 3.2 Min/Max sanity check
For the same groups (by `shape`), compute `MIN`, `MAX`, and a **rounded** average (to 1 decimal).

Columns: `shape, min_s, max_s, avg_s_rounded, n`.

In [None]:
# 🔧 Exercise 3.2
sql = """
SELECT
  shape,
  MIN(`duration (seconds)`)              AS min_s,
  MAX(`duration (seconds)`)              AS max_s,
  ROUND(AVG(`duration (seconds)`), 1)    AS avg_s_rounded,
  COUNT(*)                               AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_minmax = pd.read_sql_query(text(sql), conn)

df_minmax.head(10)

## 4) Handling NULL/empty and simple string helpers

We'll group on a **filled** state value using `COALESCE`. Example pattern:

```sql
COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN')
```

That treats whitespace/empty strings as NULL and replaces missing with `'UNKNOWN'`.

### 4.1 U.S. counts with UNKNOWN bucket
Group U.S. sightings by a cleaned state bucket using the `COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN')` pattern.
Return columns: `state_bucket`, `n`, sorted by `n` desc.

In [None]:
# 🔧 Exercise 4.1
sql = """
SELECT
  COALESCE(NULLIF(TRIM(state), ''), 'UNKNOWN') AS state_bucket,
  COUNT(*) AS n
FROM sightings
WHERE country = 'us'
GROUP BY ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_state_bucket = pd.read_sql_query(text(sql), conn)

df_state_bucket.head(20)

## 5) Light date/time derived columns

The `datetime` column is stored as text in the form `MM/DD/YYYY HH:MM`. MySQL can parse it with `STR_TO_DATE`.
We'll extract **year** to see long‑term patterns.

Format string: `'%m/%d/%Y %H:%i'`.

### 5.1 Sightings per year (top 15 years)
Create a derived column `yr` with `YEAR(STR_TO_DATE(\`datetime\`, '%m/%d/%Y %H:%i'))` and count rows per year.
Sort by count desc and show the **top 15** rows.

Columns: `yr`, `n`.

In [None]:
# 🔧 Exercise 5.1
sql = """
SELECT
  YEAR(STR_TO_DATE(`datetime`, '%m/%d/%Y %H:%i')) AS yr,
  COUNT(*) AS n
FROM sightings
GROUP BY ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_years = pd.read_sql_query(text(sql), conn)

df_years.head(15)

### 5.2 Average duration by year (filter with HAVING)
Using the same `yr` expression, compute the average duration per year where there are at least **100** records in that year.
Return `yr`, `avg_seconds`, `n` ordered by `yr` ascending.

In [None]:
# 🔧 Exercise 5.2
sql = """
SELECT
  YEAR(STR_TO_DATE(`datetime`, '%m/%d/%Y %H:%i')) AS yr,
  AVG(`duration (seconds)`) AS avg_seconds,
  COUNT(*) AS n
FROM sightings
GROUP BY ...
HAVING ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_avg_by_year = pd.read_sql_query(text(sql), conn)

df_avg_by_year.head()

## 6) Light pandas (inspect + simple viz)

We'll keep pandas simple. Convert one of your SQL results into a small **bar chart**.

> If plotting causes issues, skip the plot and just show the DataFrame.

In [None]:
# Bar chart of top shapes (from Exercise 2.2)
import matplotlib.pyplot as plt

df_shapes_plot = df_shapes.copy().head(10)
df_shapes_plot.plot(kind="bar", x="shape", y="n", legend=False, title="Top 10 UFO shapes by count")
plt.xlabel("shape"); plt.ylabel("count")
plt.show()

## (Optional) Fallback: local CSV (pandas only)

If you need to test without a DB connection, you can upload `ufo_scrubbed.csv` and practice the pandas parts.
This does **not** replace the requirement to connect to the database for grading.

In [None]:
import pandas as pd
try:
    from google.colab import files
    print("Upload ufo_scrubbed.csv")
    uploaded = files.upload()
    csv_name = next(iter(uploaded.keys()))
    ufo_csv = pd.read_csv(csv_name)
    ufo_csv.head()
except Exception as e:
    print("Upload not available here:", e)

## Troubleshooting

- **`Access denied ...` (1045)** → Check your read‑only username/password and ask the instructor if needed.
- **Timeout / can’t connect (2003)** → Network or RDS SG issue. Try campus Wi‑Fi or VPN; ask the instructor to whitelist your IP if necessary.
- **`ssl ca certificate` errors** → Re‑run the PEM download cell or upload the correct `global-bundle.pem` and ensure `connect_args={"ssl_ca": "global-bundle.pem"}`.
- **SQL errors** → Check backticks around columns with spaces or punctuation.

## Submit

- **Runtime → Restart and run all** to confirm a clean run
- **File → Download .ipynb**, rename to `Notebook2_LastFirst.ipynb`, and submit as directed