# Notebook 1 — Getting Data Out

_INFO 4614 • Unit 1: Relational Databases (MySQL)_

**Goals.** Practice the essentials you'll need all term:

- Connect from **Google Colab → AWS RDS (MySQL)** using the **Amazon global CA** (`global-bundle.pem`)
- Run the **basic queries** we introduced on Aug 21 and you'll read about for Aug 26: `SELECT`, `WHERE`, `ORDER BY`, `LIMIT`, and simple `LIKE` pattern matching
- Pull results back into **pandas** DataFrames

**Data.** We'll use the class database: schema **`ramen`**, table **`ratings`**. This table mirrors `ramen_ratings_clean.csv`.

**Submission checklist**

- ✅ All cells run top–to–bottom without errors
- ✅ Each **Exercise** cell has working code (no `...` left)
- ✅ Outputs visible (first rows printed, DataFrames displayed)
- ✅ File is renamed `Notebook1_LastFirst.ipynb` before you submit

## 0) Setup

We'll install the client libraries, set a couple variables, and (by default) **download** the Amazon RDS global certificate bundle that verifies the server.

If you prefer to **upload** a PEM file you already have, there's an alternate cell below.

In [None]:
# Install dependencies (lightweight)
!pip -q install mysql-connector-python SQLAlchemy pandas

# Show versions (sanity check)
import sys, sqlalchemy, pandas
print("Python:", sys.version.split()[0])
print("SQLAlchemy:", sqlalchemy.__version__)
print("pandas:", pandas.__version__)

In [None]:
# Connection settings for the class database
DB_HOST = ""
DB_PORT = 3306
DB_NAME = "ramen"

# We'll save the Amazon RDS certificate bundle here
CA_CERT_PATH = "global-bundle.pem"

In [None]:
# Option A (recommended): download the current Amazon RDS global certificate bundle
# Source: AWS Trust Store (global bundle)
!wget -q https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem -O "$CA_CERT_PATH"

import os
assert os.path.exists(CA_CERT_PATH), "PEM download failed — try Option B (upload)"
print("Saved:", os.path.abspath(CA_CERT_PATH))

In [None]:
# Option B (alternative): upload a PEM you already have
from google.colab import files
print("Choose your PEM (e.g., global-bundle.pem)")
uploaded = files.upload()  # pick your .pem; it will save to the current directory

## 1) Enter credentials securely

Enter the **read‑only** username and password you were given. These won't be stored in the notebook file.

In [None]:
from getpass import getpass

DB_USER = input("MySQL username: ").strip()
DB_PASS = getpass("MySQL password: ").strip()

## 2) Connect to MySQL (test)

We'll use **SQLAlchemy** with the **mysql‑connector** driver and require TLS by pointing to the CA bundle.

In [None]:
from sqlalchemy import create_engine, text

engine = create_engine(
    f"mysql+mysqlconnector://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}",
    connect_args={"ssl_ca": CA_CERT_PATH},
    pool_pre_ping=True,
)

# Quick smoke tests
with engine.connect() as conn:
    # 1) minimal "it works" check
    one = conn.execute(text("SELECT 1")).scalar_one()
    print("SELECT 1 →", one)

    # 2) confirm TLS is in use (should return a non-empty cipher name)
    row = conn.execute(text("SHOW SESSION STATUS LIKE 'Ssl_cipher'")).fetchone()
    print("SSL cipher:", row[1] if row else None)

In [None]:
# 🔧 Exercise: Re-create the engine yourself.
# Replace each `...` with the correct values, mirroring the working example above.
from sqlalchemy import create_engine

engine2 = create_engine(
    f"mysql+mysqlconnector://{...}:{...}@{...}:{...}/{...}",
    connect_args={"ssl_ca": ...},
    pool_pre_ping=True,
)

# Sanity check (run and expect 1)
from sqlalchemy import text
with engine2.connect() as conn:
    print(conn.execute(text("SELECT 1")).scalar_one())

## 3) Explore the schema

Let's verify the table exists and inspect its columns.

In [None]:
import pandas as pd
from sqlalchemy import text

with engine.connect() as conn:
    print("Tables:")
    display(pd.read_sql_query("SHOW TABLES", conn))

    print("\n`ratings` definition:")
    display(pd.read_sql_query("DESCRIBE ratings", conn))

## 4) Your first SELECTs

Use **SQL** and return results as **pandas** DataFrames. Keep it simple and clear.

### 4.1 First 5 rows
Return the first 5 rows from `ramen.ratings`.

**Starter:**

```sql
SELECT * FROM ratings
LIMIT 5;
```

In [None]:
# 🔧 Exercise 4.1
import pandas as pd
from sqlalchemy import text

sql = """
SELECT ...
...
"""
with engine.connect() as conn:
    df_first5 = pd.read_sql_query(text(sql), conn)

df_first5.head()

### 4.2 Filter + sort
Get the **top 10** `Cup`-style ramen from **Japan**, sorted by `rating` (highest first). Return only the columns:
`review, brand, variety, style, country, rating`.

Hints:
- Use `WHERE` for filters
- Use `ORDER BY rating DESC`
- End with `LIMIT 10`

In [None]:
# 🔧 Exercise 4.2
sql = """
SELECT ...
FROM ratings
WHERE ...
ORDER BY ...
LIMIT ...
"""
with engine.connect() as conn:
    df_top_jp = pd.read_sql_query(text(sql), conn)

df_top_jp.head(10)

### 4.3 Pattern matching (`LIKE`)
Find all rows whose `variety` contains the word **chicken** (case-insensitive is fine on MySQL).

Return columns: `review, brand, variety, country, rating` sorted by `rating` (desc), then by `review` (asc) to break ties.

In [None]:
# 🔧 Exercise 4.3
sql = """
SELECT ...
FROM ratings
WHERE ...
ORDER BY ...
"""
with engine.connect() as conn:
    df_chicken = pd.read_sql_query(text(sql), conn)

df_chicken.head()

### 4.4 LIMIT with offset
Get 5 rows **starting at the 11th row** of the full table when ordered by `review` ascending.

Hint: `LIMIT <count> OFFSET <offset>`

In [None]:
# 🔧 Exercise 4.4
sql = """
SELECT review, brand, variety, country, rating
FROM ratings
ORDER BY review ASC
LIMIT ...
"""
with engine.connect() as conn:
    df_page = pd.read_sql_query(text(sql), conn)

df_page

## 5) Light pandas practice

We'll keep pandas minimal for now — just enough to inspect results.

In [None]:
# Summaries of your query from 4.2
df_top_jp.shape, df_top_jp.dtypes

In [None]:
# Basic descriptive stats on ratings (from your 4.3 result)
df_chicken["rating"].describe()

## (Optional) Fallback: local CSV

If you're stuck off‑network, you can test your **pandas** parts locally by reading the CSV that matches the database table.
This does **not** replace the requirement to connect to the database.

In [None]:
import pandas as pd
from google.colab import files

print("Upload ramen_ratings_clean.csv (matches the DB table)")
uploaded = files.upload()
csv_name = next(iter(uploaded.keys()))
ramen_csv = pd.read_csv(csv_name)
ramen_csv.head()

## Troubleshooting (read me if you get errors)

- **`Access denied for user ...` (1045)** → Username/password typo, or your user doesn't have permission. Ask the instructor to verify the **read‑only** account.
- **Can't connect / timeout (2003)** → Network is blocking port **3306**, or the database firewall isn't allowing your IP. Try campus Wi‑Fi or VPN; ask the instructor to check security‑group rules.
- **`ssl ca certificate` errors** → Re‑run the **download PEM** cell (or upload the correct PEM). Make sure `connect_args={"ssl_ca": "global-bundle.pem"}` points to the file you actually downloaded.
- **`No module named ...`** → Re‑run the install cell at the top.

## Submit

- **Runtime → Restart and run all** to ensure a clean run
- **File → Download .ipynb**, rename to `Notebook1_LastFirst.ipynb`, and submit per the syllabus