 # SQL Queries Notebook

 This notebook assumes that:
 - The MySQL database **`bigdata_project`** has already been created.
 - Tables **countries**, **indicators**, **indicator_values** and view **all_data** exist
   (from `db_setup.py`).
 - The `countries.region` column is already filled using `region_list.csv`.

 In this notebook we:

 1. Connect to MySQL & run quick diagnostics
 2. (Query i) Compute **yearly averages & standard deviations** for each indicator per region
 3. (Query ii) For **one country**, compute a **10-year rolling average** for all indicators
 4. (Query iii) Identify **top/bottom 10 countries per region** for each indicator over the last 10 years

 ## Step 0 – Imports & DB connection

In [None]:
import pandas as pd
import numpy as np
import pymysql

pd.options.display.max_rows = 50
pd.options.display.max_columns = 50

# Connect directly to the bigdata_project database
conn = pymysql.connect(
    host="localhost",
    user="root",
    password="XXX",          # ← PUT YOUR REAL MYSQL PASSWORD HERE
    database="bigdata_project",
    autocommit=True,
    charset="utf8mb4"
)

print("Connected to MySQL database 'bigdata_project'.")

ModuleNotFoundError: No module named 'pandas'

 ## Step 1 – Quick diagnostics

 We:
 - Count rows in `countries`, `indicators`, `indicator_values`
 - Inspect how many countries have a region assigned
 - Preview the `all_data` view

In [None]:
cursor = conn.cursor()

def count_rows(table_name: str) -> int:
    cursor.execute(f"SELECT COUNT(*) FROM {table_name};")
    return cursor.fetchone()[0]

tables = ["countries", "indicators", "indicator_values"]
print("Row counts per table:")
for t in tables:
    try:
        n = count_rows(t)
        print(f" - {t}: {n} rows")
    except Exception as e:
        print(f"Error counting rows in {t}:", e)

# How many countries have a non-NULL region?
cursor.execute("SELECT COUNT(*) FROM countries WHERE region IS NOT NULL;")
countries_with_region = cursor.fetchone()[0]
print("\nCountries with region set:", countries_with_region)

# Preview the all_data view (if it exists)
try:
    sample_all = pd.read_sql("SELECT * FROM all_data LIMIT 10;", conn)
    print("\nSample from all_data:")
    display(sample_all)
except Exception as e:
    print("\nCould not read from view all_data. Error:", e)

: 

 ## Step 2 – Query (i): Yearly averages & standard deviations per region and indicator

 We want, for **each region & indicator & year**:
 - The average value
 - The standard deviation of the values

 We use the **indicator_values + countries + indicators** tables so that:
 - `countries.region` gives us the region
 - `indicators.indicator_code / indicator_name` describe the indicator

 We:
 1. Filter out rows with `NULL` value
 2. Exclude rows where region is `NULL`
 3. Group by `region, indicator_code, indicator_name, year`

In [None]:
query_yearly_stats = """
SELECT
    c.region,
    i.indicator_code,
    i.indicator_name,
    v.year,
    AVG(v.value)         AS avg_value,
    STDDEV_SAMP(v.value) AS stddev_value,
    COUNT(*)             AS n_obs
FROM indicator_values v
JOIN countries  c ON v.country_id   = c.country_id
JOIN indicators i ON v.indicator_id = i.indicator_id
WHERE v.value IS NOT NULL
  AND c.region IS NOT NULL
GROUP BY
    c.region,
    i.indicator_code,
    i.indicator_name,
    v.year
ORDER BY
    c.region,
    i.indicator_code,
    v.year;
"""

print("▶ Running yearly averages & stddev per region & indicator...")
yearly_stats_df = pd.read_sql(query_yearly_stats, conn)

print("Result shape:", yearly_stats_df.shape)
display(yearly_stats_df.head(20))

: 

 ### Optional: inspect one indicator / region as a time-series

 You can filter the `yearly_stats_df` for a specific:
 - `region` (e.g., `"Europe & Central Asia"`)
 - `indicator_code` (e.g., `"SH.XPD.CHEX.GD.ZS"` for health expenditure)
 and plot it if you like.

In [None]:
# Example filter (change values below as you wish)
example_region = "Europe & Central Asia"
example_indicator = "SH.XPD.CHEX.GD.ZS"  # current health expenditure % of GDP

example_ts = yearly_stats_df[
    (yearly_stats_df["region"] == example_region) &
    (yearly_stats_df["indicator_code"] == example_indicator)
].sort_values("year")

print(f"\nTime series for {example_region} – {example_indicator}:")
display(example_ts.head(15))

: 

 ## Step 3 – Query (ii): 10-year rolling average for all indicators for a given country

 For a single country (e.g. **Greece**):
 - Get all indicator values by year
 - For each indicator separately, compute a **10-year rolling average**

 We:
 1. Choose a country name or country code (you can easily change it).
 2. Pull all non-NULL values from the **all_data** view.
 3. In pandas, group by `indicator_id` and apply a 10-year rolling mean on `value`.

In [None]:
# Choose which country to analyse
country_name = "Greece"   # you can change this to any country in your data

query_country_ts = """
SELECT
    indicator_id,
    indicator_code,
    indicator_name,
    year,
    value
FROM all_data
WHERE country_name = %s
  AND value IS NOT NULL
ORDER BY indicator_id, year;
"""

print(f"\n▶ Pulling time series for country: {country_name}")
country_ts_df = pd.read_sql(query_country_ts, conn, params=[country_name])

print("Raw time-series shape:", country_ts_df.shape)
display(country_ts_df.head(15))

# Compute 10-year rolling average per indicator
country_ts_df["rolling_10y_avg"] = (
    country_ts_df
    .groupby("indicator_id", group_keys=False)["value"]
    .apply(lambda s: s.rolling(window=10, min_periods=1).mean())
)

print("\nSample with 10-year rolling averages:")
display(country_ts_df.head(20))

: 

 ### Optional: filter recent years for easier inspection

 For instance, keep only rows with `year >= 2010` to focus on the modern period.

In [None]:
recent_country_ts = country_ts_df[country_ts_df["year"] >= 2010].copy()
print("Recent period shape:", recent_country_ts.shape)
display(recent_country_ts.head(20))

: 

 ## Step 4 – Query (iii): Top/bottom 10 countries per region & indicator over the last 10 years

 Goal:
 - For each **region & indicator**, over the **last 10 available years** in the dataset:
   - Compute the **average value** for each country
   - Extract the **top 10** and **bottom 10** countries (based on that 10-year average)

 Approach:
 1. In SQL, build a base aggregated table:
    - Filter to years >= (max year - 9)
    - Group by region, indicator, country → compute avg over that 10-year window
 2. In pandas, from that base table, apply:
    - `groupby(["region", "indicator_code"]).nlargest(10, "avg_value_10y")` → top 10
    - `groupby(["region", "indicator_code"]).nsmallest(10, "avg_value_10y")` → bottom 10

In [None]:
query_top_base = """
SELECT
    c.region,
    i.indicator_code,
    i.indicator_name,
    c.country_code,
    c.country_name,
    AVG(v.value) AS avg_value_10y,
    COUNT(*)     AS n_obs
FROM indicator_values v
JOIN countries  c ON v.country_id   = c.country_id
JOIN indicators i ON v.indicator_id = i.indicator_id
WHERE v.value IS NOT NULL
  AND c.region IS NOT NULL
  AND v.year >= (SELECT MAX(year) - 9 FROM indicator_values)
GROUP BY
    c.region,
    i.indicator_code,
    i.indicator_name,
    c.country_code,
    c.country_name;
"""

print("\n▶ Building base 10-year aggregation per country/region/indicator...")
ten_year_base_df = pd.read_sql(query_top_base, conn)

print("10-year aggregated base shape:", ten_year_base_df.shape)
display(ten_year_base_df.head(20))

: 

 ### Step 4.1 – Top 10 countries per region & indicator (last 10 years)

In [None]:
def top_n_per_region_indicator(df: pd.DataFrame, n: int = 10) -> pd.DataFrame:
    """
    For each (region, indicator_code) combination, returns the top n rows
    with the largest avg_value_10y.
    """
    return (
        df.sort_values(["region", "indicator_code", "avg_value_10y"], ascending=[True, True, False])
        .groupby(["region", "indicator_code"], group_keys=False)
        .head(n)
    )

top10_df = top_n_per_region_indicator(ten_year_base_df, n=10)

print("Top 10 per region & indicator (based on avg_value_10y):")
display(top10_df.head(30))

: 

 ### Step 4.2 – Bottom 10 countries per region & indicator (last 10 years)

In [None]:
def bottom_n_per_region_indicator(df: pd.DataFrame, n: int = 10) -> pd.DataFrame:
    """
    For each (region, indicator_code) combination, returns the bottom n rows
    with the smallest avg_value_10y.
    """
    return (
        df.sort_values(["region", "indicator_code", "avg_value_10y"], ascending=[True, True, True])
        .groupby(["region", "indicator_code"], group_keys=False)
        .head(n)
    )

bottom10_df = bottom_n_per_region_indicator(ten_year_base_df, n=10)

print("Bottom 10 per region & indicator (based on avg_value_10y):")
display(bottom10_df.head(30))

: 

 ### Optional: focus on a single indicator / region

 You can filter `top10_df` or `bottom10_df` to analyse specific cases, for example:
 - `indicator_code = "SH.XPD.CHEX.GD.ZS"` (health expenditure)
 - `region = "Europe & Central Asia"`

In [None]:
focus_region = "Europe & Central Asia"
focus_indicator = "SH.XPD.CHEX.GD.ZS"

top_focus = top10_df[
    (top10_df["region"] == focus_region) &
    (top10_df["indicator_code"] == focus_indicator)
].copy()

bottom_focus = bottom10_df[
    (bottom10_df["region"] == focus_region) &
    (bottom10_df["indicator_code"] == focus_indicator)
].copy()

print(f"\nTop 10 in {focus_region} for {focus_indicator}:")
display(top_focus)

print(f"\nBottom 10 in {focus_region} for {focus_indicator}:")
display(bottom_focus)

: 

 ## Step 5 – Close connection

In [None]:
cursor.close()
conn.close()
print("Connection to MySQL closed.")

: 