# Practice Notebook: Files + PostgreSQL (Codespaces)

This notebook is intentionally **repo-name independent**.

**Important:**  
- Your **GitHub repo name** (e.g., `Data_Science-notebooks`) is *not* your PostgreSQL schema.  
- Your PostgreSQL **database/schema** for this course stays `hi5304` unless your instructor explicitly changes it.

We will:
1. Load a CSV from the repo `data/` folder  
2. Connect to PostgreSQL using a single, reusable connection  
3. Run SQL queries and analyze the results with pandas


## 0) Setup (run once)

This cell:
- finds the repo root (where the `data/` folder lives)
- sets `DATA_DIR` for file loading
- creates a PostgreSQL `engine` for `pd.read_sql(...)`


In [1]:
%pip install psycopg[binary] sqlalchemy pandas


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import psycopg
import sqlalchemy
import pandas as pd

print("All imports worked")


All imports worked


In [1]:
# Load the patients CSV from the repo's data/ folder
patients = pd.read_csv(DATA_DIR / "patients.csv")

patients.head()


NameError: name 'pd' is not defined

In [2]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path.cwd() / "data"
print("Working directory:", Path.cwd())
print("Data directory exists:", DATA_DIR.exists())


Working directory: /workspaces/Data_Science_Coding_Laboratory/lessons/Jupyter
Data directory exists: False


In [3]:
schema = SCHEMA
query = f"""
SELECT *
FROM {schema}.patients
LIMIT 5;
"""

df_patients = pd.read_sql(query, engine)
df_patients


NameError: name 'SCHEMA' is not defined

In [None]:
# Shape of the dataset
patients.shape


In [None]:
# Column names
patients.columns


In [None]:
# Basic info
patients.info()


In [None]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path("..") / "data"
patients = pd.read_csv(DATA_DIR / "patients.csv")

patients.head()


In [None]:
schema = SCHEMA
query = f"""
SELECT *
FROM {schema}.patients
LIMIT 5;
"""

df_patients = run_sql(query)
df_patients


In [None]:
schema = SCHEMA
query = f"""
SELECT
    p.patient_id,
    p.first_name,
    p.last_name,
    b.reading_date,
    b.systolic,
    b.diastolic,
    b.heart_rate
FROM {schema}.patients p
JOIN hi5304.bp_readings b
  ON p.patient_id = b.patient_id
ORDER BY b.reading_date;
"""

df_bp = run_sql(query)
df_bp.head()


In [None]:
schema = SCHEMA
query = f"""
SELECT
    patient_id,
    COUNT(*) AS medication_count
FROM {schema}.medications
GROUP BY patient_id
ORDER BY medication_count DESC;
"""

df_meds = run_sql(query)
df_meds


In [None]:
schema = SCHEMA
query = f"""
SELECT
    AVG(systolic) AS avg_systolic,
    AVG(diastolic) AS avg_diastolic,
    COUNT(*) AS total_readings
FROM {schema}.bp_readings;
"""

df_summary = run_sql(query)
df_summary


In [None]:
df_bp.describe()


In [None]:
df_bp.groupby("patient_id")[["systolic", "diastolic"]].mean()


In [None]:
import matplotlib.pyplot as plt

df_bp.plot(x="reading_date", y="systolic", kind="line")
plt.show()


In [None]:
schema = SCHEMA
query = f"""
SELECT * 
FROM {schema}.cardio1
WHERE sbp IS NOT NULL
  AND cardio IS NOT NULL;
"""
df = run_sql(query)

df.head()


In [None]:
df["cardio"].value_counts()


In [None]:
sbp_cardio_0.describe(), sbp_cardio_1.describe()


In [None]:
from scipy.stats import f_oneway


In [None]:
f_stat, p_value = f_oneway(sbp_cardio_0, sbp_cardio_1)

f_stat, p_value


In [None]:
import matplotlib.pyplot as plt

df.boxplot(column="sbp", by="cardio")
plt.xlabel("Cardio (0 = No CVD, 1 = CVD)")
plt.ylabel("Systolic Blood Pressure (mmHg)")
plt.title("SBP by Cardiovascular Disease Status")
plt.suptitle("")
plt.show()


In [None]:
sbp_cardio_0 = df.loc[df["cardio"] == 0, "sbp"]
sbp_cardio_1 = df.loc[df["cardio"] == 1, "sbp"]


In [None]:
from scipy.stats import f_oneway

f_stat, p_value = f_oneway(sbp_cardio_0, sbp_cardio_1)
f_stat, p_value
