# Data Validation in PostgreSQL

This notebook validates that the data loaded from `supermarket_sales.csv` was correctly inserted into the PostgreSQL database.

We will use SQL queries via `sqlalchemy` and `pandas` to verify data integrity.



In [None]:
# Import libraries

import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
import pandas as pd

In [None]:
# Load environment variables
load_dotenv()

# Retrieve credentials securely
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_HOST = os.getenv("DB_HOST")
DB_PORT = os.getenv("DB_PORT")
DB_NAME = os.getenv("DB_NAME")

# Connect to PostgreSQL


engine = create_engine(
    f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
)


In [None]:
# Validation queries
queries = {
    "Total rows": "SELECT COUNT(*) FROM supermarket_sales;",
    "Average total sales": "SELECT AVG(total) FROM supermarket_sales;",
    "Null values in total": "SELECT COUNT(*) FROM supermarket_sales WHERE total IS NULL;",
    "Null values in product line": "SELECT COUNT(*) FROM supermarket_sales WHERE product_line IS NULL;",
    "Null values in city": "SELECT COUNT(*) FROM supermarket_sales WHERE city IS NULL;",
    "Null values in customer type": "SELECT COUNT(*) FROM supermarket_sales WHERE customer_type IS NULL;",
    "Unique products sold": "SELECT COUNT(DISTINCT product_line) FROM supermarket_sales;",
}

# Execute and display results
for description, query in queries.items():
    df = pd.read_sql(query, engine)
    print(f"\n{description}:\n", df)


Total rows:
    count
0   1000

Average total sales:
           avg
0  322.966749

Null values in total:
    count
0      0

Null values in product line:
    count
0      0

Null values in city:
    count
0      0

Null values in customer type:
    count
0      0

Unique products sold:
    count
0      6


## Validation Checklist

- [1000] The total number of rows matches the CSV (should be 1000).
- [322.96] Average values look consistent with expectations.
- [NoNull] No null values in key columns (`total`, `city`, `product_line`, etc.).
- [6] The number of unique product lines matches the dataset.


## Observations

Insights based on the queries above:

* 1) Row count matches the CSV file.
* 2) Average total sales matches the CVS file. 
* 2) No nulls found – data looks clean.
* 3) Line products matches the CSV file. 


### Git Tracking

Steps followed to commit this notebook:

```bash
git checkout -b validation-checks
git add 03_validate_and_git.ipynb
git commit -m "Added data validation notebook for supermarket sales"
git push origin validation-checks
