# Data Validation in PostgreSQL

This notebook validates that the data loaded from `supermarket_sales.csv` was correctly inserted into the PostgreSQL database.

We will use SQL queries via `sqlalchemy` and `pandas` to verify data integrity.



In [32]:
# Import libraries

import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
import pandas as pd

In [33]:
# Load environment variables
load_dotenv()

# Retrieve credentials securely
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_HOST = os.getenv("DB_HOST")
DB_PORT = os.getenv("DB_PORT")
DB_NAME = os.getenv("DB_NAME")

# Connect to PostgreSQL


engine = create_engine(
    f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
)


In [34]:
# Validation queries
queries = {
    "Total rows": "SELECT COUNT(*) FROM supermarket_sales;",
    "Average total sales": "SELECT AVG(total) FROM supermarket_sales;",
    "Null values in total": "SELECT COUNT(*) FROM supermarket_sales WHERE total IS NULL;",
    "Null values in product line": "SELECT COUNT(*) FROM supermarket_sales WHERE product_line IS NULL;",
    "Null values in city": "SELECT COUNT(*) FROM supermarket_sales WHERE city IS NULL;",
    "Null values in customer type": "SELECT COUNT(*) FROM supermarket_sales WHERE customer_type IS NULL;",
    "Unique products sold": "SELECT COUNT(DISTINCT product_line) FROM supermarket_sales;",
}

# Execute and display results
for description, query in queries.items():
    df = pd.read_sql(query, engine)
    print(f"\n{description}:\n", df)


Total rows:
    count
0   1000

Average total sales:
           avg
0  322.966749

Null values in total:
    count
0      0

Null values in product line:
    count
0      0

Null values in city:
    count
0      0

Null values in customer type:
    count
0      0

Unique products sold:
    count
0      6


In [35]:
df = pd.read_sql("SELECT * FROM supermarket_sales", engine)

In [36]:
df.describe()

Unnamed: 0,unit_price,quantity,tax_5,total,date,cogs,gross_margin_percentage,gross_income,rating
count,1000.0,1000.0,1000.0,1000.0,1000,1000.0,1000.0,1000.0,1000.0
mean,55.67213,5.51,15.379369,322.966749,2019-02-14 00:05:45.600000,307.58738,4.761905,15.379369,6.9727
min,10.08,1.0,0.5085,10.6785,2019-01-01 00:00:00,10.17,4.761905,0.5085,4.0
25%,32.875,3.0,5.924875,124.422375,2019-01-24 00:00:00,118.4975,4.761905,5.924875,5.5
50%,55.23,5.0,12.088,253.848,2019-02-13 00:00:00,241.76,4.761905,12.088,7.0
75%,77.935,8.0,22.44525,471.35025,2019-03-08 00:00:00,448.905,4.761905,22.44525,8.5
max,99.96,10.0,49.65,1042.65,2019-03-30 00:00:00,993.0,4.761905,49.65,10.0
std,26.494628,2.923431,11.708825,245.885335,,234.17651,0.0,11.708825,1.71858


In [37]:
df.notnull().sum()

invoice_id                 1000
branch                     1000
city                       1000
customer_type              1000
gender                     1000
product_line               1000
unit_price                 1000
quantity                   1000
tax_5                      1000
total                      1000
date                       1000
time                       1000
payment                    1000
cogs                       1000
gross_margin_percentage    1000
gross_income               1000
rating                     1000
dtype: int64

In [38]:
df.head()

Unnamed: 0,invoice_id,branch,city,customer_type,gender,product_line,unit_price,quantity,tax_5,total,date,time,payment,cogs,gross_margin_percentage,gross_income,rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08:00,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29:00,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23:00,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33:00,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37:00,Ewallet,604.17,4.761905,30.2085,5.3


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   invoice_id               1000 non-null   object        
 1   branch                   1000 non-null   object        
 2   city                     1000 non-null   object        
 3   customer_type            1000 non-null   object        
 4   gender                   1000 non-null   object        
 5   product_line             1000 non-null   object        
 6   unit_price               1000 non-null   float64       
 7   quantity                 1000 non-null   int64         
 8   tax_5                    1000 non-null   float64       
 9   total                    1000 non-null   float64       
 10  date                     1000 non-null   datetime64[ns]
 11  time                     1000 non-null   object        
 12  payment                  1000 non-n

## Validation Checklist

- [1000] The total number of rows matches the CSV (should be 1000).
- [322.96] Average values look consistent with expectations.
- [NoNull] No null values in key columns (`total`, `city`, `product_line`, etc.).
- [6] The number of unique product lines matches the dataset.


## Observations

Insights based on the queries above:

* 1) Row count matches the CSV file.
* 2) Average total sales matches the CVS file. 
* 2) No nulls found – data looks clean.
* 3) Line products matches the CSV file. 


### Git Tracking

Steps followed to commit this notebook:

```bash
git checkout -b validation-checks
git add 03_validate_and_git.ipynb
git commit -m "Added data validation notebook for supermarket sales"
git push origin validation-checks
