# Project 2, Part 4, Validate data in the staging tables using SQL

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Jack Galvin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import pandas as pd
import numpy as np
import math
import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [2]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [3]:
cursor = connection.cursor()

In [4]:
# Function to run a select query and return rows in a pandas dataframe
# Pandas puts all numeric values from postgres to float
# If it will fit in an integer, change it to integer


def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

# 2.4.1 Validate the data types in the staging table stage_1_peak_sales

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that is is numeric
* sales_date - validate that it is a date
* sub_total - validate that it is numeric
* tax - validate that it is numeric
* total_amount - validate that it is numeric

Hint: make use of the operators: 
* xxxx::numeric
* xxxx::date

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [12]:
# Validate data types in stage_1_peak_sales

rollback_before_flag = True
rollback_after_flag = True

query = """

select sale_id::numeric,
        sale_date::date,
        sub_total::numeric,
        tax::numeric,
        total_amount::numeric
from stage_1_peak_sales
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,sale_id,sale_date,sub_total,tax,total_amount
0,5763728874,2020-10-03,12,0,12
1,5763729036,2020-10-03,72,0,72
2,5763728904,2020-10-03,24,0,24
3,5763728973,2020-10-03,96,0,96
4,5763728757,2020-10-03,108,0,108
...,...,...,...,...,...
92,5763728927,2020-10-03,72,0,72
93,5763729096,2020-10-03,48,0,48
94,5763729268,2020-10-03,84,0,84
95,5763729237,2020-10-03,60,0,60


# 2.4.2 Validate the data types in the staging table stage_1_peak_stores

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* location_id - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [14]:
# Validate data types in stage_1_peak_stores

rollback_before_flag = True
rollback_after_flag = True

query = """

select sale_id::numeric,
        location_id::numeric
from stage_1_peak_stores
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,sale_id,location_id
0,5763728874,12573
1,5763729036,12573
2,5763728904,12573
3,5763728973,12573
4,5763728757,12573
...,...,...
92,5763728927,12573
93,5763729096,12573
94,5763729268,12573
95,5763729237,12573


# 2.4.3 Validate the data types in the staging table stage_1_peak_customers

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* customer_id - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [15]:
# Validate data types in stage_1_peak_customers

rollback_before_flag = True
rollback_after_flag = True

query = """

select sale_id::numeric,
        customer_id::numeric
from stage_1_peak_customers
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,sale_id,customer_id
0,5763728874,3728404
1,5763729036,3729309
2,5763728904,3728508
3,5763728973,3728534
4,5763728757,3729188
...,...,...
92,5763728927,3728568
93,5763729096,3728990
94,5763729268,3728901
95,5763729237,3729019


# 2.4.4 Validate the data types in the staging table stage_1_peak_line_items

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* line_item_id - validate that it is numeric
* product_id - validate that it is numeric
* price - validate that it is numeric
* quantity - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [18]:
# Validate data types in stage_1_peak_line_items

rollback_before_flag = True
rollback_after_flag = True

query = """

select sale_id::numeric,
        line_item_id::numeric,
        product_id::numeric,
        price::numeric,
        quantity::numeric
from stage_1_peak_line_items
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,sale_id,line_item_id,product_id,price,quantity
0,5763728874,1,42314780,12,1
1,5763729036,1,42314677,12,1
2,5763729036,2,42314782,12,3
3,5763729036,3,42314784,12,2
4,5763728904,1,42314780,12,1
...,...,...,...,...,...
347,5763729237,2,42314678,12,2
348,5763729237,3,42314782,12,2
349,5763728673,1,42314677,12,2
350,5763728673,2,42314678,12,1


# 2.4.5 Validate the math on sub_total, tax, and total_amount in stage_1_peak_sales

Generally, we do not expect any issues with the math.  Write a simple query that validates the math:

total_amount = sub_total + tax

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [34]:
# Validate that total_amount = sub_total + tax in stage_1_peak_sales

rollback_before_flag = True
rollback_after_flag = True

query = """

select sub_total,
        tax,
        total_amount
from stage_1_peak_sales
where total_amount::numeric <> (sub_total::numeric + tax::numeric)
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,sub_total,tax,total_amount


# 2.4.6 Validate the math between stage_1_sales and stage_1_line_items

Generally, we do not expect any issues with the math.  Write a simple query that validates the math:

total_sales in stage_1_sales matches the sum of (price x quantity) in stage_1_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [90]:
# Validate that total_amount = price * quantity in stage_1_peak_sales and stage_1_peak_line_items

rollback_before_flag = True
rollback_after_flag = True

query = """

select sa.total_amount,
        l.price,
        sum(l.quantity::numeric) as quantity
from stage_1_peak_sales as sa
    join stage_1_peak_line_items as l
        on sa.sale_id = l.sale_id
group by sa.total_amount, l.price, sa.stage_id
having sa.total_amount::numeric <> l.price::numeric * sum(l.quantity::numeric)
order by sa.stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_amount,price,quantity


# 2.4.7 Validate the tax is always zero in stage_1_peak_sales

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [94]:
# Validate that tax = 0 in stage_1_peak_sales

rollback_before_flag = True
rollback_after_flag = True

query = """

select tax
from stage_1_peak_sales
where tax::numeric <> 0
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,tax


# 2.4.8 Validate the price is always 12 in stage_1_peak_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [95]:
# Validate that price = 12 in stage_1_peak_line_items

rollback_before_flag = True
rollback_after_flag = True

query = """

select price
from stage_1_peak_line_items
where price::numeric <> 12
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,price


# 2.4.9 Validate taxable is always N in stage_1_peak_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [172]:
# Validate that taxable = N in stage_1_peak_line_items

rollback_before_flag = True
rollback_after_flag = True

query = """

select taxable
from stage_1_peak_line_items
where taxable::text <> 'N'
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,taxable


# 2.4.10 Validate the store is the same for all in stage_1_peak_stores

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [188]:
# Validate that store is same for all in stage_1_peak_stores

rollback_before_flag = True
rollback_after_flag = True

query = """

select location_id
from stage_1_peak_stores
where location_id::numeric <> 12573
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,location_id


# 2.4.11 Validate the product id in stage_1_peak_line_items against peak_product_mapping

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [207]:
# Validate that store is same for all in stage_1_peak_stores

rollback_before_flag = True
rollback_after_flag = True

query = """

select l.product_id,
        m.peak_product_id
from stage_1_peak_line_items as l
    join peak_product_mapping as m
        on l.product_id::numeric = m.peak_product_id::numeric
where l.product_id::numeric <> m.peak_product_id
group by l.product_id, m.peak_product_id, l.stage_id
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,product_id,peak_product_id
