# Project 2, Part 5, Cleansing customer data

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Jack Galvin

Year: 2022

Semester: Spring

Section: 9


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [1]:
import pandas as pd
import numpy as np
import math
import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [2]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [3]:
cursor = connection.cursor()

In [4]:
# Function to run a select query and return rows in a pandas dataframe
# Pandas puts all numeric values from postgres to float
# If it will fit in an integer, change it to integer


def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

# 2.5.1 Validate the city, state, and zip for stage_1_peak_customers against the zip_codes table

AGM does not want to give its customer list to 3rd party sales channels, including Peak Delivery.  For that reason, we can expect some variation in customer first and last names, and in the street.  However, the city, state, and zip should be validated by Peak's system, so we do not anticipate any issues.

Write a query that demonstrates that the city, state, and zip are valid for all records.  Like we did in 2.4, it's usually best to write a query that return errors.  In our case the query should not return anything.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [98]:
# Validate city, state, and zip in stage_1_peak_customers

rollback_before_flag = True
rollback_after_flag = True

query = """

select *
from stage_1_peak_customers
where (city::text, state::text, zip) not in (
    select city, state, zip
    from zip_codes)
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,stage_id,sale_id,customer_id,first_name,last_name,street,city,state,zip


# 2.5.2 Find all customer records in stage_1_peak_customers where any of first_name, last_name, and/or street do not match a customer in the customers table

AGM does not want to give its customer list to 3rd party sales channels, including Peak Delivery.  For that reason, we can expect some variation in customer first and last names, and in the street.  

Write a query that returns all customer records in state_1_peak_customers where any of the first_name, last_name, and/or street do not match a customer in the customers table

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [95]:
# Find first names, last names, and streets that don't match

rollback_before_flag = True
rollback_after_flag = True

query = """

select *
from stage_1_peak_customers
where (first_name::text, last_name::text, street::text) not in (
    select first_name, last_name, street
    from customers)
order by stage_id;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,stage_id,sale_id,customer_id,first_name,last_name,street,city,state,zip
0,10,5763728768,3729016,Hyrum,Knuckles,86668 Spenser Terrace,Oakland,CA,94618
1,20,5763728877,3728936,Roseann,Coyish,11707 American Ash Ter,Orinda,CA,94563
2,24,5763728428,3729287,Hali,Ducker,8 Orion Pass,El Cerrito,CA,94530
3,26,5763728393,3728674,Melantha,Golborn,6140 North Field Alley,Orinda,CA,94563
4,36,5763729212,3729191,Eleni,Jansen,66 Bartelt Hill,Oakland,CA,94607
5,40,5763729129,3728856,Clyve,Humonds,22 Brent Wood Hill,Berkeley,CA,94709
6,51,5763728864,3729178,Rutledge,Hellwing,606 Gulf Plz,El Cerrito,CA,94530
7,60,5763729313,3728402,Kalli,Kemel,18373 Golf View Pass,Berkeley,CA,94702
8,72,5763728980,3729213,Honina,Philson,28 Clarendon Plaza,Berkeley,CA,94702
9,73,5763728921,3729194,Nicky,Haley,88424 Warrior Lane,Oakland,CA,94602


# 2.5.3 Find the percentage of Peak's customer records that do not match to AGM's customers table

Write a query to find the percentage of Peak's customer records that do not match AGM's.  The percentage can be found by taking the number of customer records in stage_1_peak_customers that do not match and dividing by the number of customers records in stage_1_peak_customers and multiplying by 100.

Show the total number of Peak customer records in stage_1_peak_customers, the number that match to customers, the number that do not match to customers, and the percentage that do not match.

Show the percentage rounded to the nearest tenth.  It is not necessary to include a percent sign.

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [112]:
# Find the percentage of customer records from peak that do not match AGM

rollback_before_flag = True
rollback_after_flag = True

query = """

with a as (select count(*) as total_customers 
from stage_1_peak_customers),
b as (select count(*) as total_unmatched
from stage_1_peak_customers
where (first_name::text, last_name::text, street::text) not in (
    select first_name, last_name, street
    from customers))
select a.total_customers, (a.total_customers - b.total_unmatched) as total_matched, b.total_unmatched,
round((b.total_unmatched / a.total_customers::numeric * 100), 1) as pct_unmatched
from a,b;

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,total_customers,total_matched,total_unmatched,pct_unmatched
0,97,84,13,13.4


# 2.5.4 Executive summary on customer data

Write an executive summary on the customer data.  

The summary should be the equivalent to 3/4 to 1 page using standard fonts, spacing, and margins. 

As stated in the scenario, like most companies, AGM does not want to give out its customer list to 3rd party sales channels.  The downside is, as we have seen, that customer first names, last names, and street addresses will have some variations and not be exact matches.

The executives would like your recommendation of one of the following:
* Continue to withhold the customer data from 3rd party sales channels
* Give customer data to 3rd party sales channels

Recommend exactly one of these.

Support you recommendation with an explanation based on what you have seen from this preliminary data load.

You are not required to write any queries nor create any data visualizations.  However, you may want to include some to enhance and add quality to your submission.  Submissions with these tend to be higher quality, although, not always.

You may use any number of code cells and/or markdown cells. 

You may alternate between code cells and markdown cells.  That is perfectly fine.  It is understood that before we present it, an editor would pull out the text, results of queries, and data visualizations.