------------------
# Extraction

In [1]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('customer_etl').getOrCreate()

Let's load cdw_sapp_custmer.json

In [2]:
cust_df = spark.read.json("source_data/cdw_sapp_custmer.json")

--------------------------------
# Exploratory Analysis

In [3]:
cust_df.printSchema()
cust_df.show(10)

root
 |-- APT_NO: string (nullable = true)
 |-- CREDIT_CARD_NO: string (nullable = true)
 |-- CUST_CITY: string (nullable = true)
 |-- CUST_COUNTRY: string (nullable = true)
 |-- CUST_EMAIL: string (nullable = true)
 |-- CUST_PHONE: long (nullable = true)
 |-- CUST_STATE: string (nullable = true)
 |-- CUST_ZIP: string (nullable = true)
 |-- FIRST_NAME: string (nullable = true)
 |-- LAST_NAME: string (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)
 |-- MIDDLE_NAME: string (nullable = true)
 |-- SSN: long (nullable = true)
 |-- STREET_NAME: string (nullable = true)

+------+----------------+------------+-------------+--------------------+----------+----------+--------+----------+---------+--------------------+-----------+---------+-----------------+
|APT_NO|  CREDIT_CARD_NO|   CUST_CITY| CUST_COUNTRY|          CUST_EMAIL|CUST_PHONE|CUST_STATE|CUST_ZIP|FIRST_NAME|LAST_NAME|        LAST_UPDATED|MIDDLE_NAME|      SSN|      STREET_NAME|
+------+----------------+------------+---

How many rows do we have in total in this dataframe?

In [4]:
cust_df.count()

952

Let's rearrange the columns so they make a bit more sense when looking at them.

In [5]:
cust_df = cust_df.select('CREDIT_CARD_NO', 'SSN', 'CUST_EMAIL', 'CUST_PHONE',\
            'FIRST_NAME','MIDDLE_NAME', 'LAST_NAME', 'APT_NO', 'STREET_NAME',\
            'CUST_CITY', 'CUST_ZIP', 'CUST_STATE', 'CUST_COUNTRY', 'LAST_UPDATED')

cust_df.show(5)

+----------------+---------+-------------------+----------+----------+-----------+---------+------+-----------------+------------+--------+----------+-------------+--------------------+
|  CREDIT_CARD_NO|      SSN|         CUST_EMAIL|CUST_PHONE|FIRST_NAME|MIDDLE_NAME|LAST_NAME|APT_NO|      STREET_NAME|   CUST_CITY|CUST_ZIP|CUST_STATE| CUST_COUNTRY|        LAST_UPDATED|
+----------------+---------+-------------------+----------+----------+-----------+---------+------+-----------------+------------+--------+----------+-------------+--------------------+
|4210653310061055|123456100|AHooper@example.com|   1237818|      Alec|         Wm|   Hooper|   656|Main Street North|     Natchez|   39120|        MS|United States|2018-04-21T12:49:...|
|4210653310102868|123453023|EHolman@example.com|   1238933|      Etta|    Brendan|   Holman|   829|    Redwood Drive|Wethersfield|   06109|        CT|United States|2018-04-21T12:49:...|
|4210653310116272|123454487|WDunham@example.com|   1243018|    Wilber|

In [6]:
cust_df.describe().show()

+-------+--------------------+--------------------+--------------------+------------------+----------+-----------+---------+------------------+-----------+---------+------------------+----------+-------------+--------------------+
|summary|      CREDIT_CARD_NO|                 SSN|          CUST_EMAIL|        CUST_PHONE|FIRST_NAME|MIDDLE_NAME|LAST_NAME|            APT_NO|STREET_NAME|CUST_CITY|          CUST_ZIP|CUST_STATE| CUST_COUNTRY|        LAST_UPDATED|
+-------+--------------------+--------------------+--------------------+------------------+----------+-----------+---------+------------------+-----------+---------+------------------+----------+-------------+--------------------+
|  count|                 952|                 952|                 952|               952|       952|        952|      952|               952|        952|      952|               952|       952|          952|                 952|
|   mean|4.210653353718597...|1.2345552588130252E8|                null|1239

From the above summary we can see that the credit card numbers are all 16 digits long, SSN numbers are all 9 digits long, zip codes are all 5 digits long, and all the customers are from the United States.  The only issue is that the phone numbers are only 7 digits long but should be 10 digits. 

It seems like the CUST_ZIP and CUST_STATE columns are in string format so the min and max values might only consider the first character when considering order of values.  Lets confirm that the states are 2 characters in length and the zip codes are all 5 digits long.

In [7]:
cust_df.withColumn("zip_len", F.length(cust_df["CUST_ZIP"]))\
    .groupBy("zip_len").count().show()

cust_df.withColumn("state_len", F.length(cust_df["CUST_STATE"]))\
    .groupBy("state_len").count().show()

+-------+-----+
|zip_len|count|
+-------+-----+
|      5|  952|
+-------+-----+

+---------+-----+
|state_len|count|
+---------+-----+
|        2|  952|
+---------+-----+



Lets make sure the emails are all valid. All have a "@" and a "." (for example, ".com") and no blank spaces.

In [8]:
email_filter_df = cust_df.filter(~F.col('CUST_EMAIL').rlike('^\S+@\S+\.\S+$'))
email_filter_df.select('CUST_EMAIL').show()

#^\S+@\S+\.\S+$ checks to see if the string is a valid email address.  the "^\S+" checks
#if there are one or more non-whitespace characters at the start of the string.  Can't have
#blank spaces in an email address. The ^ is necessary because "ex ample@example.com" would be
#a match on "ample@example.com".  We check for "@" with one or more non-whitespace character after it
#followed by a "." (\. needs to be used because just a "." in regex matches to any single character).
#Finished off by "\S+$" which matches one or more non-whitespace characters that end the string.  
#The "$" is necessary for the same reason that the "^" is necessary.

+----------+
|CUST_EMAIL|
+----------+
+----------+



Let's make sure all the CC numbers, SSNs, Phone numbers, and Emails are unique.

In [9]:
cust_df.select(F.countDistinct("CREDIT_CARD_NO")).show()
cust_df.select(F.countDistinct("SSN")).show()
cust_df.select(F.countDistinct("CUST_PHONE")).show()
cust_df.select(F.countDistinct("CUST_EMAIL")).show()

+------------------------------+
|count(DISTINCT CREDIT_CARD_NO)|
+------------------------------+
|                           952|
+------------------------------+

+-------------------+
|count(DISTINCT SSN)|
+-------------------+
|                952|
+-------------------+

+--------------------------+
|count(DISTINCT CUST_PHONE)|
+--------------------------+
|                       901|
+--------------------------+

+--------------------------+
|count(DISTINCT CUST_EMAIL)|
+--------------------------+
|                       928|
+--------------------------+



Hmmm, the Phone numbers and email don't seem to be unique.  Let's explore further.

In [10]:
cust_df.groupBy('CUST_PHONE').count().orderBy(F.col('count').desc()).show()

+----------+-----+
|CUST_PHONE|count|
+----------+-----+
|   1241898|    3|
|   1236886|    3|
|   1237294|    3|
|   1243382|    2|
|   1239063|    2|
|   1240382|    2|
|   1243093|    2|
|   1243459|    2|
|   1238707|    2|
|   1242999|    2|
|   1235756|    2|
|   1242026|    2|
|   1240817|    2|
|   1235508|    2|
|   1236877|    2|
|   1240229|    2|
|   1243066|    2|
|   1236672|    2|
|   1240046|    2|
|   1242677|    2|
+----------+-----+
only showing top 20 rows



Let's look at a sample phone number to see if any of the other customer info overlaps.

In [11]:
cust_df.where(cust_df['CUST_PHONE'] == '1236886').show()

+----------------+---------+--------------------+----------+----------+-----------+---------+------+--------------+---------+--------+----------+-------------+--------------------+
|  CREDIT_CARD_NO|      SSN|          CUST_EMAIL|CUST_PHONE|FIRST_NAME|MIDDLE_NAME|LAST_NAME|APT_NO|   STREET_NAME|CUST_CITY|CUST_ZIP|CUST_STATE| CUST_COUNTRY|        LAST_UPDATED|
+----------------+---------+--------------------+----------+----------+-----------+---------+------+--------------+---------+--------+----------+-------------+--------------------+
|4210653315752398|123455638|FMeredith@example...|   1236886|   Francis|    Donnell| Meredith|   417|Lincoln Street| Wilmette|   60091|        IL|United States|2018-04-21T12:49:...|
|4210653352401004|123456941|  EWells@example.com|   1236886|     Edwin|      Alice|    Wells|   890|    5th Avenue|Mundelein|   60060|        IL|United States|2018-04-21T12:49:...|
|4210653399859149|123454047| EBeatty@example.com|   1236886|     Emery|    Susanna|   Beatty|  

Seems like 3 different people have the same number. I would consider using an area code that corresponds to the customer's location but in this example two of the customers are in the same state and fall under the same area code.