Extraction phase

In [80]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ExtractCC').getOrCreate()

Lets load and show the first of three json files: cdw_sapp_branch.json

In [81]:
branch_df = spark.read.json("source_data/cdw_sapp_branch.json")
branch_df.printSchema()
branch_df.show()

root
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_ZIP: long (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)

+-----------------+-----------+------------+------------+------------+-------------------+----------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME|BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+------------+------------+-------------------+----------+--------------------+
|        Lakeville|          1|Example Bank|  1234565276|          MN|       Bridle Court|      1222|2018-04-18T16:51:...|
|        Lakeville|          1|Example Bank|  1234565276|          MN|       Bridle Court|     55044|2018-04-18T16:51:...|
|          Huntley|          2|Exam

--------------------------------
# Exploratory Analysis


Let's see if the first 3 numbers of the phone numbers for the branches are all '123'

In [82]:
new_df = branch_df.withColumn("first_phone", F.substring("BRANCH_PHONE",0,3))
new_df.groupBy('first_phone').count().orderBy('count').show()

+-----------+-----+
|first_phone|count|
+-----------+-----+
|        123|  116|
+-----------+-----+



Seems like 123 is the area code for all the phone numbers.  Now lets see if there is a different branch name than 'Example Bank'.

In [83]:
branch_df.groupBy('BRANCH_NAME').count().orderBy('count').show()

+------------+-----+
| BRANCH_NAME|count|
+------------+-----+
|Example Bank|  116|
+------------+-----+



Looks, like theres some zipcodes that are incomplete.  Lets see which ones have less than 5 numbers.

In [84]:
branch_df.select('BRANCH_CITY', 'BRANCH_STATE','BRANCH_ZIP')\
    .where(F.length(branch_df["BRANCH_ZIP"]) < 5).show()

+------------+------------+----------+
| BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP|
+------------+------------+----------+
|   Lakeville|          MN|      1222|
|    Paterson|          NJ|      7501|
|Wethersfield|          CT|      6109|
|Hillsborough|          NJ|      8844|
|     Medford|          MA|      2155|
|    Rockaway|          NJ|      7866|
|  LongBranch|          NJ|      7740|
|   Irvington|          NJ|      7111|
|    NewHaven|          CT|      6511|
|      Quincy|          MA|      2169|
+------------+------------+----------+



After googling the accurate zip code for Paterson, NJ it seems like the leading 0 was dropped.  Instead of "7501", it should have been "07501".  Glancing at the other states on this list I can conclude, based on my experience with zip codes, that the other entries have the issue with a dropped leading 0.  All of those state's zip codes start with a 0.

Let's add a leading 0 to all the zip codes with only 4 digits and whose states are either NJ, CT, NH, MA, VT, RI, or ME.

In [85]:
branch_df = branch_df.withColumn('BRANCH_ZIP',\
                    F.when((F.length(branch_df['BRANCH_ZIP']) < 5) &
                        branch_df['BRANCH_STATE'].isin(["NJ", "CT", "NH", "MA", "VT", "RI", "ME"]),\
                    F.format_string("0%s",branch_df['BRANCH_ZIP']))\
                    .otherwise(branch_df["BRANCH_ZIP"]))

Let's see if all the zip codes are 5 digits long now.

In [86]:
branch_df.withColumn("zip_len", F.length(branch_df["BRANCH_ZIP"])) \
    .groupBy("zip_len").count().show()

+-------+-----+
|zip_len|count|
+-------+-----+
|      5|  115|
|      4|    1|
+-------+-----+



---------------------------
# Transform

Let's change the phone number to a (XXX)-XXX-XXXX format.

In [87]:
branch_df = branch_df.withColumn('BRANCH_PHONE',\
                    F.format_string("(%s)%s-%s", branch_df['BRANCH_PHONE'][0:3],\
                    branch_df['BRANCH_PHONE'][4:3],branch_df['BRANCH_PHONE'][7:4]))