Extraction phase

In [17]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ExtractCC').getOrCreate()

Lets load and show the first of three json files: cdw_sapp_branch.json

In [5]:
branch_df = spark.read.json("source_data/cdw_sapp_branch.json")
branch_df.printSchema()
branch_df.show()

root
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_ZIP: long (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)

+-----------------+-----------+------------+------------+------------+-------------------+----------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME|BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+------------+------------+-------------------+----------+--------------------+
|        Lakeville|          1|Example Bank|  1234565276|          MN|       Bridle Court|     55044|2018-04-18T16:51:...|
|          Huntley|          2|Example Bank|  1234618993|          IL|  Washington Street|     60142|2018-04-18T16:51:...|
|SouthRichmondHill|          3|Exam

--------------------------------
# Exploratory Analysis


Let's see if the first 3 numbers of the phone numbers for the branches are all '123'

In [39]:
new_df = branch_df.withColumn("first_phone", F.substring("BRANCH_PHONE",0,3))
new_df.groupBy('first_phone').count().orderBy('count').show()

+-----------+-----+
|first_phone|count|
+-----------+-----+
|        123|  115|
+-----------+-----+



Seems like 123 is the area code for all the phone numbers.  Now lets see if there is a different branch name than 'Example Bank'.

In [40]:
branch_df.groupBy('BRANCH_NAME').count().orderBy('count').show()

+------------+-----+
| BRANCH_NAME|count|
+------------+-----+
|Example Bank|  115|
+------------+-----+



Looks, like theres some zipcodes that are incomplete.  Lets see which ones have less than 5 numbers.

In [46]:
branch_df.select('BRANCH_CITY', 'BRANCH_STATE','BRANCH_ZIP').where(F.length(branch_df["BRANCH_ZIP"]) < 5).show()

+------------+------------+----------+
| BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP|
+------------+------------+----------+
|    Paterson|          NJ|      7501|
|Wethersfield|          CT|      6109|
|Hillsborough|          NJ|      8844|
|     Medford|          MA|      2155|
|    Rockaway|          NJ|      7866|
|  LongBranch|          NJ|      7740|
|   Irvington|          NJ|      7111|
|    NewHaven|          CT|      6511|
|      Quincy|          MA|      2169|
+------------+------------+----------+



After googling the accurate zip code for Paterson, NJ it seems like the leading 0 was dropped.  Instead of "7501", it should have been "07501".  Glancing at the other states on this list I can conclude, based on my experience with zip codes, that the other entries have the issue with a dropped leading 0.  All of those state's zip codes start with a 0.


In [47]:
#converted BRANCH_ZIP column to string type so I can concatenate a 0 to the zip code

branch_df = branch_df.withColumn('BRANCH_ZIP', F.col('BRANCH_ZIP').cast('string'))

In [59]:
#for some reason the string concatenation didn't work.  need to investigate that


branch_df.select('BRANCH_ZIP').where(F.length(branch_df["BRANCH_ZIP"]) < 5)
branch_df.withColumn('BRANCH_ZIP', F.when(F.length(branch_df['BRANCH_ZIP']) < 5, "0" + str(branch_df['BRANCH_ZIP'])).otherwise(branch_df["BRANCH_ZIP"])).show()

+-----------------+-----------+------------+------------+------------+-------------------+--------------------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME|BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|          BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+------------+------------+-------------------+--------------------+--------------------+
|        Lakeville|          1|Example Bank|  1234565276|          MN|       Bridle Court|               55044|2018-04-18T16:51:...|
|          Huntley|          2|Example Bank|  1234618993|          IL|  Washington Street|               60142|2018-04-18T16:51:...|
|SouthRichmondHill|          3|Example Bank|  1234985926|          NY|      Warren Street|               11419|2018-04-18T16:51:...|
|       Middleburg|          4|Example Bank|  1234663064|          FL|   Cleveland Street|               32068|2018-04-18T16:51:...|
|    KingOfPrussia|          5|Example Bank|  1234849701|          PA