------------------
# Extraction

In [1]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('branch_etl').getOrCreate()

Lets load the first of three json files: cdw_sapp_branch.json

In [2]:
branch_df = spark.read.json("source_data/cdw_sapp_branch.json")

--------------------------------
# Exploratory Analysis


In [3]:
branch_df.printSchema()
branch_df.show(15)

root
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_ZIP: long (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)

+-----------------+-----------+------------+------------+------------+-----------------+----------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME|BRANCH_PHONE|BRANCH_STATE|    BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+------------+------------+-----------------+----------+--------------------+
|        Lakeville|          1|Example Bank|  1234565276|          MN|     Bridle Court|     55044|2018-04-18T16:51:...|
|          Huntley|          2|Example Bank|  1234618993|          IL|Washington Street|     60142|2018-04-18T16:51:...|
|SouthRichmondHill|          3|Example Bank| 

How many rows do we have in total in this dataframe?

In [4]:
branch_df.count()

115

Let's see if the first 3 numbers of the phone numbers for the branches are all '123'

In [5]:
new_df = branch_df.withColumn("first_phone", F.substring("BRANCH_PHONE",0,3))
new_df.groupBy('first_phone').count().orderBy('count').show()

+-----------+-----+
|first_phone|count|
+-----------+-----+
|        123|  115|
+-----------+-----+



Seems like 123 is the area code for all the phone numbers.  Now lets see if there is a different branch name than 'Example Bank'.

In [6]:
branch_df.groupBy('BRANCH_NAME').count().orderBy('count').show()

+------------+-----+
| BRANCH_NAME|count|
+------------+-----+
|Example Bank|  115|
+------------+-----+



Let's see if all the Branch Cities and codes are unique.

In [7]:
branch_df.select(F.countDistinct("BRANCH_CITY")).show()
branch_df.select(F.countDistinct("BRANCH_CODE")).show()

+---------------------------+
|count(DISTINCT BRANCH_CITY)|
+---------------------------+
|                        115|
+---------------------------+

+---------------------------+
|count(DISTINCT BRANCH_CODE)|
+---------------------------+
|                        115|
+---------------------------+



Looks like all our Branch cities and branch codes are unique.  Now, it looks like theres some zipcodes that are incomplete.  Lets see which ones have less than 5 numbers.

In [8]:
branch_df.select('BRANCH_CITY', 'BRANCH_STATE','BRANCH_ZIP')\
    .where(F.length(branch_df["BRANCH_ZIP"]) < 5).show()

+------------+------------+----------+
| BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP|
+------------+------------+----------+
|    Paterson|          NJ|      7501|
|Wethersfield|          CT|      6109|
|Hillsborough|          NJ|      8844|
|     Medford|          MA|      2155|
|    Rockaway|          NJ|      7866|
|  LongBranch|          NJ|      7740|
|   Irvington|          NJ|      7111|
|    NewHaven|          CT|      6511|
|      Quincy|          MA|      2169|
+------------+------------+----------+



After googling the accurate zip code for Paterson, NJ it seems like the leading 0 was dropped.  Instead of "7501", it should have been "07501".  Glancing at the other states on this list I can conclude, based on my experience with zip codes, that the other entries have the issue with a dropped leading 0.  All of those state's zip codes start with a 0.



---------------------------
# Transform

Let's add a leading 0 to all the zip codes with only 4 digits and whose states are either NJ, CT, NH, MA, VT, RI, or ME.

In [9]:
branch_df = branch_df.withColumn('BRANCH_ZIP',\
                    F.when((F.length(branch_df['BRANCH_ZIP']) == 4) &
                        branch_df['BRANCH_STATE'].isin(["NJ", "CT", "NH", "MA", "VT", "RI", "ME"]),
                    F.format_string("0%s",branch_df['BRANCH_ZIP']))\
                    .otherwise(branch_df["BRANCH_ZIP"]))

Let's see if all the zip codes are 5 digits long now.

In [10]:
branch_df.withColumn("zip_len", F.length(branch_df["BRANCH_ZIP"]))\
    .groupBy("zip_len").count().show()

+-------+-----+
|zip_len|count|
+-------+-----+
|      5|  115|
+-------+-----+



The requirements state that if the zip code is a null value then we should change it to 99999.  Since we don't have any zip codes that are less than 4 digits long nor do we have any missing or null zip codes, we will assign either case to display "99999" if the situation arises.

In [11]:
branch_df = branch_df.withColumn('BRANCH_ZIP',
                    F.when((F.length(branch_df['BRANCH_ZIP']) < 4) |
                        branch_df['BRANCH_ZIP'].isNull(),"99999")\
                    .otherwise(branch_df["BRANCH_ZIP"]))

Let's change the phone number to a (XXX)XXX-XXXX format.

In [12]:
branch_df = branch_df.withColumn('BRANCH_PHONE',
            F.format_string("(%s)%s-%s", branch_df['BRANCH_PHONE'][0:3],
            branch_df['BRANCH_PHONE'][4:3],branch_df['BRANCH_PHONE'][7:4]))

The city names don't have spaces in them when necessary so lets change that.

In [13]:
branch_df = branch_df.withColumn('BRANCH_CITY', 
        F.regexp_replace(branch_df['BRANCH_CITY'], "(?<=.)([A-Z])", ' $1'))

#(?<=.) is a positive lookbehind.  It checks if there is any character behind ([A-Z]).  Don't
#want to match the capital letters in the beginning of the string.
#([A-Z]) looks for any capital letter.  In parantheses because we want to capture this group.
#' $1' Blank space plus the first, and only, captured group. aka this returns the matched 
#capital letter, otherwise the matched capital letter would have been replaced.

Ok, it looks cleaned up.

In [14]:
branch_df.show()

+-------------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|        BRANCH_CITY|BRANCH_CODE| BRANCH_NAME| BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-------------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|          Lakeville|          1|Example Bank|(123)456-5276|          MN|       Bridle Court|     55044|2018-04-18T16:51:...|
|            Huntley|          2|Example Bank|(123)461-8993|          IL|  Washington Street|     60142|2018-04-18T16:51:...|
|South Richmond Hill|          3|Example Bank|(123)498-5926|          NY|      Warren Street|     11419|2018-04-18T16:51:...|
|         Middleburg|          4|Example Bank|(123)466-3064|          FL|   Cleveland Street|     32068|2018-04-18T16:51:...|
|    King Of Prussia|          5|Example Bank|(123)484-9701|          PA|        14th Street|     19406|2018-04-18T16:

Lets export this data so we can load it into a database.  

Spark was giving me an error when I tried to save the spark dataframe to a json file so I converted it to a pandas dataframe and saved it using a pandas method.

In [15]:
pandas_df = branch_df.toPandas()
pandas_df.to_json('clean_data/branch.json')