In [1]:
import pandas as pd
import zipfile
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('area_code_clean').getOrCreate()

Let's load the dataset.

In [3]:
#path to the zipped dataset
zip_data = 'source_data/full_area_code_dataset.zip'

#extract zipped file
with zipfile.ZipFile(zip_data, 'r') as zip_ref:
    zip_ref.extractall('source_data')

In [4]:
npa_df = spark.read.json("source_data/full_area_code_dataset.json")

The dataset contains NPA numbers (area codes) from Canada as well so lets look at just the United States data.

In [5]:
npa_df.where(npa_df['country'] == "United States").show(10)

+--------------------+----------+-------------+----------+-----------+---------+------------+--------+---------+---+------+---+------+--------+-------+
|                 _id|      city|      country|countryISO|dstObserved|gmtOffset|gmtOffsetDST|latitude|longitude|npa|npanxx|nxx| state|stateISO|zipCode|
+--------------------+----------+-------------+----------+-----------+---------+------------+--------+---------+---+------+---+------+--------+-------+
|{5a8c58ba60ca6764...|    Valdez|United States|        US|          1|       -9|          -8| 61.1381|-146.3572|907|907200|200|Alaska|      AK|  99686|
|{5a8c58ba60ca6764...|    Juneau|United States|        US|          1|       -9|          -8| 58.2994|-134.3908|907|907209|209|Alaska|      AK|  99811|
|{5a8c58ba60ca6764...|    Juneau|United States|        US|          1|       -9|          -8| 58.2994|-134.3908|907|907209|209|Alaska|      AK|  99803|
|{5a8c58ba60ca6764...|      Jber|United States|        US|          1|       -9|        

Let's remove some of the unecessary columns and pick just the ones we are interested in.

In [4]:
npa_df = npa_df.select('city', 'countryISO', 'npa', 'stateISO', 'zipCode')\
    .where(npa_df['country'] == "United States")

In [5]:
npa_df.show()

+-----------+----------+---+--------+-------+
|       city|countryISO|npa|stateISO|zipCode|
+-----------+----------+---+--------+-------+
|     Valdez|        US|907|      AK|  99686|
|     Juneau|        US|907|      AK|  99811|
|     Juneau|        US|907|      AK|  99803|
|       Jber|        US|907|      AK|  99506|
|  Ketchikan|        US|907|      AK|  99901|
| Fort Yukon|        US|907|      AK|  99740|
|  Anchorage|        US|907|      AK|  99503|
|  Anchorage|        US|907|      AK|  99501|
|  Anchorage|        US|907|      AK|  99508|
|  Anchorage|        US|907|      AK|  99504|
|  Anchorage|        US|907|      AK|  99509|
|  Anchorage|        US|907|      AK|  99502|
|  Anchorage|        US|907|      AK|  99517|
|  Anchorage|        US|907|      AK|  99511|
|    Wasilla|        US|907|      AK|  99687|
|     Willow|        US|907|      AK|  99688|
|Eagle River|        US|907|      AK|  99577|
|     Seward|        US|907|      AK|  99664|
|  Ward Cove|        US|907|      

Let's see how many rows are in this dataframe.

In [6]:
npa_df.count()

365121

Yikes, that's a lot of data!  I wonder if there is redundant data now that we removed the unnecessary columns.

In [7]:
npa_df.select(F.countDistinct("city")).show()
npa_df.select(F.countDistinct("npa")).show()
npa_df.select(F.countDistinct("countryISO")).show()
npa_df.select(F.countDistinct("zipCode")).show()

+--------------------+
|count(DISTINCT city)|
+--------------------+
|               17342|
+--------------------+

+-------------------+
|count(DISTINCT npa)|
+-------------------+
|                279|
+-------------------+

+--------------------------+
|count(DISTINCT countryISO)|
+--------------------------+
|                         1|
+--------------------------+

+-----------------------+
|count(DISTINCT zipCode)|
+-----------------------+
|                  34748|
+-----------------------+



Looks like there is a lot of duplicate entries so there is a lot to trim.  Let's run the .distinct() method on the dataframe to remove all the duplicate rows.

In [8]:
npa_df = npa_df.distinct()

Let's see how much smaller the dataframe is now.

In [9]:
npa_df.count()

43404

There is one issue.  Some zip codes have multiple NPA numbers (area codes) associated with them.  

In [10]:
npa_df.orderBy('zipCode', 'npa').show(6)

+---------+----------+---+--------+-------+
|     city|countryISO|npa|stateISO|zipCode|
+---------+----------+---+--------+-------+
| Adjuntas|        US|787|      PR|  00601|
| Adjuntas|        US|939|      PR|  00601|
|   Aguada|        US|787|      PR|  00602|
|Aguadilla|        US|787|      PR|  00603|
|  Maricao|        US|787|      PR|  00606|
|  Maricao|        US|939|      PR|  00606|
+---------+----------+---+--------+-------+
only showing top 6 rows



Lets consolidate those area codes into a list so that we don't have duplicate zip code entries.  We are grouping the zipcodes and cities, aka removing the redundancies and we are performing an aggregate function on the "npa" column.  The function collect_list() aggregates the values into a list and we are renaming this aggregate column as "npa" via the alias() function.

In [11]:
npa_df = npa_df.select("city","zipCode", "npa").groupBy("zipCode", "city")\
    .agg(F.collect_list("npa").alias("npa_list"))

In [12]:
npa_df.show(5)

+-------+---------+----------+
|zipCode|     city|  npa_list|
+-------+---------+----------+
|  00601| Adjuntas|[939, 787]|
|  00602|   Aguada|     [787]|
|  00603|Aguadilla|     [787]|
|  00606|  Maricao|[939, 787]|
|  00610|   Anasco|     [787]|
+-------+---------+----------+
only showing top 5 rows



Much better.  Lets export this data so we can use it in the customer_etl file.

Spark was giving me an error when I tried to save the spark dataframe to a json file so I converted it to a pandas dataframe and saved it using a pandas method.

In [13]:
pandas_df = npa_df.toPandas()
pandas_df.to_json("source_data/area_codes.json", orient='records', lines=True)

# I had to use ', orient='records', lines=True"' because the pandas .to_json function 
# saves the columns as a dictionairy entry in the json file but pyspark read.json function 
# expects each row of the dataframe to be a single line in the json file.