------------------
# Extraction

In [28]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('credit_card_etl').getOrCreate()

Let's load cdw_sapp_custmer.json

In [29]:
cc_df = spark.read.json("source_data/cdw_sapp_credit.json")

--------------------------------
# Exploratory Analysis

In [30]:
cc_df.printSchema()
cc_df.show(10)

root
 |-- BRANCH_CODE: long (nullable = true)
 |-- CREDIT_CARD_NO: string (nullable = true)
 |-- CUST_SSN: long (nullable = true)
 |-- DAY: long (nullable = true)
 |-- MONTH: long (nullable = true)
 |-- TRANSACTION_ID: long (nullable = true)
 |-- TRANSACTION_TYPE: string (nullable = true)
 |-- TRANSACTION_VALUE: double (nullable = true)
 |-- YEAR: long (nullable = true)

+-----------+----------------+---------+---+-----+--------------+----------------+-----------------+----+
|BRANCH_CODE|  CREDIT_CARD_NO| CUST_SSN|DAY|MONTH|TRANSACTION_ID|TRANSACTION_TYPE|TRANSACTION_VALUE|YEAR|
+-----------+----------------+---------+---+-----+--------------+----------------+-----------------+----+
|        114|4210653349028689|123459988| 14|    2|             1|       Education|             78.9|2018|
|         35|4210653349028689|123459988| 20|    3|             2|   Entertainment|            14.24|2018|
|        160|4210653349028689|123459988|  8|    7|             3|         Grocery|             5

How many rows do we have in total in this dataframe?

In [31]:
cc_df.count()

46694

In [32]:
cc_df.describe().show()

+-------+------------------+--------------------+-------------------+-----------------+------------------+------------------+----------------+------------------+--------------------+
|summary|       BRANCH_CODE|      CREDIT_CARD_NO|           CUST_SSN|              DAY|             MONTH|    TRANSACTION_ID|TRANSACTION_TYPE| TRANSACTION_VALUE|                YEAR|
+-------+------------------+--------------------+-------------------+-----------------+------------------+------------------+----------------+------------------+--------------------+
|  count|             46694|               46694|              46694|            46694|             46694|             46694|           46694|             46694|               46694|
|   mean| 75.00057823274939|4.210653353368964E15|1.234555184812824E8|14.50736711354778| 6.516875829871076|           23347.5|            null| 51.03938214759932|              2018.0|
| stddev|51.389074910957895|2.5604641248039957E7| 2561.2609103365367|8.06630502251638

From these results we can conclude that CC numbers are 16 digits long, SSNs are 9 digits long, Months are 12 or less, and days are 28 or less (guess no one shopped on the 29th, 30th, or 31st).

Let's make sure the CUST_SSN are tied to a single CREDIT_CARD_NO.  when we group by each of these columns they should result in the same number of items.

In [33]:
cc_df.select(F.countDistinct("CREDIT_CARD_NO")).show()
cc_df.select(F.countDistinct("CUST_SSN")).show()
cc_ssn = cc_df.select("CREDIT_CARD_NO","CUST_SSN").groupBy("CREDIT_CARD_NO","CUST_SSN").count()
cc_ssn.count()

+------------------------------+
|count(DISTINCT CREDIT_CARD_NO)|
+------------------------------+
|                           952|
+------------------------------+

+------------------------+
|count(DISTINCT CUST_SSN)|
+------------------------+
|                     952|
+------------------------+



952

Phew!  Things look good there, surprisingly (given how inconsistent the data has been so far...)

Let's see what kind of values are in the TRANSACTION_TYPE column.

In [34]:
cc_df.groupBy('TRANSACTION_TYPE').count().orderBy(F.col('count').desc()).show()

+----------------+-----+
|TRANSACTION_TYPE|count|
+----------------+-----+
|           Bills| 6861|
|      Healthcare| 6723|
|            Test| 6683|
|       Education| 6638|
|   Entertainment| 6635|
|             Gas| 6605|
|         Grocery| 6549|
+----------------+-----+



6683 test transactions?? hmm, I wonder what that means.  Surely those can't be trial runs of a transactions..

---------------
# Transforming

Let's move on to the date.  We are required to format it in a YYYYMMDD format.  

In order to do so we need to add leading zeroes to the single digit days and months.  We will use the .lpad function to pad the digit with a specified value, in our case a "0".

In [35]:
cc_df = cc_df.withColumn("MONTH", F.lpad(cc_df.MONTH, 2, '0'))
cc_df = cc_df.withColumn("DAY", F.lpad(cc_df.DAY, 2, '0'))

Let's create the TIMEID column now.

In [36]:
cc_df = cc_df.withColumn('TIMEID', F.format_string("%s%s%s", 
            cc_df['YEAR'], cc_df['MONTH'], cc_df['DAY']))

In [37]:
cc_df.select('DAY','MONTH','YEAR','TIMEID').show(5)

+---+-----+----+--------+
|DAY|MONTH|YEAR|  TIMEID|
+---+-----+----+--------+
| 14|   02|2018|20180214|
| 20|   03|2018|20180320|
| 08|   07|2018|20180708|
| 19|   04|2018|20180419|
| 10|   10|2018|20181010|
+---+-----+----+--------+
only showing top 5 rows



Looks good.  

Let's drop the YEAR, DAY, and MONTH columns since we will not be transfering them into the database.

In [38]:
cc_df = cc_df.drop("DAY")
cc_df = cc_df.drop("MONTH")
cc_df = cc_df.drop("YEAR")

Let's reorder the columns in the order in which they will be loaded into the DB.

In [39]:
cc_df = cc_df.select('CREDIT_CARD_NO','TIMEID','CUST_SSN','BRANCH_CODE',
            'TRANSACTION_TYPE','TRANSACTION_VALUE','TRANSACTION_ID')

Lets export this data so we can load it into a database.  

Spark was giving me an error when I tried to save the spark dataframe to a json file so I converted it to a pandas dataframe and saved it using a pandas method.

In [40]:
pandas_df = cc_df.toPandas()
pandas_df.to_json('clean_data/credit_card.json')