# 2.2 Transactions Data Cleaning

##### Description

Basic data visualization and data formatting for transactions.csv

##### Notebook Steps

1. Connect Spark
1. Input Data
1. Examine Data
1. Data Cleaning
1. Output Data

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-3-85-12-54.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1609241788334_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.
Cleaned up endpoint http://ec2-3-85-12-54.compute-1.amazonaws.com:8998/


## 2. Input Data

In [3]:
%%spark
df = spark.read.csv("s3://jolfr-capstone3/raw/transactions", header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Examine Data

##### show()

In [4]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|                msno|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|
+--------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|++6eU4LsQ3UQ20ILS...|               32|               90|            298|               298|            0|        20170131|              20170504|        0|
|++lvGPJOinuin/8es...|               41|               30|            149|               149|            1|        20150809|              20190412|        0|
|+/GXNtXWQVfKrEDqY...|               36|               30|            180|               180|            1|        20170303|              20170422|        0|
|+/w1UrZwyka4C9oNH...|               36|            

##### count()

In [5]:
%%spark
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1431009

##### describe()

In [6]:
%%spark
df.describe().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+-----------------+------------------+------------------+------------------+-------------------+--------------------+----------------------+-------------------+
|summary|                msno|payment_method_id| payment_plan_days|   plan_list_price|actual_amount_paid|      is_auto_renew|    transaction_date|membership_expire_date|          is_cancel|
+-------+--------------------+-----------------+------------------+------------------+------------------+-------------------+--------------------+----------------------+-------------------+
|  count|             1431009|          1431009|           1431009|           1431009|           1431009|            1431009|             1431009|               1431009|            1431009|
|   mean|                null|37.91835481118567| 66.01769590547649|281.78703488238017| 281.3172411913552| 0.7853025382789347|2.0168484537746444E7|   2.017110068205581E7|0.02455120827332323|
| stddev|                null|  4.9648049069269|10

##### printSchema()

In [7]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- msno: string (nullable = true)
 |-- payment_method_id: string (nullable = true)
 |-- payment_plan_days: string (nullable = true)
 |-- plan_list_price: string (nullable = true)
 |-- actual_amount_paid: string (nullable = true)
 |-- is_auto_renew: string (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: string (nullable = true)

##### columns

In [8]:
%%spark
df.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['msno', 'payment_method_id', 'payment_plan_days', 'plan_list_price', 'actual_amount_paid', 'is_auto_renew', 'transaction_date', 'membership_expire_date', 'is_cancel']

##### head(5)

In [9]:
%%spark
df.head(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(msno=u'++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=', payment_method_id=u'32', payment_plan_days=u'90', plan_list_price=u'298', actual_amount_paid=u'298', is_auto_renew=u'0', transaction_date=u'20170131', membership_expire_date=u'20170504', is_cancel=u'0'), Row(msno=u'++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=', payment_method_id=u'41', payment_plan_days=u'30', plan_list_price=u'149', actual_amount_paid=u'149', is_auto_renew=u'1', transaction_date=u'20150809', membership_expire_date=u'20190412', is_cancel=u'0'), Row(msno=u'+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=', payment_method_id=u'36', payment_plan_days=u'30', plan_list_price=u'180', actual_amount_paid=u'180', is_auto_renew=u'1', transaction_date=u'20170303', membership_expire_date=u'20170422', is_cancel=u'0'), Row(msno=u'+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=', payment_method_id=u'36', payment_plan_days=u'30', plan_list_price=u'180', actual_amount_paid=u'180', is_auto_renew=u'1', transaction_date=u'20170329', memb

##### Null per Column

In [10]:
%%spark
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|msno|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|
+----+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|   0|                0|                0|              0|                 0|            0|               0|                     0|        0|
+----+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+

##### Value Counts

In [11]:
%%spark
df.groupBy('payment_method_id').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+-----+
|payment_method_id|count|
+-----------------+-----+
|                5|    1|
|               24|    4|
|                2|    4|
|               25|    5|
|               10|   40|
|                3|   42|
|               11|   79|
|                8|  179|
|                6|  186|
|               26|  668|
|               14|  672|
|               18|  714|
|               16| 1842|
|               21| 1846|
|               27| 2074|
|               19| 2136|
|               17| 2532|
|               23| 2719|
|               12| 2858|
|               28| 3452|
+-----------------+-----+
only showing top 20 rows

In [12]:
%%spark
df.groupBy('payment_plan_days').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+-----+
|payment_plan_days|count|
+-----------------+-----+
|               31|    4|
|                3|    9|
|               21|   11|
|              110|   20|
|               35|   29|
|              230|   35|
|               45|   41|
|               80|   43|
|               70|   49|
|               14|   82|
|               10|  416|
|                1|  676|
|              270|  997|
|              450| 1762|
|              400| 1817|
|                0| 2218|
|              200| 3108|
|               60| 3134|
|              415| 3298|
|              240| 3440|
+-----------------+-----+
only showing top 20 rows

In [13]:
%%spark
df.groupBy('plan_list_price').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------+-----+
|plan_list_price|count|
+---------------+-----+
|             30|    1|
|             15|    1|
|             50|    2|
|            143|    4|
|            265|    4|
|           1300|    6|
|             70|   11|
|            105|   11|
|            131|   24|
|              1|   25|
|           1260|   25|
|           1150|   35|
|            400|   43|
|            126|   46|
|            350|   49|
|            210|   65|
|            596|   66|
|            134|  109|
|           2000|  130|
|           1399|  137|
+---------------+-----+
only showing top 20 rows

In [14]:
%%spark
df.groupBy('actual_amount_paid').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+-----+
|actual_amount_paid|count|
+------------------+-----+
|              1778|    1|
|                15|    1|
|              1780|    1|
|                30|    1|
|               849|    1|
|               897|    1|
|               984|    1|
|                50|    2|
|               143|    4|
|               265|    4|
|              1300|    6|
|                70|   11|
|               105|   11|
|               131|   24|
|                 1|   25|
|              1260|   25|
|              1150|   35|
|               400|   43|
|               127|   46|
|               350|   49|
+------------------+-----+
only showing top 20 rows

In [15]:
%%spark
df.groupBy('is_auto_renew').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------+-------+
|is_auto_renew|  count|
+-------------+-------+
|            0| 307234|
|            1|1123775|
+-------------+-------+

In [16]:
%%spark
df.groupBy('is_cancel').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------+
|is_cancel|  count|
+---------+-------+
|        1|  35133|
|        0|1395876|
+---------+-------+

## 4. Data Cleaning

In [17]:
%%spark
from pyspark.sql import types
from pyspark.sql.functions import col, to_date

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Columns

##### msno
The msno column corresponds to user ids for the dataset, so the column is renamed from msno to user_id.

In [18]:
%%spark
df = df.withColumnRenamed("msno","user_id")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### payment_method_id
The payment_method_id column is cast from string to integer.

In [19]:
%%spark
df = df.withColumn("payment_method_id",col("payment_method_id").cast(types.IntegerType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### payment_plan_days
The payment_plan_days column is cast from string to integer.

In [20]:
%%spark
df = df.withColumn("payment_plan_days",col("payment_plan_days").cast(types.IntegerType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### plan_list_price
The plan_list_price column is cast from string to integer.

In [21]:
%%spark
df = df.withColumn("plan_list_price",col("plan_list_price").cast(types.IntegerType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### actual_amount_paid
The actual_amount_paid column is cast from string to integer.

In [22]:
%%spark
df = df.withColumn("actual_amount_paid",col("plan_list_price").cast(types.IntegerType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### is_auto_renew
The is_auto_renew column is cast from string to boolean.

In [23]:
%%spark
df = df.withColumn("is_auto_renew",col("is_auto_renew").cast(types.BooleanType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### transaction_date
The transaction_date column must be parsed and cast to a date object.

In [24]:
%%spark
df= df.withColumn('transaction_date',to_date(df.transaction_date, 'yyyyMMdd'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### membership_expire_date
The membership_expire_date column must be parsed and cast to a date object.

In [25]:
%%spark
df= df.withColumn('membership_expire_date',to_date(df.membership_expire_date, 'yyyyMMdd'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### is_cancel
The is_cancel column is cast from string to boolean.

In [26]:
%%spark
df = df.withColumn("is_cancel",col("is_cancel").cast(types.BooleanType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Data Output

##### Final Check

In [27]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- user_id: string (nullable = true)
 |-- payment_method_id: integer (nullable = true)
 |-- payment_plan_days: integer (nullable = true)
 |-- plan_list_price: integer (nullable = true)
 |-- actual_amount_paid: integer (nullable = true)
 |-- is_auto_renew: boolean (nullable = true)
 |-- transaction_date: date (nullable = true)
 |-- membership_expire_date: date (nullable = true)
 |-- is_cancel: boolean (nullable = true)

In [28]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|             user_id|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|
+--------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+
|++6eU4LsQ3UQ20ILS...|               32|               90|            298|               298|        false|      2017-01-31|            2017-05-04|    false|
|++lvGPJOinuin/8es...|               41|               30|            149|               149|         true|      2015-08-09|            2019-04-12|    false|
|+/GXNtXWQVfKrEDqY...|               36|               30|            180|               180|         true|      2017-03-03|            2017-04-22|    false|
|+/w1UrZwyka4C9oNH...|               36|            

##### Output to File

In [29]:
%%spark

df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/interim/transactions')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…