# 2.4 Train Data Cleaning

##### Description

Basic data visualization and data formatting for train.csv

##### Notebook Steps

1. Connect Spark
1. Input Data
1. Examine Data
1. Data Cleaning
1. Output Data

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-3-85-12-54.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1609241788334_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.
Cleaned up endpoint http://ec2-3-85-12-54.compute-1.amazonaws.com:8998/


## 2. Input Data

In [3]:
%%spark
df = spark.read.csv("s3://jolfr-capstone3/raw/train", header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Examine Data

##### show()

In [4]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------+
|                msno|is_churn|
+--------------------+--------+
|ugx0CjOMzazClkFzU...|       1|
|f/NmvEzHfhINFEYZT...|       1|
|zLo9f73nGGT1p21lt...|       1|
|8iF/+8HY8lJKFrTc7...|       1|
|K6fja4+jmoZ5xG6By...|       1|
|ibIHVYBqxGwrSExE6...|       1|
|kVmM8X4iBPCOfK/m1...|       1|
|moRTKhKIDvb+C8ZHO...|       1|
|dW/tPZMDh2Oz/ksdu...|       1|
|otEcMhAX3mU4gumUS...|       1|
|t5rqTxCnG7s5VBgEf...|       1|
|dfLS2/Pom6O3iUpo+...|       1|
|a7AtvhlY8KnKZGpiV...|       1|
|F45GsXJIeLvzUJqz/...|       1|
|SJCoxreWp6Cu9WPit...|       1|
|Oo2RDQixJ0pRWqec4...|       1|
|f91n3lDipDjRtAVNg...|       1|
|/L2095JD4M/BNLTCb...|       1|
|1AzXWFlRO6EfMBzfB...|       1|
|WkF/FvlzpBLFoa+Hm...|       1|
+--------------------+--------+
only showing top 20 rows

##### count()

In [5]:
%%spark
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

970960

##### describe()

In [6]:
%%spark
df.describe().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+-------------------+
|summary|                msno|           is_churn|
+-------+--------------------+-------------------+
|  count|              970960|             970960|
|   mean|                null|0.08994191315811156|
| stddev|                null|0.28609867129385297|
|    min|+++hVY1rZox/33Ytv...|                  0|
|    max|zzzF1KsGfHH3qI6qi...|                  1|
+-------+--------------------+-------------------+

##### printSchema()

In [7]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- msno: string (nullable = true)
 |-- is_churn: string (nullable = true)

##### columns

In [8]:
%%spark
df.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['msno', 'is_churn']

##### head(5)

In [9]:
%%spark
df.head(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(msno=u'ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=', is_churn=u'1'), Row(msno=u'f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=', is_churn=u'1'), Row(msno=u'zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=', is_churn=u'1'), Row(msno=u'8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=', is_churn=u'1'), Row(msno=u'K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA=', is_churn=u'1')]

##### Null per Column

In [10]:
%%spark
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+--------+
|msno|is_churn|
+----+--------+
|   0|       0|
+----+--------+

##### Value Counts

In [11]:
%%spark
df.groupBy('is_churn').count().orderBy('count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------+
|is_churn| count|
+--------+------+
|       1| 87330|
|       0|883630|
+--------+------+

## 4. Data Cleaning

In [12]:
%%spark
from pyspark.sql import types
from pyspark.sql.functions import col, to_date

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### msno
The msno column corresponds to user ids for the dataset, so the column is renamed from msno to user_id.

In [13]:
%%spark
df = df.withColumnRenamed("msno","user_id")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### is_churn
The is_churn column is cast from string to boolean.

In [14]:
%%spark
df = df.withColumn("is_churn",col("is_churn").cast(types.BooleanType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Data Output

##### Final Check

In [15]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- user_id: string (nullable = true)
 |-- is_churn: boolean (nullable = true)

In [16]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------+
|             user_id|is_churn|
+--------------------+--------+
|ugx0CjOMzazClkFzU...|    true|
|f/NmvEzHfhINFEYZT...|    true|
|zLo9f73nGGT1p21lt...|    true|
|8iF/+8HY8lJKFrTc7...|    true|
|K6fja4+jmoZ5xG6By...|    true|
|ibIHVYBqxGwrSExE6...|    true|
|kVmM8X4iBPCOfK/m1...|    true|
|moRTKhKIDvb+C8ZHO...|    true|
|dW/tPZMDh2Oz/ksdu...|    true|
|otEcMhAX3mU4gumUS...|    true|
|t5rqTxCnG7s5VBgEf...|    true|
|dfLS2/Pom6O3iUpo+...|    true|
|a7AtvhlY8KnKZGpiV...|    true|
|F45GsXJIeLvzUJqz/...|    true|
|SJCoxreWp6Cu9WPit...|    true|
|Oo2RDQixJ0pRWqec4...|    true|
|f91n3lDipDjRtAVNg...|    true|
|/L2095JD4M/BNLTCb...|    true|
|1AzXWFlRO6EfMBzfB...|    true|
|WkF/FvlzpBLFoa+Hm...|    true|
+--------------------+--------+
only showing top 20 rows

##### Output to File

In [17]:
%%spark

df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/interim/train')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…