# 5.1 Data Prep

##### Description

Prepping the data for modeling. Data will be split into train and test subsets.

##### Notebook Steps

1. Connect Spark
1. Input data
1. Basic data review
1. Visualize relationships

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-54-91-225-25.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1612113777859_0001,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1612113777859_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


## 2. Input Data

In [4]:
%%spark
df = spark.read.csv("s3://jolfr-capstone3/clean/features.csv", header=True, inferSchema=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Split Data

In [5]:
%%spark
from pyspark.sql.functions import percent_rank
from pyspark.sql import Window
df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("time")))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
%%spark
train_df = df.where("rank <= .8").drop("rank")
print((train_df.count(), len(train_df.columns)))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(21695948, 23)

In [7]:
%%spark
test_df = df.where("rank > .8").drop("rank")
print((test_df.count(), len(test_df.columns)))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(4206473, 23)

## Downsample Data
Downsample data to speed in training.

In [8]:
%%spark
train_df, drop = train_df.randomSplit(weights = [0.05, 0.95], seed = 42)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
%%spark
test_df, drop = test_df.randomSplit(weights = [0.05, 0.95], seed = 42)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4. Drop Unnecessary Columns

In [10]:
%%spark
train_df = train_df.drop("msno").drop("time")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
%%spark
test_df = test_df.drop("msno").drop("time")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
%%spark
from pyspark.sql.types import FloatType
from pyspark.sql.functions import col
train_df = train_df.withColumn("label", col("label").cast(FloatType()))
test_df = test_df.withColumn("label", col("label").cast(FloatType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Output Data

In [13]:
%%spark
train_df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/training/train.csv')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
%%spark
test_df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/validation/validate.csv')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…