# 2.1 Members w/ Features Data Cleaning

##### Description

Basic data visualization and data formatting for members.csv

##### Notebook Steps

1. Connect Spark
1. Input Data
1. Examine Data
1. Data Cleaning
1. Output Data

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-3-94-115-24.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1610031470687_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


## 2. Input Data

In [3]:
%%spark

df = spark.read.csv("s3://jolfr-capstone3/interim/mem-features.csv", header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Examine Data

##### show()

In [None]:
%%spark
df.show()

##### count()

In [None]:
%%spark
df.count()

##### describe()

In [None]:
%%spark
df.describe().show()

##### printSchema()

In [None]:
%%spark
df.printSchema()

## 4. Clean Data

##### Rename Columns
The featuretools columns are named with a '.', which causes select statements to error out. They will be replaced with dashes.

In [None]:
%%spark
df = df.toDF(*(c.replace('.', '-') for c in df.columns))

##### Check Null Values

In [None]:
%%spark
from pyspark.sql.functions import when, count, col
null_values = df.select([count(when(col(c).isNull(), c)).alias(c) for c in 
           df.columns]).toPandas()

null_values

##### Drop Rows with Null Labels
All rows which are unlabeled will be dropped as they cannot be used to train nor validate.

In [None]:
%%spark
null_values.label

In [None]:
%%spark
old_len = df.count()
df = df.na.drop(subset='label')
new_len = df.count()

dropped = old_len - new_len

print(str(dropped) + ' rows have been dropped')

##### Drop Columns Above 90% Null

In [None]:
%%spark
missing_pct = df.na.sum() / len(df)
to_drop = list((missing_pct[missing_pct > 0.9]).index)
to_drop = [x for x in to_drop if x != 'days_to_churn']
to_drop

##### show()

In [4]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------+----+----+--------------+------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+----------------------------------------------+-----------------------------------------------+-------------------------------+---------------------------+------------------------------------+------------------------------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------------+--------------------------------------+------------------------------+------------------------------------+---------------------------

##### count()

In [5]:
%%spark
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

25849

##### describe()

In [6]:
%%spark
df.describe().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+----------+-----------------+------------------+-----------------+------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+----------------------------------------------+-----------------------------------------------+-------------------------------+---------------------------+------------------------------------+------------------------------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------------+--------------------------------------+------------------------------+--------------------------

##### printSchema()

In [7]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- msno: string (nullable = true)
 |-- time: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- registered_via: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SUM(transactions.payment_plan_days): string (nullable = true)
 |-- SUM(transactions.plan_list_price): string (nullable = true)
 |-- SUM(transactions.actual_amount_paid): string (nullable = true)
 |-- SUM(transactions.price_difference): string (nullable = true)
 |-- SUM(transactions.planned_daily_price): string (nullable = true)
 |-- SUM(transactions.daily_price): string (nullable = true)
 |-- TIME_SINCE_LAST(transactions.transaction_date): string (nullable = true)
 |-- AVG_TIME_BETWEEN(transactions.transaction_date): string (nullable = true)
 |-- ALL(transactions.is_auto_renew): string (nullable = true)
 |-- ALL(transactions.is_cancel): string (nullable = true)
 |-- MODE(transactions.payment_method_id): string (nullable = true)
 |-- NUM_UNIQUE(transactio

## 4. Clean Data

##### Rename Columns
The featuretools columns are named with a '.', which causes select statements to error out. They will be replaced with dashes.

In [8]:
%%spark
df = df.toDF(*(c.replace('.', '-') for c in df.columns))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Check Null Values

In [9]:
%%spark
from pyspark.sql.functions import when, count, col
null_values = df.select([count(when(col(c).isNull(), c)).alias(c) for c in 
           df.columns]).toPandas()

null_values

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   msno  time  city  ...  label  days_to_churn  churn_date
0     0     0  3192  ...   8616          21078       25239

[1 rows x 253 columns]

##### Drop Rows with Null Labels
All rows which are unlabeled will be dropped as they cannot be used to train nor validate.

In [10]:
%%spark
null_values.label

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0    8616
Name: label, dtype: int64

In [11]:
%%spark
old_len = df.count()
df = df.na.drop(subset='label')
new_len = df.count()

dropped = old_len - new_len

print(str(dropped) + ' rows have been dropped')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

8616 rows have been dropped

##### Drop Columns Above 90% Null

In [15]:
%%spark
missing_pct = df.na.sum() / len(df)
to_drop = list((missing_pct[missing_pct > 0.9]).index)
to_drop = [x for x in to_drop if x != 'days_to_churn']
to_drop

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
'DataFrameNaFunctions' object has no attribute 'sum'
Traceback (most recent call last):
AttributeError: 'DataFrameNaFunctions' object has no attribute 'sum'

