# 2.7 Members w/ Features Data Cleaning

##### Description

Basic data visualization and data formatting for members.csv

##### Notebook Steps

1. Connect Spark
1. Input Data
1. Examine Data
1. Data Cleaning
1. Check Data and Output

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

## 2. Input Data

In [4]:
%%spark
df = spark.read.csv("s3://jolfr-capstone3/interim/mem-features.csv", header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Examine Data

##### show()

In [4]:
%%spark
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------+----+----+--------------+------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+----------------------------------------------+-----------------------------------------------+-------------------------------+---------------------------+------------------------------------+------------------------------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------+------------------------------------+----------------------------------+-------------------------------------+-----------------------------------+--------------------------------------+------------------------------+------------------------------------+---------------------------

##### count()

In [5]:
%%spark
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

26428595

##### printSchema()

In [6]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- msno: string (nullable = true)
 |-- time: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- registered_via: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SUM(transactions.payment_plan_days): string (nullable = true)
 |-- SUM(transactions.plan_list_price): string (nullable = true)
 |-- SUM(transactions.actual_amount_paid): string (nullable = true)
 |-- SUM(transactions.price_difference): string (nullable = true)
 |-- SUM(transactions.planned_daily_price): string (nullable = true)
 |-- SUM(transactions.daily_price): string (nullable = true)
 |-- TIME_SINCE_LAST(transactions.transaction_date): string (nullable = true)
 |-- AVG_TIME_BETWEEN(transactions.transaction_date): string (nullable = true)
 |-- ALL(transactions.is_auto_renew): string (nullable = true)
 |-- ALL(transactions.is_cancel): string (nullable = true)
 |-- MODE(transactions.payment_method_id): string (nullable = true)
 |-- NUM_UNIQUE(transactio

## 4. Clean Data

##### Rename Columns
The featuretools columns are named with dots, which causes select statements to error out. They will be replaced with dashes.

In [7]:
%%spark
df = df.toDF(*(c.replace('.', '-') for c in df.columns))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Drop Rows with Null Labels and Cast
All rows which are unlabeled will be dropped as they cannot be used to train nor validate. The entire column will then be cast to a boolean datatype.

In [8]:
%%spark
from pyspark.sql.types import BooleanType, IntegerType
from pyspark.sql.functions import col
import pyspark.sql.functions as F
from functools import reduce

old_len = df.count()
df = df.na.drop(subset='label')
new_len = df.count()

df = df.withColumn("label", df["label"].cast(IntegerType()))
df = df.withColumn("label", df["label"].cast(BooleanType()))

dropped = old_len - new_len
print(str(dropped) + ' rows have been dropped with null labels')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

526174 rows have been dropped with null labels

##### Label and Fill Boolean Columns

In [9]:
%%spark
# Create list of boolean columns
bool_cols = [c for c in df.columns if 'ALL' in c or (
    'WEEKEND' in c and 'PERCENT_TRUE' not in c)]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
%%spark
import pyspark.sql.functions as F
from functools import reduce

# Replace True/False with 1/0 to prevent cast error NOTE: also fills nulls with 0
df = reduce(lambda df, c: df.withColumn(c, F.when(df[c] == 'True', 1).otherwise(0)), bool_cols, df)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
%%spark
from pyspark.sql.functions import col
from pyspark.sql.types import BooleanType

# Cast each boolean column
for column in bool_cols:
    df = df.withColumn(column, col(column).cast(BooleanType()))
    
    
df.select(bool_cols).printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- ALL(transactions-is_auto_renew): boolean (nullable = false)
 |-- ALL(transactions-is_cancel): boolean (nullable = false)
 |-- WEEKEND(registration_init_time): boolean (nullable = false)
 |-- ALL(transactions-is_auto_renew WHERE is_cancel = 0): boolean (nullable = false)
 |-- ALL(transactions-is_auto_renew WHERE is_cancel = 1): boolean (nullable = false)
 |-- ALL(transactions-is_cancel WHERE is_auto_renew = 0): boolean (nullable = false)
 |-- ALL(transactions-is_cancel WHERE is_auto_renew = 1): boolean (nullable = false)
 |-- ALL(transactions-WEEKEND(transaction_date)): boolean (nullable = false)
 |-- ALL(transactions-WEEKEND(transaction_date) WHERE is_auto_renew = 0): boolean (nullable = false)
 |-- ALL(transactions-WEEKEND(transaction_date) WHERE is_cancel = 0): boolean (nullable = false)
 |-- ALL(transactions-WEEKEND(transaction_date) WHERE is_cancel = 1): boolean (nullable = false)
 |-- ALL(transactions-WEEKEND(transaction_date) WHERE is_auto_renew = 1): boolean (nullable 

##### Label and Fill Numeric Columns

In [12]:
%%spark
# Create list of numeric columns
num_cols = [c for c in df.columns if 'SUM' in c or 'AVG' in c or 'MIN' in c or 'MEAN' in c or
    'MAX' in c or 'STD' in c or 'COUNT' in c or 'TOTAL' in c or 'SUM' in c or 'PERCENT_TRUE' in c
    or 'MODE' in c or 'NUM_UNIQUE' in c or 'MONTH' in c or 'DAY' in c or 'TIME_SINCE_LAST' in c 
    or ('LAST' in c and 'WEEKEND' not in c)]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
%%spark
# Generate column means using describe on the subset of columns
col_means = df.select(num_cols).describe().collect()[1]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
%%spark
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType

# Cast and fill numeric columns
for column in num_cols:
    df = df.na.fill(col_means[column], subset=[column])
    df = df.withColumn(column, col(column).cast(DoubleType()))
    
df.select(num_cols).printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- SUM(transactions-payment_plan_days): double (nullable = true)
 |-- SUM(transactions-plan_list_price): double (nullable = true)
 |-- SUM(transactions-actual_amount_paid): double (nullable = true)
 |-- SUM(transactions-price_difference): double (nullable = true)
 |-- SUM(transactions-planned_daily_price): double (nullable = true)
 |-- SUM(transactions-daily_price): double (nullable = true)
 |-- TIME_SINCE_LAST(transactions-transaction_date): double (nullable = true)
 |-- AVG_TIME_BETWEEN(transactions-transaction_date): double (nullable = true)
 |-- MODE(transactions-payment_method_id): double (nullable = true)
 |-- NUM_UNIQUE(transactions-payment_method_id): double (nullable = true)
 |-- MIN(transactions-payment_plan_days): double (nullable = true)
 |-- MIN(transactions-plan_list_price): double (nullable = true)
 |-- MIN(transactions-actual_amount_paid): double (nullable = true)
 |-- MIN(transactions-price_difference): double (nullable = true)
 |-- MIN(transactions-planned_dail

##### Check Remaining Columns

In [None]:
%%spark
from pyspark.sql.types import StringType
import pyspark.sql.functions as f

str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]

str_cols

##### Drop Categoricals with High Null Count
City, bd, regsitered_via, and gender are all categoricals that have a high number of null values. Imputing these features in a logical manner is not possible due to the fact that they are categoricals.

In [None]:
%%spark
columns_to_drop = ['city', 'bd', 'registered_via', 'gender']
df = df.drop(*columns_to_drop)

##### Drop Future Features
The columns days_to_churn and churn_date are both caluclated knowing information that is not available at inferencing time, becuase they require that a churn has already occurred. They will be dropped.

In [None]:
%%spark
columns_to_drop = ['days_to_churn', 'churn_date']
df = df.drop(*columns_to_drop)

## 5. Check Data and Output

In [None]:
%%spark
print((df.count(), len(df.columns)))

In [None]:
%%spark
df.printSchema()

In [None]:
%%spark
df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/clean/mem-features.csv')
print('DONE!')

In [None]:
%%spark
from pyspark.sql.types import StringType
import pyspark.sql.functions as f

str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]

str_cols

##### Drop Categoricals with High Null Count
City, bd, regsitered_via, and gender are all categoricals that have a high number of null values. Imputing these features in a logical manner is not possible due to the fact that they are categoricals.

In [None]:
%%spark
columns_to_drop = ['city', 'bd', 'registered_via', 'gender']
df = df.drop(*columns_to_drop)

##### Drop Future Features
The columns days_to_churn and churn_date are both caluclated knowing information that is not available at inferencing time, becuase they require that a churn has already occurred. They will be dropped.

In [None]:
%%spark
columns_to_drop = ['days_to_churn', 'churn_date']
df = df.drop(*columns_to_drop)

## 5. Check Data and Output

In [None]:
%%spark
print((df.count(), len(df.columns)))

In [None]:
%%spark
df.printSchema()

In [None]:
%%spark
df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/clean/mem-features.csv')
print('DONE!')

In [15]:
%%spark
from pyspark.sql.types import StringType
import pyspark.sql.functions as f

str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]

str_cols

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['msno', 'time', 'city', 'bd', 'registered_via', 'gender', 'days_to_churn', 'churn_date']

##### Drop Categoricals with High Null Count
City, bd, regsitered_via, and gender are all categoricals that have a high number of null values. Imputing these features in a logical manner is not possible due to the fact that they are categoricals.

In [16]:
%%spark
columns_to_drop = ['city', 'bd', 'registered_via', 'gender']
df = df.drop(*columns_to_drop)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Drop Future Features
The columns days_to_churn and churn_date are both caluclated knowing information that is not available at inferencing time, becuase they require that a churn has already occurred. They will be dropped.

In [17]:
%%spark
columns_to_drop = ['days_to_churn', 'churn_date']
df = df.drop(*columns_to_drop)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Check Data and Output

In [18]:
%%spark
print((df.count(), len(df.columns)))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(25902421, 247)

In [19]:
%%spark
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- msno: string (nullable = true)
 |-- time: string (nullable = true)
 |-- SUM(transactions-payment_plan_days): double (nullable = true)
 |-- SUM(transactions-plan_list_price): double (nullable = true)
 |-- SUM(transactions-actual_amount_paid): double (nullable = true)
 |-- SUM(transactions-price_difference): double (nullable = true)
 |-- SUM(transactions-planned_daily_price): double (nullable = true)
 |-- SUM(transactions-daily_price): double (nullable = true)
 |-- TIME_SINCE_LAST(transactions-transaction_date): double (nullable = true)
 |-- AVG_TIME_BETWEEN(transactions-transaction_date): double (nullable = true)
 |-- ALL(transactions-is_auto_renew): boolean (nullable = false)
 |-- ALL(transactions-is_cancel): boolean (nullable = false)
 |-- MODE(transactions-payment_method_id): double (nullable = true)
 |-- NUM_UNIQUE(transactions-payment_method_id): double (nullable = true)
 |-- MIN(transactions-payment_plan_days): double (nullable = true)
 |-- MIN(transactions-plan_list_pri

In [20]:
%%spark
df.write.format("com.databricks.spark.csv").option("header", "true").mode('overwrite').save('s3://jolfr-capstone3/clean/mem-features.csv')
print('DONE!')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DONE!