# Breakdown the Curated Data
## Author: Dulan Wijeratne 1181873

In this Notebook we will breakdown the data from the ETL, to 5 groups, to make it easier to run the code without encountering memory related errors. These groups are:
1. Merchant 
2. Consumers
3. Orders
4. Postcodes
5. Descriptions

To start we will create a Spark session and import the orders dataset that contains all the features that relate to orders.

In [2]:
from pyspark.sql import SparkSession, functions as f
import os

In [3]:

if not os.path.exists("../../../data/insights"):
    os.makedirs("../../../data/insights")

if not os.path.exists("../../../data/insights/pre_insights"):
    os.makedirs("../../../data/insights/pre_insights")

In [4]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/10/19 02:25:44 WARN Utils: Your hostname, DulanComputer resolves to a loopback address: 127.0.1.1; using 172.30.15.25 instead (on interface eth0)
23/10/19 02:25:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/19 02:25:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/19 02:26:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Reading in the ETL Data

In [5]:
all_data_combined = spark.read.parquet("../../../data/curated/all_data_combined.parquet")

                                                                                

In [6]:
all_data_combined.show()

23/10/19 01:51:43 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-----------------+------------+--------------+-------+-----------+----------------+--------------+---------------+------------------+----------------------------+--------------------+--------------------+----------------------+--------------------+--------------------+----------------------------+-----------------+-----------------+-----------------+-----------------+--------------+--------------+---------------------------------+-------------------------------+----------------------+-----------------------------+--------------------------------+-----------------------------+--------------------------+
|consumer_postcode|merchant_abn|order_datetime|user_id|consumer_id|   consumer_name|consumer_state|consumer_gender|      dollar_value|consumer_fraud_probability_%|       merchant_name|merchant_description|merchant_revenue_level|merchant_take_rate_%|             segment|merchant_fraud_probability_%|consumer_sa2_code|consumer_sa4_name|consumer_sa3_name|consumer_sa2_name|sa2_population|sa2

In [6]:
all_data_combined.columns

['consumer_postcode',
 'merchant_abn',
 'order_datetime',
 'user_id',
 'consumer_id',
 'consumer_name',
 'consumer_state',
 'consumer_gender',
 'dollar_value',
 'consumer_fraud_probability_%',
 'merchant_name',
 'merchant_description',
 'merchant_revenue_level',
 'merchant_take_rate_%',
 'segment',
 'merchant_fraud_probability_%',
 'consumer_sa2_code',
 'consumer_sa4_name',
 'consumer_sa3_name',
 'consumer_sa2_name',
 'sa2_population',
 'sa2_median_age',
 'sa2_median_mortgage_repay_monthly',
 'sa2_median_tot_prsnl_inc_weekly',
 'sa2_median_rent_weekly',
 'sa2_median_tot_fam_inc_weekly',
 'sa2_average_num_psns_per_bedroom',
 'sa2_median_tot_hhd_inc_weekly',
 'sa2_average_household_size']

### Removing Unnecessary Columns

In [7]:
all_df = all_data_combined.drop(all_data_combined.consumer_gender).\
        drop(all_data_combined.user_id)
    

In [8]:
all_df.columns

['consumer_postcode',
 'merchant_abn',
 'order_datetime',
 'consumer_id',
 'consumer_name',
 'consumer_state',
 'dollar_value',
 'consumer_fraud_probability_%',
 'merchant_name',
 'merchant_description',
 'merchant_revenue_level',
 'merchant_take_rate_%',
 'segment',
 'merchant_fraud_probability_%',
 'consumer_sa2_code',
 'consumer_sa4_name',
 'consumer_sa3_name',
 'consumer_sa2_name',
 'sa2_population',
 'sa2_median_age',
 'sa2_median_mortgage_repay_monthly',
 'sa2_median_tot_prsnl_inc_weekly',
 'sa2_median_rent_weekly',
 'sa2_median_tot_fam_inc_weekly',
 'sa2_average_num_psns_per_bedroom',
 'sa2_median_tot_hhd_inc_weekly',
 'sa2_average_household_size']

### Saving the Data
We will save the data as grouped seperate datasets. We will also rename some columns in particular group to make them easier to interpret.

In [14]:
merchant = all_df.select(all_df['merchant_abn'],\
                         all_df['merchant_name'],\
                         all_df['merchant_revenue_level'],\
                         all_df['merchant_take_rate_%'],\
                         all_df['merchant_fraud_probability_%'],\
                        )

merchant.write.mode("overwrite").parquet("../../../data/insights/pre_insights/merchant.parquet")

                                                                                

In [15]:
consumers = all_df.select("merchant_abn",
                "merchant_name",
                "consumer_id",
                "consumer_name",
                "consumer_fraud_probability_%")
consumers.write.mode("overwrite").parquet("../../../data/insights/pre_insights/consumers.parquet")

                                                                                

In [11]:
orders = all_df.select(all_df.merchant_abn,
                       all_df.merchant_name,
                       all_df.consumer_id,
                       all_df.order_datetime,
                       all_df.dollar_value)

orders.write.mode("overwrite").parquet("../../../data/insights/pre_insights/orders.parquet")

                                                                                

In [9]:
descriptions = all_df.select(all_df.merchant_abn,
                     all_df.merchant_name,
                     all_df.merchant_description,
                     all_df.segment)

descriptions.write.mode("overwrite").parquet("../../../data/insights/pre_insights/descriptions.parquet")

                                                                                

In [18]:
postcode = all_df.select(all_df.merchant_abn,
                         all_df.consumer_postcode,
                         all_df.consumer_id,
                         all_df.merchant_name,
                         f.col("sa2_population").alias("estimated_population"),
                         f.col("sa2_median_age").alias("median_age"),
                         f.col("sa2_median_mortgage_repay_monthly").alias("median_mortgage_monthly"),
                         f.col("sa2_median_tot_prsnl_inc_weekly").alias("total_weekly_personal_income"),
                         f.col("sa2_median_rent_weekly").alias("median_weekly_rent"),
                         f.col("sa2_median_tot_fam_inc_weekly").alias("total_weekly_fam_income"),
                         f.col("sa2_average_num_psns_per_bedroom").alias("avg_num_persons_per_bedroom"),
                         f.col("sa2_median_tot_hhd_inc_weekly").alias("total_hhd_income_weekly"),
                         f.col("sa2_average_household_size").alias("avg_household_size"))

postcode.write.mode("overwrite").parquet("../../../data/insights/pre_insights/postcode.parquet")



                                                                                

In [19]:
spark.stop()

### Summary
Split the dataset from ETL into 5 groups being:
1. Merchant 
2. Consumers
3. Orders
4. Postcodes
5. Descriptions