# Preparing the Spark job

In this section we will explore our full dataset with a Spark mindset. We have already explored it in the introduction with pandas to know what it contains and how it looks like but now we need to prepare it to be able to do the calculations and aggregations we need with the appropriate data types. We will start with the data sample and then try the scheme with the full dataset. As a result, we will build a program that can be run in the HDFS.

The first step is to download the files.

In [None]:
#Sample
!wget 'https://files.datapress.com/london/dataset/smartmeter-energy-use-data-in-london-households/UKPN-LCL-smartmeter-sample.csv'

In [None]:
#Full dataset
!wget 'https://files.datapress.com/london/dataset/smartmeter-energy-use-data-in-london-households/Power-Networks-LCL-June2015(withAcornGps).zip'

The file is compressed in .zip format. The problem of this format is it cannot be broken into pieces without being corrupted. Therefore, if we attempt a MapReduce operation we will get a useless binary output. To be able to use it without decompressing it (as we do not have enough disk space) we will transform it line by line to bzip2 using pipes.

In [None]:
!unzip -c 'Power-Networks-LCL-June2015(withAcornGps).zip'  | bzip2 > 'Power-Networks-LCL-June2015(withAcornGps).bz2'

Now we are ready to start. The commented lines would be needed if our Virtual Machine had not been already configured to make our life easier to import SparkContext and assign it to sc upon start.

In [29]:
#from pyspark import SparkContext
#sc = SparkContext()
from pyspark.sql.types import DateType
from datetime import datetime
import numpy as np
from pyspark import Row
sc

<pyspark.context.SparkContext at 0x7f6cef748d90>

In [31]:
sqlContext

<pyspark.sql.context.SQLContext at 0x7f235c0bb290>

Let us start with the sample.

In [1]:
path_sample = 'UKPN-LCL-smartmeter-sample.csv'

In [2]:
data_sample = sc.textFile(path_sample)

In [3]:
data_sample.take(3)

[u'LCLid,stdorToU,DateTime,KWH/hh (per half hour) ,Acorn,Acorn_grouped',
 u'MAC003718,Std,17/10/2012 13:00:00,0.09,ACORN-A,Affluent',
 u'MAC003718,Std,17/10/2012 13:30:00,0.16,ACORN-A,Affluent']

The header is useful in pandas but we need to remove it when working with Spark.

In [4]:
header = data_sample.first()

In [5]:
data_sample2 = data_sample.filter(lambda l: l != header).persist()

In [6]:
data_sample2.take(3)

[u'MAC003718,Std,17/10/2012 13:00:00,0.09,ACORN-A,Affluent',
 u'MAC003718,Std,17/10/2012 13:30:00,0.16,ACORN-A,Affluent',
 u'MAC003718,Std,17/10/2012 14:00:00,0.212,ACORN-A,Affluent']

Now we will map the file extracting the relevant fields and converting them to the appropriate types. We will ignore the last field, which is actually a higher-level ACORN grouping compared to the second last. ACORN is a UK consumer socioeconomical segmentation system that may be useful in this work (see http://acorn.caci.co.uk/downloads/Acorn-User-guide.pdf).

In [74]:
def line2tuple(l):
    fields = l.split(',')
    if len(fields) == 6:
        ID = fields[0]
        tariff = fields[1]
        DateTime = datetime.strptime(fields[2], '%d/%m/%Y %H:%M:%S')
        ACORN = fields[4]
        try:
            consumption = float(fields[3])
        except ValueError:
            consumption = np.nan
        return (ID, tariff, DateTime, consumption, ACORN)
    else:
        return (np.nan,np.nan,np.nan,np.nan,np.nan)

In [75]:
data_sample2.map(line2tuple).take(3)

[(u'MAC003718',
  u'Std',
  datetime.datetime(2012, 10, 17, 13, 0),
  0.09,
  u'ACORN-A'),
 (u'MAC003718',
  u'Std',
  datetime.datetime(2012, 10, 17, 13, 30),
  0.16,
  u'ACORN-A'),
 (u'MAC003718',
  u'Std',
  datetime.datetime(2012, 10, 17, 14, 0),
  0.212,
  u'ACORN-A')]

In [71]:
rows = data_sample2.map(line2tuple)\
.map(lambda x: Row(ID = x[0], Tariff = x[1], DateTime = x[2], kWh_30min = x[3], ACORN = x[4]))

In [72]:
df = rows.toDF()

In [73]:
df.show(5)

+-------+--------------------+---------+------+---------+
|  ACORN|            DateTime|       ID|Tariff|kWh_30min|
+-------+--------------------+---------+------+---------+
|ACORN-A|2012-10-17 13:00:...|MAC003718|   Std|     0.09|
|ACORN-A|2012-10-17 13:30:...|MAC003718|   Std|     0.16|
|ACORN-A|2012-10-17 14:00:...|MAC003718|   Std|    0.212|
|ACORN-A|2012-10-17 14:30:...|MAC003718|   Std|    0.145|
|ACORN-A|2012-10-17 15:00:...|MAC003718|   Std|    0.104|
+-------+--------------------+---------+------+---------+
only showing top 5 rows



We can try calculating the mean grouping by tariff. In this case, there is only one consumer, which is subject to the standard tariff.

In [34]:
df.dropna().groupBy('Tariff').mean().collect()

[Row(Tariff=u'Std', avg(kWh_30min)=0.20900675947184585)]

In [54]:
#Checking the types, everything looks correct.
df.printSchema

<bound method DataFrame.printSchema of DataFrame[DateTime: timestamp, ID: string, Tariff: string, kWh_30min: double]>

Now let us start tackling the real thing:

In [36]:
path = 'Power-Networks-LCL-June2015(withAcornGps).bz2'

In [37]:
data = sc.textFile(path)

In [38]:
data.take(5)

[u'Archive:  Power-Networks-LCL-June2015(withAcornGps).zip',
 u'  inflating: Power-Networks-LCL-June2015(withAcornGps)v2.csv  ',
 u'LCLid,stdorToU,DateTime,KWH/hh (per half hour) ,Acorn,Acorn_grouped',
 u'MAC000002,Std,2012-10-12 00:30:00.0000000, 0 ,ACORN-A,Affluent',
 u'MAC000002,Std,2012-10-12 01:00:00.0000000, 0 ,ACORN-A,Affluent']

There are two new problems.

First of all, there are two extra lines above the header, so now we need to remove 3 lines before mapping.

In [39]:
data_no_header = data.zipWithIndex().filter(lambda x: x[1] > 2).keys()

In [57]:
data_no_header.map(line2tuple).take(3)

[(u'MAC000002', u'Std', u'2012-10-12 00:30:00.0000000', 0.0),
 (u'MAC000002', u'Std', u'2012-10-12 01:00:00.0000000', 0.0),
 (u'MAC000002', u'Std', u'2012-10-12 01:30:00.0000000', 0.0)]

Moreover, the DateTime format has changed compared to the sample provided!! Not very nice, let's hope that at least it is consistent thoughout the whole full dataset.
We must change our mapping function accordingly. This datetime format can be automatically changed later for the full dataset to pyspark.sql TimestampType so we will leave it as string for now.

In [56]:
def line2tuple(l):
    fields = l.split(',')
    if len(fields) == 6:
        ID = fields[0]
        tariff = fields[1]
        DateTime = fields[2]
        try:
            consumption = float(fields[3])
        except ValueError:
            consumption = np.nan
        return (ID, tariff, DateTime, consumption)
    else:
        return (np.nan,np.nan,np.nan,np.nan)

We create our dataframe and check that DateTime is a string variable.

In [58]:
rows = data_no_header.map(line2tuple)\
.map(lambda x: Row(ID = x[0], Tariff = x[1], DateTime = x[2], kWh_30min = x[3]))

In [59]:
df = rows.toDF()

In [60]:
df.printSchema

<bound method DataFrame.printSchema of DataFrame[DateTime: string, ID: string, Tariff: string, kWh_30min: double]>

We change DateTime to pyspark.sql TimestampType (this is equivalent to datetime.datetime).

In [65]:
from pyspark.sql.types import TimestampType
df2 = df.withColumn('DateTime', df['DateTime'].cast(TimestampType()))

In [80]:
df2.printSchema

<bound method DataFrame.printSchema of DataFrame[DateTime: timestamp, ID: string, Tariff: string, kWh_30min: double]>

In [76]:
df2.show(5)

+--------------------+---------+------+---------+
|            DateTime|       ID|Tariff|kWh_30min|
+--------------------+---------+------+---------+
|2012-10-12 00:30:...|MAC000002|   Std|      0.0|
|2012-10-12 01:00:...|MAC000002|   Std|      0.0|
|2012-10-12 01:30:...|MAC000002|   Std|      0.0|
|2012-10-12 02:00:...|MAC000002|   Std|      0.0|
|2012-10-12 02:30:...|MAC000002|   Std|      0.0|
+--------------------+---------+------+---------+
only showing top 5 rows



Let's try if calculating the mean grouped by tariff works, this would mean we have cleaned the dataset successfully.

In [79]:
df2.dropna().groupBy('Tariff').mean().collect()

[Row(Tariff=u'Std', avg(kWh_30min)=0.21507225601128732),
 Row(Tariff=u'ToU', avg(kWh_30min)=0.1986226410448441)]

That's great! We have also learnt the keys used to name the two tariff groups we expected to find ('Std' and 'ToU')

It will be useful in our analysis which consumers were subjected to Dynamic Time of Use (ToU) tariff during 2013. These had a standard tariff during the rest of the period included in the dataset so we can evaluate behavioural changes due to the tariff scheme.

In [94]:
df2.filter(df2['Tariff'] == 'ToU').select('ID').drop_duplicates()

DataFrame[ID: string]

We can add a variable to our dataframe indicating whether a user is always subject to the standard flat rate or is subject to the dToU tariff during 2013. In order to be more efficient this should be done before mapping the text file.

We can now write our script to parse the full text file dataset.

In [96]:
####to be checked
data = sc.textFile(path)
data_no_header = data.zipWithIndex().filter(lambda x: x[1] > 2).keys()
ID_ToU = data_no_header.map(lambda l: (l.split(',')[1], l.split(',')[0]))\
.filter(lambda (t,_): t == 'ToU').distinct()

KeyboardInterrupt: 

In [56]:
####to be checked
def line2tuple(l):
    fields = l.split(',')
    if len(fields) == 6:
        ID = fields[0]
        tariff = fields[1]
        DateTime = fields[2]
        ToU_User = 1 if ID in ID_ToU else 0  
        try:
            consumption = float(fields[3])
        except ValueError:
            consumption = np.nan
        return (ID, tariff, DateTime, consumption, ID_ToU)
    else:
        return (np.nan,np.nan,np.nan,np.nan,np.nan)