<font color="#04B404"><h1 align="center">PySpark</h1></font>
<font color="#6E6E6E"><h2 align="center">HHAR dataset</h2></font>
<br>
<span>In this notebook we present a basical treatment of the dataset 'Human Activity Recognition'
(HHAR https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition+Data+Set) which contains data cof people's physical movements collected through their phones and watches sensors (gyroscopes and accelerometers). More specifically we will treat 4 csv's files: Phones_accelerometer.csv, Phones_gyroscope.csv,
Watch_accelerometer.csv y Watch_gyroscope.csv. Its format is, for all of them, the next (columns):
    
- Index: record id.
- Arrival_Time.
- Creation_Time.
- X,y,z: x, y and z axis' values of meditions.
- User: user identifier from 'a' to 'i'.
- Model: phone/watch model.
- Device: model device which takes the measure.
- Gt: activity user was realizing during the measurement. Posible values are: bike sit, stand, walk, stairsup, stairsdown and null.

The goal is to process these datasets to obtain a dataset with next info (not the same format is required):
User,Model,gt,mean(x,y,z),stdev(x,y,z),max(x,y,z),min(x,y,z)
</span>

In [28]:
# Spark libraries:
from pyspark import SparkContext   
from pyspark import SparkConf
from statistics import mean, stdev 
import time as t                   # in order to measure execution times

In [29]:
if not (("sc" in globals()) or ("sc" in locals())):
    import pyspark
    sc = pyspark.SparkContext()

In [30]:
'''
Input  -> CSV file
Output -> RDD con formato User | Model | Gt | Index | Arrival_Time | Creation_Time | x | y | z | Device

'''

def createRDD(file):
    
    RDD = sc.textFile(file) \
            .map(lambda el: el.split(',')) \
            .map(lambda el: (el[6], el[7], el[9], int(el[0]), int(el[1]), float(el[2]), \
                             float(el[3]), float(el[4]), float(el[5]), el[8]) ) 
            
    return RDD

In [31]:
'''Función a la cual se le pasa un RDD y una variable tipo string x,y ó z y retorna otro RDD con (clave 3 campos), 
media(col), std(col), max(col), min(col). Se usan las funciones mean, std de la librería statistics.

Input: rdd + string variable to select
Output: rdd with format ((key), mean, std, max, min)
'''

def calc_estad(rdd, var):
    if var == 'x':
        alt = rdd.map(lambda el: ( (el[0], el[1], el[2]) , el[6]) )
    elif var == 'y':
        alt = rdd.map(lambda el: ( (el[0], el[1], el[2]) , el[7]) )
    else:
        alt = rdd.map(lambda el: ( (el[0], el[1], el[2]) , el[8]) )
        
    rdd_est = alt.groupByKey() \
                 .map(lambda el: (el[0], list(el[1]) ) ) \
                 .map(lambda el: (el[0], (mean(el[1]), stdev(el[1]), max(el[1]), min(el[1])) ) ) 
            
            
    return rdd_est

In [32]:
aux1 = createRDD("../data/small_data/Phones_accelerometer.csv")

In [33]:
# Join on three axis x, y, z:
rddPhAc = calc_estad(aux1, 'x').join(calc_estad(aux1, 'y')).join(calc_estad(aux1, 'z'))

In [34]:
# Se repite el tratamiento para el fichero de Phones_gyroscope.csv:
aux2 = creaRDD("../data/small_data/Phones_gyroscope.csv")

In [35]:
rddPhGyr = calc_estad(aux2, 'x').join(calc_estad(aux2, 'y')).join(calc_estad(aux2, 'z'))

In [36]:
# Joining both phone gyroscopes RDDs:
rddPhones = rddPhAc.join(rddPhGyr)

In [37]:
# Same treatment for watchers files:
aux3 = creaRDD("../data/small_data/Watch_accelerometer.csv")
rddWatAc = calc_estad(aux3, 'x').join(calc_estad(aux3, 'y')).join(calc_estad(aux3, 'z'))

In [38]:
# Watch_gyroscope.csv
aux4 = creaRDD("../data/small_data/Watch_gyroscope.csv")
rddWatGyr = calc_estad(aux4, 'x').join(calc_estad(aux4, 'y')).join(calc_estad(aux4, 'z'))

In [39]:
# Join:
rddWatches = rddWatAc.join(rddWatGyr)

Finally the result:

In [43]:
# Union of both watches and phones RDDs
_rdd = rddPhones.union(rddWatches)

In [46]:
print(_rdd.collect()[0])

(('a', 'nexus4', 'stand'), ((((-6.02649995057, 0.18456097501476634, -5.5202026, -7.0448303), (0.9334959509016, 0.24044618153708053, 1.9472808999999998, -0.84251404)), (8.01364601312, 0.17600865886717026, 8.638794, 7.149872)), (((0.0015888519490950001, 0.04277706596172393, 0.6321869000000001, -0.16569519), (0.001009460465647, 0.028614446745451536, 0.34971620000000003, -0.15550232)), (0.000442184429349, 0.04594334128107021, 0.44873047, -0.6001586999999999))))
