# Working with meteorological data 2

We will use meteorological data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

In [1]:
rdd = sc.textFile('datasets/meteogalicia.txt')

## Extract date and temperature information

Filter data from the RDD keeping only "Temperatura media" lines and keeping the date information.

In [2]:
def parse_temperature(line):
    (_, date, hour, _, _, _, value) = line.split()
    return (date, float(value.replace(',', '.')))

In [3]:
temperatures = (rdd.filter(lambda line: 'Temperatura media' in line)
                .map(parse_temperature))

In [4]:
temperatures.take(5)

[(u'2017-06-01', 13.82),
 (u'2017-06-01', 13.71),
 (u'2017-06-01', 13.61),
 (u'2017-06-01', 13.52),
 (u'2017-06-01', 13.33)]

## Filter out invalid values

As we saw in part 1, a temperature value of -9999 indicates a non existing value, so we filter out these values before performing calculations on the data:

In [5]:
temperatures_clean = temperatures.filter(lambda (date, temp): temp != -9999)

NOTE: In Python 3 tuple parameter unpacking has been removed as explained in [PEP 3113](https://www.python.org/dev/peps/pep-3113/) so we would rewrite it as:

    temperatures_clean = temperatures.filter(lambda date_temp: date_temp[1] != -9999)

## Calculate the average temperature per day

In [6]:
def sum_pairs(a, b):
    return (a[0]+b[0], a[1]+b[1])

In [7]:
averages = (temperatures_clean.map(lambda (date, temp): (date, (temp, 1)))
            .reduceByKey(sum_pairs)
            .map(lambda (date, (temp, count)): (date, temp/count)))

NOTE: In Python 3 the syntax gets ugly especially in cases like this where there are nested structures. The code above in Python 3 will look like:

    averages = (temperatures_clean.map(lambda date_temp: (date_temp[0], (date_temp[1], 1)))
                .reduceByKey(sum_pairs)
                .map(lambda date__temp_count: (date__temp_count[0], date__temp_count[1][0]/date__temp_count[1][1])))



## Show the results sorted by date

In [8]:
averages.sortByKey().collect()

[(u'2017-06-01', 17.179580419580425),
 (u'2017-06-02', 16.007500000000004),
 (u'2017-06-03', 14.511736111111105),
 (u'2017-06-04', 14.889375000000005),
 (u'2017-06-05', 13.67486111111111),
 (u'2017-06-06', 14.901041666666666),
 (u'2017-06-07', 17.76305555555556),
 (u'2017-06-08', 17.49979166666667),
 (u'2017-06-09', 17.86694444444445),
 (u'2017-06-10', 19.207222222222224),
 (u'2017-06-11', 17.806250000000006),
 (u'2017-06-12', 20.020138888888884),
 (u'2017-06-13', 18.769027777777776),
 (u'2017-06-14', 17.93489510489511),
 (u'2017-06-15', 18.135486111111103),
 (u'2017-06-16', 22.042708333333337),
 (u'2017-06-17', 25.475902777777772),
 (u'2017-06-18', 26.350069444444443),
 (u'2017-06-19', 25.422708333333333),
 (u'2017-06-20', 26.977916666666665),
 (u'2017-06-21', 23.28430555555555),
 (u'2017-06-22', 19.56493055555555),
 (u'2017-06-23', 18.57861111111111),
 (u'2017-06-24', 17.6775),
 (u'2017-06-25', 19.57138888888889),
 (u'2017-06-26', 18.298125000000002),
 (u'2017-06-27', 17.025555555555