# Working with meteorological data using DataFrames

We will use meteorolical data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

In [1]:
rdd = sc.textFile('datasets/meteogalicia.txt')

## Convert to a DataFrame

In [2]:
from pyspark.sql import Row

def parse_row(line):
    """Converts a line into a Row
       If the line is a data line it is converted to a Row and returned as a list with that Row,
       otherwise an empty list is returned.
    """
    # All data lines start with 6 spaces
    if line.startswith('      '):
        code = int(line[:17].strip())
        date_time = line[17:40]
        date, time = date_time.split()
        parameter = line[40:82].strip()
        value = float(line[82:].replace(',', '.'))
        return [Row(code=code, date=date, time=time, parameter=parameter, value=value)]
    return []

Using flatMap we have the flexibility to return nothing from a call to the function, this is accomplished returning and empty array.

In [3]:
data = rdd.flatMap(parse_row).toDF()

## Count the number of points

In [4]:
data.count()

16704

## Filter temperature data

In [5]:
from pyspark.sql.functions import col

t = data.where(col('parameter').like('Temperatura media %'))

## Find the maximum temperature of the month

In [6]:
t.groupBy().max('value').show()

+----------+
|max(value)|
+----------+
|      34.4|
+----------+



In [7]:
t.selectExpr('max(value)').show()

+----------+
|max(value)|
+----------+
|      34.4|
+----------+



In [8]:
from pyspark.sql.functions import max
t.select(max(col('value'))).show()

+----------+
|max(value)|
+----------+
|      34.4|
+----------+



## Find the minimum temperature of the month

In [9]:
t.groupBy().min('value').show()

+----------+
|min(value)|
+----------+
|   -9999.0|
+----------+



The value -9999 is a code used to indicate a non registered value (N/A).

If we look at the possible values of "Códigos de validación" (`code` column) we see valid points have the code 1, so we can concentrate our efforts on data with code 1.

In [10]:
t_validated = t.where(col('code') == 1)

Now we can find the minimum temperature of the month using the validated data:

In [11]:
t_validated.groupBy().min('value').show()

+----------+
|min(value)|
+----------+
|      9.09|
+----------+



## Calculate the average temperature per day

In [12]:
t_validated.groupBy('date').mean('value').show(30)

+----------+------------------+
|      date|        avg(value)|
+----------+------------------+
|2017-06-22| 19.56493055555555|
|2017-06-07| 17.76305555555556|
|2017-06-24|           17.6775|
|2017-06-29|13.477083333333331|
|2017-06-19|25.422708333333333|
|2017-06-03|14.511736111111105|
|2017-06-23| 18.57861111111111|
|2017-06-28|15.242361111111105|
|2017-06-12|20.020138888888884|
|2017-06-30|             11.59|
|2017-06-26|18.298125000000002|
|2017-06-04|14.889375000000005|
|2017-06-18|26.350069444444443|
|2017-06-06|14.901041666666666|
|2017-06-09| 17.86694444444445|
|2017-06-21| 23.28430555555555|
|2017-06-25| 19.57138888888889|
|2017-06-14| 17.93489510489511|
|2017-06-16|22.042708333333337|
|2017-06-11|17.806250000000006|
|2017-06-08| 17.49979166666667|
|2017-06-13|18.769027777777776|
|2017-06-01|17.179580419580425|
|2017-06-02|16.007500000000004|
|2017-06-27|17.025555555555556|
|2017-06-17|25.475902777777772|
|2017-06-15|18.135486111111103|
|2017-06-20|26.977916666666665|
|2017-06

## Show the results sorted by date

In [13]:
t_validated.groupBy('date').mean('value').orderBy('date').show(30)

+----------+------------------+
|      date|        avg(value)|
+----------+------------------+
|2017-06-01|17.179580419580425|
|2017-06-02|16.007500000000004|
|2017-06-03|14.511736111111105|
|2017-06-04|14.889375000000005|
|2017-06-05| 13.67486111111111|
|2017-06-06|14.901041666666666|
|2017-06-07| 17.76305555555556|
|2017-06-08| 17.49979166666667|
|2017-06-09| 17.86694444444445|
|2017-06-10|19.207222222222224|
|2017-06-11|17.806250000000006|
|2017-06-12|20.020138888888884|
|2017-06-13|18.769027777777776|
|2017-06-14| 17.93489510489511|
|2017-06-15|18.135486111111103|
|2017-06-16|22.042708333333337|
|2017-06-17|25.475902777777772|
|2017-06-18|26.350069444444443|
|2017-06-19|25.422708333333333|
|2017-06-20|26.977916666666665|
|2017-06-21| 23.28430555555555|
|2017-06-22| 19.56493055555555|
|2017-06-23| 18.57861111111111|
|2017-06-24|           17.6775|
|2017-06-25| 19.57138888888889|
|2017-06-26|18.298125000000002|
|2017-06-27|17.025555555555556|
|2017-06-28|15.242361111111105|
|2017-06