# Working with meteorological data using DataFrames

We will use meteorolical data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

In [None]:
rdd = sc.textFile('datasets/meteogalicia.txt')

## Convert to a DataFrame

In [None]:
from pyspark.sql import Row

def parse_row(line):
    """Converts a line into a Row
       If the line is a data line it is converted to a Row and returned as a list with that Row,
       otherwise an empty list is returned.
    """
    # All data lines start with 6 spaces
    if line.startswith('      '):
        code = int(line[:17].strip())
        date_time = line[17:40]
        date, time = date_time.split()
        parameter = line[40:82].strip()
        value = float(line[82:].replace(',', '.'))
        return [Row(code=code, date=date, time=time, parameter=parameter, value=value)]
    return []

Using flatMap we have the flexibility to return nothing from a call to the function, this is accomplished returning and empty array.

In [None]:
data = rdd.flatMap(parse_row).toDF()

## Count the number of points

In [None]:
data.???

## Filter temperature data

In [None]:
t = data.???

## Find the maximum temperature of the month

In [None]:
t.???

## Find the minimum temperature of the month

In [None]:
t.???

The value -9999 is a code used to indicate a non registered value (N/A).

If we look at the possible values of "Códigos de validación" (`code` column) we see valid points have the code 1, so we can concentrate our efforts on data with code 1.

In [None]:
t_validated = t.???

Now we can find the minimum temperature of the month using the validated data:

In [None]:
t_validated.???

## Calculate the average temperature per day

In [None]:
t_validated.???

## Show the results sorted by date

In [None]:
t_validated.???