# Filtering meteorological data

We will use meteorolical data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

## Load data

In [1]:
rdd = sc.textFile('datasets/meteogalicia.txt')

## Filter temperature data

Filter data from the RDD keeping only "Temperatura media" lines.

In [2]:
temperature_lines = rdd.filter(lambda line: 'Temperatura media' in line)

## Count the number of points

In [3]:
temperature_lines.count()

4176

## Find the maximum temperature of the month

In [4]:
temperature_strings = temperature_lines.map(lambda line: line.split()[6])

The temperature_strings contain strings of the form "21,55", in order to use them we have to convert them to floats we have to first replace the "," with a ".":

In [5]:
values = temperature_strings.map(lambda value: value.replace(',', '.'))

And now we can convert them to floats:

In [6]:
temperatures = values.map(lambda value: float(value))

Finally we can calculate the maximum temperature:

In [7]:
temperatures.reduce(max)

34.4

Sometimes it is useful to explore the API to find more direct ways to do what we want.

In this case we can see that there is a **max()** built-in function in the RDD object just to do this, so we can also do:

In [8]:
temperatures.max()

34.4

## Find the minimum temperature of the month

In [9]:
temperatures.min()

-9999.0

Reading the header of the dataset file we can see that -9999 is used as a code to indicate N/A values.

So we have to filter out -9999 and repeat:

In [10]:
temperatures.filter(lambda value: value != -9999).min()

9.09