# Creating a climate index API

In order to know which destinations have (or will have) the best weather on a given date we will be using historical data from all of Spain's weather stations, as well as packaging that information into an API that will allow us to remotely perform queries.

Since all information from 291 weather stations are stored in individual *csv* files, let's import the first one and see how can we coax the necessary data out of it.

# Transforming historical data into a weather index

*Indices are particularly valuable because they allow the integrated effects of a range of climatic variables to be quantified, facilitating an interpretation and rating of climatic conditions at a destination. Another advantage of indices is that they enable the climate of tourism destinations to be objectively compared and are therefore a convenient and more conceptually sound means to assess possible impacts of climate change on the distribution of climatic resources worldwide.*

                  - An Inter-Comparison of the Holiday Climate Index (HCI) and the Tourism Climate Index (TCI) in Europe
                    https://www.mdpi.com/2073-4433/7/6/80/htm
                    
                    
                    
                    
Our aim is to condense all meteorological data (wind, temperature, hours of daylight, rain...) into a single numnber (an index) that can be used to easily compare one destination with another. The objective is to create a dataframe where every row is a different destination and each column holds the index for each week of the year.

We can't use the **TCI** as-is because we lack some of the data (humidity and cloud cover %), so we will have to make an index of our own. From now on we will call it simply the **CWI** (**C**ycling **W**eather **I**ndex).

## Creating our **CWI**

While it would be nice to have all weather data for every town in Spain, we have to make do with the available datasets.

Of the present variables in the official meteo datasets, the following variables are of use:


- Date (to group values per week).
- Average temperature.
- Rain (mm).
- Average wind speed.


The hours of sunlight aren't really meaningful because we're crafting this custom index to be used only on Spain, which has a single timezone and the latitude change isn't that great.


The original **TCI** roughly assigns the following weights to each parameter:


**Maximum temperature:** 40%

**Cloud cover:** 20%

**Precipitation:** 30%

**Wind:** 10%



While those percentages are quite good, this model was made with tourism in mind. Cycling has a few diferrences with tourism and other leisure activities that must be kept in mind when creating our **CWS**:

- Clouds are OK as long as it doesn't rain.
- Wind is very harmful.


A first approach to our index could be the following:

**Maximum temperature:** 40%

**Precipitation:** 40%

**Wind:** 20%

## Importing our data

The first step is importing our data. Every weather station's data is contained in a different *csv* file, we'll import the first one to see how data is structured.

In [38]:
import pandas as pd
import time
from pathlib import Path

In [39]:
#Importing our dataframe.

df = pd.read_csv('meteo_test.csv', sep=';')

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17390 entries, 0 to 17389
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   FECHA          17390 non-null  object 
 1   INDICATIVO     17390 non-null  object 
 2   NOMBRE         17390 non-null  object 
 3   PROVINCIA      17390 non-null  object 
 4   ALTITUD        17390 non-null  int64  
 5   TMEDIA         16212 non-null  float64
 6   PRECIPITACION  16081 non-null  object 
 7   TMIN           16212 non-null  float64
 8   HORATMIN       16191 non-null  object 
 9   TMAX           16231 non-null  float64
 10  HORATMAX       16218 non-null  object 
 11  DIR            15588 non-null  float64
 12  VELMEDIA       16233 non-null  float64
 13  RACHA          15569 non-null  float64
 14  HORARACHA      15539 non-null  object 
 15  SOL            10102 non-null  float64
 16  PRESMAX        8928 non-null   float64
 17  HORAPRESMAX    8899 non-null   object 
 18  PRESMI

In [41]:
df.head()

Unnamed: 0,FECHA,INDICATIVO,NOMBRE,PROVINCIA,ALTITUD,TMEDIA,PRECIPITACION,TMIN,HORATMIN,TMAX,HORATMAX,DIR,VELMEDIA,RACHA,HORARACHA,SOL,PRESMAX,HORAPRESMAX,PRESMIN,HORAPRESMIN
0,1968-03-01,0002I,VANDELL�S,TARRAGONA,32,8.9,21.0,6.6,03:00,11.2,18:00,5.0,1.9,6.7,10:55,0.0,,,,
1,1968-03-02,0002I,VANDELL�S,TARRAGONA,32,10.9,0.0,6.0,07:10,15.8,15:00,32.0,1.1,18.6,23:40,8.6,,,,
2,1968-03-03,0002I,VANDELL�S,TARRAGONA,32,,0.0,,,,,32.0,6.1,19.2,00:55,8.6,,,,
3,1968-03-04,0002I,VANDELL�S,TARRAGONA,32,10.9,6.5,7.8,05:00,14.0,11:40,32.0,3.9,8.1,06:10,8.4,,,,
4,1968-03-05,0002I,VANDELL�S,TARRAGONA,32,,0.0,,,,,32.0,3.6,13.1,23:40,2.5,,,,


In [42]:
#Precipitation needs to be changed to float, but first some values ('ip') will have to go (it means 0 precipitation).

df['PRECIPITACION'] = df['PRECIPITACION'].str.replace('Ip', '0.0')

df["PRECIPITACION"] = pd.to_numeric(df["PRECIPITACION"])

## Using time series to get the average values

Since we want to calculate the **TCI** on a weekly basis it makes sense to compact our data with the same granularity. For this purpose we will be using time series.

In [43]:
#Converting 'FECHA' to Datetime format.

df["FECHA"] = df['FECHA'].apply(pd.to_datetime)

In [44]:
#Checking the result.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17390 entries, 0 to 17389
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   FECHA          17390 non-null  datetime64[ns]
 1   INDICATIVO     17390 non-null  object        
 2   NOMBRE         17390 non-null  object        
 3   PROVINCIA      17390 non-null  object        
 4   ALTITUD        17390 non-null  int64         
 5   TMEDIA         16212 non-null  float64       
 6   PRECIPITACION  16081 non-null  float64       
 7   TMIN           16212 non-null  float64       
 8   HORATMIN       16191 non-null  object        
 9   TMAX           16231 non-null  float64       
 10  HORATMAX       16218 non-null  object        
 11  DIR            15588 non-null  float64       
 12  VELMEDIA       16233 non-null  float64       
 13  RACHA          15569 non-null  float64       
 14  HORARACHA      15539 non-null  object        
 15  SOL            1010

Since the climate is changing it would be foolish to use all data for our forecast. The last 5 years will suffice.

In [45]:
#Creating a dataframe with data from the last 5 years.

filtered = df.loc[(df['FECHA'] >= '2016-01-01') & (df['FECHA'] < '2021-01-01')]

#Reseting the index.

filtered.reset_index(drop=True, inplace=True)

In [46]:
filtered.head()

Unnamed: 0,FECHA,INDICATIVO,NOMBRE,PROVINCIA,ALTITUD,TMEDIA,PRECIPITACION,TMIN,HORATMIN,TMAX,HORATMAX,DIR,VELMEDIA,RACHA,HORARACHA,SOL,PRESMAX,HORAPRESMAX,PRESMIN,HORAPRESMIN
0,2016-01-01,0002I,VANDELL�S,TARRAGONA,32,13.6,0.1,10.8,04:40,16.4,13:00,12.0,0.6,3.6,00:10,,,,,
1,2016-01-02,0002I,VANDELL�S,TARRAGONA,32,13.6,0.0,10.7,21:10,16.4,14:20,99.0,2.8,13.1,10:30,,,,,
2,2016-01-03,0002I,VANDELL�S,TARRAGONA,32,12.1,0.0,9.6,23:00,14.6,11:50,24.0,1.4,8.6,09:30,,,,,
3,2016-01-04,0002I,VANDELL�S,TARRAGONA,32,12.7,0.3,9.4,01:00,16.0,04:40,33.0,1.7,14.2,20:40,,,,,
4,2016-01-05,0002I,VANDELL�S,TARRAGONA,32,12.0,0.0,9.7,23:59,14.4,13:30,30.0,3.1,17.5,03:10,,,,,


In [47]:
#We don't have data for every single day, but this will be enough.

filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1291 entries, 0 to 1290
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   FECHA          1291 non-null   datetime64[ns]
 1   INDICATIVO     1291 non-null   object        
 2   NOMBRE         1291 non-null   object        
 3   PROVINCIA      1291 non-null   object        
 4   ALTITUD        1291 non-null   int64         
 5   TMEDIA         1276 non-null   float64       
 6   PRECIPITACION  1126 non-null   float64       
 7   TMIN           1276 non-null   float64       
 8   HORATMIN       1274 non-null   object        
 9   TMAX           1276 non-null   float64       
 10  HORATMAX       1274 non-null   object        
 11  DIR            1236 non-null   float64       
 12  VELMEDIA       1241 non-null   float64       
 13  RACHA          1236 non-null   float64       
 14  HORARACHA      1236 non-null   object        
 15  SOL            0 non-

Now that we have a dataset of the last 5 years it's time to calculate the weekly averages.

In [48]:
#Using datetime to add the week number as a new column.

filtered['week'] = None
weekNumber = filtered['FECHA'].dt.week.tolist()
filtered['week'] = weekNumber

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = None
  weekNumber = filtered['FECHA'].dt.week.tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = weekNumber


In [49]:
filtered.head()

Unnamed: 0,FECHA,INDICATIVO,NOMBRE,PROVINCIA,ALTITUD,TMEDIA,PRECIPITACION,TMIN,HORATMIN,TMAX,...,DIR,VELMEDIA,RACHA,HORARACHA,SOL,PRESMAX,HORAPRESMAX,PRESMIN,HORAPRESMIN,week
0,2016-01-01,0002I,VANDELL�S,TARRAGONA,32,13.6,0.1,10.8,04:40,16.4,...,12.0,0.6,3.6,00:10,,,,,,53
1,2016-01-02,0002I,VANDELL�S,TARRAGONA,32,13.6,0.0,10.7,21:10,16.4,...,99.0,2.8,13.1,10:30,,,,,,53
2,2016-01-03,0002I,VANDELL�S,TARRAGONA,32,12.1,0.0,9.6,23:00,14.6,...,24.0,1.4,8.6,09:30,,,,,,53
3,2016-01-04,0002I,VANDELL�S,TARRAGONA,32,12.7,0.3,9.4,01:00,16.0,...,33.0,1.7,14.2,20:40,,,,,,1
4,2016-01-05,0002I,VANDELL�S,TARRAGONA,32,12.0,0.0,9.7,23:59,14.4,...,30.0,3.1,17.5,03:10,,,,,,1


In [50]:
#Creating a dataframe with the weekly mean values.

df_mean = filtered.groupby("week").mean()

## Assigning each value a rating 0-10

To create our index all values (temperature, rain and wind speed) will need to be rated on a given scale, from 0 to 10. For this we will be using the same ratings as the famous **TCI**.

In [51]:
#Assigning temperature values.

df_mean['tmax'] = None

for i in range(len(df_mean)):
    if df_mean['TMAX'].iloc[i] >= 39: #Checking the temperature, from high to low.
        df_mean['tmax'].iloc[i] = 0 #Assigning the score.
    elif df_mean['TMAX'].iloc[i] > 38:
        df_mean['tmax'].iloc[i] = 1
    elif df_mean['TMAX'].iloc[i] > 37:
        df_mean['tmax'].iloc[i] = 2
    elif df_mean['TMAX'].iloc[i] > 36:
        df_mean['tmax'].iloc[i] = 3
    elif df_mean['TMAX'].iloc[i] > 35:
        df_mean['tmax'].iloc[i] = 4
    elif df_mean['TMAX'].iloc[i] > 33:
        df_mean['tmax'].iloc[i] = 5
    elif df_mean['TMAX'].iloc[i] > 31:
        df_mean['tmax'].iloc[i] = 6
    elif df_mean['TMAX'].iloc[i] > 29:
        df_mean['tmax'].iloc[i] = 7
    elif df_mean['TMAX'].iloc[i] > 27:
        df_mean['tmax'].iloc[i] = 8
    elif df_mean['TMAX'].iloc[i] > 26:
        df_mean['tmax'].iloc[i] = 9
    elif df_mean['TMAX'].iloc[i] > 23:
        df_mean['tmax'].iloc[i] = 10
    elif df_mean['TMAX'].iloc[i] > 20:
        df_mean['tmax'].iloc[i] = 9
    elif df_mean['TMAX'].iloc[i] > 19:
        df_mean['tmax'].iloc[i] = 8
    elif df_mean['TMAX'].iloc[i] > 18:
        df_mean['tmax'].iloc[i] = 7
    elif df_mean['TMAX'].iloc[i] > 15:
        df_mean['tmax'].iloc[i] = 6
    elif df_mean['TMAX'].iloc[i] > 11:
        df_mean['tmax'].iloc[i] = 5
    elif df_mean['TMAX'].iloc[i] > 7:
        df_mean['tmax'].iloc[i] = 4
    elif df_mean['TMAX'].iloc[i] > 0:
        df_mean['tmax'].iloc[i] > 3
    elif df_mean['TMAX'].iloc[i] <= 3:
        df_mean['tmax'].iloc[i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [52]:
#Assigning precipitation values.

df_mean['rain'] = None

for i in range(len(df_mean)):
    if df_mean['PRECIPITACION'].iloc[i] >= 40:
        df_mean['rain'].iloc[i] = -2
    elif df_mean['PRECIPITACION'].iloc[i] >= 25:
        df_mean['rain'].iloc[i] = -1
    elif df_mean['PRECIPITACION'].iloc[i] > 12:
        df_mean['rain'].iloc[i] = 0
    elif df_mean['PRECIPITACION'].iloc[i] > 10:
        df_mean['rain'].iloc[i] = 1
    elif df_mean['PRECIPITACION'].iloc[i] > 9:
        df_mean['rain'].iloc[i] = 2
    elif df_mean['PRECIPITACION'].iloc[i] > 8:
        df_mean['rain'].iloc[i] = 3
    elif df_mean['PRECIPITACION'].iloc[i] > 7:
        df_mean['rain'].iloc[i] = 4
    elif df_mean['PRECIPITACION'].iloc[i] > 6:
        df_mean['rain'].iloc[i] = 5
    elif df_mean['PRECIPITACION'].iloc[i] > 5:
        df_mean['rain'].iloc[i] = 6
    elif df_mean['PRECIPITACION'].iloc[i] > 4:
        df_mean['rain'].iloc[i] = 7
    elif df_mean['PRECIPITACION'].iloc[i] > 3:
        df_mean['rain'].iloc[i] = 8
    elif df_mean['PRECIPITACION'].iloc[i] > 0:
        df_mean['rain'].iloc[i] = 9
    elif df_mean['PRECIPITACION'].iloc[i] == 0.0:
        df_mean['rain'].iloc[i] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [53]:
#Assigning wind values. Since our values are in m/s and we need Km/h, a conversion is needed.

df_mean['wind'] = None

for i in range(len(df_mean)):
    if df_mean['VELMEDIA'].iloc[i]*3.6 > 70:
        df_mean['wind'].iloc[i] = -10
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 65:
        df_mean['wind'].iloc[i] = -8
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 60:
        df_mean['wind'].iloc[i] = -6
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 55:
        df_mean['wind'].iloc[i] = -4
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 50:
        df_mean['wind'].iloc[i] = -2
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 40:
        df_mean['wind'].iloc[i] = 0
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 35:
        df_mean['wind'].iloc[i] = 2
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 30:
        df_mean['wind'].iloc[i] = 6
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 25:
        df_mean['wind'].iloc[i] = 7
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 20:
        df_mean['wind'].iloc[i] = 8
    elif df_mean['VELMEDIA'].iloc[i]*3.6 > 10:
        df_mean['wind'].iloc[i] = 9
    elif df_mean['VELMEDIA'].iloc[i]*3.6 < 10:
        df_mean['wind'].iloc[i] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [54]:
df_mean.head()

Unnamed: 0_level_0,ALTITUD,TMEDIA,PRECIPITACION,TMIN,TMAX,DIR,VELMEDIA,RACHA,SOL,PRESMAX,PRESMIN,tmax,rain,wind
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,32,11.708571,0.202857,8.011429,15.417143,28.457143,2.78,14.708571,,,,6,9,9
2,32,10.165714,0.04,6.728571,13.602857,30.885714,3.382857,16.685714,,,,5,9,9
3,32,10.502857,2.554286,6.94,14.082857,30.285714,2.831429,14.02,,,,5,9,9
4,32,11.206667,2.193103,8.053333,14.326667,33.3,3.11,15.663333,,,,5,9,9
5,32,10.964286,1.292857,7.253571,14.664286,28.821429,2.6,14.267857,,,,5,9,10


## Assigning a CWI score to each week

Having all relative scores for any given week (max temperature, rain and wind speed) we can proceed to assign a score.

In [55]:
#Creating a new column by multiplying our individual scores by the corresponding %.

df_mean['CWI'] = df_mean['tmax']*0.4 + df_mean['rain']*0.4 + df_mean['wind']*0.2

In [56]:
#Our final dataframe needs only the last column (score).

df_score = pd.DataFrame(df_mean['CWI'])

In [57]:
df_score.head()

Unnamed: 0_level_0,CWI
week,Unnamed: 1_level_1
1,7.8
2,7.4
3,7.4
4,7.4
5,7.6


In [58]:
#Transposing the dataframe to condense all information in a single row.

df_t = df_score.T

In [59]:
#Reseting the index.

df_t.reset_index(drop=True, inplace=True)

In [60]:
#Finally, creating new columns for the weather station name and coordinates.

df_t['name'] = None
df_t['coords'] = None

In [61]:
df_t.head()

week,1,2,3,4,5,6,7,8,9,10,...,46,47,48,49,50,51,52,53,name,coords
0,7.8,7.4,7.4,7.4,7.6,7.8,8.0,8.0,8.0,8.4,...,7.8,8.0,7.8,8.0,7.8,8.0,8.0,8.0,,


In [62]:
c = df_t.columns.tolist()
r = df_t.iloc[0].tolist()

In [63]:
df_score.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 53
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   CWI     53 non-null     object
dtypes: object(1)
memory usage: 2.9+ KB


# Applying the scoring system to every weather station

Now that we have all the necessary steps to transform a dataframe into our desired structure it's time to create a function (or a loop) that can automatically perform this operation on all *csv* containing meteo data. To access each folder we will be using **Path**.

This will be done in three steps, with three separate functions/loops:

1. Iterate through every *csv* file in a folder.
2. Transform each *csv* file with the previous methods and return a dicionary.
3. Create a master dataframe using all generated dictionaries.

## Creating our parser

Let's begin by defining the parameters that our function must meet:

1. Perform all necessary transformations on a given *csv* file.
2. Return a dictionary where the first key is the station name, the second one the coords, and the next 53 correspond to a week with their respective values reflecting that week's CWI.

In [256]:
#Defining our function.

def score_creator(file):
    df = pd.read_csv(file, sep=';') #Opening the file.
    try:
        df['PRECIPITACION'] = df.PRECIPITACION.str.replace(r"[a-zA-Z]",'')
    except: 
        pass
    df["PRECIPITACION"] = pd.to_numeric(df["PRECIPITACION"]) #Typecasting column.
    df["FECHA"] = df['FECHA'].apply(pd.to_datetime) #Converting from object to Datetime.
     
    df['TMAX'].fillna(df['TMAX'].mean(), inplace=True) #Replacing NaNs.
    df['PRECIPITACION'].fillna(df['PRECIPITACION'].mean(), inplace=True)
    df['VELMEDIA'].fillna(df['VELMEDIA'].mean(), inplace=True)
    
    newname = df['NOMBRE'].iloc[1]
    
    
    filtered = df.loc[(df['FECHA'] >= '2016-01-01') & (df['FECHA'] < '2021-01-01')] #Grabbing data from the last 5 years.
    filtered.reset_index(drop=True, inplace=True) #Resetting index.
    
    filtered['week'] = None
    weekNumber = filtered['FECHA'].dt.week.tolist()
    filtered['week'] = weekNumber #Grouping by week.
    
    df_mean = filtered.groupby("week").mean() #Calculating the weekly average values.
    
    df_mean['tmax'] = None
    for i in range(len(df_mean)): #Assigning temp score.
        try:
            if df_mean['TMAX'].iloc[i] >= 39:
                df_mean['tmax'].iloc[i] = 0
            elif df_mean['TMAX'].iloc[i] > 38:
                df_mean['tmax'].iloc[i] = 1
            elif df_mean['TMAX'].iloc[i] > 37:
                df_mean['tmax'].iloc[i] = 2
            elif df_mean['TMAX'].iloc[i] > 36:
                df_mean['tmax'].iloc[i] = 3
            elif df_mean['TMAX'].iloc[i] > 35:
                df_mean['tmax'].iloc[i] = 4
            elif df_mean['TMAX'].iloc[i] > 33:
                df_mean['tmax'].iloc[i] = 5
            elif df_mean['TMAX'].iloc[i] > 31:
                df_mean['tmax'].iloc[i] = 6
            elif df_mean['TMAX'].iloc[i] > 29:
                df_mean['tmax'].iloc[i] = 7
            elif df_mean['TMAX'].iloc[i] > 27:
                df_mean['tmax'].iloc[i] = 8
            elif df_mean['TMAX'].iloc[i] > 26:
                df_mean['tmax'].iloc[i] = 9
            elif df_mean['TMAX'].iloc[i] > 23:
                df_mean['tmax'].iloc[i] = 10
            elif df_mean['TMAX'].iloc[i] > 20:
                df_mean['tmax'].iloc[i] = 9
            elif df_mean['TMAX'].iloc[i] > 19:
                df_mean['tmax'].iloc[i] = 8
            elif df_mean['TMAX'].iloc[i] > 18:
                df_mean['tmax'].iloc[i] = 7
            elif df_mean['TMAX'].iloc[i] > 15:
                df_mean['tmax'].iloc[i] = 6
            elif df_mean['TMAX'].iloc[i] > 11:
                df_mean['tmax'].iloc[i] = 5
            elif df_mean['TMAX'].iloc[i] > 7:
                df_mean['tmax'].iloc[i] = 4
            elif df_mean['TMAX'].iloc[i] > 0:
                df_mean['tmax'].iloc[i] > 3
            elif df_mean['TMAX'].iloc[i] <= 3:
                df_mean['tmax'].iloc[i] = 0
        except:
            df_mean['tmax'].iloc[i] = df_mean['tmax'].mean() #Filling missing values with the mean.
            
    df_mean['rain'] = None
    for i in range(len(df_mean)): #Assigning rain score.
        try:
            if df_mean['PRECIPITACION'].iloc[i] >= 40:
                df_mean['rain'].iloc[i] = -2
            elif df_mean['PRECIPITACION'].iloc[i] >= 25:
                df_mean['rain'].iloc[i] = -1
            elif df_mean['PRECIPITACION'].iloc[i] > 12:
                df_mean['rain'].iloc[i] = 0
            elif df_mean['PRECIPITACION'].iloc[i] > 10:
                df_mean['rain'].iloc[i] = 1
            elif df_mean['PRECIPITACION'].iloc[i] > 9:
                df_mean['rain'].iloc[i] = 2
            elif df_mean['PRECIPITACION'].iloc[i] > 8:
                df_mean['rain'].iloc[i] = 3
            elif df_mean['PRECIPITACION'].iloc[i] > 7:
                df_mean['rain'].iloc[i] = 4
            elif df_mean['PRECIPITACION'].iloc[i] > 6:
                df_mean['rain'].iloc[i] = 5
            elif df_mean['PRECIPITACION'].iloc[i] > 5:
                df_mean['rain'].iloc[i] = 6
            elif df_mean['PRECIPITACION'].iloc[i] > 4:
                df_mean['rain'].iloc[i] = 7
            elif df_mean['PRECIPITACION'].iloc[i] > 3:
                df_mean['rain'].iloc[i] = 8
            elif df_mean['PRECIPITACION'].iloc[i] > 0:
                df_mean['rain'].iloc[i] = 9
            elif df_mean['PRECIPITACION'].iloc[i] == 0.0:
                df_mean['rain'].iloc[i] = 10
        except:
            df_mean['rain'].iloc[i] = df_mean['rain'].mean() #Filling missing values with the mean.
            
    df_mean['wind'] = None
    for i in range(len(df_mean)): #Assigning wind score.
        try:
            if df_mean['VELMEDIA'].iloc[i]*3.6 > 70:
                df_mean['wind'].iloc[i] = -10
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 65:
                df_mean['wind'].iloc[i] = -8
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 60:
                df_mean['wind'].iloc[i] = -6
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 55:
                df_mean['wind'].iloc[i] = -4
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 50:
                df_mean['wind'].iloc[i] = -2
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 40:
                df_mean['wind'].iloc[i] = 0
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 35:
                df_mean['wind'].iloc[i] = 2
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 30:
                df_mean['wind'].iloc[i] = 6
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 25:
                df_mean['wind'].iloc[i] = 7
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 20:
                df_mean['wind'].iloc[i] = 8
            elif df_mean['VELMEDIA'].iloc[i]*3.6 > 10:
                df_mean['wind'].iloc[i] = 9
            elif df_mean['VELMEDIA'].iloc[i]*3.6 < 10:
                df_mean['wind'].iloc[i] = 10
        except:
            df_mean['wind'].iloc[i] = df_mean['wind'].mean() #Filling missing values with the mean.    
            
    df_mean['CWI'] = df_mean['tmax']*0.4 + df_mean['rain']*0.4 + df_mean['wind']*0.2 #Calculating the score.
    
    df_score = pd.DataFrame(df_mean['CWI']) #Keeping just the score.
    
    df_t = df_score.T #Transposing the dataframe.
    
    df_t['name'] = None
    df_t['coords'] = None
    
    df_t.reset_index(drop=True, inplace=True) #Resetting index.
    
    c = df_t.columns.tolist() #Creating a list of columns.
    r = df_t.iloc[0].tolist() #Creating a list of row values.
    
    keys = c #Creating a dictionary from our two lists.
    values = r
    dictionary = dict(zip(keys, values))
    
    dictionary['name'] = newname
    
    return dictionary #Returning our dictionary.

In [257]:
output = score_creator('B248-19710401-20210602.csv')

  df['PRECIPITACION'] = df.PRECIPITACION.str.replace(r"[a-zA-Z]",'')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = None
  weekNumber = filtered['FECHA'].dt.week.tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = weekNumber #Grouping by week.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [258]:
#Testing our function.

output

{1: 6.6000000000000005,
 2: 6.6000000000000005,
 3: 4.800000000000001,
 4: 5.4,
 5: 6.6000000000000005,
 6: 6.200000000000001,
 7: 7.0,
 8: 7.0,
 9: 6.800000000000001,
 10: 6.6000000000000005,
 11: 7.3999999999999995,
 12: 5.2,
 13: 7.199999999999999,
 14: 7.199999999999999,
 15: 7.0,
 16: 7.3999999999999995,
 17: 6.6000000000000005,
 18: 7.3999999999999995,
 19: 7.8,
 20: 7.8,
 21: 9.0,
 22: 8.600000000000001,
 23: 9.0,
 24: 9.0,
 25: 9.4,
 26: 9.4,
 27: 8.600000000000001,
 28: 9.0,
 29: 9.4,
 30: 8.600000000000001,
 31: 9.0,
 32: 9.4,
 33: 9.0,
 34: 9.0,
 35: 9.4,
 36: 8.600000000000001,
 37: 7.6,
 38: 9.0,
 39: 9.0,
 40: 8.200000000000001,
 41: 7.8,
 42: 6.6000000000000005,
 43: 5.8,
 44: 7.3999999999999995,
 45: 5.800000000000001,
 46: 6.800000000000001,
 47: 6.4,
 48: 6.800000000000001,
 49: 6.0,
 50: 5.4,
 51: 4.0,
 52: 6.800000000000001,
 53: 6.6000000000000005,
 'name': 'SIERRA DE ALFABIA, BUNYOLA',
 'coords': None}

## Creating the component that iterates through all csv files and running it

In [260]:
#This loop will iterate through all files in our designated folder and apply the yet to be defined function to them.

start = time.time() #Starting a timer.

dict_list = [] #We will be appending our output dictionaries here.

directory = 'csv' #The folder containing our csv files.
 
files = Path(directory).glob('*') #Grabbing all files.
for file in files:
    dict_list.append(score_creator(file)) #Applying our score function.
    
stop = time.time() #Stopping our timer.
duration = (stop - start) / 60
print('Minutes:', duration) #Returning the elapsed minutes.

test = pd.DataFrame(dict_list) #Creating the dataframe from the list of dictionaries.

  df['PRECIPITACION'] = df.PRECIPITACION.str.replace(r"[a-zA-Z]",'')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = None
  weekNumber = filtered['FECHA'].dt.week.tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['week'] = weekNumber #Grouping by week.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Minutes: 4.488581577936809


We can see that many rows (weather stations) are missing or have missing weeks, that's because they have faulty data or the recording simply doesn't fill the last 5 years. The rows with missing data can be safely dropped.

In [261]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291 entries, 0 to 290
Data columns (total 55 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       264 non-null    float64
 1   2       264 non-null    float64
 2   3       264 non-null    float64
 3   4       263 non-null    float64
 4   5       262 non-null    float64
 5   6       262 non-null    float64
 6   7       263 non-null    float64
 7   8       263 non-null    float64
 8   9       263 non-null    float64
 9   10      263 non-null    float64
 10  11      265 non-null    float64
 11  12      265 non-null    float64
 12  13      265 non-null    float64
 13  14      265 non-null    float64
 14  15      265 non-null    float64
 15  16      265 non-null    float64
 16  17      263 non-null    float64
 17  18      264 non-null    float64
 18  19      263 non-null    float64
 19  20      263 non-null    float64
 20  21      262 non-null    float64
 21  22      263 non-null    float64
 22  23

In [262]:
#Dropping the rows with missing data.

meteo = test.dropna(thresh=52)

In [263]:
#Finally, saving our resulting dataframe.

meteo.to_csv('meteo.csv', index=False)

# Adding the weather station coordinates

Now that we have our complete dataframe it's to assign every weather station its geographical coordinates. This can be done quite easily by using *ListadoEstaciones*, a *csv* file containing all identifiers and corresponding coordinates.

In [286]:
listado = pd.read_csv('ListadoEstaciones.csv', sep=';')

In [287]:
listado.head()

Unnamed: 0,ID,codigo_postal,nombre,provincia,latitud,longitud,elevacion,inicio,fin
0,1387E,8002.0,A CORU�A AEROPUERTO,A CORU�A,431825N,082219W,98,1971-12-01,2021-06-02
1,1387,8001.0,A CORU�A,A CORU�A,432157N,082517W,58,1930-10-01,2021-06-02
2,1393,8006.0,CABO VILAN,A CORU�A,430938N,091239W,50,1994-01-01,2021-06-02
3,1351,8004.0,ESTACA DE BARES,A CORU�A,434710N,074105W,80,1961-01-01,2021-06-02
4,1400,8040.0,FISTERRA,A CORU�A,425529N,091729W,230,1951-01-01,2021-06-02


Latitude and longitude aren't expressed in degrees but in degrees, minutes and seconds. Let's define two functions to convert them.

In [288]:
def latDD(x):
    '''
    Input: latitude in DDMMSS.
    
    Output: latitude in degrees.
    
    '''
    D = int(x[0:2])
    M = int(x[2:4])
    S = float(x[4:6])
    DD = D + float(M)/60 + float(S)/3600
    DD = round(DD, 5) #Setting a limit to 5 decimals.
    return DD

In [289]:
def longDD(x):
    '''
    Input: longitude in DDMMSS.
    
    Output: longitude in degrees.
    
    '''    
    D = int(x[0:2])
    M = int(x[2:4])
    S = float(x[4:6])
    DD = D + float(M)/60 + float(S)/3600
    if x[5] == 'W':
        DD = DD*(-1)
    DD = round(DD, 5)
    return DD

In [290]:
#Testing the first function.

latDD(listado['latitud'].iloc[0])

43.30694

In [291]:
#Testing the second one.

longDD(listado['longitud'].iloc[0])

8.37194

In [292]:
#Now let's apply the functions to the dataframe.

listado['latitud'] = listado['latitud'].map(lambda x: latDD(x))
listado['longitud'] = listado['longitud'].map(lambda x: longDD(x))

In [293]:
listado.head()

Unnamed: 0,ID,codigo_postal,nombre,provincia,latitud,longitud,elevacion,inicio,fin
0,1387E,8002.0,A CORU�A AEROPUERTO,A CORU�A,43.30694,8.37194,98,1971-12-01,2021-06-02
1,1387,8001.0,A CORU�A,A CORU�A,43.36583,8.42139,58,1930-10-01,2021-06-02
2,1393,8006.0,CABO VILAN,A CORU�A,43.16056,9.21083,50,1994-01-01,2021-06-02
3,1351,8004.0,ESTACA DE BARES,A CORU�A,43.78611,7.68472,80,1961-01-01,2021-06-02
4,1400,8040.0,FISTERRA,A CORU�A,42.92472,9.29139,230,1951-01-01,2021-06-02


In [294]:
#Now we can finally add the coordinates to the original meteo dataframe.

for i in range(len(listado)):
    for n in range(len(meteo)):
        if listado['nombre'].iloc[i] == meteo['name'].iloc[n]:
            meteo['coords'].iloc[n] = '('+ str(listado['latitud'].iloc[i]) + ',' + str(listado['longitud'].iloc[i]) + ')'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [296]:
#Checking the results.

meteo.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,46,47,48,49,50,51,52,53,name,coords
0,7.8,7.4,7.4,7.4,7.6,7.8,8.0,8.0,8.0,8.4,...,7.8,8.0,7.8,8.0,7.8,8.0,8.0,8.0,VANDELL�S,
1,7.8,7.4,7.4,7.4,7.8,7.8,8.0,8.0,7.8,8.6,...,8.2,7.8,7.8,7.8,7.8,8.0,7.8,7.4,REUS AEROPUERTO,"(41.145,1.16361)"
2,7.8,7.4,7.4,6.2,7.8,7.8,7.8,7.8,7.8,7.8,...,8.2,7.8,7.8,7.8,7.8,7.8,7.8,7.8,BARCELONA AEROPUERTO,"(41.29278,2.07)"
3,7.6,7.2,7.6,6.4,7.6,7.6,8.0,8.0,8.0,8.0,...,8.0,7.6,7.6,7.6,7.6,7.6,7.6,7.2,MANRESA,"(41.72,1.84028)"
4,7.4,7.4,7.4,6.6,7.4,7.4,7.4,7.8,7.8,7.8,...,7.0,7.8,7.4,7.0,7.4,7.4,7.4,7.4,"BARCELONA, FABRA","(41.41833,2.12417)"


## Adding the missing coordinates

Some coordinates are missing, but they will be easy to add since we have the weather station location.

In [302]:
#Creating a list of missing coords:

missing_names = [] #This list will hold the names of stations with missing coords.

for i in range(len(meteo)):
    if meteo['coords'].iloc[i] == None: #Checking for the missing placeholder (None).
        missing_names.append(meteo['name'].iloc[i]) #Appending the name to the list.    

In [303]:
missing_names

['VANDELL�S',
 'BAZTAN, IRURITA',
 'PINOSO',
 'VALDERREDIBLE, POLIENTES',
 'LOGRO�O AEROPUERTO',
 'PAMPLONA AEROPUERTO',
 'ARANGUREN, ILUNDAIN',
 'BARDENAS REALES, BASE A�REA',
 'ZARAGOZA AEROPUERTO',
 'HUESCA AEROPUERTO',
 'TORTOSA',
 'NAUT ARAN, ARTIES',
 'PALMA, AEROPUERTO',
 'PORTOCOLOM']

Now we will manually search the coordinates for each missing station and add them to our dataframe. The following code asks for the station name and coordinates, it then adds them to the dataframe.

In [322]:
print('Paste the station name:') #Asking for the station name.
name = input()

print('Paste the coordinates:') #Asking for the coords.
coords = input()

for i in range(len(meteo)):
    if meteo['name'].iloc[i] == name:
        meteo['coords'].iloc[i] = coords

Paste the station name:
PORTOCOLOM
Paste the coordinates:
(39.42022, 3.25379)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [326]:
#Saving our dataframe.

meteo.to_csv('meteo.csv', index=False)