# Creating a climate index API

In order to know which destinations have (or will have) the best weather on a given date we will be using historical data from all of Spain's weather stations, as well as packaging that information into an API that will allow us to remotely perform queries.

Since all information from 291 weather stations are stored in individual *csv* files, let's import the first one and see how can we coax the necessary data out of it.

# Transforming historical data into a weather index

*Indices are particularly valuable because they allow the integrated effects of a range of climatic variables to be quantified, facilitating an interpretation and rating of climatic conditions at a destination. Another advantage of indices is that they enable the climate of tourism destinations to be objectively compared and are therefore a convenient and more conceptually sound means to assess possible impacts of climate change on the distribution of climatic resources worldwide.*

                  - An Inter-Comparison of the Holiday Climate Index (HCI) and the Tourism Climate Index (TCI) in Europe
                    https://www.mdpi.com/2073-4433/7/6/80/htm
                    
                    
                    
                    
Our aim is to condense all meteorological data (wind, temperature, hours of daylight, rain...) into a single numnber (an index) that can be used to easily compare one destination with another. The objective is to create a dataframe where every row is a different destination and each column holds the index for each week of the year.

We can't use the **TCI** as-is because we lack some of the data (humidity and cloud cover %), so we will have to make an index of our own. From now on we will call it simply the **CWI** (**C**ycling **W**eather **I**ndex).

## Creating our **CWI**

While it would be nice to have all weather data for every town in Spain, we have to make do with the available datasets.

Of the present variables in the official meteo datasets, the following variables are of use:


- Date (to group values per week).
- Average temperature.
- Rain (mm).
- Average wind speed.


The hours of sunlight aren't really meaningful because we're crafting this custom index to be used only on Spain, which has a single timezone and the latitude change isn't that great.


The original **TCI** roughly assigns the following weights to each parameter:


**Maximum temperature:** 40%

**Cloud cover:** 20%

**Precipitation:** 30%

**Wind:** 10%



While those percentages are quite good, this model was made with tourism in mind. Cycling has a few diferrences with tourism and other leisure activities that must be kept in mind when creating our **CWS**:

- Clouds are OK as long as it doesn't rain.
- Wind is very harmful.


A first approach to our index could be the following:

**Maximum temperature:** 40%

**Precipitation:** 40%

**Wind:** 20%

## Importing our data

The first step is importing our data. Every weather station's data is contained in a different *csv* file, we'll import the first one to see how data is structured.

In [1]:
import pandas as pd
import time
from pathlib import Path

In [2]:
#Importing our dataframe.

df = pd.read_csv('meteo_test.csv', sep=';')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Id      2500 non-null   object 
 1   Fecha   2500 non-null   object 
 2   Tmax    2039 non-null   float64
 3   HTmax   2039 non-null   object 
 4   Tmin    2039 non-null   float64
 5   HTmin   2039 non-null   object 
 6   Tmed    2039 non-null   float64
 7   Racha   2041 non-null   float64
 8   HRacha  2041 non-null   object 
 9   Vmax    2041 non-null   float64
 10  HVmax   2041 non-null   object 
 11  TPrec   2007 non-null   float64
 12  Prec1   2157 non-null   float64
 13  Prec2   2126 non-null   float64
 14  Prec3   2113 non-null   float64
 15  Prec4   2087 non-null   float64
dtypes: float64(10), object(6)
memory usage: 312.6+ KB


In [4]:
df.head()

Unnamed: 0,Id,Fecha,Tmax,HTmax,Tmin,HTmin,Tmed,Racha,HRacha,Vmax,HVmax,TPrec,Prec1,Prec2,Prec3,Prec4
0,0009X,2013-05-07,25.4,17:40,14.6,00:30,20.0,44.0,16:30,24.0,09:40,0.0,0.0,0.0,0.0,0.0
1,0009X,2013-05-08,24.3,18:20,14.4,05:00,19.3,30.0,16:20,15.0,16:20,0.0,0.0,0.0,0.0,0.0
2,0009X,2013-05-09,21.6,16:50,12.7,05:00,17.2,25.0,23:20,16.0,23:59,1.8,0.0,0.0,0.0,1.8
3,0009X,2013-05-10,21.7,17:10,11.4,05:50,16.6,50.0,13:20,29.0,14:00,0.0,0.0,0.0,0.0,0.0
4,0009X,2013-05-11,22.3,16:40,10.8,01:00,16.5,46.0,13:30,26.0,13:50,0.0,0.0,0.0,0.0,0.0


In [7]:
#The precipitation values are split by time intervals, so we will just add them up to get a day's rain.

df['prec'] = None
df['prec'] = df['Prec1'] + df['Prec2'] + df['Prec3'] + df['Prec4']

## Using time series to get the average values

Since we want to calculate the **TCI** on a weekly basis it makes sense to compact our data with the same granularity. For this purpose we will be using time series.

In [10]:
#Converting 'FECHA' to Datetime format.

df["Fecha"] = df['Fecha'].apply(pd.to_datetime)

In [11]:
#Checking the result.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Id      2500 non-null   object        
 1   Fecha   2500 non-null   datetime64[ns]
 2   Tmax    2039 non-null   float64       
 3   HTmax   2039 non-null   object        
 4   Tmin    2039 non-null   float64       
 5   HTmin   2039 non-null   object        
 6   Tmed    2039 non-null   float64       
 7   Racha   2041 non-null   float64       
 8   HRacha  2041 non-null   object        
 9   Vmax    2041 non-null   float64       
 10  HVmax   2041 non-null   object        
 11  TPrec   2007 non-null   float64       
 12  Prec1   2157 non-null   float64       
 13  Prec2   2126 non-null   float64       
 14  Prec3   2113 non-null   float64       
 15  Prec4   2087 non-null   float64       
 16  prec    2007 non-null   float64       
dtypes: datetime64[ns](1), float64(11), object(5)
memory 

Since our new datasets keep data from 2013 onwards will be using all of it. If we had data from some decades ago we would have to trim it. Now it's time to calculate the weekly averages.

In [12]:
#Using datetime to add the week number as a new column.

df['week'] = None
weekNumber = df['Fecha'].dt.week.tolist()
df['week'] = weekNumber

  weekNumber = df['Fecha'].dt.week.tolist()


In [13]:
df.head()

Unnamed: 0,Id,Fecha,Tmax,HTmax,Tmin,HTmin,Tmed,Racha,HRacha,Vmax,HVmax,TPrec,Prec1,Prec2,Prec3,Prec4,prec,week
0,0009X,2013-05-07,25.4,17:40,14.6,00:30,20.0,44.0,16:30,24.0,09:40,0.0,0.0,0.0,0.0,0.0,0.0,19
1,0009X,2013-05-08,24.3,18:20,14.4,05:00,19.3,30.0,16:20,15.0,16:20,0.0,0.0,0.0,0.0,0.0,0.0,19
2,0009X,2013-05-09,21.6,16:50,12.7,05:00,17.2,25.0,23:20,16.0,23:59,1.8,0.0,0.0,0.0,1.8,1.8,19
3,0009X,2013-05-10,21.7,17:10,11.4,05:50,16.6,50.0,13:20,29.0,14:00,0.0,0.0,0.0,0.0,0.0,0.0,19
4,0009X,2013-05-11,22.3,16:40,10.8,01:00,16.5,46.0,13:30,26.0,13:50,0.0,0.0,0.0,0.0,0.0,0.0,19


In [14]:
#Creating a dataframe with the weekly mean values.

df_mean = df.groupby("week").mean()

## Assigning each value a rating 0-10

To create our index all values (temperature, rain and wind speed) will need to be rated on a given scale, from 0 to 10. For this we will be using the same ratings as the famous **TCI**.

In [15]:
#Assigning temperature values.

df_mean['tmax'] = None

for i in range(len(df_mean)):
    if df_mean['Tmax'].iloc[i] >= 39: #Checking the temperature, from high to low.
        df_mean['tmax'].iloc[i] = 0 #Assigning the score.
    elif df_mean['Tmax'].iloc[i] > 38:
        df_mean['tmax'].iloc[i] = 1
    elif df_mean['Tmax'].iloc[i] > 37:
        df_mean['tmax'].iloc[i] = 2
    elif df_mean['Tmax'].iloc[i] > 36:
        df_mean['tmax'].iloc[i] = 3
    elif df_mean['Tmax'].iloc[i] > 35:
        df_mean['tmax'].iloc[i] = 4
    elif df_mean['Tmax'].iloc[i] > 33:
        df_mean['tmax'].iloc[i] = 5
    elif df_mean['Tmax'].iloc[i] > 31:
        df_mean['tmax'].iloc[i] = 6
    elif df_mean['Tmax'].iloc[i] > 29:
        df_mean['tmax'].iloc[i] = 7
    elif df_mean['Tmax'].iloc[i] > 27:
        df_mean['tmax'].iloc[i] = 8
    elif df_mean['Tmax'].iloc[i] > 26:
        df_mean['tmax'].iloc[i] = 9
    elif df_mean['Tmax'].iloc[i] > 23:
        df_mean['tmax'].iloc[i] = 10
    elif df_mean['Tmax'].iloc[i] > 20:
        df_mean['tmax'].iloc[i] = 9
    elif df_mean['Tmax'].iloc[i] > 19:
        df_mean['tmax'].iloc[i] = 8
    elif df_mean['Tmax'].iloc[i] > 18:
        df_mean['tmax'].iloc[i] = 7
    elif df_mean['Tmax'].iloc[i] > 15:
        df_mean['tmax'].iloc[i] = 6
    elif df_mean['Tmax'].iloc[i] > 11:
        df_mean['tmax'].iloc[i] = 5
    elif df_mean['Tmax'].iloc[i] > 7:
        df_mean['tmax'].iloc[i] = 4
    elif df_mean['Tmax'].iloc[i] > 0:
        df_mean['tmax'].iloc[i] > 3
    elif df_mean['Tmax'].iloc[i] <= 3:
        df_mean['tmax'].iloc[i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [16]:
#Assigning precipitation values.

df_mean['rain'] = None

for i in range(len(df_mean)):
    if df_mean['prec'].iloc[i] >= 40:
        df_mean['rain'].iloc[i] = -2
    elif df_mean['prec'].iloc[i] >= 25:
        df_mean['rain'].iloc[i] = -1
    elif df_mean['prec'].iloc[i] > 12:
        df_mean['rain'].iloc[i] = 0
    elif df_mean['prec'].iloc[i] > 10:
        df_mean['rain'].iloc[i] = 1
    elif df_mean['prec'].iloc[i] > 9:
        df_mean['rain'].iloc[i] = 2
    elif df_mean['prec'].iloc[i] > 8:
        df_mean['rain'].iloc[i] = 3
    elif df_mean['prec'].iloc[i] > 7:
        df_mean['rain'].iloc[i] = 4
    elif df_mean['prec'].iloc[i] > 6:
        df_mean['rain'].iloc[i] = 5
    elif df_mean['prec'].iloc[i] > 5:
        df_mean['rain'].iloc[i] = 6
    elif df_mean['prec'].iloc[i] > 4:
        df_mean['rain'].iloc[i] = 7
    elif df_mean['prec'].iloc[i] > 3:
        df_mean['rain'].iloc[i] = 8
    elif df_mean['prec'].iloc[i] > 0:
        df_mean['rain'].iloc[i] = 9
    elif df_mean['prec'].iloc[i] == 0.0:
        df_mean['rain'].iloc[i] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [17]:
#Assigning wind values.

df_mean['wind'] = None

for i in range(len(df_mean)):
    if df_mean['Vmax'].iloc[i] > 70:
        df_mean['wind'].iloc[i] = -10
    elif df_mean['Vmax'].iloc[i] > 65:
        df_mean['wind'].iloc[i] = -8
    elif df_mean['Vmax'].iloc[i] > 60:
        df_mean['wind'].iloc[i] = -6
    elif df_mean['Vmax'].iloc[i] > 55:
        df_mean['wind'].iloc[i] = -4
    elif df_mean['Vmax'].iloc[i] > 50:
        df_mean['wind'].iloc[i] = -2
    elif df_mean['Vmax'].iloc[i] > 40:
        df_mean['wind'].iloc[i] = 0
    elif df_mean['Vmax'].iloc[i] > 35:
        df_mean['wind'].iloc[i] = 2
    elif df_mean['Vmax'].iloc[i] > 30:
        df_mean['wind'].iloc[i] = 6
    elif df_mean['Vmax'].iloc[i] > 25:
        df_mean['wind'].iloc[i] = 7
    elif df_mean['Vmax'].iloc[i] > 20:
        df_mean['wind'].iloc[i] = 8
    elif df_mean['Vmax'].iloc[i] > 10:
        df_mean['wind'].iloc[i] = 9
    elif df_mean['Vmax'].iloc[i] < 10:
        df_mean['wind'].iloc[i] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [18]:
df_mean.head()

Unnamed: 0_level_0,Tmax,Tmin,Tmed,Racha,Vmax,TPrec,Prec1,Prec2,Prec3,Prec4,prec,tmax,rain,wind
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,11.98,4.12,8.046667,51.422222,29.066667,1.581818,0.456522,0.195556,0.43913,0.426087,1.581818,5,9,7
2,11.6125,4.11,7.8575,48.325,28.525,0.09,0.004651,0.01,0.0,0.073171,0.09,5,9,7
3,12.3975,4.6625,8.5375,48.95,26.55,1.507692,0.190909,0.747619,0.4,0.066667,1.507692,5,9,7
4,12.765957,5.506383,9.140426,48.723404,27.319149,0.934783,0.314894,0.347826,0.2,0.059574,0.934783,5,9,7
5,13.636735,5.422449,9.532653,51.489796,28.285714,0.926531,0.120755,0.180769,0.212,0.372549,0.926531,5,9,7


## Assigning a CWI score to each week

Having all relative scores for any given week (max temperature, rain and wind speed) we can proceed to assign a score.

In [19]:
#Creating a new column by multiplying our individual scores by the corresponding %.

df_mean['CWI'] = df_mean['tmax']*0.4 + df_mean['rain']*0.4 + df_mean['wind']*0.2

In [20]:
#Our final dataframe needs only the last column (score).

df_score = pd.DataFrame(df_mean['CWI'])

In [23]:
df_score.head()

Unnamed: 0_level_0,CWI
week,Unnamed: 1_level_1
1,7.0
2,7.0
3,7.0
4,7.0
5,7.0


In [24]:
#Transposing the dataframe to condense all information in a single row.

df_t = df_score.T

In [25]:
#Reseting the index.

df_t.reset_index(drop=True, inplace=True)

In [26]:
#Finally, creating new columns for the weather station name and coordinates.

df_t['name'] = None
df_t['coords'] = None

In [27]:
df_t.head()

week,1,2,3,4,5,6,7,8,9,10,...,46,47,48,49,50,51,52,53,name,coords
0,7.0,7.0,7.0,7.0,7.0,7.0,7.2,7.2,7.0,7.4,...,7.2,7.2,6.8,7.0,7.2,7.0,7.0,6.6,,


# Applying the scoring system to every weather station

Now that we have all the necessary steps to transform a dataframe into our desired structure it's time to create a function (or a loop) that can automatically perform this operation on all *csv* containing meteo data. To access each folder we will be using **Path**.

This will be done in three steps, with three separate functions/loops:

1. Iterate through every *csv* file in a folder.
2. Transform each *csv* file with the previous methods and return a dicionary.
3. Create a master dataframe using all generated dictionaries.

## Creating our parser

Let's begin by defining the parameters that our function must meet:

1. Perform all necessary transformations on a given *csv* file.
2. Return a dictionary where the first key is the station name, the second one the coords, and the next 53 correspond to a week with their respective values reflecting that week's **CWI**.

In [28]:
df.head(1)

Unnamed: 0,Id,Fecha,Tmax,HTmax,Tmin,HTmin,Tmed,Racha,HRacha,Vmax,HVmax,TPrec,Prec1,Prec2,Prec3,Prec4,prec,week
0,0009X,2013-05-07,25.4,17:40,14.6,00:30,20.0,44.0,16:30,24.0,09:40,0.0,0.0,0.0,0.0,0.0,0.0,19


In [59]:
#Defining our function. NEW FUNC TEST

def score_creator(file):
    df = pd.read_csv(file, sep=';') #Opening the file.

    df['Id'] = df['Id'].astype(str) #Typecasting as a str, otherwise the zeros on the left will be lost.
    
    df['prec'] = None
    df['prec'] = df['Prec1'] + df['Prec2'] + df['Prec3'] + df['Prec4']

    df["Fecha"] = df['Fecha'].apply(pd.to_datetime) #Converting from object to Datetime.
     
    df['Tmax'].fillna(df['Tmax'].mean(), inplace=True) #Replacing NaNs.
    df['prec'].fillna(df['prec'].mean(), inplace=True)
    df['Vmax'].fillna(df['Vmax'].mean(), inplace=True)
    
    newname = df['Id'].iloc[1]
    
    df['week'] = None
    weekNumber = df['Fecha'].dt.week.tolist()
    df['week'] = weekNumber #Grouping by week.
    
    df_mean = df.groupby("week").mean() #Calculating the weekly average values.
    
    df_mean['tmax'] = None

    for i in range(len(df_mean)):
        try:
            if df_mean['Tmax'].iloc[i] >= 39: #Checking the temperature, from high to low.
                df_mean['tmax'].iloc[i] = 0 #Assigning the score.
            elif df_mean['Tmax'].iloc[i] > 38:
                df_mean['tmax'].iloc[i] = 1
            elif df_mean['Tmax'].iloc[i] > 37:
                df_mean['tmax'].iloc[i] = 2
            elif df_mean['Tmax'].iloc[i] > 36:
                df_mean['tmax'].iloc[i] = 3
            elif df_mean['Tmax'].iloc[i] > 35:
                df_mean['tmax'].iloc[i] = 4
            elif df_mean['Tmax'].iloc[i] > 33:
                df_mean['tmax'].iloc[i] = 5
            elif df_mean['Tmax'].iloc[i] > 31:
                df_mean['tmax'].iloc[i] = 6
            elif df_mean['Tmax'].iloc[i] > 29:
                df_mean['tmax'].iloc[i] = 7
            elif df_mean['Tmax'].iloc[i] > 27:
                df_mean['tmax'].iloc[i] = 8
            elif df_mean['Tmax'].iloc[i] > 26:
                df_mean['tmax'].iloc[i] = 9
            elif df_mean['Tmax'].iloc[i] > 23:
                df_mean['tmax'].iloc[i] = 10
            elif df_mean['Tmax'].iloc[i] > 20:
                df_mean['tmax'].iloc[i] = 9
            elif df_mean['Tmax'].iloc[i] > 19:
                df_mean['tmax'].iloc[i] = 8
            elif df_mean['Tmax'].iloc[i] > 18:
                df_mean['tmax'].iloc[i] = 7
            elif df_mean['Tmax'].iloc[i] > 15:
                df_mean['tmax'].iloc[i] = 6
            elif df_mean['Tmax'].iloc[i] > 11:
                df_mean['tmax'].iloc[i] = 5
            elif df_mean['Tmax'].iloc[i] > 7:
                df_mean['tmax'].iloc[i] = 4
            elif df_mean['Tmax'].iloc[i] > 0:
                df_mean['tmax'].iloc[i] > 3
            elif df_mean['Tmax'].iloc[i] <= 3:
                df_mean['tmax'].iloc[i] = 0
        except:
            df_mean['tmax'].iloc[i] = df_mean['tmax'].mean() #Filling missing values with the mean.
            
    df_mean['rain'] = None
    for i in range(len(df_mean)):
        try:
            if df_mean['prec'].iloc[i] >= 40:
                df_mean['rain'].iloc[i] = -2
            elif df_mean['prec'].iloc[i] >= 25:
                df_mean['rain'].iloc[i] = -1
            elif df_mean['prec'].iloc[i] > 12:
                df_mean['rain'].iloc[i] = 0
            elif df_mean['prec'].iloc[i] > 10:
                df_mean['rain'].iloc[i] = 1
            elif df_mean['prec'].iloc[i] > 9:
                df_mean['rain'].iloc[i] = 2
            elif df_mean['prec'].iloc[i] > 8:
                df_mean['rain'].iloc[i] = 3
            elif df_mean['prec'].iloc[i] > 7:
                df_mean['rain'].iloc[i] = 4
            elif df_mean['prec'].iloc[i] > 6:
                df_mean['rain'].iloc[i] = 5
            elif df_mean['prec'].iloc[i] > 5:
                df_mean['rain'].iloc[i] = 6
            elif df_mean['prec'].iloc[i] > 4:
                df_mean['rain'].iloc[i] = 7
            elif df_mean['prec'].iloc[i] > 3:
                df_mean['rain'].iloc[i] = 8
            elif df_mean['prec'].iloc[i] > 0:
                df_mean['rain'].iloc[i] = 9
            elif df_mean['prec'].iloc[i] == 0.0:
                df_mean['rain'].iloc[i] = 10
        except:
            df_mean['rain'].iloc[i] = df_mean['rain'].mean() #Filling missing values with the mean.
            
    df_mean['wind'] = None
    for i in range(len(df_mean)):
        try:
            if df_mean['Vmax'].iloc[i] > 70:
                df_mean['wind'].iloc[i] = -10
            elif df_mean['Vmax'].iloc[i] > 65:
                df_mean['wind'].iloc[i] = -8
            elif df_mean['Vmax'].iloc[i] > 60:
                df_mean['wind'].iloc[i] = -6
            elif df_mean['Vmax'].iloc[i] > 55:
                df_mean['wind'].iloc[i] = -4
            elif df_mean['Vmax'].iloc[i] > 50:
                df_mean['wind'].iloc[i] = -2
            elif df_mean['Vmax'].iloc[i] > 40:
                df_mean['wind'].iloc[i] = 0
            elif df_mean['Vmax'].iloc[i] > 35:
                df_mean['wind'].iloc[i] = 2
            elif df_mean['Vmax'].iloc[i] > 30:
                df_mean['wind'].iloc[i] = 6
            elif df_mean['Vmax'].iloc[i] > 25:
                df_mean['wind'].iloc[i] = 7
            elif df_mean['Vmax'].iloc[i] > 20:
                df_mean['wind'].iloc[i] = 8
            elif df_mean['Vmax'].iloc[i] > 10:
                df_mean['wind'].iloc[i] = 9
            elif df_mean['Vmax'].iloc[i] < 10:
                df_mean['wind'].iloc[i] = 10
        except:
            df_mean['wind'].iloc[i] = df_mean['wind'].mean() #Filling missing values with the mean. 
            
    df_mean['CWI'] = df_mean['tmax']*0.4 + df_mean['rain']*0.4 + df_mean['wind']*0.2 #Calculating the score.
    
    df_score = pd.DataFrame(df_mean['CWI']) #Keeping just the score.
    
    df_t = df_score.T #Transposing the dataframe.
    
    df_t['name'] = None
    df_t['coords'] = None
    
    df_t.reset_index(drop=True, inplace=True) #Resetting index.
    
    c = df_t.columns.tolist() #Creating a list of columns.
    r = df_t.iloc[0].tolist() #Creating a list of row values.
    
    keys = c #Creating a dictionary from our two lists.
    values = r
    dictionary = dict(zip(keys, values))
    
    dictionary['name'] = newname
    
    return dictionary #Returning our dictionary.

In [33]:
output = score_creator('meteo_test.csv')

  weekNumber = df['Fecha'].dt.week.tolist()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [60]:
#Testing our function.

output

{1: 7.0,
 2: 7.0,
 3: 7.0,
 4: 7.0,
 5: 7.0,
 6: 7.0,
 7: 7.199999999999999,
 8: 7.6,
 9: 7.4,
 10: 7.4,
 11: 7.6,
 12: 6.800000000000001,
 13: 7.6,
 14: 7.200000000000001,
 15: 8.600000000000001,
 16: 8.600000000000001,
 17: 8.200000000000001,
 18: 8.8,
 19: 8.8,
 20: 8.8,
 21: 8.8,
 22: 9.2,
 23: 9.4,
 24: 8.8,
 25: 9.4,
 26: 8.600000000000001,
 27: 8.600000000000001,
 28: 8.600000000000001,
 29: 9.0,
 30: 8.600000000000001,
 31: 9.0,
 32: 8.600000000000001,
 33: 9.0,
 34: 9.0,
 35: 8.600000000000001,
 36: 9.0,
 37: 9.2,
 38: 9.0,
 39: 9.4,
 40: 9.0,
 41: 9.0,
 42: 9.0,
 43: 9.0,
 44: 8.600000000000001,
 45: 7.4,
 46: 7.200000000000001,
 47: 7.6,
 48: 7.199999999999999,
 49: 7.0,
 50: 7.199999999999999,
 51: 7.6,
 52: 7.199999999999999,
 53: 7.0,
 'name': '0009X',
 'coords': None}

## Creating the component that iterates through all csv files and running it

In [61]:
#This loop will iterate through all files in our designated folder and apply the yet to be defined function to them.

start = time.time() #Starting a timer.

dict_list = [] #We will be appending our output dictionaries here.

directory = 'csv' #The folder containing our csv files.
 
files = Path(directory).glob('*') #Grabbing all files.
for file in files:
    dict_list.append(score_creator(file)) #Applying our score function.
    
stop = time.time() #Stopping our timer.
duration = (stop - start) / 60
print('Minutes:', duration) #Returning the elapsed minutes.

test = pd.DataFrame(dict_list) #Creating the dataframe from the list of dictionaries.

  weekNumber = df['Fecha'].dt.week.tolist()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Minutes: 3.3536048332850137


We can see that many rows (weather stations) are missing or have missing weeks, that's because they have faulty data or the recording simply doesn't fill the last 5 years. The rows with missing data can be safely dropped.

In [62]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 849 entries, 0 to 848
Data columns (total 55 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       718 non-null    float64
 1   2       721 non-null    float64
 2   3       721 non-null    float64
 3   4       722 non-null    float64
 4   5       722 non-null    float64
 5   6       722 non-null    float64
 6   7       727 non-null    float64
 7   8       727 non-null    float64
 8   9       728 non-null    float64
 9   10      733 non-null    float64
 10  11      734 non-null    float64
 11  12      734 non-null    float64
 12  13      734 non-null    float64
 13  14      735 non-null    float64
 14  15      735 non-null    float64
 15  16      735 non-null    float64
 16  17      734 non-null    float64
 17  18      734 non-null    float64
 18  19      735 non-null    float64
 19  20      735 non-null    float64
 20  21      735 non-null    float64
 21  22      735 non-null    float64
 22  23

In [63]:
#Dropping the rows with missing data.

meteo = test.dropna(thresh=52)

In [64]:
#Finally, saving our resulting dataframe.

meteo.to_csv('meteo.csv', index=False)

# Adding the weather station coordinates

Now that we have our complete dataframe it's to assign every weather station its geographical coordinates. This can be done quite easily by using *ListadoEstaciones*, a *csv* file containing all identifiers and corresponding coordinates.

In [65]:
listado = pd.read_csv('ListadoEstaciones.csv', sep=';')

In [66]:
listado.head()

Unnamed: 0,ID,lat,long
0,1363X,432646N,075141W
1,1387,432157N,082517W
2,1387E,431825N,082219W
3,1390X,431213N,084239W
4,1393,430938N,091239W


Latitude and longitude aren't expressed in degrees but in degrees, minutes and seconds. Let's define two functions to convert them.

In [67]:
def latDD(x):
    '''
    Input: latitude in DDMMSS.
    
    Output: latitude in degrees.
    
    '''
    D = int(x[0:2])
    M = int(x[2:4])
    S = float(x[4:6])
    DD = D + float(M)/60 + float(S)/3600
    DD = round(DD, 5) #Setting a limit to 5 decimals.
    return DD

In [68]:
def longDD(x):
    '''
    Input: longitude in DDMMSS.
    
    Output: longitude in degrees.
    
    '''    
    D = int(x[0:2])
    M = int(x[2:4])
    S = float(x[4:6])
    DD = D + float(M)/60 + float(S)/3600
    if x[5] == 'W':
        DD = DD*(-1)
    DD = round(DD, 5)
    return DD

In [69]:
#Testing the first function.

latDD(listado['lat'].iloc[0])

43.44611

In [70]:
#Testing the second one.

longDD(listado['long'].iloc[0])

7.86139

In [71]:
#Now let's apply the functions to the dataframe.

listado['lat'] = listado['lat'].map(lambda x: latDD(x))
listado['long'] = listado['long'].map(lambda x: longDD(x))

In [72]:
listado.head()

Unnamed: 0,ID,lat,long
0,1363X,43.44611,7.86139
1,1387,43.36583,8.42139
2,1387E,43.30694,8.37194
3,1390X,43.20361,8.71083
4,1393,43.16056,9.21083


In [73]:
#Now we can finally add the coordinates to the original meteo dataframe.

for i in range(len(listado)):
    for n in range(len(meteo)):
        if listado['ID'].iloc[i] == meteo['name'].iloc[n]:
            meteo['coords'].iloc[n] = '('+ str(listado['lat'].iloc[i]) + ',' + str(listado['long'].iloc[i]) + ')'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [75]:
#Checking the results.

meteo.head(50)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,46,47,48,49,50,51,52,53,name,coords
0,7.6,7.6,7.2,7.2,7.6,7.6,7.8,7.8,7.6,8.0,...,7.8,7.8,7.2,7.6,7.6,7.8,7.8,7.8,0002I,"(40.95806,0.87139)"
1,7.0,7.0,7.0,7.0,7.0,7.0,7.2,7.6,7.4,7.4,...,7.2,7.6,7.2,7.0,7.2,7.6,7.2,7.0,0009X,"(41.21389,0.96333)"
2,7.6,7.0,7.4,7.2,7.2,7.2,7.6,7.6,7.2,7.6,...,8.4,7.8,7.4,7.6,7.4,7.6,7.4,6.8,0016A,"(41.14972,1.17889)"
4,7.8,7.8,7.8,7.8,7.8,7.8,7.8,7.8,7.8,8.2,...,8.6,8.2,7.8,7.8,7.8,7.8,7.8,7.8,0042Y,"(41.12389,1.24917)"
5,7.0,7.0,7.0,5.6,6.8,6.8,7.2,7.2,6.8,7.2,...,7.6,6.6,7.2,7.2,6.8,7.2,7.0,6.8,0061X,"(41.41694,1.51917)"
6,7.4,7.4,7.4,7.0,7.8,7.4,7.8,7.8,7.8,7.8,...,8.2,7.8,7.8,7.8,7.8,7.8,7.4,7.4,0066X,"(41.33028,1.67694)"
7,7.4,7.4,7.4,7.4,7.8,7.4,7.4,7.8,7.8,7.8,...,7.8,7.4,7.8,7.8,7.8,7.8,7.8,7.4,0073X,"(41.24389,1.8525)"
8,7.0,7.0,7.2,7.0,7.2,7.0,7.4,7.4,7.2,7.2,...,7.8,7.8,7.4,7.4,7.4,7.4,7.4,6.8,76,"(41.29278,2.07)"
9,7.0,7.0,7.0,6.2,7.4,7.0,7.4,7.4,7.4,7.2,...,7.4,7.0,7.4,7.4,7.0,7.4,7.0,7.0,0092X,"(42.10139,1.8575)"
10,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.8,7.8,7.8,...,7.8,7.4,7.4,7.4,7.4,7.4,7.4,7.4,0106X,"(41.86639,1.8725)"


In [78]:
#Saving our dataframe.

meteo.to_csv('meteo.csv', index=False)