# Wind speed data for fire spread model

## Motivation for obtaining wind speed and direction data

The AgroSucess model requires wind speed and direction data to determine how fires will spread. At the time of writing (December 2019) there are two possible wind spread sub-models I might choose. However both of these sub-models will require wind speed and direction data. See overviews of the sub-models below.

### Ellipse model
This is a type of model after [Anderson et al. 1982][Anderson1982], and [Catchpole et al. 1992][Catchpole1992] where the shape of a fire is modelled as an ellipse and parameters for spread rates are estimated. The major axis of the ellipse is parallel to the wind direction, and the shape of the ellipse is determined by wind speed.

See [catalog description][frames-desc] for summary of a United States Department of Agriculture (USDA) working document describing the wind-driven wild land fire size model proposed by [Anderson, 1983][Anderson1983a].

While such a model would be simple to implement, there is not a large body of literature documenting how the spread parameters vary as a function of land cover type.

[frames-desc]: https://www.frames.gov/catalog/8149
[Anderson1983a]: https://www.frames.gov/documents/behaveplus/publications/Anderson_1983_INT-RP-305_ocr.pdf
[Anderson1982]: https://doi.org/10.1017/s0334270000000394
[Catchpole1992]: https://doi.org/10.1139/x92-129

### Process-based model

This is the approach taken in , and [Millington et al. 2009][Millington2009]. This approach considers on how a fire spreads from one simulation grid cell to another. Given a burning cell, each of its neighbours is assigned a wind spread weight depending on its orientation with respect to the burning cell and the wind direction. Cells downwind of burning cells have the greatest wind spread weight.

<img src="img/perry-enright-wind-dir.png" width="350px">

Fig. 1: Wind risk weights for simulation cells neighbouring a burning cell. Arrows pointing stright up give the weights for cells downwind of fire. Image taken from [Perry and Enright 2002][Perry2002a].

The approach of using arbitrary/ relative weights to determine directional fire spread risk was pironeered by [Karafyllidis and 
Thanailakis, 1997][Karafyllidis1997]. In that paper the authors propose the following direction dependent wind spread weights:


```text
Weak       Strong
------     ------
↑ 1.1      ↑ 1.3 
↗ 1.04     ↗ 1.1 
→ 1.0      → 1.0 
↘ 1.0      ↘ 1.0 
↓ 0.9      ↓ 0.8 
```
No empirical justification is given for these choices, and the paper makes clear that the distances travelled by a fire driven by wind in this way should be thought of as having arbitrary units. 

On the other hand, justification for this approach of estimating risk weightings is given by the fact that [Millington et al., 2009][Millington2009] were able to produce wildfire frequency/ size statistics in a similar regime to those observed empirically using this approach.

[Millington2009]: https://doi.org/10.1016/j.envsoft.2009.03.013
[Perry2002a]: https://doi.org/10.1016/S0304-3800(02)00004-2
[Karafyllidis1997]: https://doi.org/10.1016/S0304-3800(96)01942-4

In [None]:
import calendar
from dataclasses import dataclass
from datetime import date
import logging
import os
from pathlib import Path
from typing import List, Tuple

from shapely.geometry import LinearRing, LineString, Point, Polygon

import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib.pyplot as plt
%matplotlib inline

from aemet_api import get_station_inventory, wind_data_for_sites
from aemet_api.geo import aemet_stations_to_gdf, stations_near_targets
from aemet_api.web import check_internet
from aemet_api.wind import degrees_to_cardinal, beaufort_number

Explicitly state inputs and primary and intermediate outputs for this notebook

In [None]:
pwd = os.getcwd().split('/')[-1]
DATA_DIR = Path('../inputs') if pwd == 'wind' else Path('inputs')
TMP_DIR = Path('../tmp') if pwd == 'wind' else Path('tmp')
OUTPUT_DIR = Path('../outputs') if pwd == 'wind' else Path('outputs')

INPUTS = {
    'site_location_info': OUTPUT_DIR / 'site_location_info.csv',
    'un_data_portugal': DATA_DIR / 'UNdata_Export_20191205_131516780.zip',
    'internet_connection': check_internet()
}

INTERMEDIATE = {
    'wind_download': TMP_DIR / 'wind_download.csv',
}

OUTPUTS = {
    'wind_clean': TMP_DIR / 'wind_clean.csv',
    'site_wind_speed_class_prob': OUTPUT_DIR / 'site_wind_speed_class_prob.csv',
    'site_wind_dir_probs': OUTPUT_DIR / 'site_wind_dir_probs.csv',
}

Dates between which we will try to obtain study site data

In [None]:
START_DATE, END_DATE = date(1990, 1, 1), date(2019, 11, 1)

## Choose weather stations to collect wind data from

### Load study site locations from file

This .csv file is produced using the [`epd-query` application](https://doi.org/10.5281/zenodo.3560683).

In [None]:
ssite_df = pd.read_csv(INPUTS['site_location_info'], sep=',')
ssite_df.head()

In [None]:
ssite_gdf = gpd.GeoDataFrame(
    ssite_df,
    geometry=[Point(xy) for xy in zip(ssite_df['londd'], ssite_df['latdd'])],
    crs={'init': 'epsg:4326'}
).to_crs(epsg=2062)  # Madrid 1870 (Madrid) / Spain 

In [None]:
ssite_gdf

### Find all available weather stations

In [None]:
API_KEY = ('eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhbmRyZXcubGFuZUB'
           'rY2wuYWMudWsiLCJqdGkiOiJhNWE3MjdhMS1hYmM0LTQzNjU'
           'tODEwYy0xNGZlMjUyMjUxZDgiLCJpc3MiOiJBRU1FVCIsIml'
           'hdCI6MTU3NTM4NjM5MSwidXNlcklkIjoiYTVhNzI3YTEtYWJ'
           'jNC00MzY1LTgxMGMtMTRmZTI1MjI1MWQ4Iiwicm9sZSI6IiJ'
           '9.b_yvT_4L8mYMyozG91-4LDkG7SP-XqO6fc96O1G7bH0')

Download inventory of all weather stations from the AEMET API

In [None]:
station_gdf = aemet_stations_to_gdf(
    pd.DataFrame(get_station_inventory(API_KEY))
)

In [None]:
station_gdf.head()

In [None]:
station_gdf.plot()

The cluster to the bottom left are the Canary Islands.

In [None]:
print('Provinces in cluster to bottom left:\n' + '\n'.join(
    station_gdf[station_gdf['geometry'].x < -200000]['provincia'].unique())
)

In [None]:
station_gdf = station_gdf[
    ~station_gdf['provincia'].isin(['LAS PALMAS', 'STA. CRUZ DE TENERIFE'])
]

In [None]:
WORLD = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
SPAIN = WORLD[WORLD['name'] == 'Spain'].to_crs(epsg=2062)  # Madrid 1870 (Madrid) / Spain
PORTUGAL = WORLD[WORLD['name'] == 'Portugal'].to_crs(epsg=2062)

In [None]:
def plot_all_stations(axis_on: bool=False):    
    ax = SPAIN.plot()
    PORTUGAL.plot(ax=ax, color='w', edgecolor='k')
    station_gdf.plot(ax=ax, color='k', markersize=18)
    if not axis_on:
        ax.axis('off')
    plt.tight_layout()
    
plot_all_stations()

Reference map showing the Spanish mainland with all available weather stations overlaid in black.

### Find subset of stations within 50 km of study sites

In [None]:
buffer_dist = 50000
ssite_stations_gdf = stations_near_targets(station_gdf, ssite_gdf, max_dist=buffer_dist)
ssite_stations_gdf.head()

Map showing the locations of weather stations (black points) within 50 km of study sites. Note that as we are using data from the Spanish AEMET, the Portuguese site Charco da Candieira does not have any data at present. This will be handled in the section [Portugal](#portugal)

In [None]:
def plot_study_site_map():
    ax = SPAIN.plot()
    PORTUGAL.plot(ax=ax, color='w', edgecolor='k')
    ssite_gdf.buffer(buffer_dist).plot(ax=ax, edgecolor='y', color='g')
    for i, row in ssite_gdf.iterrows():
        plt.annotate(row['sitename'], (row['geometry'].x + 1000, 
                                       row['geometry'].y + 60000))
    ssite_stations_gdf.plot(ax=ax, color='k', markersize=18)
    ax.axis('off')
    plt.tight_layout()
    
plot_study_site_map()

DataFrames of note:

- `ssite_stations_gdf` links Spanish weather stations to study sites. Multiple stations per study site are allowed, but at this point there are no stations included for Charco da Candieira
- `station_gdf` contains the locations of all weather stations in mainland Spain
- `ssite_gdf` contains the locations of all study sites

## Strategy for incorporating Portuguese site

The above analysis provides wind speed data for study sites in Spanish territory, but not **Charco da Candieira** which is in Portugal's Aviero district (Anadia municipality). A search for open data characterising wind speed in Portugal was not as productive as that for Spain.

### Work done to obtain Portuguese wind speed data

#### Datasets used in the literature

In a literature search conducted in December 2019 I searched for "wind speed Portugal" on Web of Science and Google Schollar. I found no results providing directions to accessible open datasets. For example, in the paper [Fonte, Silva and Quadrado (2005) *Wind Speed Prediction using Artificial Neural Networks*](https://pdfs.semanticscholar.org/ee96/63a95cee7ad75d9de0320acc6881f11dbe3d.pdf), the authors state that:

> "The data set used in this work corresponds to the hourly average values of wind during the years of 2003 and 2004 in Faro."

but don't specify how they obtained that data. In [Rio, Esteves and Estanqueiro (2006) *Monthly Forecasts of the Average Wind Speed in Portugal*](https://www.researchgate.net/publication/259333270_Monthly_forecasts_of_the_average_wind_speed_in_Portugal) the authors state their data were obtained from the NOAA Operational Archive and Distribution System (NOMADS). However, the authors don't provide a link to the dataset, and the [NOMADS website](https://nomads.ncep.noaa.gov/) shows only data relating to the United States.

#### Commercial data sources

[WindStatistics](https://www.windfinder.com/windstatistics/sagres) provide commercial wind speed datasets, but these are not appropriate for research purposes as they are not accessible.

#### Institutional data sources

The Portuguese Institute for Sea and Atmosphere (IPMA) provide a Portuguese [climate change portal](http://portaldoclima.pt/en) which publishes temperature and precipitation data. IPMA also provide temperature and precipitation data [specific to individual weather stations](https://www.ipma.pt/en/oclima/normais.clima/1971-2000/#102). However I was unable to find any official source of wind speed data provided by the Portuguese state.

The most fine grained relevant data I have been able to obtain is from a [UN data portal](http://data.un.org/Data.aspx?d=CLINO&f=ElementCode%3a16%3bCountryCode%3aPO&c=2,5,6,7,10,15,18,19,20,22,24,26,28,30,32,34,36,38,40,42,44,46&s=CountryName:asc,WmoStationNumber:asc,StatisticCode:asc&v=1) showing monthly wind speed averages in Portugal at the district level. While this is a good start it is not ideal as there is not enough information provided to infer a wind speed *distribution* (only point estimates). Additionally since Aviero district is not included it is necessary to choose a neighbouring district to use as a proxy.

### Process Portuguese wind speed data from UN data

UN data was downloaded on 5th December 2019 from [this](http://data.un.org/Data.aspx?d=CLINO&f=ElementCode%3a16%3bCountryCode%3aPO&c=2,5,6,7,10,15,18,19,20,22,24,26,28,30,32,34,36,38,40,42,44,46&s=CountryName:asc,WmoStationNumber:asc,StatisticCode:asc&v=1) page. It is necessary to visit that page in a browser and click the 'download' link to obtain the data as the page uses filtering performed in the browser to generate the dataset to download. I did also attempt to use an API the UN provide for downloading the data but found the [documentation](http://data.un.org/Host.aspx?Content=API) unintelligible.

Assuming the data are downloaded into the local file `DATA_DIR / UNdata_Export_20191205_131516780.zip` we view it as follows:

In [None]:
keep_cols = (['Country or Territory', 'Station Name', 'WMO Station Number',
              'Period', 'Statistic Description', 'Unit', 'Annual']
             + [calendar.month_abbr[x + 1] for x in range(12)])
un_data = pd.read_csv(INPUTS['un_data_portugal'],
                      usecols=keep_cols, sep=';')
un_data

Aviero has an Atlantic coast and is situated betwen Porto and Coimbra.

In [None]:
candieira_wind_mean = (
    un_data.loc[
        un_data['Station Name'].isin(['Coimbra', 'Porto']), 
        ['Station Name', 'Annual']
    ]
    .set_index('Station Name')
)

In [None]:
print(candieira_wind_mean)

Take the measurements at Porto to be representative of the mean wind speed at Charco da Candieira

In [None]:
cdc_mean_wind_speed, = candieira_wind_mean.loc['Porto'].values

We now look to find a region along the Spanish Atlantic coast in which all points are a similar distance from the Atlantic coast as Charco da Candieira. We will then select a weather station within this region which has the closest average wind speed to `cdc_mean_wind_speed` to represent Charco da Candieira (in lieu of wind speed distribution data in Portugal). 

Steps to identify candidate stations to represent Charco da Candieira:

1. Measure distance between the site point and the Atlantic coast
2. Find the line comprising the points on the peninsular which are this distance from the Atlantic coast
3. Create a 50 km^2 buffer around this line to mimic the search area around Spanish study sites
4. Identify a site search area by finding the intersection between the buffered Atlantic coast distance line and Spain
5. Obtain a listing of all Spanish weather stations within this search area

In [None]:
def closest_point_on_poly_to_point(poly: Polygon, pt: Point) -> Point:
    """Find the closest point on a polygon to another point.
    
    Implementation lifted from this `SO answer`.
    
    _`SO answer`: https://stackoverflow.com/questions/33311616
    """
    pol_ext = LinearRing(poly.exterior.coords)
    dist = pol_ext.project(pt)
    nearest_p = pol_ext.interpolate(dist)
    return nearest_p

Isolate Shapely objects representing i. the point where Charco da Candieira is located, and ii. a polygon representing the whole of Iberia. 

In [None]:
cdc_pt = (
    ssite_gdf[ssite_gdf['sitename'] == 'Charco da Candieira']['geometry']
    .values[0]
)
iberia_poly = (
    pd.concat([SPAIN, PORTUGAL]).dissolve(by='continent')['geometry']
    .values[0]
)

In [None]:
iberia_gdf = gpd.GeoDataFrame([('iberia', iberia_poly)],
                              columns=['desc', 'geometry'])
closest_pt = closest_point_on_poly_to_point(iberia_poly, cdc_pt)
points_gdf = gpd.GeoDataFrame([('site', cdc_pt),('close_point', closest_pt)],
                              columns=['desc', 'geometry'])

In [None]:
base = iberia_gdf.plot()
points_gdf.plot(ax=base, color='r')

In [None]:
dist_to_atl = cdc_pt.distance(closest_pt)
atl_corridor = iberia_gdf.intersection(iberia_gdf.boundary.buffer(dist_to_atl))
base = atl_corridor.plot()
points_gdf.plot(ax=base, color='r')
#PORTUGAL.plot(ax=base)

Use a rectangular bounding box to trim sections not on the atlantic coast

In [None]:
top, bottom, left, right = 1000000, 180000, 85000, 290000
atl_corridor = (
    gpd.GeoSeries(Polygon([(left, top), (right, top),
                           (right, bottom), (left, bottom)]))
    .intersection(atl_corridor)
)

In [None]:
base = atl_corridor.plot()
points_gdf.plot(ax=base, color='r')

In [None]:
atl_corridor_points = [
    Point(x[0], x[1]) for x in atl_corridor.boundary.values[0].coords
]

In [None]:
fig, ax = plt.subplots(figsize=(7, 12))
xs, ys = zip(*[(pt.x, pt.y) for pt in atl_corridor_points])
plt.scatter(xs, ys)
for i in range(len(atl_corridor_points)):
    plt.annotate(i, (xs[i] + 1000, ys[i] + 100))

Referring to the above plot of `atl_corridor`, the line which defines the set of points the same distance from the Atlantic coast as Charco da Candieira is given by the numbered points 1 - 25 excluding the kink at point 2.

In [None]:
dist_from_coast_line = LineString(
    [pt for i, pt in enumerate(atl_corridor_points) 
     if (i <= 25) and (i != 2) and (i > 0)]
)
dist_from_coast_gdf = gpd.GeoDataFrame([('dist_from_coast', dist_from_coast_line)],
                                       columns=['desc', 'geometry'])

Buffer the distance from coast to search the area within 50 km of the line

In [None]:
dist_from_coast_area_gdf = dist_from_coast_gdf.buffer(50000)

In [None]:
base = iberia_gdf.plot(color='w', edgecolor='k')
dist_from_coast_gdf.plot(ax=base, color='r')
dist_from_coast_area_gdf.plot(ax=base, color='g', edgecolor='y')

The red line shows the points the same distance from Atlantic coast as Charco da Candieira. The green buffer zone extends 50 km in all directions from the distance from Atlantic coast line.

Find intersection between distance from coast spacer area and Spain

In [None]:
cdc_search_area = gpd.GeoDataFrame([
    ('cdc_search_area',
     SPAIN.iloc[0]['geometry']
     .intersection(dist_from_coast_area_gdf.values[0])[2])
], columns=['desc', 'geometry'], crs={'init': 'epsg:2062'})

In [None]:
base = SPAIN.plot()
cdc_search_area.plot(ax=base, color='r')
(ssite_gdf[ssite_gdf['sitename'] == 'Charco da Candieira'].to_crs(epsg=2062)
 .buffer(50000).plot(ax=base, edgecolor='y', color='g'))

The red marker on the above map shows the area which is both i. within the Atlantic coast selection zone, and ii. within Spain.

We now identify weather stations within this search area

In [None]:
cdc_candidate_sites = gpd.sjoin(station_gdf, cdc_search_area, how='inner', op='within') 

In [None]:
base = SPAIN.plot()
cdc_search_area.plot(ax=base, color='r')
(ssite_gdf[ssite_gdf['sitename'] == 'Charco da Candieira'].to_crs(epsg=2062)
 .buffer(50000).plot(ax=base, edgecolor='y', color='g'))
cdc_candidate_sites.plot(ax=base, color='k')

In [None]:
cdc_candidate_sites

Download wind speed and direction data for candidate Portuguese sites

In [None]:
cdc_station_wind_data = wind_data_for_sites(
    START_DATE,
    END_DATE,
    cdc_candidate_sites['indicativo'].unique(),
    API_KEY
)

In [None]:
cdc_station_comparison_df = (
    cdc_station_wind_data
    .groupby(by='station_id')['ave_wind_speed']
    .agg(['count', 'mean'])
    .assign(cdc_mean_abs_diff=lambda df: (df['mean'] - cdc_mean_wind_speed).abs())
    .sort_values(by='cdc_mean_abs_diff')
)

assert cdc_station_comparison_df.iloc[0].name == '1428'
cdc_station_comparison_df

We see that station `1428` has the smallest average difference in average wind speed with respect to the average wind speed in Porto as reported by the UN. We select this site to represent Charco da Candieira's wind speed distribution

In [None]:
cdc_site_id = '1428'

In [None]:
portugal_ssite_station_wind_data = (
    cdc_station_wind_data
    .assign(date=lambda df: pd.to_datetime(df['date']))
    .set_index(['station_id', 'date'])
    .loc[pd.IndexSlice[cdc_site_id, :], :]
)

In [None]:
portugal_ssite_station_wind_data

In [None]:
portugal_ssite_station = pd.concat([
    ssite_gdf[ssite_gdf['sitename'] == 'Charco da Candieira']
    .iloc[0].drop('geometry'), # keep location from station not study site
    cdc_candidate_sites[
        cdc_candidate_sites['indicativo'] == cdc_site_id
    ].iloc[0]
]).to_frame().T[ssite_stations_gdf.columns]

Add selected station representing the Portuguese study site to the DataFrame linking weather stations to study sites

In [None]:
ssite_stations_gdf = pd.concat([ssite_stations_gdf, portugal_ssite_station]).reset_index(drop=True)

## Download wind speed and direction data for Spanish sites within 50 km of study sites

In [None]:
ssite_station_wind_data = wind_data_for_sites(
    START_DATE,
    END_DATE,
    # exclude Portuguese site from query to avoid re-downloading data
    ssite_stations_gdf[
        ssite_stations_gdf['indicativo'] != cdc_site_id
    ]['indicativo'].unique(),
    API_KEY
)

Join with previously downloaded data representing Portuguese site

In [None]:
ssite_station_wind_data = (
    pd.concat(
        [ssite_station_wind_data,
         portugal_ssite_station_wind_data.reset_index()],
        sort=True
    )
    .assign(date=lambda df: pd.to_datetime(df['date']))
    .set_index(['station_id', 'date'])
)   

In [None]:
ssite_station_wind_data

In [None]:
ssite_station_wind_data.to_csv(INTERMEDIATE['wind_download'])

## Clean wind speed and direction data

In [None]:
try:
    ssite_station_wind_data
except NameError:
    ssite_station_wind_data = (
    pd.read_csv(INTERMEDIATE['wind_download'])
    .assign(date=lambda df: pd.to_datetime(df['date']))
    .set_index(['station_id', 'date'])
)

In [None]:
ssite_station_wind_data

In [None]:
ssite_station_wind_data.dtypes

In [None]:
ssite_station_wind_data.loc['6302A']['ave_wind_speed'].rolling(7).mean().plot()

Plot of average wind speed for station 6302A (near the San Rafael site). Note the seasonal periodicity in wind speed.

### Check for sites missing wind speed data

In [None]:
ssites_with_observation_counts = (
    pd.merge(
        ssite_stations_gdf,
        (ssite_station_wind_data['ave_wind_speed'].dropna()
         .groupby(level='station_id').count()
         .rename('n_observations')
         .to_frame().reset_index()),
        how='left', left_on='indicativo', right_on='station_id'
    )
    .loc[
        :, ['sitename', 'provincia', 'nombre', 'indicativo', 'n_observations']
    ]
)

We see that only San Rafael an Monte Areo mire have stations missing data. Also the Algendar station at Menorca Airport has over 10,000 observations 👍

In [None]:
ssites_with_observation_counts

Summarise count of observations by study site

In [None]:
ssites_with_observation_counts.groupby(by=['sitename'])['n_observations'].sum().astype(int)

### 

In [None]:
ssite_stations_gdf.loc[:, ['indicativo', 'sitename']].head()

### Ensure time index is continuous

In [None]:
tmp = (
    ssite_station_wind_data.swaplevel().unstack().sort_index()
    .pipe(lambda df: df.reindex(
        pd.date_range(df.iloc[0].name, df.iloc[-1].name)
    ))
    .rename_axis('date')
    .stack().swaplevel().sort_index()
)

assert len(tmp.index) >= len(ssite_station_wind_data.dropna(how='all').index), (
    'Expect more rows when ensuring time index continuous, not fewer.'
)
ssite_station_wind_data = tmp

In [None]:
ssite_station_wind_data

Add study site name to index for each combination of station and date

In [None]:
# Like ssite_station_wind_data but index is site, station and date, not station and date
ssite_wind_data = (
    pd.merge(
        ssite_station_wind_data.reset_index(),
        ssite_stations_gdf.loc[:, ['indicativo', 'sitename']],
        how='left',
        left_on='station_id',
        right_on='indicativo',
    )
    .drop(columns=['indicativo'])
    .set_index(['sitename', 'station_id', 'date'])
    .sort_index()
    .dropna()
)    

In [None]:
ssite_wind_data

In [None]:
sites = ssite_wind_data.index.levels[0].values
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.flat):
    if i < 6:
        # Create room for 6 plots but only make 5
        ax.hist(ssite_wind_data.loc[sites[i], 'ave_wind_speed'], density=True)
        ax.set_title(sites[i])
        ax.set_xlim([0, 25])
    else:
        ax.axis('off')
plt.tight_layout()

Histograms of daily average wind speed across all stations for each site

### Clean wind direction data

In [None]:
test_dir = ssite_wind_data.loc[pd.IndexSlice[:, '6293X'], 'direction']
test_dir.hist()

In [None]:
test_dir.reset_index(level=[0, 1], drop=True).plot()

It looks very much like the values greater than 36 (indicating 360 degrees) are erroneous. Let's remove them:

In [None]:
test_dir.loc[test_dir > 40] = np.nan

In [None]:
assert test_dir.max() <= 36

In [None]:
test_dir.hist()

In [None]:
test_dir.reset_index(level=[0, 1], drop=True).plot()

This looks much more sensible. Apply this logic to the whole dataframe

In [None]:
ssite_wind_data.loc[ssite_wind_data['direction'] > 40, 'direction'] = np.nan

In [None]:
ssite_wind_data['direction'].hist()

The API data has units 'tens of degrees', i.e. a value of 22 indicates a maximum wind gust of 220° clockwise from north. Convert these values to cardinal directions.

In [None]:
ssite_wind_data['direction'] = (
    ssite_wind_data['direction']
    .apply(lambda x: degrees_to_cardinal(x * 10))
)

In [None]:
ssite_wind_data.head()

In [None]:
ssite_wind_data.to_csv(OUTPUTS['wind_clean'])

## Classify wind speed according to Beaufort No.

See [here][BeaufortDesc] for reference for Beaufort numbers.

We classify wind speed observations according to Beaufort numbers. Classify (with reference to wind spread weight from Perry and Enright, 2002) Beaufort numbers as high/ medium/ low wind speeds as follows:

- Low wind = Beaufort No. 0-2
- Med wind = Beaufort No. 3-5
- High wind = Beaufort No. >= 6

This is done because neither Perry and Enright, 2002, or Karafyllidis and Thanailakis, 1997 specify the wind speed ranges to which their low/ medium/ high fire spread risk classifications apply.

[BeaufortDesc]: https://www.engineeringtoolbox.com/beaufort-wind-scale-d_184.html

In [None]:
pd.read_csv(OUTPUTS['wind_clean'])

In [None]:
try:
    ssite_wind_data
except NameError:
    ssite_wind_data = (
        pd.read_csv(OUTPUTS['wind_clean'])
        .assign(date=lambda df: pd.to_datetime(df['date']))
        .set_index(['sitename', 'station_id', 'date'])
    )   

In [None]:
ssite_wind_data['beaufort_no'] = (
    ssite_wind_data['ave_wind_speed'].transform(beaufort_number)
)

In [None]:
ssite_wind_data

Classify beaufort numbers as low/ med/ high wind

In [None]:
def beaufort_to_class(beaufort_no: int) -> str:
    """Classify Beaufort number as low/ medium/ high wind"""
    if beaufort_no <= 2:
        return 'low'
    elif beaufort_no <= 5:
        return 'medium'
    return 'high'


def test_beaufort_to_class():
    assert beaufort_to_class(2) == 'low'
    assert beaufort_to_class(3) == 'medium'
    assert beaufort_to_class(5) == 'medium'
    assert beaufort_to_class(12) == 'high'
    
test_beaufort_to_class()

In [None]:
ssite_wind_data['wind_speed_class'] = (
    ssite_wind_data['beaufort_no'].transform(beaufort_to_class)
)

In [None]:
ssite_wind_data.head()

In [None]:
wind_speed_class_prob = (
    ssite_wind_data['wind_speed_class']
    .groupby(level='sitename').value_counts()
    .unstack().replace(np.nan, 0)
    .pipe(lambda df: df.divide(df.sum(1), axis=0))
    .loc[:, ['low', 'medium', 'high']]
)

wind_speed_class_prob

In [None]:
wind_speed_class_prob.to_csv(OUTPUTS['site_wind_speed_class_prob'])

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 7))
for i, ax in enumerate(axes.flat):
    if i < 6:
        # Create room for 6 plots but only make 5
        data = wind_speed_class_prob.iloc[i]
        ax.bar(np.arange(3), data.values, tick_label=data.index)
        ax.set_title(data.name)
        ax.set_ylim([0, 1])
    else:
        ax.axis('off')
plt.tight_layout()

Probability of the wind speed being low, medium or high on any given day at each of the Spanish study sites.

## Get wind direction probabilities for sites

In [None]:
wind_dir_probs = (
    ssite_wind_data['direction'].groupby(by='sitename').value_counts()
    .rename('num_obs_with_dir').reset_index('direction')
    .join(
        ssite_wind_data['direction'].groupby(level='sitename').count()
        .rename('total_num_obs')
    )
    .assign(dir_prob=lambda df: df['num_obs_with_dir'] / df['total_num_obs'])
    .set_index('direction', append=True)
    .loc[:, 'dir_prob']
    .sort_index()
    .swaplevel().unstack()
    .reindex(['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW'])
    .rename_axis('direction')
    .T
)
wind_dir_probs.to_csv(OUTPUTS['site_wind_dir_probs'])

In [None]:
wind_dir_probs

In [None]:
fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(12, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.flat):
    if i < len(wind_dir_probs.index):
        ax.bar(x=np.arange(8), height=wind_dir_probs.iloc[i], tick_label=wind_dir_probs.columns)
        ax.set_title(wind_dir_probs.iloc[i].name)
    else:
        ax.axis('off')    
plt.tight_layout()

Wind direction probabilities for all study sites. These were obtained by counting the number of times the day's maximum gust was in each cardinal direction for each weather station, and normalising by the total number of observations for each study site.

## Cleanup

Delete intermediate files

In [None]:
for f in INTERMEDIATE.values():
    try:
        os.remove(f)
    except FileNotFoundError:
        logging.warning(f'Tried to delete {f} but could not find file')