## Class 496: weather observations and random forests

In this tutorial, we will mix weather observation archives with a Random Forest (RF) classifier model. Our goal is to use historical observation data for Valparaiso to see if we can predict the 12Z dry-bulb temperature using the 00Z observations alone.

Our progress will go:
- Obtain weather data for Valpo
- Decide which weather variables may be most important
- Produce one "post-processed" variable such as change in dry-bulb temperature over 6 hours
- Get the data into the right format (remove NaNs; correct dimensions)
- Train a RF classifier with half of the dataset; test with the other half
- Verify the performance with R-squared, mean-absolute-error (MAE), mean-squared-error (MSE), or similar
- Return to pick different variables, or "tune the dials" of the RF model.



### First, we need to download data from Synoptic, a repository for online weather observations in the US
You will need a token, which is like a password to download the data. When prompted, JRL can send you this but it should not be used outside of this assignment today. You can visit the Synoptic website if you'd like to get your own token at [synopticdata.com]().

You will also need to pick a good weather station near here. Try choosing one via a GUI at [www.mesowest.edu]().

In [91]:
from synoptic.services import stations_timeseries
from datetime import datetime
import pandas as pd
import numpy as np

df = stations_timeseries(stid='KGRR', vars=['air_temp', 'wind_speed', 'dew_point_temperature'],
                         start=datetime(2021,10,20),
                         end=datetime(2022,10,20))



 🚚💨 Speedy Delivery from Synoptic API [timeseries]: https://api.synopticdata.com/v2/stations/timeseries?stid=KGRR&vars=air_temp,wind_speed,dew_point_temperature&start=202110200000&end=202210200000&token=🙈HIDDEN



In [92]:
df.drop(labels='dew_point_temperature_set_1', axis=1, inplace=True)
df.dropna(inplace=True)
df

Unnamed: 0_level_0,air_temp,dew_point_temperature,wind_speed
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-10-20 00:00:00+00:00,16.0,8.95,2.058
2021-10-20 00:05:00+00:00,16.0,8.95,2.058
2021-10-20 00:10:00+00:00,16.0,8.95,2.058
2021-10-20 00:15:00+00:00,15.0,8.95,2.058
2021-10-20 00:20:00+00:00,16.0,8.95,2.058
...,...,...,...
2022-10-19 23:45:00+00:00,3.0,-3.08,4.116
2022-10-19 23:50:00+00:00,3.0,-2.07,4.116
2022-10-19 23:53:00+00:00,3.3,-2.28,3.087
2022-10-19 23:55:00+00:00,3.0,-2.07,3.601


### Get the data tidy (e.g., remove NaNs or missing data), and list out the variables.

In [93]:
df0 = df[(df.index.hour == 0) & (df.index.minute == 0)]
df12 = df[(df.index.hour == 12) & (df.index.minute == 0)]

print(df0,df12)

                           air_temp  dew_point_temperature  wind_speed
date_time                                                             
2021-10-20 00:00:00+00:00      16.0                   8.95       2.058
2021-10-21 00:00:00+00:00      16.0                  12.98       2.572
2021-10-22 00:00:00+00:00       9.0                   6.97       6.173
2021-10-23 00:00:00+00:00       8.0                   6.99       0.000
2021-10-24 00:00:00+00:00       7.0                   5.99       0.000
...                             ...                    ...         ...
2022-10-16 00:00:00+00:00       6.0                   0.93       2.572
2022-10-17 00:00:00+00:00       7.0                   4.97       2.572
2022-10-18 00:00:00+00:00       5.0                   1.96       7.717
2022-10-19 00:00:00+00:00       4.0                   1.97       7.717
2022-10-20 00:00:00+00:00       3.0                  -2.07       3.601

[358 rows x 3 columns]                            air_temp  dew_point_temper

### Convert the date and time into a format the RF will understand
For this, we will [https://www.mikulskibartosz.name/time-in-machine-learning/](take some assistance) from an online blog:

```python
# Here, df is the dataframe with our observations
def convert_time_to_angles(self,df):
    # Get the week number - depends on the format of the timestamps
    week_num = pd.Int64Index(df.index.isocalendar().week)
    # Decompose into sines and cosines.
    df["week_sin"] = np.sin(week_num * ((2/np.pi)/52))
    df["week_cos"] = np.cos(week_num * ((2/np.pi)/52))

    # Time of day is important too
    hr = df.index.hour
    min = df.index.minute
    df["time_sin"] = np.sin((hr+min/60)*((2/np.pi)/24))
    df["time_cos"] = np.cos((hr+min/60)*((2/np.pi)/24))
```

Now implement this yourself in our context.

In [94]:
week_num = pd.Int64Index(df0.index.isocalendar().week)
df0.loc[:,"weeks_sin"] = np.sin(week_num * ((2/np.pi)/52))
df0.loc[:,"weeks_cos"] = np.cos(week_num * ((2/np.pi)/52))
# print(week_num)

hr0 = df0.index.hour
# df0["hours_sin"] = np.sin((hr0)*((2/np.pi)/24))
df0.loc[:,"hours_sin"] = np.sin((hr0)*((2/np.pi)/24))
df0.loc[:,"hours_cos"] = np.cos((hr0)*((2/np.pi)/24))
df0

  week_num = pd.Int64Index(df0.index.isocalendar().week)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df0.loc[:,"weeks_sin"] = np.sin(week_num * ((2/np.pi)/52))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df0.loc[:,"weeks_cos"] = np.cos(week_num * ((2/np.pi)/52))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df0.loc[:,"hours_sin"] = np.si

Unnamed: 0_level_0,air_temp,dew_point_temperature,wind_speed,weeks_sin,weeks_cos,hours_sin,hours_cos
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-10-20 00:00:00+00:00,16.0,8.95,2.058,0.491832,0.870690,0.0,1.0
2021-10-21 00:00:00+00:00,16.0,12.98,2.572,0.491832,0.870690,0.0,1.0
2021-10-22 00:00:00+00:00,9.0,6.97,6.173,0.491832,0.870690,0.0,1.0
2021-10-23 00:00:00+00:00,8.0,6.99,0.000,0.491832,0.870690,0.0,1.0
2021-10-24 00:00:00+00:00,7.0,5.99,0.000,0.491832,0.870690,0.0,1.0
...,...,...,...,...,...,...,...
2022-10-16 00:00:00+00:00,6.0,0.93,2.572,0.481136,0.876646,0.0,1.0
2022-10-17 00:00:00+00:00,7.0,4.97,2.572,0.491832,0.870690,0.0,1.0
2022-10-18 00:00:00+00:00,5.0,1.96,7.717,0.491832,0.870690,0.0,1.0
2022-10-19 00:00:00+00:00,4.0,1.97,7.717,0.491832,0.870690,0.0,1.0


### Let's train the RF classifier. Can we pass in a 6-hour temperature change as a way to provide "memory" to the RF model?

In [95]:
x = np.arange(10)
y = x > 5
print(np.array(y,dtype=int))

df12['exceed_10'] = (df["air_temp"]>10).astype(int)
df12

[0 0 0 0 0 0 1 1 1 1]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df12['exceed_10'] = (df["air_temp"]>10).astype(int)


Unnamed: 0_level_0,air_temp,dew_point_temperature,wind_speed,exceed_10
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-10-20 12:00:00+00:00,8.0,8.00,2.058,0
2021-10-21 12:00:00+00:00,15.0,13.99,4.116,1
2021-10-22 12:00:00+00:00,5.0,3.99,0.000,0
2021-10-23 12:00:00+00:00,6.0,6.00,2.058,0
2021-10-24 12:00:00+00:00,-1.0,-1.00,0.000,0
...,...,...,...,...
2022-10-15 12:00:00+00:00,3.0,0.97,3.601,0
2022-10-16 12:00:00+00:00,5.0,2.97,1.543,0
2022-10-17 12:00:00+00:00,4.0,1.97,5.659,0
2022-10-18 12:00:00+00:00,3.0,-1.06,8.231,0


In [96]:
from sklearn.model_selection import train_test_split as tts
f_train, f_test, l_train, l_test = tts(df0.values, df12['exceed_10'].values, test_size=0.5)
print(f_train.shape, f_test.shape, l_train.shape, l_test.shape)

(179, 7) (179, 7) (179,) (179,)


### Evaluate (verify) performance using a few measures.
Check the sklearn documentation for modules you can import to help (`sklearn.metrics`).


In [97]:
from sklearn.ensemble import RandomForestClassifier as RFC
rfc=RFC(n_estimators=100)

rfc.fit(f_train,np.ravel(l_train))
fcst = rfc.predict(f_test)

pd.DataFrame(fcst).head(20)

Unnamed: 0,0
0,0
1,1
2,1
3,1
4,1
5,0
6,0
7,1
8,0
9,0


In [98]:
pd.DataFrame(l_test).head(20)

Unnamed: 0,0
0,0
1,1
2,1
3,1
4,1
5,1
6,0
7,1
8,0
9,0


In [99]:
from sklearn import metrics
print(metrics.r2_score(l_test,fcst))

0.8380718531920393


### Iterate!
At this point, if we find ourselves repeating code, or want to avoid having to copy, paste, edit, and produce spaghetti... We can make loose code into functions. Have a browse above to see if that's something to do before we move on to iterating the model.

#### What can we change?
For instance:
- Variables
- RF forest size
- RF parameters in documentation (here be dragons!)