# Modeling Weather Geographies using XGBoost

In this notebook, we are using aggregated `TMAX` data by global weather stations in order to create a machine learning model to predict the daily maximum temperatures at any given latitude and longitude. Our model will take 3 continuous predictors: latitude, longitude, and elevation, and provide an estimated `TMAX` for a given day of the year.

Finally, we will show how to save this model to your Watson Studio filesystem to be used for online scoring.

<div class="alert alert-block alert-info"> Note: You will need to install the Basemap and GEOS libraries to dynamically produce output maps. For your convinience, the maps have been pre-rendered in this sample notebook.</div>

## Table of Contents
This notebook contains these main sections:

1. [Import Libraries](#Import_Libraries)
2. [The Data](#The_Data)
3. [The Model](#The_Model)
4. [Data Visualization](#Data_Visualization)
5. [Save Model to Watson Studio Filesystem](#Save_Model_to_Watson_Studio_Filesystem)
6. [Predict on New Data](#Predict_on_New_Data)
7. [Summary](#Summary)

<a id='Import_Libraries'></a>
## Import Libraries
Run the cell below once to install the `tqdm` library

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
#from mpl_toolkits.basemap import Basemap, maskoceans

import numpy as np
import pandas as pd

from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

import xgboost as xgb

<a id='The_Data'></a>
## The Data
The dataset was created using data from the Global Historical Climatology Network. We have averaged the `TMAX` over the entire history of each weather station for the spring and autumn equinox, and the summer and winter solstice.

In [2]:
df_data_1 = pd.read_csv("https://raw.githubusercontent.com/IBMDataScience/DSX-DemoCenter/master/weatherGeographies/data_assets/seasonal_data.csv")

Let's pull out the data from Boston's Logan International Airport:

In [3]:
df_data_1[df_data_1['name'].str.contains('BOSTON LOGAN')]

Unnamed: 0,id,latitude,longitude,elevation,country,name,state,21-Mar,21-Jun,21-Sep,21-Dec
27094,USW00014739,42.3606,-71.0106,3.7,US,BOSTON LOGAN INTL AP,MA,7.690244,25.507407,21.975309,3.892593


As shown above, we have the lat-lon coordinates, and elevation data for the station. The last four columns determine the average daily maximum temperatures over the history of the station in Celsius. For example, the average `TMAX` on the 21st of June over all the recorded history for Logan Airport's weather station is 25.5$^{\circ}$C.

<a id='The_Model'></a>
## The Model
Let's first split the data by columns into features and response variables, and then further into training and testing sets.

In [4]:
x = df_data_1[['elevation','latitude','longitude']]
y = df_data_1[['21-Mar','21-Jun','21-Sep','21-Dec']]

x_init, x_test, y_init, y_test = train_test_split(x, y['21-Jun'], test_size=.25)
x_train, x_val, y_train, y_val = train_test_split(x_init, y_init, test_size=.25)

We will fit a **XGBoost** model. XGBoost is an advanced implementation of the gradient boosting algorithm. XGBoost uses its own data structure called a `DMatrix` in which the training and testing data is stored. We can create a DMatrix below.

In [5]:
dtrain = xgb.DMatrix(x_train, label=y_train)
dval = xgb.DMatrix(x_val, label=y_val)
dtest = xgb.DMatrix(x_test, label = y_test)

val_model = xgb.train(params = {'eval_metric':'mae'}, 
                      dtrain = dtrain, 
                      num_boost_round=200,
                      evals=[(dval, "Test")],
                      early_stopping_rounds=1,
                      )

[0]	Test-mae:17.3622
Will train until Test-mae hasn't improved in 1 rounds.
[1]	Test-mae:12.1576
[2]	Test-mae:8.5344
[3]	Test-mae:6.03006
[4]	Test-mae:4.3361
[5]	Test-mae:3.2276
[6]	Test-mae:2.51321
[7]	Test-mae:2.06199
[8]	Test-mae:1.78793
[9]	Test-mae:1.62618
[10]	Test-mae:1.51968
[11]	Test-mae:1.45972
[12]	Test-mae:1.42318
[13]	Test-mae:1.38381
[14]	Test-mae:1.3684
[15]	Test-mae:1.35151
[16]	Test-mae:1.34271
[17]	Test-mae:1.33646
[18]	Test-mae:1.32589
[19]	Test-mae:1.32184
[20]	Test-mae:1.31396
[21]	Test-mae:1.30195
[22]	Test-mae:1.3003
[23]	Test-mae:1.29365
[24]	Test-mae:1.28396
[25]	Test-mae:1.27956
[26]	Test-mae:1.27445
[27]	Test-mae:1.27331
[28]	Test-mae:1.27344
Stopping. Best iteration:
[27]	Test-mae:1.27331



Unlike our `scikit-learn` implementation of gradient boosting, XGBoost automatically determines the ideal number of boosting rounds depending on whether improvement to a particular evaluation metric (in our case, *mean absolute error*) no longer improves. Below we can print the best number of iterations to use.

In [6]:
val_model.best_ntree_limit

28

In [7]:
val_model.best_iteration

27

In [8]:
model = xgb.train(params = {'eval_metric':'mae'}, 
                  dtrain = dtrain, 
                  num_boost_round=26,
                  evals=[(dtest, "Test")],
                  verbose_eval = False)
y_pred = model.predict(dtest)
print("Mean Absolute Error: {}\nR^2 value: {}".format(mean_absolute_error(y_test,y_pred),r2_score(y_test,y_pred)))

Mean Absolute Error: 1.2920784997
R^2 value: 0.834500839387


Not bad! Remember that we are dealing with degrees Celsius. Our mean error in this case around less than 2 degrees. We also have a strong $R^2$ value.

Now let's fit the models on the entirety of the data, and then produce a grid of predicted `TMAX` values to visualize on a map.

In [9]:
june_data = xgb.DMatrix(x, label=y['21-Jun'])
xgb_jun = xgb.train(params = {}, dtrain = june_data, num_boost_round=26)

And now the model for the December 21

In [10]:
dec_data = xgb.DMatrix(x, label=y['21-Dec'])
xgb_dec = xgb.train(params = {}, dtrain = dec_data, num_boost_round=26)

<a id='Data_Visualization'></a>
## Data Visualization
Let's pull a matrix of elevation data. The data will contain elevations at latitudes [-90,90], every degree and longitudes [-180,180), every 2 degrees. We'll provide the latitude, longitude, and elevation from this matrix to create a matrix of the same size that contains predicted temperature information.

In [11]:
z = pd.DataFrame(np.empty([181,180]))
elevations = pd.read_csv("https://raw.githubusercontent.com/IBMDataScience/DSX-DemoCenter/master/weatherGeographies/data_assets/elevation.csv", index_col=0)

# Make sure that the matrices are indexed by (lon,lat) values
elevations.columns = elevations.columns.astype(int)
z.columns = elevations.columns
z.index = elevations.index

### Plots
We'll use `matplotlib` Basemap to plot our data. First, we should fill our empty `z` temperature matrix with the predicted temperatures. Then we must flip the matrix, as Basemap wants 90$^{\circ}$S to be the first row in the matrix.

<div class="alert alert-block alert-info"> If you have installed the Basemap and GEOS libraries, copy the code below in a python cell to dynamically produce an output map</div>

```python
for lon in range(-180,180,2):
    for lat in reversed(range(-90,91,1)):
        z[lon][lat] = xgb_jun.predict(xgb.DMatrix(pd.DataFrame({
            "latitude": [lat],
            "longitude": [lon],
            "elevation": [elevations[lon][lat]]
        })))

plt.rcParams['figure.figsize'] = (15,15)
m = Basemap()
lon, lat = np.meshgrid(list(range(-180,180,2)),list(range(-90,91,1)))
x1,y1 = m(lon,lat)
m.drawcoastlines()
m.drawstates()
m.drawcountries()
m.drawmapboundary()
z1 = maskoceans(x1,y1,np.flip(np.array(z),axis=0))
cs = m.contourf(x1,y1,z1, 15)
plt.show()
```
![](https://github.com/IBMDataScience/DSX-DemoCenter/raw/master/weatherGeographies/notebooks/static/jun21.png)

This is a fairly predictable distribution of temperatures for the 21st of June. Africa and the southern United States are scortching hot, while Antarctica is frigidly cold. Elevation data has also proved to be important, as the himalayas and tibet are marked to be colder than other regions at the same latitude, just as expected. It would be interesting to see how our map would differ if we used 21st of December data:

```python
for lon in range(-180,180,2):
    for lat in reversed(range(-90,91,1)):
        z[lon][lat] = xgb_dec.predict(xgb.DMatrix(pd.DataFrame({
            "latitude": [lat],
            "longitude": [lon],
            "elevation": [elevations[lon][lat]]
        })))

m = Basemap()
lon, lat = np.meshgrid(list(range(-180,180,2)),list(range(-90,91,1)))
x1,y1 = m(lon,lat)
m.drawcoastlines()
m.drawstates()
m.drawcountries()
m.drawmapboundary()
z1 = maskoceans(x1,y1,np.flip(np.array(z),axis=0))
cs = m.contourf(x1,y1,z1, 15)
plt.show()
```
![](https://github.com/IBMDataScience/DSX-DemoCenter/raw/master/weatherGeographies/notebooks/static/dec21.png)

The model has done an excellent job in estimating the temperatures on 21-Dec. As expected, the southern hemisphere is in summer, and thus hot, while North America and Europe are in Winter. 

<a id='Save_Model_to_Watson_Studio_Filesystem'></a>
## Save Model to Watson Studio Filesystem
We can now save `XGBoost` models to the Watson Studio filesystem for publishing, scoring, deployment, and evaluations. First, import the `save` function from the `dsx_ml.ml` library. The save function takes a few arguments which are listed below.

In [12]:
from dsx_ml.ml import save

Using TensorFlow backend.


Now we can save both the June 21 and December 21 models.

In [13]:
save(model = xgb_jun,
     name = 'XGBJune21',
     x_test = x,
     y_test = pd.DataFrame(y['21-Jun']),
     algorithm_type = 'Regression',
     params = {})

{'path': '/user-home/999/DSX_Projects/dsx-samples/models/XGBJune21/1',
 'scoring_endpoint': 'https://dsxl-api/v3/project/score/Python27/xgboost-0.7/dsx-samples/XGBJune21/1'}

In [14]:
save(model = xgb_dec,
     name = 'XGBDec21',
     x_test = x,
     y_test = pd.DataFrame(y['21-Dec']),
     algorithm_type = 'Regression',
     params = {})

{'path': '/user-home/999/DSX_Projects/dsx-samples/models/XGBDec21/1',
 'scoring_endpoint': 'https://dsxl-api/v3/project/score/Python27/xgboost-0.7/dsx-samples/XGBDec21/1'}

### Model Metadata
The model will be stored in the models directory in your Watson Studio Project. Each model is stored as a directory, in which the model artifact and metadata are stored. The metadata is stored as a JSON file, which we can open and display.

In [15]:
import json
import os

uid = os.environ['DSX_USER_ID']
proj = os.environ['DSX_PROJECT_NAME']

with open('/user-home/{}/DSX_Projects/{}/models/XGBDec21/metadata.json'.format(uid,proj),'r') as infile:
    metadata_dict = json.load(infile)

print("Runtime: {}".format(metadata_dict['runtime']))
print("Model Type: {}".format(metadata_dict['type']))
print("Algorithm: {}".format(metadata_dict['algorithm']))

print("Feature(s):")
for feature in metadata_dict['features']:
    print('    '+feature['name'])

print("Latest Model Version: {}".format(metadata_dict['latestModelVersion']))
print("Label(s):")
for label in metadata_dict['labelColumns']:
    print('    '+label['name'])

Runtime: Python27
Model Type: xgboost-0.7
Algorithm: Booster
Feature(s):
    elevation
    latitude
    longitude
Latest Model Version: 1
Label(s):
    21-Dec


<a id='Predict_on_New_Data'></a>
## Predict on New Data

Let's make some predictions using new data. Below we have gathered the latitude, longitude, and elevation data for the cities of Chicago, IL and Miami, FL.

In [16]:
chicago_data = {
    "elevation" : 200.6,
    "latitude" : 41.995,
    "longitude" : -87.9336
}

miami_data = {
    "elevation" : 1,
    "latitude" : 25.7616798,
    "longitude" : -80.1917902
}

We can call the predict function of our models and print them below:

In [17]:
new_data = xgb.DMatrix(pd.DataFrame([chicago_data, miami_data]))

jun21_temps = xgb_jun.predict(new_data)
dec21_temps = xgb_dec.predict(new_data)


output_str = (u'On June 21, it is predicted to be ' +
    str(jun21_temps[0].round(1)) + 
    u'\N{DEGREE SIGN} C in Chicago, and '+ 
    str(jun21_temps[1].round(1)) + 
    u'\N{DEGREE SIGN} C in Miami\n' + 
    u'On December 21, it is predicted to be ' + 
    str(dec21_temps[0].round(1)) + 
    u'\N{DEGREE SIGN} C in Chicago, and ' + 
    str(dec21_temps[1].round(1)) +
    u'\N{DEGREE SIGN} C in Miami')


print(output_str)

On June 21, it is predicted to be 27.0° C in Chicago, and 31.9° C in Miami
On December 21, it is predicted to be 1.6° C in Chicago, and 24.3° C in Miami


<a id='Summary'></a>
## Summary
In this notebook you learned how to create an XGBoost `Booster` model, create some data visualizations, and save the model in the Watson Studio environment.

<div class="alert alert-block alert-info">Note: To save resources and get the best performance please use the code below to stop the kernel before exiting your notebook.</div>

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<hr>
Copyright &copy; IBM Corp. 2017. Released as licensed Sample Materials.