## Predicting river levels via machine learning
Given historical data, try to predict river heights up to 5 days in advance.
Available features: GSOD data (Temp, Pressure, Windspeed, PPT), and past river height data.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Load and gather river data. River data exists for several measurement stations

In [2]:
sites=[('chiangsaen','csa'),('luangprabang','lua'),('chiangkhan','ckh'),('vientiane','vie'),\
       ('nongkhai','non'),('paksane','pak'),('nakhonphanom','nak'),('mukdahan','muk'),\
       ('pakse','pks'),('stungtreng','str'),('kratie','kra'),\
       ('kompongcham','kom'),('phnompenhbassac','ppb'),('kohkhel','koh'),\
       ('neakluong','nea'),('prekkdam','pre'),('tanchau','tch'),('chaudoc','cdo')]

dirr='../HydrologySection/river_analytics/MRC_alldata/'

def extractCont(site):
    heights = np.append(np.loadtxt(dirr+site+'_NDJFMAM_2013_2014.txt'),\
                        np.loadtxt(dirr+site+'_JJASO_2014.txt'))
    heights = np.append(heights,np.loadtxt(dirr+site+'_NDJFMAM_2014_2015.txt'))

    heights = np.append(heights,np.loadtxt(dirr+site+'_JJASO_2015.txt'))
    heights = np.append(heights,np.loadtxt(dirr+site+'_NDJFMAM_2015_2016.txt'))
    heights = np.append(heights,np.loadtxt(dirr+site+'_JJASO_2016.txt'))
    heights = np.append(heights,np.loadtxt(dirr+site+'_NDJFMAM_2016_2017.txt'))                       
    return heights                

In [3]:
#extract river heights
df_heights=pd.DataFrame()
for i in sites:
    df_heights[i[0]]=extractCont(i[0])
df_heights['date']=pd.date_range('11/01/2013',periods=len(df_heights),freq='D')
df_heights.set_index('date')
print(df_heights.describe())

        chiangsaen  luangprabang   chiangkhan    vientiane     nongkhai  \
count  1300.000000   1300.000000  1300.000000  1300.000000  1300.000000   
mean      3.296785      6.562392     6.624008     3.287962     3.985992   
std       0.843536      2.089908     2.000935     1.971468     2.155641   
min       1.820000      3.660000     3.350000     0.530000     0.970000   
25%       2.660000      5.195000     5.315000     1.970000     2.580000   
50%       3.260000      6.000000     6.120000     2.635000     3.250000   
75%       3.700000      7.392500     7.562500     4.105000     4.802500   
max       7.350000     15.760000    14.490000    11.000000    11.840000   

           paksane  nakhonphanom     mukdahan        pakse   stungtreng  \
count  1300.000000   1298.000000  1299.000000  1300.000000  1300.000000   
mean      5.185638      3.672804     3.807760     3.323900     4.142469   
std       2.399079      2.434351     2.257907     2.409166     1.833934   
min       2.070000      

Looks like not every station has entries for every single day. Let's take a closer look!

In [5]:
ent,sta=np.where(pd.isnull(df_heights))
for ii in ent:
    print(df_heights.iloc[ii,-1])

2014-11-20 00:00:00
2015-04-08 00:00:00
2015-05-21 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-26 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-27 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
2015-05-28 00:00:00
