# Predicting Reservoir Storage

My project's goal is to model and predic reservoir water levels in California. Using a previous years temperature and precipitation data, I hoped to see how accurately I could predict reservoir capacity and provide a range of possible values, not given the current water year data.

This project is of utmost importance given California's intensifying climate crisis. For my process of data analysis and model training, I use Folsom Lake, the 10th largest dam in California and a key water supply for millions of its residents. My goal was to create a model that could use exogenous seasonal variables to predict seasonal water supply and its annual minimum.  Folsom Lake is going increasingly deep in its reserves during the extending droughts, having got to as low as 14% of its capacity in 2014. My ultimate goal is to predict the conditions that would cause Folsom Lake to be unusable, ie. near empty and unavailable for use as a water supply for the millions of people who rely on it, the agricultural companies, and the water-species. 

In [1]:
#import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA', FutureWarning)
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARIMA', FutureWarning)

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit




## The Data

To predict Folsom Lake's water storage, we can take daily data from the US Bureau of Reclamation. There are also 4 exogenous variables of interest which could help in predictions: storage inflow, release, water evaporation, and precipitation. 

In [2]:
precipitation = pd.read_csv('data/folsomlake/folsomlake_daily_precipitation.csv',header=7) #inch total
storage = pd.read_csv('data/folsomlake/folsomlake_daily_storage.csv',header=7) #acre feet
release = pd.read_csv('data/folsomlake/folsomlake_daily_release.csv',header=7) #avg cfs 24 hr
inflow = pd.read_csv('data/folsomlake/folsomlake_daily_inflow.csv',header=7) #avg cfs 24 hr
evaporation = pd.read_csv('data/folsomlake/folsomlake_daily_evaporation.csv',header=7) #sum cfs 24 hr

The 8-station Sierra Index tracks precipitation across the Northern Sierras, and is an accurate measure of the water coming into the Sacramento River watershed. The Sacramento River watershed is the largest watershed in California, and contains many of the key reservoirs we will be modelling.

Meanwhile, the 5-station Sierra Index tracks precipitation in the Southern Sierras, and is a measure of the water year of the San Joaquin River watershed, the second largest watershed in the state.

I am downloading this to utilize as an exogenous variable.

In [3]:
northsierra = pd.read_csv('data/8SI.csv')
southsierra = pd.read_excel('data/5SI.xlsx')