In [1]:
%matplotlib inline

# Forcing mechanisms for drifters entering and exiting Galveston Bay

An on-going area of interest for response efforts at the Texas General Land Office is under what conditions oil may exit or enter a given bay. Here, we limit our scope to Galveston Bay and use statistics to find relationships between drifter entrances/exits and potential forcing mechanisms.

## Relevant previous effort

This work builds on already-existing shelf model output (DJ's [20 year hindcast](http://barataria.tamu.edu:8080/thredds/dodsC/NcML/txla_hindcast_agg)) and Dongyu's effort to blended the "coarse" resolution Galveston Bay model with the shelf model to create a blended model product for seamlessly running drifters. The drifter simulations here are run using the blended model product.

## Run the drifters

The drifters were run on hafen (`/raid/home/kthyng/projects/bay/`) using the run.py script. Pertinent simulation details:

* Output drifter locations every 15 minutes
* Ran forward/backward for 14 days
* Circulation model output is hourly
* Surface drifters
* No subgrid diffusion
* 300 meter initial distance between drifters
* Drifters started in uniform array within bay
* Simulations are started from 4 dates:
 * 2010-02-01: winter winds, high river discharge
 * 2011-02-01: winter winds, low river discharge
 * 2010-07-01: summer winds, high river discharge
 * 2011-07-01: summer winds, low river discharge
* Simulations are run forward from the dates listed above, and backward from 2 weeks after the dates listed above. That is, the forward- and backward-running simulations are simulataneous but represent drifters exiting and entering the domain (respectively).
* New simulations are started every 4 hours for 2 weeks for a total number of simulations of: 6 * 14=84 simulations for each set (i.e. season and run direction)
* Each simulation has 13,340 drifters, so each set (i.e. season and run direction) has 13340 * 84=1,120,560 drifters or over 1.1 million.

Note that the appropriate file names for these simulations fit the pattern: `2010-02-01_forward_14days_dx300` (or backward).

The simulations have already been run using:

    python2 run.py > logs/[etc] &

The drifter tracks are stored on hafen in `tracks`.

The figure below shows the initial drifter locations.
![](figures/starting_points_bay_dx300.png)

## When are drifters moving between the bay and the shelf?

### Run the analysis
This analysis was run with calcs.io() with changing refdates through the simulation start dates (2010-02-01, 2011-02-01, 2010-07-01, 2011-07-01) and also the backward-running simulations (2010-02-15, 2011-02-15, 2010-07-15, 2011-07-15). This gives files of the pattern `calcs/enterexit_sim3_[yyyy]-[mm]-forward_14days_dx300` (or backward).

### Calculate dataframes
Summarize into a `pandas` dataframe the forcing mechanisms (read in from the blended model product, the shelf model, etc) for each time period and combine it with the time series summary of when drifters are outside the domain. Do this using `calcs.make_dfs()`, and the files are in the form `df_2010-02_backward.csv`.

This was all done on hafen, but the df files have been copied to Tahoma too.

## Find relationships between drifter behavior and forcing mechanisms for each set of simulations individually

Run statistics in calcs.stats().

subtidal vs. tidal


In [1]:
import calcs
calcs.stats(which='subtidal', direction='forward')

Simulation set: calcs/df_2010-02_forward.csv
Number of combinations checked: 59
Top adjusted r^2, lowest BIC performers, no p>0.1: 4
Adjusted r^2: 0.86
BIC: 2272
coefficients: 
Intercept   -30.340305
river         0.753698
uwind         0.183440
theta       -15.875531
pvalues: 
river     0.000000e+00
uwind    1.621830e-126
theta     0.000000e+00

Adjusted r^2: 0.84
BIC: 2606
coefficients: 
Intercept   -34.210658
river         0.784992
s            -0.135435
theta       -17.957780
pvalues: 
river    0.000000e+00
s        5.900459e-54
theta    0.000000e+00

Adjusted r^2: 0.85
BIC: 2374
coefficients: 
Intercept   -30.415900
river         0.768867
theta       -15.886888
sustr         0.140292
pvalues: 
river     0.000000e+00
theta     0.000000e+00
sustr    2.620875e-104

Adjusted r^2: 0.82
BIC: 2838
coefficients: 
Intercept   -33.423779
river         0.800416
theta       -17.476123
pvalues: 
river    0.0
theta    0.0

Simulation set: calcs/df_2010-07_forward.csv
Number of combinations chec

## Cross-validation



## Notes to add somewhere

Bayesian information Criterion [(BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion)

| ΔBIC   |	Evidence against higher BIC       |
|--------|-------------------------------------|
| 0 to 2 |	Not worth more than a bare mention |
| 2 to 6 | Positive |
| 6 to 10 |	Strong |
| >10 |	Very Strong |


[Scaling variables:](http://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia)

>In regression, it is often recommended to center the variables so that the predictors have mean 00. This makes it so the intercept term is interpreted as the expected value of YiYi when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of YiYi when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). Another practical reason for scaling in regression is when one variable has a very large scale, e.g. if you were using population size of a country as a predictor. In that case, the regression coefficients may be on a very small order of magnitude (e.g. 10−610−6) which can be a little annoying when you're reading computer output, so you may convert the variable to, for example, population size in millions. The convention that you standardize predictions primarily exists so that the units of the regression coefficients are the same.

>As @gung alludes to and @MånsT shows explicitly (+1 to both, btw), centering/scaling does not effect your statistical inference in regression models - the estimates are adjusted appropriately and the pp-values will be the same.

>Other situations where centering and/or scaling may be useful:

>* when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
>* To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply X′XX′X. Similarly, if a univariate random variable XX has been mean centered, then var(X)=E(X2)var(X)=E(X2) and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
>* Related to aforementioned, PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

>Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.


[Standardize variables](http://stats.stackexchange.com/questions/13267/how-to-sum-two-variables-that-are-on-different-scales/13271#13271) by:

$C_{scaled} = (C - C_{mean})/ C_{std}$