# Interfaces
## The Arrows between the Boxes
Recalling our tutorial workflow, we have the boxes that represent actions, our modules, but we also have the arrows that represent the interface between the modules.

![Our Training Data Workflow](images/DataWorkflowTraining.png "Our Training Data Workflow")


## Defining the Interface Between Modules
It's important to define the interface between the modules so that when we swap specific implementations of each step out, the modules know what to expect as input.

Let's look at our current baseline workflow and see what our interfaces are.

In [2]:
import dataretrieval.nwis as nwis

def acquire_streamflow_nwis_iv(site, start, end):
    df = nwis.get_record(sites=site, service='iv', start=start, end=end, parameterCD='00060')
    return df

def resample_to_daily(df):
    return df['00060'].resample('1D').mean()

def visualize_summary_statistics(df):
    print(df.describe())

# Acquire / Filter
df = acquire_streamflow_nwis_iv(site='04294000', start='2022-06-01', end='2022-11-01')
print(df)

# Manipulate
daily = resample_to_daily(df)
print(daily)

# Visualize
visualize_summary_statistics(daily)

                            site_no   00060 00060_cd
datetime                                            
2022-06-01 04:00:00+00:00  04294000  2240.0        A
2022-06-01 04:15:00+00:00  04294000  2210.0        A
2022-06-01 04:30:00+00:00  04294000  2210.0        A
2022-06-01 04:45:00+00:00  04294000  2190.0        A
2022-06-01 05:00:00+00:00  04294000  2190.0        A
...                             ...     ...      ...
2022-11-02 02:45:00+00:00  04294000   914.0        A
2022-11-02 03:00:00+00:00  04294000   860.0        A
2022-11-02 03:15:00+00:00  04294000   780.0        A
2022-11-02 03:30:00+00:00  04294000   718.0        A
2022-11-02 03:45:00+00:00  04294000   674.0        A

[14775 rows x 3 columns]
datetime
2022-06-01 00:00:00+00:00    1628.875000
2022-06-02 00:00:00+00:00    2029.770833
2022-06-03 00:00:00+00:00    2553.333333
2022-06-04 00:00:00+00:00    2503.229167
2022-06-05 00:00:00+00:00    1667.802083
                                ...     
2022-10-29 00:00:00+00:00     

So... our interface between acquire / filter is a dataframe with all kinds of stuff in it and coded column labels and our interface between manipulate and visualization is a pandas Series (one column of a DataFrame).  The single pandas Series seems to make sense to me for passing data to the visualization step, but that dataframe seems to have a bunch of extra stuff that we probably don't want.  In fact, it kinda complicates the function for resample():

```
def resample_to_daily(df):
    return df['00060'].resample('1D').mean()
```

See how the function needs to have a column named '00060' for it to work? That severely limits the modularity, reusability, and generalizability of that function.  Thinking about this, we probably want that function to just simply be:

```
def resample_to_daily(df):
    return df.resample('1D').mean()
```

but, if we tried that right now, the 00060_cd column in particular would give us problems because you can't take the mean of a character. 

In [None]:
import dataretrieval.nwis as nwis

def acquire_streamflow_nwis_iv(site, start, end):
    df = nwis.get_record(sites=site, service='iv', start=start, end=end, parameterCD='00060')
    return df

def resample_to_daily(df):
    return df.resample('1D').mean()

def visualize_summary_statistics(df):
    print(df.describe())

# Acquire / Filter
df = acquire_streamflow_nwis_iv(site='04294000', start='2022-06-01', end='2022-11-01')

# Manipulate
daily = resample_to_daily(df)

# Visualize
visualize_summary_statistics(daily)

So... let's structure this interface a little bit better.  It makes sense to me that the acquire_streamflow_nwis_iv() should simply return a column with the name "streamflow" so we know what it is.  Then, we can pass that one column around and manipulate it however we need.  So, let's do that...

In [5]:
import dataretrieval.nwis as nwis

def acquire_streamflow_nwis_iv(site, start, end):
    df = nwis.get_record(sites=site, service='iv', start=start, end=end, parameterCD='00060')
    # https://help.waterdata.usgs.gov/parameter_cd?group_cd=PHY
    return df['00060'].rename('streamflow (ft^3/s)')

def resample_to_daily(df):
    return df.resample('1D').mean()

def visualize_summary_statistics(df):
    print(df.describe())

# Acquire / Filter
df = acquire_streamflow_nwis_iv(site='04294000', start='2022-06-01', end='2022-11-01')
print(df)

# Manipulate
daily = resample_to_daily(df)

# Visualize
visualize_summary_statistics(daily)

datetime
2022-06-01 04:00:00+00:00    2240.0
2022-06-01 04:15:00+00:00    2210.0
2022-06-01 04:30:00+00:00    2210.0
2022-06-01 04:45:00+00:00    2190.0
2022-06-01 05:00:00+00:00    2190.0
                              ...  
2022-11-02 02:45:00+00:00     914.0
2022-11-02 03:00:00+00:00     860.0
2022-11-02 03:15:00+00:00     780.0
2022-11-02 03:30:00+00:00     718.0
2022-11-02 03:45:00+00:00     674.0
Name: streamflow (ft^3/s), Length: 14775, dtype: float64
count     155.000000
mean      989.816458
std      1278.847398
min       110.748958
25%       306.796875
50%       549.020833
75%      1064.348958
max      9091.354167
Name: streamflow (ft^3/s), dtype: float64


There, now we have a standard interface between our modules that's simple and makes sense. There are a ton of options for what that interface could look like for your workflow. For this simple example, a single pandas Series works just fine, but if this was more complicated, we might want to have our interface be a list or Python dictionary of Series if we want to pass around more than a single Series between our modules.

## Other Interface / Data Structure Concerns
- Timezones
- Units (English vs. SI)
- Sampling Rate (15min, hourly, daily, etc.)