# Sonde3 
##  Reads and converts binary water quality environmental instrument data to a DataFrame


### I.  Example Usage

Lets dive in!  

We have a example water quality instrument binary file `"tests/ysi_test_files/SA08.dat"`.  This file was generated by a YSI 600LS instrument and is in proprietary binary format.

#### Using the `sonde()` function we:

1.  `autodetect()` the file type and pass to the correct parser function 
2.  `read_ysi()` the binary file and convert to pandas DataFrame
3.  Transform all datetimes to the UTC timezone
4.  Standardize the units to metric and rename the columns to standard name conventions
3.  Pass the DataFrame to `calculate_salinity_psu()` and `calculate_do_mgl()` to apply standard formulas to generate the salinity and dissolved oxygen columns.

In [1]:
import sonde3
import pandas   
metadata, df = sonde3.sonde("tests/ysi_test_files/SA08.dat")


  metadata, df = formats.read_ysi(filename, tzinfo)
  Rtx = (rt) ** 0.5


#### Why the runtime warnings?

1.  The YSI instrument files don't contain any timezone information.  Therefore, the function has to assume that the timezone of the file to make the UTC conversion.

2. Often raw instrument files will contain impossible & incorrect values in the beginning and end of the file.  Examples: negative values for salinity or dissolved oxygen percentage.  `sonde3` does not trim the raw file, or perform QA analysis.  `sonde3` will pass the values as they were recorded by the instrument.

##### We can now interact with the two dataframes produced by `sonde3`:


In [2]:
df.info() 
df.head() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
datetime_(UTC)                       700 non-null datetime64[ns, UTC]
water_temp_c                         700 non-null float64
water_conductivity_mS/cm             700 non-null float64
water_depth_m_nonvented              700 non-null float64
water_DO_%                           700 non-null float64
instrument_battery_voltage           700 non-null float64
water_specific_conductivity_mS/cm    700 non-null float64
water_salinity_PSU                   678 non-null float64
water_DO_mgl                         678 non-null float64
dtypes: datetime64[ns, UTC](1), float64(8)
memory usage: 49.3 KB


Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
0,2008-07-16 14:00:31+00:00,28.998718,3.7e-05,0.010862,93.391418,6.09375,3.5e-05,0.013536,7.18342
1,2008-07-16 15:00:31+00:00,28.482361,5.9e-05,0.016358,96.765137,6.09375,5.5e-05,0.013326,7.510631
2,2008-07-16 16:00:31+00:00,27.257385,0.000546,0.017263,103.529358,6.09375,0.000524,0.012655,8.212117
3,2008-07-16 17:00:31+00:00,29.507751,21.301758,0.542648,93.055725,6.09375,19.613107,11.601472,6.655432
4,2008-07-16 18:00:31+00:00,29.762268,21.454102,0.557098,94.18869,6.09375,19.665354,11.631321,6.706414


In [3]:
metadata

Unnamed: 0,Instrument_Type,Manufacturer,System_Signal,Program_Version,Instrument_Serial_Number,Station,Logging_Interval,Begin_Log_Time_(UTC),First_Sample_Time_(UTC),Filename
0,600,YSI,870489733,306,1012,SANT_CDT,3600,2008-07-16 13:51:00+00:00,2008-07-16 13:51:31+00:00,SA08.dat


### II.  Working with time zones


What if data was collected outside of US/Central time?  Pass the timezone information to `sonde3.sonde`:

In [4]:
import pytz
metadata, df = sonde3.sonde("tests/ysi_test_files/SA08.dat", pytz.timezone('US/Eastern'))
df.head()

  Rtx = (rt) ** 0.5


Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
0,2008-07-16 13:05:31+00:00,28.998718,3.7e-05,0.010862,93.391418,6.09375,3.5e-05,0.013536,7.18342
1,2008-07-16 14:05:31+00:00,28.482361,5.9e-05,0.016358,96.765137,6.09375,5.5e-05,0.013326,7.510631
2,2008-07-16 15:05:31+00:00,27.257385,0.000546,0.017263,103.529358,6.09375,0.000524,0.012655,8.212117
3,2008-07-16 16:05:31+00:00,29.507751,21.301758,0.542648,93.055725,6.09375,19.613107,11.601472,6.655432
4,2008-07-16 17:05:31+00:00,29.762268,21.454102,0.557098,94.18869,6.09375,19.665354,11.631321,6.706414


### III. Autodetecting files



Curious about what kind of instrument files you have in a directory?  Apply the `sonde3.autodetect` method:

```python
#what kind of file is this??
sonde3.autodetect("tests/greenspan_test_files/RIOA_20060718_CDT_GS7837.xls") 
```

In [5]:
sonde3.autodetect("tests/greenspan_test_files/RIOA_20060718_CDT_GS7837.xls") 

'greenspan_xls'

Copy this code snippet to the notebook line below to run `autodetect()` on all of the use-case examples in the `sonde3` package:

```python
#this script runs through all of the text examples and prints out the autodetect results
import os

root_dir = 'tests'
results = []
for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        if "_test.txt" in file:
            continue
        os.path.join(directory, file)
        results.append(os.path.join(directory, file) + ' ' + sonde3.autodetect(os.path.join(directory, file)))

results 
```

### IV. Generating Salinity and Dissolved Oxygen

Typically deployed water quality instruments do not compute all rows of data internally.  Instead, these are calculated by the program used to read the file back at the lab.  For example, YSI instruments do not compute salinity or dissolved oxygen concentration.  This can cause confusion because when viewing a YSI *\*.dat* file in YSI's ECOWIN, or Ecowatch Lite program displays salinity and DO mg/L.  However, the raw binary file does not include these rows as they were not physically collected by the instrument during deployment.

For example, lets read the raw binary file of the example file `"tests/ysi_test_files/SA08.dat"` and see what it contains:

In [6]:
metadata, SA08_BIN = sonde3.read_ysi("tests/ysi_test_files/SA08.dat",pytz.timezone('US/Central'))
SA08_BIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 6 columns):
datetime_(UTC)                700 non-null datetime64[ns, UTC]
water_temp_c                  700 non-null float64
water_conductivity_mS/cm      700 non-null float64
water_depth_m_nonvented       700 non-null float64
water_DO_%                    700 non-null float64
instrument_battery_voltage    700 non-null float64
dtypes: datetime64[ns, UTC](1), float64(5)
memory usage: 32.9 KB


The pandas DataFrame contains the columns for conductivity and % DO saturation, but not for salinity and DO concentration.

For comparision, lets read the comma separated version of this file that was produced by the proprietary YSI Ecowin program:

In [7]:
metadata, SA08_CSV = sonde3.read_ysi_ascii("tests/ysi_test_files/SA08.CDF", pytz.timezone('US/Central'),delim=",")


The ECOWIN exported comma separated file has far more columns! The extra columns were derived through formulas using the instrumnet observed measurments.

To generate these columns pass the SA08_BIN DataFrame to calculate Salinity (PSU) and Dissolved Oxygen (mg/L) using `sonde3.calculate_salinity_psu()` and `sonde3.calculate_do_mgl()`

We can then compare our computed results to the ECOwatch program results:

In [8]:
SA08_BIN = sonde3.calculate_salinity_psu(SA08_BIN)
SA08_BIN = sonde3.calculate_do_mgl(SA08_BIN)
SA08_BIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 8 columns):
datetime_(UTC)                700 non-null datetime64[ns, UTC]
water_temp_c                  700 non-null float64
water_conductivity_mS/cm      700 non-null float64
water_depth_m_nonvented       700 non-null float64
water_DO_%                    700 non-null float64
instrument_battery_voltage    700 non-null float64
water_salinity_PSU            678 non-null float64
water_DO_mgl                  678 non-null float64
dtypes: datetime64[ns, UTC](1), float64(7)
memory usage: 43.8 KB


  Rtx = (rt) ** 0.5


So, why does `sonde3` produce less values for salinity and DO than the comma separated file?  Is this a bug?

*NO!*  If you recal from the `sqrt()` warning above, some of the values in the file are invalid.  This produces null (**NaN**) values in those rows.

In [9]:
SA08_BIN.head()

Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_salinity_PSU,water_DO_mgl
0,2008-07-16 14:00:31+00:00,28.998718,3.7e-05,0.010862,93.391418,6.09375,0.013536,7.18342
1,2008-07-16 15:00:31+00:00,28.482361,5.9e-05,0.016358,96.765137,6.09375,0.013326,7.510631
2,2008-07-16 16:00:31+00:00,27.257385,0.000546,0.017263,103.529358,6.09375,0.012655,8.212117
3,2008-07-16 17:00:31+00:00,29.507751,21.301758,0.542648,93.055725,6.09375,11.601472,6.655432
4,2008-07-16 18:00:31+00:00,29.762268,21.454102,0.557098,94.18869,6.09375,11.631321,6.706414


Lets check our two files to see if our conversion methods are those used by YSI.  

Lets use `numpy.random` to pick a row to check:



In [10]:
from numpy import random
row = random.randint(2,677) # pick random row

sub1 =  SA08_BIN.iloc[row:row+1]  #row one binary file
sub2 = SA08_CSV.iloc[row:row+1]   #row two csv file

pandas.concat([sub1, sub2], axis=0,join="inner")

Unnamed: 0,water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_salinity_PSU,water_DO_mgl
459,29.989319,30.498535,0.449013,29.139709,5.859375,17.015122,2.00663
459,29.99,30.499,0.449,29.1,5.9,17.02,2.01


Looks the same!  Notice, however, that the csv file values are rounded due to the fixed digit precision.

### V. Files from other Manufacturers
#### a. Hydrotech

In [11]:
metadata, mydf = sonde3.sonde("tests/hydrotech_test_files/0109DELT.CSV", pytz.timezone('US/Central'))

In [12]:
metadata

Unnamed: 0,Manufacturer,Instrument_Serial_Number,Model,Station,Deployment_Setup_Date,Filename
0,Hydrotech,081107-D,MiniSonde4a,0109delt,2009-01-13 00:00:00,0109DELT.CSV


In [13]:
mydf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 700 entries, 13 to 712
Data columns (total 7 columns):
Datetime_(UTC)                       700 non-null datetime64[ns, UTC]
water_temp_c                         700 non-null float64
water_specific_conductivity_mS/cm    700 non-null float64
water_salinity_PSU                   700 non-null float64
water_depth_m_nonvented              700 non-null float64
instrument_battery_voltage           700 non-null float64
water_conductivity_mS/cm             700 non-null float64
dtypes: datetime64[ns, UTC](1), float64(6)
memory usage: 43.8 KB


In [14]:
mydf.head()

Unnamed: 0,Datetime_(UTC),water_temp_c,water_specific_conductivity_mS/cm,water_salinity_PSU,water_depth_m_nonvented,instrument_battery_voltage,water_conductivity_mS/cm
13,2009-01-14 13:00:00+00:00,12.81,0.866,0.428305,-0.04,6.0,0.66437
14,2009-01-14 14:00:00+00:00,12.57,0.868,0.429304,-0.06,6.0,0.661926
15,2009-01-14 15:00:00+00:00,12.34,0.867,0.428759,-0.05,6.0,0.657354
16,2009-01-14 16:00:00+00:00,12.59,0.868,0.429306,-0.04,6.0,0.662257
17,2009-01-14 17:00:00+00:00,11.78,0.883,0.436876,-0.02,6.0,0.660041


In [15]:
mydf.tail()

Unnamed: 0,Datetime_(UTC),water_temp_c,water_specific_conductivity_mS/cm,water_salinity_PSU,water_depth_m_nonvented,instrument_battery_voltage,water_conductivity_mS/cm
708,2009-02-12 12:00:00+00:00,19.24,34.9,22.008874,1.03,3.9,31.060442
709,2009-02-12 13:00:00+00:00,19.22,34.9,22.008973,1.02,3.9,31.04711
710,2009-02-12 14:00:00+00:00,19.31,35.1,22.147645,1.02,3.9,31.285367
711,2009-02-12 15:00:00+00:00,19.44,35.8,22.63472,1.02,3.9,31.998183
712,2009-02-12 16:00:00+00:00,19.34,34.9,22.008374,1.03,0.0,31.127101


### Feature Development Testing

Code below is my own feature/bug testing

In [16]:
metadata, baytcsv = sonde3.sonde("tests/ysi_test_files/0108BAYT.csv", pytz.timezone('US/Central'))

In [17]:
metadata, baytdat = sonde3.sonde("tests/ysi_test_files/0108BAYT.dat", pytz.timezone('US/Central'))

  Rtx = (rt) ** 0.5


In [18]:
metadata, baytCDF = sonde3.sonde("tests/ysi_test_files/0108BAYT.CDF", pytz.timezone('US/Central'))

In [19]:
baytcsv.head()

Unnamed: 0,Datetime_(UTC),water_temp_c,water_specific_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,water_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
0,2008-01-29 15:00:00+00:00,22.56,0.0,-0.093,103.2,0.0,0.01137,8.927547
1,2008-01-29 16:00:00+00:00,20.2,0.0,-0.096,101.3,0.0,0.0104,9.174617
2,2008-01-29 17:00:00+00:00,20.04,0.0,-0.085,103.5,0.0,0.010332,9.403651
3,2008-01-29 18:00:00+00:00,17.77,0.0,-0.092,102.3,0.0,0.009327,9.731162
4,2008-01-29 19:00:00+00:00,11.87,20.681,2.114,81.7,15.494557,12.3939,8.169018


In [20]:
baytCDF.head()

Unnamed: 0,Datetime_(UTC),water_temp_c,water_specific_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_conductivity_mS/cm,water_DO_mgl,water_salinity_PSU
0,2008-01-29 15:00:33+00:00,22.56,-0.0,-0.093,103.2,6.6,-0.0,8.927547,0.01137
1,2008-01-29 16:00:33+00:00,20.2,-0.0,-0.096,101.3,6.6,-0.0,9.174617,0.0104
2,2008-01-29 17:00:33+00:00,20.04,-0.0,-0.085,103.5,6.6,-0.0,9.403651,0.010332
3,2008-01-29 18:00:33+00:00,17.77,-0.0,-0.092,102.3,6.5,-0.0,9.731162,0.009327
4,2008-01-29 19:00:33+00:00,11.87,20.681,2.114,81.7,6.5,15.493,8.169087,12.392548


In [21]:
baytdat.head()

Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
0,2008-01-29 15:00:33+00:00,22.562561,-3.1e-05,-0.093325,103.227234,6.5625,-3.2e-05,,
1,2008-01-29 16:00:33+00:00,20.197754,-3e-05,-0.095846,101.288605,6.5625,-3.3e-05,,
2,2008-01-29 17:00:33+00:00,20.04303,-3e-05,-0.08506,103.504181,6.5625,-3.3e-05,,
3,2008-01-29 18:00:33+00:00,17.767944,-2.9e-05,-0.091656,102.253723,6.484375,-3.4e-05,,
4,2008-01-29 19:00:33+00:00,11.867371,15.493164,2.113831,81.650543,6.484375,20.680527,12.393562,8.164566


In [22]:
baytdat.tail()

Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
1057,2008-03-13 17:00:33+00:00,18.975525,0.000262,-0.052872,102.866364,5.546875,0.000296,0.009562,9.547421
1058,2008-03-13 18:00:33+00:00,19.33075,-2.8e-05,-0.05699,104.083252,5.546875,-3.1e-05,,
1059,2008-03-13 19:00:33+00:00,20.094299,-3e-05,-0.066307,103.873444,5.625,-3.4e-05,,
1060,2008-03-13 20:00:33+00:00,21.163635,-3.1e-05,-0.075314,104.251099,5.625,-3.4e-05,,
1061,2008-03-13 21:00:33+00:00,22.510376,-3.3e-05,-0.083,104.284668,5.625,-3.4e-05,,


In [23]:
root_dir = "C:/Users/ETurner/Desktop/Evans CMSS Research/Montagna N/Sonde/RAW/T5160804.CDF"

In [24]:
metadata, CDF = sonde3.sonde(root_dir, pytz.timezone('US/Central'))

In [25]:
root_dir2 = "C:/Users/ETurner/Desktop/Evans CMSS Research/Montagna N/Sonde/RAW/T5160804.dat"
metadata,DAT = sonde3.sonde(root_dir2, pytz.timezone('US/Central'))

  Rtx = (rt) ** 0.5


In [26]:
CDF.tail()

Unnamed: 0,Datetime_(UTC),water_temp_c,water_specific_conductivity_mS/cm,water_depth_m_nonvented,water_salinity_PSU,water_conductivity_mS/cm
1453,2016-08-19 14:15:08+00:00,20.9,0.193,-0.02,0.093957,0.177886
1454,2016-08-19 14:30:08+00:00,21.07,0.192,-0.019,0.093503,0.177588
1455,2016-08-19 14:45:08+00:00,21.17,0.191,-0.016,0.093042,0.177028
1456,2016-08-19 15:00:08+00:00,21.23,0.19,-0.016,0.092578,0.176319
1457,2016-08-19 15:15:08+00:00,21.26,0.188,-0.017,0.091642,0.17457


In [27]:
DAT.tail()

Unnamed: 0,datetime_(UTC),water_temp_c,water_conductivity_mS/cm,water_depth_m_nonvented,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU
1453,2016-08-19 16:15:08+00:00,20.900879,0.17804,-0.019606,5.546875,0.193163,0.094034
1454,2016-08-19 16:30:08+00:00,21.072083,0.177284,-0.019296,5.546875,0.191663,0.093345
1455,2016-08-19 16:45:08+00:00,21.165466,0.176792,-0.016361,5.546875,0.190764,0.092931
1456,2016-08-19 17:00:08+00:00,21.227722,0.176197,-0.016347,5.546875,0.189878,0.092521
1457,2016-08-19 17:15:08+00:00,21.263428,0.174881,-0.016842,5.546875,0.188321,0.091793
