# Sonde3 
##  Reads and converts binary water quality environmental instrument data to a DataFrame


### I.  Example Usage

Lets dive in!  

We have a example water quality instrument binary file `"tests/ysi_test_files/SA08.dat"`.  This file was generated by a YSI 600LS instrument and is in proprietary binary format.

#### Using the `sonde()` function we:

1.  `autodetect()` the file type and pass to the correct parser function 
2.  `read_ysi()` the binary file and convert to pandas DataFrame
3.  Transform all datetimes to the UTC timezone
4.  Standardize the units to metric and rename the columns to standard name conventions
3.  Pass the DataFrame to `calculate_salinity_psu()` and `calculate_do_mgl()` to apply standard formulas to generate the salinity and dissolved oxygen columns.

In [1]:
import sonde3
import pandas
metadata, df = sonde3.sonde("tests/ysi_test_files/SA08.dat", remove_invalids=False, twdbparams=True)

  metadata, df = formats.read_ysi(filename, tzinfo)
  Rtx = (rt) ** 0.5


#### Why the runtime warnings?

1.  The YSI instrument files don't contain any timezone information.  Therefore, the function has to assume that the timezone of the file to make the UTC conversion.

2. Often raw instrument files will contain impossible & incorrect values in the beginning and end of the file.  Examples: negative values for salinity or dissolved oxygen percentage.  `sonde3` does not trim the raw file, or perform QA analysis.  `sonde3` will pass the values as they were recorded by the instrument.

To automatically convert invalid negative values to zero, either set the flag **remove_invalids=False** or remove it altogether, as this is the default behavior of the package

##### We can now interact with the two dataframes produced by `sonde3`:


In [2]:
df.info() 
df.head() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
Datetime_(UTC)                               700 non-null datetime64[ns, UTC]
water_temperature                            700 non-null float64
water_electrical_conductivity                700 non-null float64
water_depth_nonvented                        700 non-null float64
water_dissolved_oxygen_percent_saturation    700 non-null float64
instrument_battery_voltage                   700 non-null float64
water_specific_conductance                   700 non-null float64
seawater_salinity                            678 non-null float64
water_dissolved_oxygen_concentration         678 non-null float64
dtypes: datetime64[ns, UTC](1), float64(8)
memory usage: 49.3 KB


Unnamed: 0,Datetime_(UTC),water_temperature,water_electrical_conductivity,water_depth_nonvented,water_dissolved_oxygen_percent_saturation,instrument_battery_voltage,water_specific_conductance,seawater_salinity,water_dissolved_oxygen_concentration
0,2008-07-16 12:00:31+00:00,28.998718,3.7e-05,0.010862,93.391418,6.09375,3.5e-05,0.013536,7.18342
1,2008-07-16 13:00:31+00:00,28.482361,5.9e-05,0.016358,96.765137,6.09375,5.5e-05,0.013326,7.510631
2,2008-07-16 14:00:31+00:00,27.257385,0.000546,0.017263,103.529358,6.09375,0.000524,0.012655,8.212117
3,2008-07-16 15:00:31+00:00,29.507751,21.301758,0.542648,93.055725,6.09375,19.613107,11.601472,6.655432
4,2008-07-16 16:00:31+00:00,29.762268,21.454102,0.557098,94.18869,6.09375,19.665354,11.631321,6.706414


In [3]:
metadata

Unnamed: 0,Model,Manufacturer,Instrument_Serial_Number,Station,Deployment_Setup_Time,Filename,Deployment_Start_Time,Deployment_Stop_Time
0,600,YSI,1012,SANT_CDT,,SA08.dat,2008-07-16 12:00:31+00:00,2008-08-14 15:00:16+00:00


### II.  Working with time zones


What if data was collected outside of US/Central time?  Pass the timezone information to `sonde3.sonde`:

In [4]:
import pytz
metadata, df = sonde3.sonde("tests/ysi_test_files/SA08.dat", pytz.timezone('US/Eastern'))
df.head()

Unnamed: 0,Datetime_(UTC),water_temp_C,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_specific_conductivity_mS/cm,water_salinity_PSU,water_DO_mgl
0,2008-07-16 11:00:31+00:00,28.998718,3.7e-05,0.010862,93.391418,6.09375,3.5e-05,0.013536,7.18342
1,2008-07-16 12:00:31+00:00,28.482361,5.9e-05,0.016358,96.765137,6.09375,5.5e-05,0.013326,7.510631
2,2008-07-16 13:00:31+00:00,27.257385,0.000546,0.017263,103.529358,6.09375,0.000524,0.012655,8.212117
3,2008-07-16 14:00:31+00:00,29.507751,21.301758,0.542648,93.055725,6.09375,19.613107,11.601472,6.655432
4,2008-07-16 15:00:31+00:00,29.762268,21.454102,0.557098,94.18869,6.09375,19.665354,11.631321,6.706414


### III. Autodetecting files



Curious about what kind of instrument files you have in a directory?  Apply the `sonde3.autodetect` method:

```python
#what kind of file is this??
sonde3.autodetect("tests/greenspan_test_files/RIOA_20060718_CDT_GS7837.xls") 
```

In [5]:
sonde3.autodetect("tests/greenspan_test_files/RIOA_20060718_CDT_GS7837.xls") 

'greenspan_xls'

Copy this code snippet to the notebook line below to run `autodetect()` on all of the use-case examples in the `sonde3` package:

```python
#this script runs through all of the text examples and prints out the autodetect results
import os

root_dir = 'tests'
results = []
for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        if "_test.txt" in file:
            continue
        os.path.join(directory, file)
        results.append(os.path.join(directory, file) + ' ' + sonde3.autodetect(os.path.join(directory, file)))

results 
```

### IV. Generating Salinity and Dissolved Oxygen

Typically deployed water quality instruments do not compute all rows of data internally.  Instead, these are calculated by the program used to read the file back at the lab.  For example, YSI instruments do not compute salinity or dissolved oxygen concentration.  This can cause confusion because when viewing a YSI *\*.dat* file in YSI's ECOWIN, or Ecowatch Lite program displays salinity and DO mg/L.  However, the raw binary file does not include these rows as they were not physically collected by the instrument during deployment.

For example, lets read the raw binary file of the example file `"tests/ysi_test_files/SA08.dat"` and see what it contains:

In [6]:
metadata, SA08_BIN = sonde3.read_ysi("tests/ysi_test_files/SA08.dat",pytz.timezone('US/Central'))
SA08_BIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 6 columns):
Datetime_(UTC)                700 non-null datetime64[ns, UTC]
water_temp_C                  700 non-null float64
water_conductivity_mS/cm      700 non-null float64
water_depth_m_nonvented       700 non-null float64
water_DO_%                    700 non-null float64
instrument_battery_voltage    700 non-null float64
dtypes: datetime64[ns, UTC](1), float64(5)
memory usage: 32.9 KB


The pandas DataFrame contains the columns for conductivity and % DO saturation, but not for salinity and DO concentration.

For comparision, lets read the comma separated version of this file that was produced by the proprietary YSI Ecowin program:

In [7]:
metadata, SA08_CSV = sonde3.read_ysi_ascii("tests/ysi_test_files/SA08.CDF", pytz.timezone('US/Central'),delim=",")


The ECOWIN exported comma separated file has far more columns! The extra columns were derived through formulas using the instrumnet observed measurments.

To generate these columns pass the SA08_BIN DataFrame to calculate Salinity (PSU) and Dissolved Oxygen (mg/L) using `sonde3.calculate_salinity_psu()` and `sonde3.calculate_do_mgl()`

We can then compare our computed results to the ECOwatch program results:

In [8]:
SA08_BIN = sonde3.calculate_salinity_psu(SA08_BIN)
SA08_BIN = sonde3.calculate_do_mgl(SA08_BIN)
SA08_BIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 8 columns):
Datetime_(UTC)                700 non-null datetime64[ns, UTC]
water_temp_C                  700 non-null float64
water_conductivity_mS/cm      700 non-null float64
water_depth_m_nonvented       700 non-null float64
water_DO_%                    700 non-null float64
instrument_battery_voltage    700 non-null float64
water_salinity_PSU            678 non-null float64
water_DO_mgl                  678 non-null float64
dtypes: datetime64[ns, UTC](1), float64(7)
memory usage: 43.9 KB


So, why does `sonde3` produce less values for salinity and DO than the comma separated file?  Is this a bug?

*NO!*  If you recal from the `sqrt()` warning above, some of the values in the file are invalid.  This produces null (**NaN**) values in those rows.

In [9]:
SA08_BIN.tail()

Unnamed: 0,Datetime_(UTC),water_temp_C,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_salinity_PSU,water_DO_mgl
695,2008-08-14 11:00:16+00:00,23.977051,-6e-06,-0.052356,-0.006866,5.546875,,
696,2008-08-14 12:00:16+00:00,23.938599,-5e-06,-0.050262,-0.006866,5.46875,,
697,2008-08-14 13:00:16+00:00,24.039307,-5e-06,-0.045875,-0.006866,5.46875,,
698,2008-08-14 14:00:16+00:00,24.107056,-5e-06,-0.038671,-0.006866,5.46875,,
699,2008-08-14 15:00:16+00:00,24.177551,-6e-06,-0.034785,-0.006866,5.46875,,


Lets check our two files to see if our conversion methods are those used by YSI.  

Lets use `numpy.random` to pick a row to check:



In [10]:
from numpy import random
row = random.randint(2,677) # pick random row

sub1 =  SA08_BIN.iloc[row:row+1]  #row one binary file
sub2 = SA08_CSV.iloc[row:row+1]   #row two csv file

pandas.concat([sub1, sub2], axis=0,join="inner")

Unnamed: 0,Datetime_(UTC),water_temp_C,water_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,instrument_battery_voltage,water_salinity_PSU,water_DO_mgl
204,2008-07-25 00:00:31+00:00,28.85498,32.961914,1.031036,93.282318,5.9375,18.971225,6.476434
204,2008-07-25 00:00:31+00:00,28.85,32.962,1.031,93.3,5.9,18.98,6.48


Looks the same!  Notice, however, that the csv file values are rounded due to the fixed digit precision.

### V. Files from other Manufacturers and Models
#### a. Hydrotech

In [11]:
#metadata, mydf = sonde3.sonde("tests/hydrotech_test_files/0109DELT.CSV", pytz.timezone('US/Central'))

In [12]:
#metadata

In [13]:
#mydf.info()

In [14]:
#mydf.head()

In [15]:
#mydf.tail()

#### b. YSI EXO Units

YSI EXO units are timezone aware.  Thus, we discover the internal timezone and convert to UTC if applicable.

In [16]:
sonde3.autodetect("tests/ysi_test_files/GE-SA-B_17H104157_090617_060000.csv")

'ysi_exo_csv'

In [17]:
#metadata, exo_df = sonde3.sonde("tests/ysi_test_files/GE-SA-B_17H104157_090617_060000.csv")

In [18]:
metadata

Unnamed: 0,Manufacturer,Instrument_Serial_Number,Model,Station,Deployment_Setup_Time,Deployment_Start_Time,Deployment_Stop_Time
0,YSI,,,,,2008-07-16 12:00:31+00:00,2008-08-14 15:00:16+00:00


In [19]:
#exo_df.head()

#### c. YSI ascii + Comma Separated.

YSI timeseries produced by kermit transfer from YSI handset protocol.

In [20]:
metadata, df = sonde3.sonde("tests/ysi_test_files/0917GEB.txt")

  metadata, df = formats.read_ysi_ascii(filename, tzinfo, ',', None, [1, 2, 3])


In [21]:
metadata

Unnamed: 0,Manufacturer,Instrument_Serial_Number,Model,Station,Deployment_Setup_Time,Deployment_Start_Time,Deployment_Stop_Time
0,YSI,,,,,2017-09-08 12:15:08+00:00,2017-09-26 20:00:08+00:00


In [22]:
df.head()

Unnamed: 0,Datetime_(UTC),water_temp_C,water_tds_g/L,water_pressure_abs,water_depth_m_nonvented,instrument_battery_voltage,water_conductivity_mS/cm,water_specific_conductivity_mS/cm,water_salinity_PSU
0,2017-09-08 12:15:08+00:00,19.93,0.0,14.763,0.303,6.2,0.0,0.0,0.010285
1,2017-09-08 12:30:08+00:00,20.05,0.0,14.767,0.306,6.2,0.0,0.0,0.010336
2,2017-09-08 12:45:08+00:00,20.54,0.0,14.769,0.307,6.2,0.0,0.0,0.010544
3,2017-09-08 13:00:08+00:00,20.91,0.0,14.771,0.309,6.2,0.0,0.0,0.010699
4,2017-09-08 13:15:08+00:00,21.21,0.0,14.774,0.311,6.2,0.0,0.0,0.010823


#### D. Lowell Tiltmeter Comma Separated.

Concatinated timeseries file from a Lowell tiltmeter

In [23]:
metadata, df = sonde3.sonde("tests/lowell_test_files/1603103_SAB_05122016.csv")

  metadata, df = formats.read_lowell(filename, tzinfo, ',')


In [24]:
sonde3.merge_lowell()

In [25]:
metadata.head()

Unnamed: 0,Deployment_Setup_Time,Deployment_Start_Time,Deployment_Stop_Time,Filename,Instrument_Serial_Number,Manufacturer,Model,Station
0,,2016-05-13 17:30:00+00:00,2016-07-15 18:00:00+00:00,,1603103,Lowell,TCM,


### VI. Package Validation Tests

#### A.  Check to see if the package is correctly handling daylight savings time.

The test case dataset **tests/ysi_test_files/0108BAYT.csv** was collected during a daylight savings transition.  These three files were processed differently at the time of instrument processing.  The *.dat* file is the raw binary file from the instrument.  The *CDF* file was produced by the Ecowatch program.  The *cvs* was produced by converting the *CDF* file in ms excel to a comma separated file.  If the times were not consistant between the three files it would indicate **sonde3** is not processing datetime correctly.

In [26]:
#import all three versions of the file ,CDF, csv, and binary *.dat
metadata, baytcsv = sonde3.sonde("tests/ysi_test_files/0108BAYT.csv", pytz.timezone('US/Central'))
metadata, baytdat = sonde3.sonde("tests/ysi_test_files/0108BAYT.dat", pytz.timezone('US/Central'))
metadata, baytCDF = sonde3.sonde("tests/ysi_test_files/0108BAYT.CDF", pytz.timezone('US/Central'))

In [27]:
#cut the first & last row from each file.
pandas.concat([baytCDF.iloc[0:1], baytdat.iloc[0:1], baytcsv.iloc[0:1] , \
               baytCDF.iloc[-1:], baytdat.iloc[-1:], baytcsv.iloc[-1:]], axis=0,join="inner")

Unnamed: 0,Datetime_(UTC),water_temp_C,water_specific_conductivity_mS/cm,water_depth_m_nonvented,water_DO_%,water_conductivity_mS/cm,water_DO_mgl,water_salinity_PSU
0,2008-01-29 15:00:33+00:00,22.56,-0.0,-0.0,103.2,-0.0,8.927547,0.01137
0,2008-01-29 15:00:33+00:00,22.562561,-0.0,-0.0,103.227234,-0.0,8.929467,0.011371
0,2008-01-29 15:00:00+00:00,22.56,0.0,-0.0,103.2,0.0,8.927547,0.01137
1061,2008-03-13 19:00:33+00:00,22.51,-0.0,-0.0,104.3,-0.0,9.031326,0.01135
1061,2008-03-13 19:00:33+00:00,22.510376,-0.0,-0.0,104.284668,-0.0,9.029934,0.01135
1061,2008-03-13 19:00:00+00:00,22.51,0.0,-0.0,104.3,0.0,9.031326,0.01135


#### B.  More parsing testing

In [28]:
metadata, df = sonde3.sonde("tests/lowell_test_files/1708018_3Basin.csv", twdbparams=True)
df.columns

Index(['Datetime_(UTC)', 'water_speed', 'water_bearing',
       'northward_water_velocity', 'eastward_water_velocity',
       'water_temperature'],
      dtype='object')

In [29]:
metadata, df = sonde3.sonde("tests/ysi_test_files/spotchecks-galveston-180618-081639.csv", twdbparams=True)
df.columns

AttributeError: 'str' object has no attribute 'seek'

In [None]:
df

In [None]:
df.columns

In [None]:

import pandas
import seawater
import datetime
import pytz
import sys
import csv
import six


print ("Python: ", sys.version)
print ("Our Maxsize: ", sys.maxsize)
print ("pandas: ", pandas.__version__)
print ("seawater: ", seawater.__version__)
print ("six: ", six.__version__)
print ("csv: ", csv.__version__)
print ("pytz: ", pytz.__version__)