# Data processing

The water monitoring project at the Montreux Jazz has been going on since 2016. The data has been collected and treated by a variety of people since then. 

__Objective:__ Standardize the nomenclature from the different sampling years. Provide a model for storing and collecting data in the future.

__Purpose:__ Define the probability that a survey will exceed a threshold value within the period of the year defined by the survey results.

## Definitions

* colony: a circular discoloration of the media within a defined size and color range
* colony-count: the number of discolorations of the same hue for a media type
* media/medium: the environment that the water samples are placed in
* color: the observed color of the colony
* label: the assumed category of the color:
  * Bioindicator
  * Coliform
  * Other
* coef:  the correction factor applied, to allow reporting of colony counts per 100ml of the original water sample.
 
The purpose of the sampling is to identify colonies that appear in the media and classify them as one of the possible labels. The label of interest is _Bioindicators_, this represents the bacteria that are issue from the organism of interest. The organism in this case is people, the _Bioindicator_ is issue from fecal contaminants.

## Methods

The process requires collaborating with the data-manager(s) from the different project years and ensuring that the data from each year can be combined and interpreted together. The data for this collaboration is stored in the _componentdata_ folder.

The relationship of previous label <---> new label is stored in a dictionary or an array for the different possibilities of medium, color, label and coefficient. The new labels are applied to a data-frame.

The finsihed data (the result of the collaboration) is stored in the _end_ folder

## Sample data

The sample data is an example of the desired output per year. This includes the following parameters:

1. colony-count
2. label
3. location
4. coeficient*count
5. week number
6. day of year
7. is-jazz: boolean
8. rain fall in millimeters

In [1]:
import pandas as pd
import datetime as dt
import numpy as np

project = "Hackuarium do it together water quality sampling"
site_markers = {"SVT":"o", "VNX":"D", "MRD":"X"}
species_colors = { "Bioindicator":"dodgerblue", "Coliform":"magenta"}
marker_colors = {"SVT":"black", "VNX":"green", "MRD":"goldenrod"}
sites = ["SVT", "VNX", "MRD"]

## Survey data

The format of the survey data prior to processing. The result of the collaboration.

In [2]:
stddf = pd.read_csv("data/end/survey_data_2020_2023.csv")
stddf['date'] = pd.to_datetime(stddf["date"])
stddf["date"] = stddf["date"].dt.date
stddf.head()

Unnamed: 0,date,sample,temperature,media,color,count,image (48h),coef,date_sample,year,location,doy,week,isjazz,label,medium
0,2023-06-12,VNX1,16.1,ECC-A,Dark Blue,0.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Bioindicator,ECC-A
1,2023-06-12,VNX1,16.1,ECC-A,Turquoise,0.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Other,ECC-A
2,2023-06-12,VNX1,16.1,ECC-A,Pink,11.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Coliform,ECC-A
3,2023-06-12,VNX2,16.1,ECC-A,Dark Blue,1.0,20230614_205229.jpg,100.0,"('12/6/2023', 'VNX2')",2023,VNX,163,24,False,Bioindicator,ECC-A
4,2023-06-12,VNX2,16.1,ECC-A,Turquoise,1.0,20230614_205229.jpg,100.0,"('12/6/2023', 'VNX2')",2023,VNX,163,24,False,Other,ECC-A


In [3]:
new_df16 = pd.read_csv("data/componentdata/2016_Data.csv")

colors_2016 = [
    'P1_24h_big_blue', 'P1_24h_med_blue',
       'P1_24h_other', 'P1_24h_pink', 'P1_24h_turq', 'P1_qty_sample',
       'P2_24h_big_blue', 'P2_24h_med_blue', 'P2_24h_other', 'P2_24h_pink',
       'P2_24h_turq', 'P3_24h_big_blue', 'P3_24h_med_blue', 'P3_24h_other',
       'P3_24h_pink', 'P3_24h_turq']

new_df16.columns

Index(['Unnamed: 0', 'Date', 'Location', 'P1_24h_big_blue', 'P1_24h_med_blue',
       'P1_24h_other', 'P1_24h_pink', 'P1_24h_turq', 'P1_qty_sample',
       'P2_24h_big_blue', 'P2_24h_med_blue', 'P2_24h_other', 'P2_24h_pink',
       'P2_24h_turq', 'P3_24h_big_blue', 'P3_24h_med_blue', 'P3_24h_other',
       'P3_24h_pink', 'P3_24h_turq'],
      dtype='object')

In [4]:
new_df17 = pd.read_csv("data/componentdata/2017_Data.csv")
new_df17.columns

Index(['Date', 'Location', 'medium', 'Samples', 'Sampling_Notes', 'Water_temp',
       'Plating_notes', 'Temp_incubation', 'P1_qty_sample',
       'Image_24h_fluo_plate_one', 'P1_fluo_halo_colonies', 'P1_fluo_other',
       'Plate_one_24h_image', 'P1_24h_big_blue', 'P1_24h_med_blue',
       'P1_24h_green', 'P1_24h_turq', 'P1_24h_pink', 'P1_24h_other',
       'Comments_p1_24h', 'Plate_one_48h_image', 'P1_48h_big_blue',
       'P1_48h_med_blue', 'P1_48h_green', 'P1_48h_turq', 'P1_48h_pink',
       'P1_48h_other', 'Comments_p1_48h', 'P2_qty_sample',
       'Image_24h_fluo_plate_two', 'P2_fluo_halo_colonies', 'P2_fluo_other',
       'Plate_two_24h_image', 'P2_24h_big_blue', 'P2_24h_med_blue',
       'P2_24h_green', 'P2_24h_turq', 'P2_24h_pink', 'P2_24h_other',
       'Comments_p2_24h', 'Plate_two_48h_image', 'P2_48h_big_blue',
       'P2_48h_med_blue', 'P2_48h_green', 'P2_48h_turq', 'P2_48h_pink',
       'P2_48h_other', 'Comments_p2_48h', 'P3_qty_sample',
       'Image_24h_fluo_plate_three

### Applying labels

The colors that were used for the observations can be placed into three broad categories. 

1. Bioindicator
2. Coliforms
3. Other

The microbiologist determines the correct label for the recorded color based on the specifics of the media/medium used to grow the culture.

The colors appropriate to each label are stored in an array. The color for each record is tested for membership in one of the arrays. If it is in one of the arrays, the name of that array is returned. If the color is not in any array the original value is returned. The result is added to the data-frame.

```python
bioindicators = ["Dark Blue", "Blue", "Turquoise fast", "metallic_green", "green_met"]
coliforms = ["Pink", "pink"]
other = ["Turquoise", "Turquoise slow", "other"]

def translate_colors(x, bioindicators, coliforms, other):
    if x in bioindicators:
        return "Bioindicator"
    elif x in coliforms:
        return "Coliform"
    elif x in other:
        return "Other"
    else:
        return x

stddf ["label"] = stddf .color.apply(lambda x: translate_colors(x, bioindicators, coliforms, other))
```

We do the same for the media/medium except we use a dictionary to store that information

```python
media_names = {
    "ECC-A Card":"ECC-A",
    "new ECCA":"ECC-A",
    "E-coli side": "E coli",
    "ECC-side":"ECC",
    "selective":"Levine",
    "media":"EasyGel",
    "plus uv":"EasyGelPlus",
    "UVplus":"EasyGelPlus",
    "non-restrictive":"LB"
}

def translate_media(x, media_names):
    if x in media_names.keys():
        return media_names[x]
    else:
        return x


stddf ["medium"] = stddf .media.apply(lambda x: translate_media(x, media_names))

```



### Labeling the date range of interest

Voici les dates de Jazz pour toutes les années de prélèvement :

* 2016:  2016-07-01 - 2016-07-16
* 2017: 2017-06-30 - 2017-06-15
* 2020: 2020-07-03 - 2020-07-18
* 2022: 2022-07-01 - 2022-07-16
* 2023: 2023-06-30 - 2023-07-15

In [5]:
# mask the date ranges
import datetime as dt

def make_date_object(x):
    return dt.datetime.strptime(x, "%Y-%m-%d")

# y1 = (stddf ['date'] >= make_date_object("2020-07-03" )) & (stddf ['date'] <= make_date_object("2020-07-18"))
# y2 = (stddf ['date'] >= make_date_object("2022-07-01" )) & (stddf ['date'] <= make_date_object("2022-07-16"))
# y3 = (stddf ['date'] >= make_date_object("2023-06-30" )) & (stddf ['date'] <= make_date_object("2023-07-15"))

# stddf .loc[y1, "isjazz"] = True
# stddf .loc[y2, "isjazz"] = True
# stddf .loc[y3, "isjazz"] = True
stddf .head()

Unnamed: 0,date,sample,temperature,media,color,count,image (48h),coef,date_sample,year,location,doy,week,isjazz,label,medium
0,2023-06-12,VNX1,16.1,ECC-A,Dark Blue,0.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Bioindicator,ECC-A
1,2023-06-12,VNX1,16.1,ECC-A,Turquoise,0.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Other,ECC-A
2,2023-06-12,VNX1,16.1,ECC-A,Pink,11.0,20230614_205159.jpg,100.0,"('12/6/2023', 'VNX1')",2023,VNX,163,24,False,Coliform,ECC-A
3,2023-06-12,VNX2,16.1,ECC-A,Dark Blue,1.0,20230614_205229.jpg,100.0,"('12/6/2023', 'VNX2')",2023,VNX,163,24,False,Bioindicator,ECC-A
4,2023-06-12,VNX2,16.1,ECC-A,Turquoise,1.0,20230614_205229.jpg,100.0,"('12/6/2023', 'VNX2')",2023,VNX,163,24,False,Other,ECC-A


In [6]:
# translate colors
def translate_colors(x, bioindicators, coliforms, other):
    if x in bioindicators:
        return "Bioindicator"
    elif x in coliforms:
        return "Coliform"
    elif x in other:
        return "Other"
    else:
        return x

bioindicators = ["Dark Blue", "Blue", "Turquoise fast", "metallic_green", "green_met", "fluo_halo", "big_blue"]
coliforms = ["Pink", "pink", "purple", "med_blue"]
other = ["Turquoise", "Turquoise slow", "other", "mauve", "fluo_other", "green"]

# stddf ["label"] = stddf .color.apply(lambda x: translate_colors(x, bioindicators, coliforms, other))

def translate_media(x, media_names):
    if x in media_names.keys():
        return media_names[x]
    else:
        return x

media_names =  {
"ECC-A Card":"ECC-A",
"new ECCA":"ECC-A",
"E-coli side": "E coli",
"Double side E coli": "E coli",
"ECC-side":"ECC",
"Double side ECC":"ECC",
"selective":"Levine",
"media":"EasyGel",
"plus uv":"EasyGelPlus",
"UVplus":"EasyGelPlus",
"non-restrictive":"LB",
"levine": "Levine",
"easy_gel":"EasyGel",
"unil_kitchen":"LB",
"micrology_card": "ECC"
}

# stddf ["medium"] = stddf .media.apply(lambda x: translate_media(x, media_names))

# Rain fall

In [7]:
sample_data = pd.read_csv("data/end/rain_data_2016.csv")
sample_data.head()

Unnamed: 0,date,mm
0,2016-06-21,4.0
1,2016-06-22,0.6
2,2016-06-23,0.9
3,2016-06-24,13.1
4,2016-06-25,9.8


In [8]:
stddf ["medium"].unique()

array(['ECC-A', 'ECC', 'E coli', 'LB', 'Levine'], dtype=object)

In [9]:
stddf ["label"].unique()

array(['Bioindicator', 'Other', 'Coliform'], dtype=object)

In [10]:
stddf .coef.unique()

array([100.,   1.])

In [11]:
stddf .year.unique()

array([2023, 2022, 2020])

In [12]:
stddf .week.unique()

array([24, 25, 26, 27, 28, 29, 30, 31])

In [14]:
stddf  = stddf [stddf .location.isin(sites)]
stddf ["count"] = stddf ["count"].astype('int', errors='ignore')
dropthese = stddf[stddf ["count"] == 'nd']
dropthese

Unnamed: 0,date,sample,temperature,media,color,count,image (48h),coef,date_sample,year,location,doy,week,isjazz,label,medium
