# **DATA ORGANIZATION**

this script will be used for organizing the data/feature engineering and writing other .csv/xslx files as needed

NOTE: the orginal datafile will not be saved in this repository as it contains confidential location information...each location will be assigned a number, and we will keep track of this list internally, however this number will not be used in the algorithms as a feature

In [1]:
%load_ext autoreload
%autoreload 2

In [22]:
#imports and get raw data file
import pandas as pd
import numpy as np
import funcs
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#note that 'private_name' is the associated secret number for the different locations
df = pd.read_csv('../data/raw_data.csv')
t1 = pd.read_excel('../data/table1.xlsx')

In [3]:
'''
IMPORTANT NOTES/ASSUMPTIONS: 

- many of the tests are for other soil/water parameters (pH, electrical conductivity, etc) so we want to extract just pesticide tests...

- to be thorough, the DEC tested for numerous pesticides on each sample, many of which were not applied, resulting in lots of important 
  but unusable data where there is no detectable amount

- many farmers/pesticide appliers provided us information on which pesticides they used...the df includes a 'wasused' column that will be
  utilized to extract the usable feature...however many pesticides were detectable in cases where we did not know if it was applied, so it
  is ASSUMED that the pesticide was applied somewhere in close proximity

- different testing methods with different detection limits are used for different pesticides...these methods/limits are often improving it... 
  is suspected that the lower the detection limit, the more likely a pesticide is to be detected...so it will be used as a parameter in 
  the algorithms...some detection limits for the associated 'parameter' were not entered into the dataset for each test, however they were all entered for
  at least one test, so we must fill the NaN values correctly
  
- all tests for sulfur as the parameter will be removed due to wildly varying behavior

- uninterested in loctype 'Pond', 'Categorical - potable', and 'Long term' as these were ancillary tests or not enough information 
  is known about the testing area

- FEATURE ENGINEERING: all nan results are considered zero...the pesticide was not detected

'''
# fill na results to 0
df['result'] = df['result'].fillna(0)
pd.to_numeric(df['result'])

# fill all detection limits
df['detlimit'].fillna(0, inplace =True)
df['detlimit'] = df['detlimit'].astype(str)
df['detlimit'] = df['detlimit'].apply(lambda x: x.replace('*',''))
df['detlimit'] = df['detlimit'].apply(lambda x: x.replace('>',''))
df['detlimit'] = df['detlimit'].apply(lambda x: x.replace('<',''))
df['detlimit'] = df['detlimit'].apply(lambda x: x.replace('?',''))
df['detlimit'] = df['detlimit'].astype(float)
for idx, row in df.iterrows():
    # find unfilled detlimit
    if row['detlimit'] == '':
        parameter = row['parameter']
        detlimit = df[(df['detlimit'] != '') & (df['parameter'] == parameter)]

        # fill limit if found elsewhere
        if len(detlimit) > 0:
            df.loc[idx, 'detlimit'] = detlimit.loc[detlimit.index[0],'detlimit']


#this contains all test rows to be put into algorithms
df_tests = df[np.logical_or(df['wasused'] != 'no', df['koc'].notnull() & df['result'] > 0, df['kfoc'].notnull() & df['result'] > 0)]
df_tests = df_tests[df_tests['drainage_class'].notnull() & df_tests['soil_halflife'].notnull()]
df_tests = df_tests[df_tests['parameter'] != 'Sulfur']

#strip whitespaces
df_tests['loctype'] = df_tests['loctype'].apply(lambda x: x.strip())
df_tests = df_tests[df_tests['loctype'] != 'Pond']
df_tests = df_tests[df_tests['loctype'] != 'Categorical - potable']
df_tests = df_tests[df_tests['loctype'] != 'Long term']


In [4]:
'''
IMPORTANT CONCEPT

- theoretically, the organic carbon-water partition coefficient ('koc' column) and the organic carbon-water normalized Freundlich distribution 
  coefficient will be treated as the same

- this loop combines the columns, choosing koc first if it is available

'''
pcoef = []
for idx, row in df_tests.iterrows():
    if row['koc'] > 0 :
        pcoef += [float(row['koc'])]
    else :
        pcoef += [float(row['kfoc'])]

df_tests['pcoef'] = pcoef

In [5]:
'''
- extract all current columns of potential interest to be put into algorithms...NOT FINAL

- other minor fixes
'''
col_list = ['private_name', 'loctype', 'aquifer_vulnerability', 'drainage_class', 'detlimit', 'sampdate', 'parameter','gus', 'soil_halflife', 'simphalflife', 'morehalflives', 'pcoef', 'simpsorp', 'simpsorp2', 'result', 'simpresult']

#get all columns of interest
df_cols = df_tests.loc[:, col_list]

#replace all instances of 'well drained' to 'Well drained'
df_cols.replace(to_replace='well drained', value='Well drained', inplace = True)



In [6]:
'''
IMPORTANT CONCEPT:

- at many testing sites, samples were taken in both the downgradient and upgradient groundwater of the pesticide-treated area...
  these are distinguished by 'Categorical - upgradient' and'Categorical - downgradiet'...'Categorical - up and downgradient' indicates
  one site where the test was both upgradient of one treated area and downgradient of another

- tests were done at upgradient sites to find out if pesticides were in the already in the groundwater NOT as a result of the land-owners'
  application...this could be the result of a neighboring property apply pesticides, for example...if the same pesticide is detected downgradient 
  and upgradient of the pesticide application area, then the upgradient value should be subtracted from the downgradient value to get a better
  representation of what is happening with land-owners' pesticides

- this loop identifies upgradient/downgradient tests on the same sampling date and subtracts the upgradient result from the downgradient

'''

# reset index
df_reset = df_cols.reset_index().iloc[:,1:]

for idx, row in df_reset.iterrows():
    # find 'upgradient' or 'up and downgradient' test on same date for same parameter in same location
    if row['loctype'] in ['Categorical - downgradient','Categorical - up and downgradient']:
        sampdate = row['sampdate']
        parameter = row['parameter']
        loctype = row['loctype']
        name = row['private_name']
        upgradient = df_reset[(df_reset['private_name'] == name) & (df_reset['sampdate'] == sampdate) & (df_reset['loctype'] > loctype) & (df_reset['parameter'] == parameter)]

    # if test has both 'upgradient' and 'up and downgradient' samples, then subtract just the 'up and downgradient'
    # when upgradient is created, it puts 'up and downgradient' results first, so we can just subtract out first index of whatever upgradient is
    if len(upgradient) > 0:
      df_reset.loc[idx, 'result'] -= upgradient.loc[upgradient.index[0],'result']

# now extract out just the downgradient tests of interests
df_adjusted = df_reset[(df_reset['loctype'] != 'Categorical - upgradient') & (df_reset['loctype'] != 'Categorical - up and downgradient') ]


# add a 'detected' column if result > 0
# 1 if detected, -1 if not
for idx, row in df_adjusted.iterrows():
    if df_adjusted.loc[idx, 'result'] > 0:
        df_adjusted.loc[idx, 'detected'] = 1
    else:
        df_adjusted.loc[idx, 'detected'] = -1



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_adjusted.loc[idx, 'detected'] = -1


In [16]:
# reset index again
df_adjusted = df_adjusted.reset_index().iloc[:,1:]

In [26]:
print(t1['parameter'].unique())
print(df_adjusted['parameter'].unique())
print('2,4-D' in t1['parameter'].un)

['Aldicarb' 'Atrazine' 'Diuron' ' Metolachlor' 'Oxamyl' 'Picloram'
 'Prometryn' 'Simazine' 'Chlordane' 'Chlorothalonil' 'Chlorpyrifos'
 '2,4-D' 'Dicamba' 'Endosulfan' 'Heptachlor' 'Lindane' 'Phorate'
 'Propachlor' 'Toxaphene' 'Trifluralin' 'Alachlor' 'Carbaryl' 'Carbofuran'
 'Dinoseb' 'Ethoprop' 'Fonofos' 'Chlorthal dimethyl' 'Cyanazine' 'DBCP'
 'EDB ' 'Metribuzin' 'Naled' 'Prometon' 'Propylene dichloride' 'Aldrin'
 'Chloramben' '1,3-D' 'DDD' 'Endosulfan sulfate' 'Pendimethalin' 'Silvex'
 'DDT' 'Endrin' 'Dieldrin']
['Boscalid' 'Myclobutanil' 'Chlorpyrifos' 'Oxadiazon' 'Metolachlor OA'
 'Metolachlor ESA' 'Imidacloprid' 'Indaziflam' 'Atrazine' 'Linuron'
 'S-Metolachlor' 'Carbaryl' 'Flumioxazin' 'Glyphosate' 'Ethofumesate'
 'Acetamiprid' 'Malathion' 'Metribuzin' 'Fluopyram' 'Sulfentrazone'
 'Terbacil' 'Diuron' 'Acetochlor ESA' 'Simazine' 'Tebuconazole'
 'Thiamethoxam' 'Mandipropamid' 'Clethodim' 'JSE76' 'Mefentrifluconazole'
 'Pyrimethanil' 'Iprodione' 'Propiconazole' 'Bentazon'
 'Chloran

In [40]:
'''
IMPORTANT CONCEPT

- SWL lab members are currently deriving a new theoretical groundwater ubiquity score (TGUS) to be compared to typically used groundwater ubiquity
  score (GUS) derived by Gustafson et al., 1989

- dataframe 't1' contains columns for the GUS, TGUS, and TGUS* (a modified form of TGUS) for 45 different pesticides, as well as some more accurate
  soil halflife and partitioning coefficient values that need to be updated in our data

- we will consider the effect of all ubiquity scores together and separately for predicting test outcomes

- many tgus and tgus* values are not documented, so those need to be calculated using defined functions

'''
for idx, row in df_adjusted.iterrows():
  parameter = row['parameter']

  if parameter in t1['parameter'].unique():
    # get needed values from t1
    pcoef = t1[t1['parameter'] == parameter]['koc']
    shl = t1[t1['parameter'] == parameter]['soil_halflife']
    gus = t1[t1['parameter'] == parameter]['gus']
    tgus = t1[t1['parameter'] == parameter]['tgus']
    tgus_star = t1[t1['parameter'] == parameter]['tgus*']

    # add values to data
    df_adjusted.loc[idx, 'pcoef'] = pcoef.iloc[0]
    df_adjusted.loc[idx, 'soil_halflife'] = shl.iloc[0]
    df_adjusted.loc[idx, 'gus'] = gus.iloc[0]
    df_adjusted.loc[idx, 'tgus'] = tgus.iloc[0]
    df_adjusted.loc[idx, 'tgus*'] = tgus_star.iloc[0]

  else:
    tgus_star = funcs.tgus(row['soil_halflife'], row['pcoef'], star = True)
    tgus = funcs.tgus(row['soil_halflife'], row['pcoef'])
    df_adjusted.loc[idx, 'tgus*'] = tgus_star
    df_adjusted.loc[idx, 'tgus'] = tgus






In [41]:
# setup final dataframe
# for now, working with all raw numbers and not pre-decided categories
onehot_cols = ['aquifer_vulnerability','drainage_class']
raw_cols = ['gus','tgus', 'tgus*','soil_halflife', 'pcoef']

# normalize raw values
norm = scaler.fit_transform(df_adjusted.loc[:, raw_cols])
norm = round(pd.DataFrame(norm, columns = raw_cols), 3)

# onehot categorical
df_onehot = funcs.onehot(df=df_adjusted, columns = onehot_cols)
df_final = pd.concat([df_onehot, norm], axis = 1)

# append offset and re-add detected column
df_final['offset'] = np.ones((df_adjusted.shape[0]))
df_final['detected'] = df_adjusted['detected']


In [47]:
'''
IMPORTANT CONCEPT

- to compare the performance of the different groundwater ubiquity score, values, we will make separate dataframes containing just one of the score values

- we will also separate out a dataframe with just the soil halflives/partitioning coefficient and no ubiquity scores to see how well raw values perform

- 'df_final' will be used to evaluate the performance of all ubiquity scores and raw data combined

'''

df_gus = df_final.loc[:, ~df_final.columns.isin(['tgus', 'tgus*', 'soil_halflife', 'pcoef'])]
df_tgus = df_final.loc[:, ~df_final.columns.isin(['gus', 'tgus*', 'soil_halflife', 'pcoef'])]
df_tgus_st = df_final.loc[:, ~df_final.columns.isin(['tgus', 'gus', 'soil_halflife', 'pcoef'])]
df_raw = df_final.loc[:, ~df_final.columns.isin(['tgus', 'tgus*', 'gus'])]

In [48]:
#write df_final as csv for future use
df_final.to_csv(path_or_buf = '../data/df_all.csv', sep = ',')
df_gus.to_csv(path_or_buf = '../data/df_gus.csv', sep = ',')
df_tgus.to_csv(path_or_buf = '../data/df_tgus.csv', sep = ',')
df_tgus_st.to_csv(path_or_buf = '../data/df_tgus*.csv', sep = ',')
df_raw.to_csv(path_or_buf = '../data/df_raw.csv', sep = ',')