# Data Organization

this script will be used for organizing the data/feature engineering and writing other .csv/xslx files as needed

NOTE: the orginal datafile will not be saved in this repository as it contains confidential location information...each location will be assigned a number, and we will keep track of this list internally, however this number will not be used in the algorithms as a feature

In [3]:
#imports and get raw data file
import pandas as pd
import numpy as np

#note that 'private_name' is the associated secret number for the different locations
df = pd.read_csv('../data/raw_data.csv')
df.head()

Unnamed: 0,private_name,loctype,site_code,aquifer_vulnerability,ecoregion,loccode,tot_depth,context,msym,drainage_class,...,koc,kfoc,group2023,wasused,analyst,result,qualifier,detlimit,units,resconstrain
0,0,Categorical - downgradient,Golf-2,high,Allegheny,Well-irrig,,"amidst treated area, heavy pumping",Alton gravelly fine sandy loam,Well drained,...,,,field,no,sp17,759.0,,,us/cm,759.0
1,0,Categorical - downgradient,Golf-2,high,Allegheny,Well-irrig,,"amidst treated area, heavy pumping",Alton gravelly fine sandy loam,Well drained,...,,,field,no,sp17,7.54,,,ph units,7.54
2,0,Categorical - downgradient,Golf-2,high,Allegheny,Well-irrig,,"amidst treated area, heavy pumping",Alton gravelly fine sandy loam,Well drained,...,,,ions,no,Sanchez,78.0,,,mgcaco3/l,78.0
3,0,Categorical - downgradient,Golf-2,high,Allegheny,Well-irrig,,"amidst treated area, heavy pumping",Alton gravelly fine sandy loam,Well drained,...,,,ions,no,CNAL,0.05,,<0.04,mg/l,0.02
4,0,Categorical - downgradient,Golf-2,high,Allegheny,Well-irrig,,"amidst treated area, heavy pumping",Alton gravelly fine sandy loam,Well drained,...,,,ions,no,CNAL,0.28,,<0.01,mg/l,0.005


In [5]:
'''
IMPORTANT NOTES/ASSUMPTIONS: 
- many of the tests are for other soil/water parameters (pH, electrical conductivity, etc) so we want to extract just pesticide tests...

- to be thorough, the DEC tested for numerous pesticides on each sample, many of which were not applied, resulting in lots of important 
  but unusable data where there is no detectable amount

- many farmers/pesticide appliers provided us information on which pesticides they used...the df includes a 'wasused' column that will be
  utilized to extract the usable feature...however many pesticides were detectable in cases where we did not think it was applied, so it
  is ASSUMED that the pesticide was applied somewhere in close proximity, perhaps upstream or maybe there was errors in communication with the
  farmers/pesticide appliers

- FEATURE ENGINEERING: all nan results are considered zero...the pesticide was not detected

- also need to convert to numeric values...can see there is error with entry at index 12756, so row gets deleted...commented out b/c causes errors 
  after it's done

'''
#df = df.drop(labels= 12756, axis = 0)
df['result'] = df['result'].fillna(0)
pd.to_numeric(df['result'])

#this contains all feature rows to be put into algorithms...we will keep df as the raw file and modify this one
dfFeats = df[np.logical_or(df['wasused'] != 'no', df['koc'].notnull() & df['result'] > 0, df['kfoc'].notnull() & df['result'] > 0)]

In [6]:
'''
- theoretically, the organic carbon-water partition coefficient ('koc' column) and the organic carbon-water normalized Freundlich distribution 
  coefficient will be treated as the same

- this loop combines the columns, choosing koc first if it is available
'''
pcoef = []
for idx, row in dfFeats.iterrows():
    if row['koc'] > 0 :
        pcoef += [float(row['koc'])]
    else :
        pcoef += [float(row['kfoc'])]

dfFeats['pcoef'] = pcoef

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfFeats['pcoef'] = pcoef
