# Data Organization

this script will be used for organizing the data/feature engineering and writing other .csv/xslx files as needed

NOTE: the orginal datafile will not be saved in this repository as it contains confidential location information...each location will be assigned a number, and we will keep track of this list internally, however this number will not be used in the algorithms as a feature

In [2]:
#imports and get raw data file
import pandas as pd
import numpy as np
import funcs

#note that 'private_name' is the associated secret number for the different locations
df = pd.read_csv('../data/raw_data2.csv')
df = df.drop(labels= 17900, axis = 0) #saw from errors later on that this needs to be deleted
df.head()

Unnamed: 0,private_name,loctype,site_code,aquifer_vulnerability,ecoregion,loccode,tot_depth,context,msym,drainage_class,...,result,qualifier,detlimit,units,resconstrain,simphalflife,simpresult,simpsorp2,simpsorp,morehalflives
0,0,Categorical - downgradient,Green-18,high,Coastal,Well-1,296.0,corner of greenhouse,Elmridge fine sandy loam,3-Moderately well drained,...,,nd,<0.2,ug/l,0.1,1-short,0-ND,2-mobile,2,F <15
1,0,Categorical - downgradient,Green-18,high,Coastal,Well-2,276.0,downgradient edge of property,Elmridge fine sandy loam,3-Moderately well drained,...,,nd,<0.2,ug/l,0.1,1-short,0-ND,2-mobile,2,F <15
2,0,Categorical - upgradient,Green-18,high,Coastal,Well-upgradient,296.0,"upgradient from property, wetland above",Elmridge fine sandy loam,3-Moderately well drained,...,,nd,<0.2,ug/l,0.1,1-short,0-ND,2-mobile,2,F <15
3,1,Categorical - downgradient,Nur-3,medium,Great Lakes,Well-2,213.0,"downgradient edge of property, edge of boggy w...",Lima loam,3-Moderately well drained,...,,nd,<0.2,ug/l,0.1,1-short,0-ND,2-mobile,2,F <15
4,1,Categorical - downgradient,Nur-3,medium,Great Lakes,Well-2,213.0,"downgradient edge of property, edge of boggy w...",Lima loam,3-Moderately well drained,...,,nd,<0.2,ug/l,0.1,1-short,0-ND,2-mobile,2,F <15


In [3]:
'''
IMPORTANT NOTES/ASSUMPTIONS: 
- many of the tests are for other soil/water parameters (pH, electrical conductivity, etc) so we want to extract just pesticide tests...

- to be thorough, the DEC tested for numerous pesticides on each sample, many of which were not applied, resulting in lots of important 
  but unusable data where there is no detectable amount

- many farmers/pesticide appliers provided us information on which pesticides they used...the df includes a 'wasused' column that will be
  utilized to extract the usable feature...however many pesticides were detectable in cases where we did not think it was applied, so it
  is ASSUMED that the pesticide was applied somewhere in close proximity, perhaps upstream or maybe there was errors in communication with the
  farmers/pesticide appliers

- FEATURE ENGINEERING: all nan results are considered zero...the pesticide was not detected

'''
df['result'] = df['result'].fillna(0)
pd.to_numeric(df['result'])

#this contains all feature rows to be put into algorithms
dfFeats = df[np.logical_or(df['wasused'] != 'no', df['koc'].notnull() & df['result'] > 0, df['kfoc'].notnull() & df['result'] > 0)]

In [4]:
'''
- theoretically, the organic carbon-water partition coefficient ('koc' column) and the organic carbon-water normalized Freundlich distribution 
  coefficient will be treated as the same

- this loop combines the columns, choosing koc first if it is available
'''
pcoef = []
for idx, row in dfFeats.iterrows():
    if row['koc'] > 0 :
        pcoef += [float(row['koc'])]
    else :
        pcoef += [float(row['kfoc'])]

dfFeats['pcoef'] = pcoef

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfFeats['pcoef'] = pcoef


In [7]:
#extract all current columns of interest to be put into algorithms...NOT FINAL
dfAlg = dfFeats.loc[:, ['private_name', 'loctype', 'aquifer_vulnerability', 'drainage_class', 'sampdate',
                    'parameter', 'soil_halflife', 'simphalflife', 'morehalflives', 'pcoef', 'simpsorp', 
                    'simpsorp2', 'result', 'simpresult']]

#reset index
dfAlg = dfAlg.reset_index()

In [9]:
#change nominal data using feat_eng_nom function in funcs.py
dfAlg, vulnerability_categories = funcs.feat_eng_nom(dfAlg, 'aquifer_vulnerability')
#dfAlg = funcs.feat_eng_nominal(dfAlg, 'aquifer_vulnerability')