# Datasets Site info - from EDI to jrn_metabase

Code for populating the DataSetSites table in jrn_metabase is below. The EDI Solr search
API is a little wonky in that when site data, including geographic descriptions and 
coordinates queries, are requested the response doesn't contain everything in the EML.
Requesting geographic description will only return the geographicDescription element from the _first_ geographicCoverage found even if there are multiple geographicCoverages. If requesting coordinates, all boundingCoordinates elements found within the coverage element are returned. So, it makes sense to just count the spatial elements returned per datasetID.

In [48]:
import sys
sys.path.append('/home/greg/GitHub/')
import pyEDIutils.search as edi
import pandas as pd
import numpy as np

# Establish database connection
sys.path.append('../')
import py2pg.connect as connect
conn = connect.connect('../jrn_metabase_dev.conn.json')

## Load DataSetAttributeEnumeration

In [49]:
# Do a query of the database for the table
sql = 'select * from lter_metabase."DataSetAttributeEnumeration";'
dat = pd.read_sql_query(sql, conn)

In [50]:
dat.head()

Unnamed: 0,DataSetID,EntitySortOrder,ColumnName,Code,Definition,CodeID
0,210001001,1,zone,G,Grassland vegetation type,vegtype_Grassland
1,210308004,1,qwt,SC,Other - SEE COMMENTS,BSNE_qwt_SC
2,210278001,1,water_07_08_trt,1.8,180% ambient precipitation (achieved with supp...,gcPPT_trt_1.8
3,210278002,1,PPT_trt,0.2,20% ambient precipitation (achieved with raino...,gcPPT_trt_0.2
4,210278002,1,PPT_trt,1.8,180% ambient precipitation (achieved with supp...,gcPPT_trt_1.8


In [51]:
dat2 = dat.loc[:,['Code', 'Definition', 'CodeID']]

In [52]:
# Get and count unique rows
lc = dat2.drop_duplicates()
print(len(lc))
lc.head()

502


Unnamed: 0,Code,Definition,CodeID
0,G,Grassland vegetation type,vegtype_Grassland
1,SC,Other - SEE COMMENTS,BSNE_qwt_SC
2,1.8,180% ambient precipitation (achieved with supp...,gcPPT_trt_1.8
3,0.2,20% ambient precipitation (achieved with raino...,gcPPT_trt_0.2
5,0.5,50% ambient precipitation (achieved with raino...,gcPPT_trt_0.5


In [53]:
# Number of unique codes - ~9 are repeated
len(lc.CodeID.unique())

502

In [54]:
# Code + Definition is a unique constraint - look for dups
dups=lc.duplicated(subset=['Code', 'Definition'])
lc[dups]

Unnamed: 0,Code,Definition,CodeID


## Prepare table for metabase

In [56]:
# Format the dataframe to look like the DataSetSites table in jrn_metabase
lc.rename(columns = {'Definition':'CodeExplanation'}, inplace = True)
df_in = lc.loc[:,['CodeID','Code','CodeExplanation']]
df_in.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,CodeID,Code,CodeExplanation
0,vegtype_Grassland,G,Grassland vegetation type
1,BSNE_qwt_SC,SC,Other - SEE COMMENTS
2,gcPPT_trt_1.8,1.8,180% ambient precipitation (achieved with supp...
3,gcPPT_trt_0.2,0.2,20% ambient precipitation (achieved with raino...
5,gcPPT_trt_0.5,0.5,50% ambient precipitation (achieved with raino...


In [57]:
# Some of these are in the database already - remove from incoming table
print(len(df_in))
test = ((df_in.CodeID=='dot4') | (df_in.CodeID=='NA6'))
print(test.sum())
df_in2 = df_in[~test]
len(df_in2)

502
2


500

## Now insert the table

In [58]:
import py2pg.populate as pop

In [59]:
#Use function to load ent_in
pop.copy_from_file(conn, df_in2, 'lter_metabase."ListCodes"', sep=',') # copy the dataframe to SQL
# Close the database connection
conn.close()

copy_from_file() done
