# Unique Value Code

With the ES generated iDigBio dataframe, build a text list of all the unique values in each field.


In [1]:
import re
idb_df_version = "20161119"  # Hardcoded version of the idb parquet to use

In [2]:
df = sqlContext.read.load("/guoda/data/idigbio-{0}.parquet".format(idb_df_version))

## Small subset

Start by making a small selection to work with

In [3]:
small_df = (df
            .where(df["stateprovince"] == "vermont")
            .where(df["genus"] == "acer")
            )
print(small_df.count())
small_df.printSchema()

447
root
 |-- barcodevalue: string (nullable = true)
 |-- basisofrecord: string (nullable = true)
 |-- bed: string (nullable = true)
 |-- canonicalname: string (nullable = true)
 |-- catalognumber: string (nullable = true)
 |-- class: string (nullable = true)
 |-- collectioncode: string (nullable = true)
 |-- collectionid: string (nullable = true)
 |-- collectionname: string (nullable = true)
 |-- collector: string (nullable = true)
 |-- commonname: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- coordinateuncertainty: float (nullable = true)
 |-- country: string (nullable = true)
 |-- countrycode: string (nullable = true)
 |-- county: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- coreid: string (nullable = true)
 |    |-- dc:rights: string (nullable = true)
 |    |-- dcterms:accessRights: string (nullable = true)
 |    |-- dcterms:bibliographicCitation: string (nullable = true)
 |    |-- dcterms:language: string (nullable = true)
 |    |

## Test set of fields to do this for
Set up the fields to iterate over to generate a file per field. Hard code for now. Just a few to make sure the summarization process works.

In [4]:
fields = ["stateprovince", "specificepithet"]
#, "data.dwc:specificepithet"]

In [5]:
p = re.compile('[\W_]+')
for field in fields:
    slug = p.sub("_", field)
    output_fn = "idigbio-{0}-unique-{1}".format(idb_df_version, slug)
    (small_df
     .groupBy(df[field])
     .count()
     .write
     .format("com.databricks.spark.csv")
     .mode("overwrite")
     .option("header", "false")
     .save("/outputs/{0}.csv".format(output_fn))
    )

Looks like the nested fields are pretty killer. They take a lot longer than the top level ones and it GC memory killed Spark in this notebook even for data.dwc:genus which should be smallish (50k).

Changed small_df to acer in vermont instead of vermont and the test summary to specificepithet.

Doesn't seem to matter, eats all memory even when the small dataset is 477 records.

## Building field list to iterate over

Now build up the full list of fields, in corperate this into the real job

In [20]:
field_set = set()
for s in small_df.schema:
    #print(s.dataType)
    #if str(s.dataType) in ["StringType", "FloatType", "TimestampType", "InegertType", "BooleanType", "DoubleType"]:
    if not str(s.dataType).startswith("StructType"):
        field_set.add(s.name)
    else:
        for sub in s.dataType:
            field_set.add(".".join([s.name, sub.name]))
print(field_set)
print(len(field_set))

{'data.dwc:georeferenceVerificationStatus', 'data.dwc:earliestEraOrLowestErathem', 'data.dcterms:references', 'data.dwc:informationWithheld', 'data.dwc:verbatimElevation', 'data.dwc:geologicalContextID', 'data.dwc:phylum', 'data.dwc:coordinatePrecision', 'data.dwc:samplingEffort', 'genus', 'data.dwc:identificationRemarks', 'data.dwc:minimumDepthInMeters', 'data.dcterms:language', 'data.dwc:locationRemarks', 'data.dwc:originalNameUsage', 'datemodified', 'taxonid', 'data.dwc:verbatimTaxonRank', 'data.dwc:member', 'data.dwc:eventTime', 'data.dwc:typeStatus', 'data.dwc:accessRights', 'data.dwc:establishmentMeans', 'data.dwc:previousIdentifications', 'typestatus', 'data.dwc:collectionID', 'data.dwc:eventDate', 'data.dwc:georeferenceRemarks', 'data.dcterms:rights', 'data.dwc:behavior', 'data.dwc:bed', 'data.dwc:latestEonOrHighestEonothem', 'formation', 'data.dwc:class', 'data.dwc:earliestPeriodOrLowestSystem', 'catalognumber', 'data.dwc:associatedReferences', 'data.fcc:datePicked', 'data.dwc