# GeoDeepDive Application Demo

## Wrangling text data from GeoDeepDive

The Python script `find_target.py` is a simplified component of the data-mining application utilized in [Peters, Husson and Wilcots (in press, Geology)](https://github.com/UW-Macrostrat/stromatolites_demo), which utilizes the [GeoDeepDive](https://geodeepdive.org) digital library to constrain the spatio-temporal distribution of stromatolite fossils across Earth history. This script searches for user-defined word(s) or phrases in 5 provided USGS Technical Reports that have been processed and annotated by GeoDeepDive. The output is a list of `document-sentence` tuples that uniquely describe the location of the specified word(s) within the GeoDeepDive library, along with any adjectives that are used to describe those word(s). It serves as a demonstration of what data derived from GeoDeepDive looks like, and how it can be manipulated in simple ways to gain knowledge about the published literature.

## Input description

Input for `find_target.py` are two text files: `input/sentences_nlp352` and `var/target_variables.py`. The `sentences_nlp352` file comes from the GeoDeepDive library, and is a TSV file containing 5 technical reports from the United States Geological Survey that have been parsed using [Stanford Natural Language Processing](http://nlp.stanford.edu/) (version 3.5.2). More detailed information about the sentences table data structure can be found [here](https://github.com/jonhusson/gdd_demo/tree/master/input). You can also view a reference list describing the five included reports [here](https://github.com/jonhusson/gdd_demo/blob/master/input/references.pdf); it is also in the downloaded input folder as `references.pdf`.

The `var/target_variables.py` file can (and should!) be altered by the user, and principally consists of Python list of strings called `target_names`. Each object in the list is searched for within the set of five documents, using Python's regular expressions module. It also includes a list of "bad" words, which should not be considered matches. For example, the default values provided are:


In [1]:
target_names = ['stromatol', r'\b' + 'Gamuza Formation' + r'\b']
bad_words = ['non-stromatolitic','nonstromatolitic','non-stromatolite']

meaning that words containing the string fragment `stromatol` will be returned (i.e., stromatolite, stromatolitic), as well as the phrase `Gamuza Formation`, provided the latter is bound by non-alphanumeric characters (e.g., `TheNotGamuza Formation` will not be returned). Words such as non-stromatolitic, which would be considered matches based on the regex on the target_name, are vetoed by its inclusion in the bad_words list. These lists can be altered to anything you like!

## Running the script

The find_target function can simply be imported into this notebook (or a similar script) and run from there.

In [2]:
from find_target import find_target
hits = find_target(target_names, bad_words)

Or it can be run from your terminal, where it will default to the target_names and bad_words specified in `var/target_variables.py`


## Output description

The result of running `find_target.py` will be written to `output/output.tsv` as tab-delimited text file. Each row consists of a discovery of one of the strings specified in `var/target_variables.py`.  The columns are described below:

Column | Description 
-------|--------
docid| identifier for the relevant document from the GeoDeepDive database, with metadata for it available through the GeoDeepDive API (i.e., [558dcf01e13823109f3edf8e](https://geodeepdive.org/api/articles?id=558dcf01e13823109f3edf8e))
sentid| identifier for sentence within the specified document where the `target` was extracted
target| discovered word or phrase (e.g., stromatolite, stromatolites, stromatolitic).
start\_idx| Pythonic index for start of discovered `target` (e.g., `0` would mean first word in that sentence).
end\_idx| Pythonic index for end of discovered `target`
adjective| words determined by [NLP](http://nlp.stanford.edu/) to be an adjective describing `target` (e.g., `Riphean, domal stromatolites`)
sentence| full sentence in which `target` was discovered






In [3]:
print len(hits)
hits[:5]

458


[{'adjectives': [],
  'docid': '558dcf01e13823109f3edf8e',
  'end_idx': 8,
  'sentence': '__________________________ 10 Pitiquito Quartzite ____________________________ 11 Gamuza Formation - ___________________________ 12 Papalote Formation ___________________________ 14 Tecolote Quartzite ____________________________ 15 La Cienega Formation _________________________ 15 Puerto Blanco Formation - ______________________ 15 Proveedora Quartzite __________________________ 16 Page Physical stratigraphy of upper Proterozoic and Cambrian rocks Continued ______________________ 17 Buelna Formation ____________________________ 17 Cerro Prieto Formation ________________________ 17 Arrojos Formation ____________________________ 17 Tren Formation ______________________________ 17 Biostratigraphy _ _________________________________ 17 Paleocurrent studies ______________________________ 22 Regional correlations _____________________________ 25 Southern Great Basin _________________________ 25 San Ber

## Exercises
Using the provided Python script and some coding (in any language) of your own:

1. What adjectives are used to describe stromatolites?

2. Create a list of `document-sentence` tuples for sentences in this test set that contain BOTH `sandstone` and `limestone,` two commonly studied rock types.

## Additional Information

I recently added a simple script that seeks to determine the start of the "References" list in a given GDD document. This information may be helpful, because one may be interested in discarding phrase matches that happen within the reference list or bibliography, focusing only on the main body of the document. To run this extractor, simply type:

```
python find_refs.py
```

The output is written to `output/ref_start.tsv`, and consists of `docid-sentid` tuples. For example, for docid `55adf5cde13823763a830891`, the associated sentid is `2783`. This means that for sentences with sentids less than 2783 are the main body of the text (for that particular document), and sentences with sentids greater than or equal to 2783 are determined to be part of the reference list.

## find_documents.py
In addition to the find_target script, there is also a find_documents script within the repository. This script takes the same input (`var/target_variables.py`) and dumps a list of the GeoDeepDive document IDs which include a term that matches, along with a dump of how many times each target variable occurs in each document (`output/docs_terms.txt`) and the text content of each match (`output/matching_documents.txt`). Additional logic can be applied in this script -- if a list of lists is supplied in `var/target_variables.py`, at least one term from each list is required in order for the document to be considered a match For example, ` target_names = [['a'], ['b', 'c']]` will only consider a document to be a match if it includes 'a' and either 'b' or 'c'.

In [8]:
from find_documents import find_documents
find_documents()
with open("output/docs_terms.txt") as fin:
    for line in fin:
        print line

55b6cd71e13823bd29ba7d93	stromatol	32

55b6cd71e13823bd29ba7d93	Stromatol	1

558dcf01e13823109f3edf8e	Gamuza Formation	28

558dcf01e13823109f3edf8e	GAMUZA FORMATION	3

558dcf01e13823109f3edf8e	stromatol	53

558dcf01e13823109f3edf8e	Stromatol	2

55adf8dee13823763a8308a7	stromatol	257

55a68e92e13823757cc6fa6d	STROMATOL	2

55a68e92e13823757cc6fa6d	stromatol	32

55a68e92e13823757cc6fa6d	Stromatol	7

55adf5cde13823763a830891	stromatol	40

55adf5cde13823763a830891	Stromatol	1

