<a href="https://colab.research.google.com/github/poldham/githubpins/blob/master/spacy_tuples_pipe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Google Drive

In [1]:
# best to load your Google Drive and save files created to /content
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import Packages and Read in the Data

In [None]:
# uncomment if loading a larger model and update model reference in the next cell
# TBD

There are three datasets for the Bahamas:
1. bahamas_100.csv with 100 records. Good for testing out approaches without taking any time to run [https://publicpins.blob.core.windows.net/bahamas/bahamas_100.csv](https://publicpins.blob.core.windows.net/bahamas/bahamas_100.csv)
1. bahamas_small.csv with 1000 records (for slightly bigger experiments) [https://publicpins.blob.core.windows.net/bahamas/bahamas_small.csv](https://publicpins.blob.core.windows.net/bahamas/bahamas_small.csv)
2. bahamas_texts.csv with 9600 texts as the working data at the moment [https://publicpins.blob.core.windows.net/bahamas/bahamas_texts.csv](https://publicpins.blob.core.windows.net/bahamas/bahamas_texts.csv)


In [11]:
import pandas as pd
import spacy

# load the mode

nlp = spacy.load("en_core_web_sm") # load the spacy model (use !pip install x for larger models in a separate cell before loading)

df = pd.read_csv("https://publicpins.blob.core.windows.net/bahamas/bahamas_100.csv", usecols=['text', 'id'])

In [12]:
# take a look
df.head()

Unnamed: 0,text,id
0,Gilmer Limestone: Oolite Tidal Bars on Sabine ...,000-020-923-889-687
1,Inshore and offshore bottlenose dolphin (Tursi...,000-032-304-114-733
2,Choice of Exchange Rate Regimes. The 20 econom...,000-034-137-084-029
3,The behavior and sensory biology of elasmobran...,000-043-479-729-303
4,Fluorescent epibiotic microbial community on t...,000-063-519-499-978


In [13]:
# convert the df rows to tuples using pandas .itertuples
# with "meta" as the context for the id

tup = list(df.itertuples(index=False, name='meta'))

print(tup[0])

meta(text='Gilmer Limestone: Oolite Tidal Bars on Sabine Uplift: ABSTRACT. Studies of cores, cuttings, and sample logs have shown that the Cotton Valley Limestone, also known as the "Gilmer Limestone," consists of a linear belt of oolitic and pelletal grainstones and packstones along the margin of the Sabine uplift. This belt extends from Gilmer field in Upshur County southward at least as far as Overton field in Smith County. The oolite grainstone sequences with leached, "chalky" porosity are restricted to the north-south trend and are replaced by lime-muddy, nonporous rocks to the east and west. The high percentage of ooids, the abundant festoon and tabular cross-beds in the grainstone belt, and the linear-shoal anatomy of the unit suggest that the Gilmer Limestone is an ancient analog of the tidal bars in the modern Bahamas. The Gilmer grainstones formed as a series of submarine bars which accumulated in the shallow, agitated water along the flanks of a peninsular shoal (the Sabine 

# Create an Entities Dataset
reference: [named entities](https://spacy.io/usage/linguistic-features#named-entities)

In [15]:
# Create an empty list
# use list(nlp.pipe()) method with as_tuples= True. iterate and retrieve entities
# set batch size and n_process for parallel processing (outside of Colab)
# write to file with pandas .from_records method

entities = []

for doc, context in list(nlp.pipe(tup, batch_size = 10, n_process = 7, as_tuples=True)):
  for ent in doc.ents:
    entities.append(
        {
        "id":context, 
        "entity_text":ent.text, 
        "entity_label":ent.label_, 
        "entity_start":ent.start_char, 
        "entity_end":ent.end_char
        }
    )

entities_df = pd.DataFrame.from_records(entities)  

# Take a look at the entities
Note that it is picking up entities but some of them are attributed incorrectly
see the annotation scheme for the label interpretation [https://spacy.io/api/annotation#named-entities](https://spacy.io/api/annotation#named-entities)

In [17]:

entities_df.head()

Unnamed: 0,id,entity_text,entity_label,entity_start,entity_end
0,000-020-923-889-687,the Cotton Valley Limestone,LOC,124,151
1,000-020-923-889-687,Sabine,PERSON,292,298
2,000-020-923-889-687,Gilmer,GPE,330,336
3,000-020-923-889-687,Upshur County,GPE,346,359
4,000-020-923-889-687,Overton,GPE,389,396


In [None]:
# Write to csv
entities_df.to_csv("entities_df.csv")

# Create a Part of Speech (POS) Data Frame

It is unclear that we will use this BUT it may become important for disambiguating names and refining match patterns later on. POS can take longer to run than anything else as it is word by word.

reference: [linguistic annotations](https://spacy.io/usage/linguistic-features)


In [18]:
pos = []

for doc, context in list(nlp.pipe(tup, batch_size = 10, n_process = 7, as_tuples=True)):
  for token in doc:
    pos.append(
        {
        "id":context,
        "token_text":token.text,
        "token_lemma":token.lemma_,
        "token_pos":token.pos_,
        "token_tag":token.tag_, 
        "token_dep":token.dep_, 
        "token_shape":token.shape_
         }
    )

pos_df = pd.DataFrame.from_records(pos)    

In [19]:
pos_df.head()

Unnamed: 0,id,token_text,token_lemma,token_pos,token_tag,token_dep,token_shape
0,000-020-923-889-687,Gilmer,Gilmer,PROPN,NNP,compound,Xxxxx
1,000-020-923-889-687,Limestone,Limestone,PROPN,NNP,ROOT,Xxxxx
2,000-020-923-889-687,:,:,PUNCT,:,punct,:
3,000-020-923-889-687,Oolite,Oolite,PROPN,NNP,compound,Xxxxx
4,000-020-923-889-687,Tidal,Tidal,PROPN,NNP,compound,Xxxxx


In [None]:
# write to csv

pos_df.to_csv("pos_df.csv")