# Import

Using spaCy to create training data for Google AutoML NER model. Listed below are links to some of the documentation that helped create this notebook.

1. https://spacy.io/usage/training

2. https://spacy.io/api/matcher

3. https://spacy.io/api/tokenizer

4. https://spacy.io/usage/linguistic-features#sbd

5. https://spacy.io/usage/spacy-101

In [1]:
import spacy
import csv

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load Data

In [3]:
# Load txt files

session_116_13_txt = open('/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-13_OPPORTUNITIES IN AGRICULTURE.txt', mode='r').read()
session_116_15_txt = open('/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-15_REDUCING RISKS AND COSTS.txt', mode='r').read()
session_116_17_txt = open('/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-17_OVERCOMING THE HEALTH RISKS.txt', mode='r').read()
session_116_18_txt = open('/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-18_JUST CLEAN ENERGY ENVIRONMENT.txt', mode='r').read()
session_116_19_txt = open('/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-19_CREATING_A_CLIMATE_RESILIENT_AMERICA.txt', mode='r').read()
session_116_8_txt = open("""/content/drive/MyDrive/Online Presence/Uniqtech/Environmental Policy Data/116-8_COLORADO'S ROADMAP FOR CLEAN ENERGY.txt""", mode='r').read()

In [None]:
# Check files loaded

session_116_18_txt



In [4]:
txt_files = [session_116_13_txt, session_116_15_txt, session_116_17_txt, session_116_18_txt, session_116_19_txt]

# Tokenize

Splitting the loaded documents into sentences and combining into one list.

Input: Txt documents

Output: List of spacy tokens


In [5]:
#Initialize sentence split https://spacy.io/usage/linguistic-features#sbd
from spacy.lang.en import English


all_sent = []

nlp = English()  # just the language with no model
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)

for txt in txt_files:
  doc = nlp(txt)
  for sent in doc.sents:
      all_sent.append(sent)

In [6]:
len(all_sent)

9615

In [7]:
print("Type: " + str(type(all_sent[800])) + "\n" + "Sample: " + str(all_sent[800]))

Type: <class 'spacy.tokens.span.Span'>
Sample: First, DTE pays dairy 
farmers a share of the revenues earned from the sale of RNG, allowing 
these primarily family owned businesses to realize value from a waste 
byproduct.


# Entity Recognition

Using exisitng spacy models to annotate sentences.

Input: List of spacy tokens

Output: list of annotated dictionaries for jsonl Google Cloud data upload (https://cloud.google.com/natural-language/automl/docs/prepare)


In [8]:
# Entity recognition
all_jsonl = []

nlp = spacy.load("en_core_web_sm") # model doc: https://spacy.io/models, https://spacy.io/models/en

for sent in all_sent:
  dict = {"annotations": [],
          "text_snippet": {"content":None}
          }
  doc = nlp(sent.text)
  dict["text_snippet"]["content"] = str(doc)
  for ent in doc.ents:
    annotation_dict = {
        "text_extraction": {"text_segment": {"end_offset": ent.end_char, "start_offset": ent.start_char}},
        "display_name": ent.label_
    }
    dict["annotations"].append(annotation_dict)
  all_jsonl.append(dict)
      #print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [9]:
all_jsonl[256]

{'annotations': [{'display_name': 'ORG',
   'text_extraction': {'text_segment': {'end_offset': 106,
     'start_offset': 103}}},
  {'display_name': 'CARDINAL',
   'text_extraction': {'text_segment': {'end_offset': 137,
     'start_offset': 120}}}],
 'text_snippet': {'content': 'Combining \ncurrent cover crop adoptees and this conservative estimate of future \nadoption would reduce GHG emissions by an estimated 26.8 to 38.2 MMT of \nCO2e per year.'}}

# Export to txt

Export the file to txt document. Then download the document, copy all, and paste into csv file in second column. Then fill in the first column with TRAIN, TEST, VALIDATION, or UNASSIGNED.

In [10]:
# CSV writer

with open('ner_data2.txt', mode='w') as data_file:
  for data in all_jsonl:
    data_file.write(str(data) + "\n")