In [107]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Processing Label Studio Exports for Spacy
This notebook processes exports from Label Studio for use in Spacy.

In [108]:
import preprocessor
from preprocessor import Preprocessor

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Loading the Data
### 1.1. Loading JSON export from Label Studio
The first step is to load the JSON export from Label Studio. This is done using the `json` library and the predefined function `LoadFile` from the Preprocessor class.

In [109]:
export_data = preprocessor.loadFile('label_studio_test_notebook_231005.json')
# export_data = preprocessor.loadFile('label_studio_test_notebook_290923.json')

The exported data constists of all the annotations from a specific text of all the users. The first annotated text in the exported data file looks something like the following:

In [110]:
print(export_data[0])

{'id': 70476117, 'annotations': [{'id': 22641728, 'completed_by': {'id': 12687, 'email': 'marlougielen@gmail.com', 'first_name': '', 'last_name': ''}, 'result': [{'id': 'kOQ_kmIuD1', 'type': 'labels', 'value': {'end': 56, 'text': 'city ', 'start': 51, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'v3hgj0k0kp', 'type': 'labels', 'value': {'end': 337, 'text': 'cities ', 'start': 330, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'IVC04elaO2', 'type': 'labels', 'value': {'end': 602, 'text': 'one of the Seven Wonders of the Ancient World', 'start': 557, 'labels': ['type']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'uVurmXz1pv', 'type': 'labels', 'value': {'end': 633, 'text': 'monumental buildings', 'start': 613, 'labels': ['component']}, 'origin': 'manual', 'to_name': 'text', 'from_name': 'label'}, {'id': 'dfKW_UPcVr', 'type': 'labels', 'value': {'end': 664, 'text': 'Library 

### 1.2 Converting to Spacy training format:
To provide custom labels to Spacy, we need to convert the data to the following format:

```python
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING"), (20, 23, "HEIGHT")]),
]
```

The `process_export` function from the preprocessor can fix this for us, it retrieves the training data in the given format, in combination with the labels and their relationship

In [111]:
training_data, relation_data = preprocessor.process_export_sentences(export_data)

[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


### 1.3 Checking results
We can print the first 5 relations to check whether it was done correctly.

In [112]:
for i, (label, relations) in enumerate(relation_data.items()):
    if i > 4:
        break
    print(f"Labels: {label}")
    print(f"Relations: {relations}")
    print()

Labels: ('Ephesus ', 'Attic and Ionian Greek colonists')
Relations: ['org:created_by']

Labels: ('Ephesus ', 'city ')
Relations: ['org:is_type']

Labels: ('Ephesus ', 'cities ')
Relations: ['org:is_type']

Labels: ('Ephesus ', 'Ancient Greece')
Relations: ['org:located_in']

Labels: ('Ephesus ', 'the coast of Ionia')
Relations: ['org:located_in']



The same can be done for the labels, which are stored in a dictionary. Lets only print the first one.

In [113]:
print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1]["entities"], sep = "\n")

Training data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[51, 56, 'type']
[59, 73, 'location']
[77, 95, 'location']
[122, 128, 'location']
[131, 144, 'location']
[146, 152, 'location']
[110, 121, 'date']
[1, 9, 'landmark_name']
[1, 8, 'landmark_name']
[51, 55, 'type']
[56, 73, 'location']
[74, 95, 'location']
[122, 127, 'location']
[128, 144, 'location']
[146, 152, 'location']
[1, 8, 'landmark_name']
[51, 55, 'type']
[59, 73, 'location']
[74, 95, 'location']
[97, 127, 'location']
[131, 144, 'location']
[146, 152, 'location']
[1, 8, 'landmark_name']
[48, 55, 'type']
[59, 73, 'location']
[77, 95, 'location']
[122, 127, 'location']
[131, 144, 'location']
[146, 152, 'location']
[1, 8, 'landmark_name']
[51, 55, 'type']
[59, 73, 'people']
[77, 95, 'location']
[122, 127, 'location']
[131, 144, 'location']
[146, 152, 'location']


### 1.4 Preparing the data for Spacy
The data is now in the correct format, so it can be processed and saved as Spacy training file using the `preprocess_spacy` function from the `Preprocessor` class.

In [114]:
# preprocessor.preprocess_spacy(training_data, warn = False)
preprocessor.preprocess_json(training_data)

# 2 Training the Spacy Model on on the training file

Start with importing all the spacy modules


In [115]:
import spacy
from spacy import displacy
import spacy
from spacy.cli.train import train

# If GPU is availabel, use it for training
spacy.prefer_gpu()

False

### 2.1 Spacy before training with custom labels

In [116]:
example_text = training_data[0][0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

print(nlp.pipe_names)

displacy.render(doc, style="ent")

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 2.2 Training a spacy model
Training of a spacy model usually is done via the command line. This is the reason for the following no so understandable lines of code. There are a few steps in the training process:
1. The spacy model needs a config file and all necessary files are in the spacy folder
2. The model needs training data, which was exported in this file above to the spacy folder
3. After training the model is evaluated and the results are printed for training and evaluation

In [123]:
import subprocess

subprocess.run("spacy project run convert", cwd="spacy")

# subprocess.run("spacy project run create-config", cwd="spacy")

CompletedProcess(args='spacy project run create-config', returncode=0)

In [127]:
result = subprocess.run("spacy project run train", cwd="spacy", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

[38;5;3m⚠ Your project configuration file includes a `spacy_version` key, which
is now deprecated. Weasel will not validate your version of spaCy.[0m
[1m
Running command: 'C:\Users\20182640\.virtualenvs\Text-Mining-xR8YyNgY\Scripts\python.exe' -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --training.eval_frequency 10 --training.max_steps 500 --training.patience 50 --gpu-id -1
[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner'][0m
[38;5;4mℹ Frozen components: ['tok2vec', 'tagger', 'parser', 'senter',
'attribute_ruler', 'lemmatizer'][0m
[38;5;4mℹ Initial learn rate: 0.01[0m
E    #       LOSS NER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  SPEED   SCORE 
---  ------  --------  -------  -------  -------  -------  ---------  -

In [128]:
result = subprocess.run("spacy project run evaluate", cwd="spacy", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, encoding="utf-8")
print(result.stdout)

[38;5;3m⚠ Your project configuration file includes a `spacy_version` key, which
is now deprecated. Weasel will not validate your version of spaCy.[0m
[1m
Running command: 'C:\Users\20182640\.virtualenvs\Text-Mining-xR8YyNgY\Scripts\python.exe' -m spacy evaluate training/model-best corpus/dev.spacy --output training/metrics.json
[38;5;4mℹ Using CPU[0m
[1m

TOK      100.00
TAG      -     
POS      -     
MORPH    -     
LEMMA    -     
UAS      -     
LAS      -     
NER P    100.00
NER R    100.00
NER F    100.00
SENT P   -     
SENT R   -     
SENT F   -     
SPEED    2287  

[1m

                   P        R        F
location      100.00   100.00   100.00
person_name   100.00   100.00   100.00
animal        100.00   100.00   100.00

[38;5;2m✔ Saved results to training\metrics.json[0m



### 2.2 Visualizing the results
The model is now trained. This model can be loaded into spacy and an example text can be visualized with the NER.

In [134]:
options = {
    "colors": {"location": "lightyellow",
               "person_name": "lightgreen",
               "landmark_name": "lightred",
               "condition": "lightblue"}
}

# Now test teh newly created spacy model on a sample text and visualize it using spacy
nlp = spacy.load("spacy/training/model-last/")

example_text = "The Phoenix Islands Protected Area (PIPA) is located in the Republic of Kiribati, an ocean nation in the central Pacific approximately midway between Australia and Hawaii. PIPA constitutes 11.34% of Kiribati's exclusive economic zone (EEZ), and with a size of , it is one of the largest marine protected areas (MPA) and one of the largest protected areas of any type (land or sea) on Earth. The PIPA was also designated as the world's largest and deepest UNESCO World Heritage Site in 2010.\nThe PIPA conserves one of the world's largest intact oceanic coral archipelago ecosystems, includes 14 known underwater seamounts (presumed to be extinct volcanoes) and other deep-sea habitats. The area contains approximately 800 known species of fauna, including about 200 coral species, 500 fish species, 18 marine mammals and 44 bird species. In total it is equivalent to the size of the state of California in the US, though the total land area is only . To the north of the PIPA is the U.S. administered Pacific Remote Islands Marine National Monument that is currently the world's largest designated MPA.\nHistory and administration.\nThe Republic of Kiribati, in partnership with the non-governmental conservation organizations Conservation International and the New England Aquarium, has formed the Phoenix Island Protected Area Conservation Trust (PIPA Trust). In 2018, the New England Aquarium resigned as a partner and the Aquarium of the Pacific joined the PIPA Trust Board.\nManagement and protection requirements necessary to maintain the values of this MPA are reflected both in the current interim management measures and the recently approved management plan. These include, but are not limited to, the following:\nThe administrators of the reserve had been criticized for the amount of fishing they allowed prior to full closure on January 1, 2015. As of Jan 1, 2015, all commercial extractive activities (including tuna fishing) are prohibited throughout the MPA. Only a small sustainable-use zone around Canton Island allows for limited activities to support the resident population. In the PIPA Management Plan 2015\u20132020, which was implemented following a Kiribati government decision in January 2014, there is a total ban on commercial fishing within the PIPA boundaries. The Territorial Sea (to 12 nm) and all lagoons of the 8 PIPA islands, (Kanton, Manra, Rawaki, Birnie, Mckean, Enderbury, Nikumaroro and Orona) to ensure there is no impact to marine and terrestrial species including habitats.\nIn November 2021, it was announced that Kiribati government will terminate the protected area to boost tuna fishing.\nNatural heritage.\nThere are three atolls with associated lagoons and perimeter coral reefs in the PIPA (Orona (Hull), Nikumaroro (Gardner), and Kanton (Aba-Riringa)), and five low islands surrounded by fringing reefs (Manra (Sydney), Rawaki (Phoenix), McKean, Birnie and Enderbury), and also two submerged reefs (Winslow and Carondelet). The area contains seven main habitats: island, lagoon, coral reef, deep reef, sea mount, deep benthos, and open ocean, which are all represented within the protected zone.\nThe 2000 surveys (Obura, et al.) in the Phoenix Islands (Orona (Hull), Nikumaroro (Gardner), and Kanton (Aba-Riringa)), five low reef islands surrounded by coral reefs (Manra (Sydney), Rawaki (Phoenix), McKean and Enderbury) identify that, at the time of these surveys, the reefs were in an excellent state of health, and free from the bleaching that has plagued reefs in other parts of the Pacific with no evidence of any coral diseases.\nThe coral reefs of the Phoenix Islands were notable for their moderate Live Coral Cover (LCC) of 20-40% and evidence of high physical breakage of coral by wave energy on the southern, eastern and northern reefs of the islands, which create coral rubble in the lagoons and base of the reefs. The dominant bottom cover of the lagoons was hard coral (36.0%), followed by coralline algae (red algae) (18.0%), coral rubble (16.7%), turf and fleshy algae (11.6%) and \"Halimeda\" (green macroalgae) (10.4%). The dominance of coral and coralline algae indicates healthy reef ecosystems dominated by calcifying organisms and active reef framework growth. The effect of exposure to storms is indicated by the dominance trends with storm resistant encrusting/submassive forms in windward sites, its somewhat lower abundance at leeward sites and a corresponding increase in more delicate plate forms, and the dominance of the more fragile table and staghorn corals in protected lagoon sites. Coral species diversity is higher on the larger islands of Nikumaroro, Kanton and Orona, which indicates the importance of the larger area of reef on these islands for support of biodiversity. Carpeting soft corals (\"Sinularia\" and \"Lobophytum\") were found at the bottom of the lagoons of Kanton and Orona, which are the only true lagoons in the Phoenix Islands.\nCrown-of-thorns starfish (\"Acanthaster planci\"), cushion star and other coral predators, such as the corallivorous snail \"Drupella spp.\", are found on the reefs of the Phoenix Islands, although there has not been any indication of destructive outbreaks of those predators on the reefs.\nSpecies of giant clam (\"Tridacna\") occur in low numbers: \"Tridacna squamosa\"; \"Tridacna maxima\"; but not \"Tridacna gigas\".\nTwo submerged reefs, Winslow and Carondelet, and at least 14 known seamounts together with open ocean and deep-sea habitat are an integral part of the Phoenix Islands Protected Area (PIPA). The New England Aquarium (NEAq), Boston University (BU), Woods Hole Oceanographic Institution, Sea Education Association (SEA), and Schmidt Ocean Institute have carried out scientific research expeditions of these seamount habitats, which have been identified being rich in deep-water coral and biodiversity supporting a variety of oceanic pelagic species. PIPA has been identified as an important feeding and spawning site for the tuna species. The dominant taxonomic group in the deep sea across all depths were the octocorals, followed by antipatharians, scleractinians, and then zoantharians.\nImpact of iron leaching from shipwrecks and anchor gear.\nThe PIPA is in a naturally iron poor region. The introduction of iron to this environment from shipwrecks and anchor gear, is linked to proliferation of turf algae and benthic bacterial communities, and degraded \u2018black reefs\u2019. Monitoring from 2000 to 2015 recorded the black reef originating at the 1929 wreck of the SS \"Norwich City\" on Nikumaroro progressing northward to sites away. The 2015 expedition to the PIPA recorded the presence of black reefs on five atolls (Enderbury, Kanton, Nikumaroro, McKean, Rawaki) and on Carondelet seamount associated with shipwreck debris. No recovery has been documented at black reefs observed between 2005 and 2015.\nIsland restoration and biosecurity program.\n \nFive of the eight islands in PIPA are currently designated as Important Bird Areas by Birdlife International. Today there are 19 species of seabirds living on the islands. Many other seabirds migrate through PIPA, including shearwaters and mottled petrels from Australia and New Zealand. Prominent species include the endemic, endangered Phoenix petrel.\nSome of the negative impacts of the introduction of non-native, invasive plants and animals include the elimination of native seabirds and plants, particularly through the destruction of the eggs and young, and introduced plants taking over other plant life, modifying the natural island ecosystem. Plants and animals that have been introduced over time include Pacific and Asian rats, rabbits, cats, ants, pigs, dogs and lantana.\nUntil PIPA was declared, the last comprehensive fauna surveys of the Phoenix Islands occurred in the 1960s. In 2006 a new survey was conducted to determine the extent of non-native pest species invasions on each island and the feasibility of a restoration program. From this work it was determined that pests - especially the feral rabbits on Rawaki Island and Asian rats on McKean Island - should be removed from the Phoenix Islands.\nSometime around the year 2002, Asian rats colonized McKean, apparently when a fishing trawler was wrecked on the island. The 2006 survey found that rat predation had virtually destroyed the once abundant populations of storm-petrels, blue noddles and other petrels and shearwaters. Rabbits on Rawaki were competing for and generally damaging necessary resources for the birds, as well as trampling nests.\nAs a first step towards biodiversity recovery on the islands of the PIPA, in mid 2008 rats and rabbits were targeted on McKean and Rawaki. In November\u2013December 2009 a check of these islands by a science team indicated that the eradication programs were successful. The responses from the plant life and bird life were spectacular with the team finding that seabirds were nesting successfully on McKean for the first time in nearly 10 years. Meanwhile, on Rawaki the vegetation recovery has enabled birds like blue noddies to find suitable nest sites throughout the island. Even frigatebirds were nesting on the now recovering plants. These restoration efforts will enable populations of Phoenix petrel, white-throated storm petrel, and other important seabird populations to recover in the PIPA. A second eradication expedition was successfully executed in July 2011, with two additional islands of the PIPA, Enderbury and Birnie, targeted for pest removal. Both islands had populations of the non-native Pacific rat.\nUNESCO World Heritage Site.\nOn January 30, 2009, the Republic of Kiribati submitted an application for the Phoenix Islands Protected Area for consideration on the United Nations Educational, Scientific and Cultural Organization (UNESCO) World Heritage List. This was the first nomination submitted by Kiribati since they ratified the Convention in 2000.\nOn August 1, 2010, at the 34th session of the World Heritage Committee in Bras\u00edlia, Brazil, the decision was made to inscribe PIPA onto the World Heritage List. It became the largest and deepest World Heritage site in the world."
doc = nlp(example_text)

displacy.render(doc, style="ent", jupyter=True, options=options)

# Show the tokens, their labels and their entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Kiribati location
Pacific location
approximately animal
Australia location
Hawaii location
11.34 location
Kiribati location
Earth location
Heritage Site person_name
2010 location
14 known person_name
California location
US location
U.S. location
Pacific location
Marine location
Monument location
History location
Kiribati location
International and person_name
England location
Aquarium location
Trust animal
Trust animal
2018 location
England location
Aquarium location
Aquarium location
Pacific location
Board location
January 1 person_name
2015 location
Jan 1 person_name
2015 location
Canton location
Island animal
Kiribati location
January 2014 person_name
Territorial Sea person_name
12 nm person_name
8 PIPA person_name
Kanton location
Manra location
Rawaki location
Birnie location
Mckean location
Enderbury location
Nikumaroro location
Orona location
November 2021 person_name
Kiribati location
Orona location
Nikumaroro location
Gardner location
Kanton location
Riringa location
Manra loca