# Regex and Prodigy (LABEL: Names)



<font size= "3">
Annotation of entities for sentences based on prodigy's format for different labels for further usage in Machine learning. Prodigy is an annotation tool based on spaCy. Please see here:
    </p>
  <a href="https://prodi.gy/ "> Prodigy </a> 
<font 3>

## Imports

In [35]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import scipy as sp
import spacy 
from prodigy.util import write_jsonl



## Read the Data

In [36]:
importVersion = '013'

In [37]:
path= '../data/01_df_v{0}.pickle'.format(importVersion)# Put the path of the data in your local machine here, consider the letter "r" before the path
dfAstroNova = pd.read_pickle(path)

# Sort based on the Chapter 

In [38]:
# Sort the data based on the chapters of the book 
dfAstroNova['chapter'] = dfAstroNova.chapter.replace("appendix b",np.nan).astype(float)  
dfAstroNova = dfAstroNova.rename_axis('MyIdx').sort_values(by = ['chapter', 'MyIdx'], ascending = [True, True])
dfAstroNova.chapter.fillna('appendix b', inplace=True)

In [39]:
dfAstroNova.reset_index(inplace=True)
dfAstroNova=dfAstroNova.drop("MyIdx",axis=1,inplace=False)
dfAstroNova=dfAstroNova.drop("html",axis=1)

In [40]:
type(dfAstroNova)

pandas.core.frame.DataFrame

In [41]:
dfAstroNova.head(5)

Unnamed: 0,text,links,italic,chapter,graphic,table,marginal,sentences,tagged
0,Chapter 1,[],[],1,[],[],[],[Chapter 1],"[[(Chapter, None), (1, NUM)]]"
1,On the distinction between the first motion an...,[],[],1,[],[],[],[On the distinction between the first motion a...,"[[(On, None), (the, None), (distinction, None)..."
2,The testimony of the ages confirms that the mo...,[],[],1,[],[],[ Terms: 1. The first motion is that of the wh...,[The testimony of the ages confirms that the m...,"[[(The, None), (testimony, None), (of, None), ..."
3,It is just this from which astronomy arose amo...,[],[],1,[ ch 1 gr 1],[],[],[It is just this from which astronomy arose am...,"[[(It, None), (is, None), (just, None), (this,..."
4,Before the distinction between the first motio...,[],[(such],1,[],[],[ 2],[Before the distinction between the first moti...,"[[(Before, None), (the, None), (distinction, N..."


### Read the Sentences

In [42]:
df=dfAstroNova.sentences

In [43]:
sents=[]
for para in dfAstroNova.sentences:
    sents +=para

## Annotation Based on Consistant format with Prodigy 

<fig size= 3>
Prodigy accept a specific format; a JSONL format (newline-delimited JSON). Entities and other highlighted spans of text can be defined in the "spans" property. A example could look like this dictionary:
</p>    
text': 'On 1595 October 30 at 8h 20m, they found Mars at 17° 48’ Taurus, with a diurnal motion of 22’ 54” ^15.',
  'spans': [{'start': 22, 'end': 29, 'label': 'TIME'}]}
</p>
which start and end's numbers refer to the position of entitiy  in the sentence. (here STAR,PLAN, NAME)

    
<fig size>

Here since the correction of annotation is not very hard, we annotate 3 follwoing labels: </p>
<ul>
<li>STAR: stars's names </li>
<li>PLAN: planet's names </li>
<li>NAME: names of people and places </li>
</ul>
and then we  merge the results for correction by annotators.
    

## STAR


In [45]:
label = "STAR"   
texts = sents  
regex_patterns = [
re.compile(r'(\bAldebaran\b|\bAlphard\b|\bAntares\b|\bArcturus\b|\bBack of Leo\b|\bBeta Leonis\b|\bBeta Scorpii\b|\bBeta Tauri\b|\bBetelgeuse\b|\bcanis\b|\bCanis Minor\b|\bCor Leonis\b|\bCor Scorpii,10\b|\bCor Scorpii\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bAldebaran\b|\bAlphard|\bAntares\b|\bArcturus\b|\bBack of Leo\b|\bBeta Leonis\b|\bBeta Scorpii\b|\b\Beta Tauri\b|\bBetelgeuse\b|\bcanis\b|\bCanis Minor\b|\bCor Leonis\b|\bCor Scorpii\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bHeart of Hydra\b|\bHydrae\b|\bKappa Geminorum\b|\bLambda Leonis\b|\bNeck of Leo\b|\bOrion\b|\bPalilicium\b|\bPolaris\b|\bPollux\b|\bProcyon\b|\bRegulus\b|\bSpica Virginis\b|\bTail of Leo\b|\bUrsa\b|\bUrsa Major\b|\bVindemiatrix\b|\bZeta Leonis\b)')
]
examples = []
for text in texts:
    for expression in regex_patterns:
    
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

#write_jsonl("NER_STAR_01.jsonl", examples)

In [16]:
def misaligned_token(examples):
    counter=0
    nlp = spacy.load("en_core_web_sm")  
    for example in examples:  
        doc = nlp(example["text"])
        for span in example["spans"]:
            char_span = doc.char_span(span["start"], span["end"])
            if char_span is None:  
                counter+=1
                print("{}- Misaligned tokens-->".format(counter), example["text"],span)

In [17]:
misaligned_token(examples)

1- Misaligned tokens--> Again, at 7h 15m on the morning of December 27, it was 36° 43’ from Cor Leonis,7 whose latitude is 0° 26½’; hence, its longitude at the end of 1582 is 17° 28⅓’ Cancer, altitude 14° 4’, and thus affected by refraction. {'start': 68, 'end': 78, 'label': 'STAR'}
2- Misaligned tokens--> [IV] On 1590 October 6, at 4h 45m in the morning, Mars was observed at an altitude of 12½ degrees, [and distances taken] from the Tail of Leo7 and the Heart of Hydra,8 with its declination. {'start': 151, 'end': 165, 'label': 'STAR'}
3- Misaligned tokens--> From this time interval, by the principles laid down above, the sun's mean motion is found to have gone 5s 25° 32' 50" beyond Cor Leonis,7 with an anomaly of 234° 54' 34" . {'start': 126, 'end': 136, 'label': 'STAR'}


# Edge Cases

As it is clear from above , After using regular expression we faced with some specific format like these words:
- Cor Leonis,7
- Hydra,8
- Cor Leonis,7

We consider these words as edge cases which means tokens are not consistant with the tokens assigned by the model’s tokenizer. here we have a samll set of edge cases 

## One Possible Solution 

We can modify our annotation rules based on the above examples

In [34]:
label = "STAR"   
texts = sents  
regex_patterns = [
re.compile(r'(\bAldebaran\b|\bAlphard\b|\bAntares\b|\bArcturus\b|\bBack of Leo\b|\bBeta Leonis\b|\bBeta Scorpii\b|\bBeta Tauri\b|\bBetelgeuse\b|\bcanis\b|\bCanis Minor\b|\bCor Leonis,7\b|\bCor Leonis\b|\bCor Scorpii,10\b|\bCor Scorpii\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bAldebaran\b|\bAlphard|\bAntares\b|\bArcturus\b|\bBack of Leo\b|\bBeta Leonis\b|\bBeta Scorpii\b|\b\Beta Tauri\b|\bBetelgeuse\b|\bcanis\b|\bCanis Minor\b|\bCor Leonis,7\b|\bCor Leonis\b|\bCor Scorpii\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bDenebola\b|\bdog\b|\bEpsilon Virginis\b|\bErichthonius\b|\bHeart of Hydra,8|\bHeart of Hydra\b|\bHydrae\b|\bKappa Geminorum\b|\bLambda Leonis\b|\bNeck of Leo\b|\bOrion\b|\bPalilicium\b|\bPolaris\b|\bPollux\b|\bProcyon\b|\bRegulus\b|\bSpica Virginis\b|\bTail of Leo\b|\bUrsa\b|\bUrsa Major\b|\bVindemiatrix\b|\bZeta Leonis\b)[:,]?')
]
examples = []
for text in texts:
    for expression in regex_patterns:
    
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

write_jsonl("NER_STAR_V02.jsonl", examples)

In [21]:
misaligned_token(examples)

# PLAN

In [24]:
label = "PLAN"   
texts = sents
regex_patterns = [
re.compile(r'(\bEarth\b|\bJupiter\b|\bMars\b|\bMercury\b|\bMoon\b|\bNeptune\b|\bPluto\b|\bSaturn\b|\bSun\b|\bUranus\b|\bVenus\b)')
]
examples = []
for text in texts:
    for expression in regex_patterns:
    
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

#write_jsonl("NER_PLAN_01.jsonl", examples)

In [25]:
misaligned_token(examples)

1- Misaligned tokens--> If this wearisome method has filled you with loathing, it should more properly fill you with compassion for me, as I have gone through it at least seventy times at the expense of a great deal of time, and you will cease to wonder that the fifth year has now gone by since I took up Mars,12 although the year 1603 was nearly all given over to optical investigations. {'start': 282, 'end': 286, 'label': 'PLAN'}


# Edge Cases

As it is clear from above , After using regular expression we faced with some specific format like these words:
- Mars,12


We consider these words as edge cases which means tokens are not consistant with the tokens assigned by the model’s tokenizer. here we have only one edge case

## One Possible Solution 

We can modify our annotation rules based on the above examples

In [26]:
label = "PLAN"   
texts = sents
regex_patterns = [
re.compile(r'(\bEarth\b|\bJupiter\b|\bMars,12\b|\bMars\b|\bMercury\b|\bMoon\b|\bNeptune\b|\bPluto\b|\bSaturn\b|\bSun\b|\bUranus\b|\bVenus\b)')
]
examples = []
for text in texts:
    for expression in regex_patterns:
    
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

write_jsonl("NER_PLAN_V02.jsonl", examples)

In [None]:
misaligned_token(examples)

# NAME

In [27]:
label = "NAME"   
texts = sents 
regex_patterns = [
re.compile(r'(\bAlbategnius\b|\bAlexandria\b|\bApanius\b|\bApollonius\b|\bApollonius of Perga\b|\bAristarchus\b|\bAristotelian\b|\bAristotle\b|\bArzachel\b|\bBaron Friedrich Hoffmann\b|\bBohemia\b|\bBrahean\b|\bChristian Severinus\b|\bCopernican\b|\bCopernicus\b|\bDiogenes Laertius\b|\bDiogenes Laertius\b|\bEast Frisia\b|\bFabrician\b|\bFabricius|\bHipparchus\b|\bHven\b|\bJohann Schuler\b|\bLansberg\b|\bongomontanus\b|\bMaestlin\b|\bBetelgeuse\b|\bMagini\b|\bMatthias Seiffard\b|\bPatricius\b|\bPeurbach\b|\bPrague\b|\bPrutenics\b|\bPtolemaic\b|\bPythagoreans\b|\bRheticus\b|\bScaliger\b|\bStadius\b|\bTheodesius \b|\bTycho Brahe\b|\bUraniborg|\bZalippus\b)')
]


examples = []
for text in texts:
    for expression in regex_patterns:
    
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

#write_jsonl("NER_NAME_01.jsonl", examples)

## See Misaligned Tokens

<fig size= 3>
This step is very important, since if Pordigy faces with Misaligned tokens, we can see that before and try to adjust regex accordingly
<fig size>

In [30]:
def misaligned_token(examples):
    counter=0
    nlp = spacy.load("en_core_web_sm")  
    for example in examples:  
        doc = nlp(example["text"])
        for span in example["spans"]:
            char_span = doc.char_span(span["start"], span["end"])
            if char_span is None:  
                counter+=1
                print("{}- Misaligned tokens-->".format(counter), example["text"],span)

In [31]:
misaligned_token(examples)

1- Misaligned tokens--> Those with more experience consider them with good reason to be incompetent, or (if, like that man Patricius,4 they want to be known as philosophers) to act mad with reasoning. {'start': 99, 'end': 109, 'label': 'NAME'}
2- Misaligned tokens--> The followers of Aristotle, and even Scaliger,6 who professes to be a Christian, openly contend that this motion of the orbs is voluntary, and that the principle of volition for them is intellectual intuition and desire. {'start': 37, 'end': 46, 'label': 'NAME'}
3- Misaligned tokens--> For in Maestlin,4 on the twelfth at noon, Mars is put at 8° 20’ Gemini, and on the seventeenth, again at noon, it is at 6° 25’ Gemini. {'start': 7, 'end': 16, 'label': 'NAME'}
4- Misaligned tokens--> In Stadius,5 it is 1° 52’. {'start': 3, 'end': 11, 'label': 'NAME'}
5- Misaligned tokens--> 18/28 at 10h 30m, using the Tychonic  instruments (with the help of the learned Matthias Seiffard,22 bequeathed us by Tycho), I took the distance of Mars

# Edge Cases

As it is clear from above , After using regular expression we faced with some specific format like these words:
- Stadius,5
- Scaliger,6
- Patricius,4
- Matthias Seiffard,22
- Magini,26
- Maestlin,4
- Fabricius2
- Brahe,7 
- Arzachel,8

We consider these words as edge cases which means tokens are not consistant with the tokens assigned by the model’s tokenizer.

## One Possible Solution 

We can modify our annotation rules based on the above examples

In [32]:
    label = "NAME"   
    texts = sents  
    regex_patterns = [
    re.compile(r'(\bAlbategnius\b|\bAlexandria\b|\bApanius\b|\bApollonius\b|\bApollonius of Perga\b|\bAristarchus\b|\bAristotelian\b|\bAristotle\b|\bArzachel,8\b|\bArzachel\b|\bBaron Friedrich Hoffmann\b|\bBohemia\b|\bBrahe,7\b|\bBrahean\b|\bChristian Severinus\b|\bCopernican\b|\bCopernicus\b|\bDiogenes Laertius\b|\bDiogenes Laertius\b|\bEast Frisia\b|\bFabrician\b|\bFabricius2\b|\bFabricius|\bHipparchus\b|\bHven\b|\bJohann Schuler\b|\bLansberg\b|\bongomontanus\b|\bMaestlin,4\b|\bMaestlin\b|\bBetelgeuse\b|\bMagini,26\b\bMagini\b|\bMatthias Seiffard,22\b|\bMatthias Seiffard\b|\bPatricius,4\b|\bPatricius\b|\bPeurbach\b|\bPrague\b|\bPrutenics\b|\bPtolemaic\b|\bPythagoreans\b|\bRheticus\b|\bScaliger,6\b|\bScaliger\b|\bStadius,5\b|\bStadius\b|\bTheodesius \b|\bTycho Brahe\b|\bUraniborg|\bZalippus\b)')
    ]


    examples = []
    for text in texts:
        for expression in regex_patterns:

            spans = []
        for match in re.finditer(expression, text):
            start, end = match.span()
            span = {"start": start, "end": end, "label": label}
            spans.append(span)
        task = {"text": text, "spans": spans}
        examples.append(task)              

    write_jsonl("NER_NAME_V02.jsonl", examples)

In [33]:
misaligned_token(examples)