# Regex and Prodigy (LABEL: TIME)



<font size= "3">
Annotation of entities for sentences based on prodigy's format for different labels for further usage in Machine learning. Prodigy is an annotation tool based on spaCy. Please see here:
    </p>
  <a href="https://prodi.gy/ "> Prodigy </a> 
<font 3>

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
import spacy 
import re
from prodigy.util import write_jsonl



## Read the Data

In [2]:
importVersion = '013'

In [3]:
path= '../data/01_df_v{0}.pickle'.format(importVersion)# Put the path of the data in your local machine here, consider the letter "r" before the path
dfAstroNova = pd.read_pickle(path)

# Sort based on the Chapter 

In [4]:
# Sort the data based on the chapters of the book 
dfAstroNova['chapter'] = dfAstroNova.chapter.replace("appendix b",np.nan).astype(float)  
dfAstroNova = dfAstroNova.rename_axis('MyIdx').sort_values(by = ['chapter', 'MyIdx'], ascending = [True, True])
dfAstroNova.chapter.fillna('appendix b', inplace=True)

In [5]:
dfAstroNova.reset_index(inplace=True)
dfAstroNova=dfAstroNova.drop("MyIdx",axis=1,inplace=False)
dfAstroNova=dfAstroNova.drop("html",axis=1)

In [6]:
type(dfAstroNova)

pandas.core.frame.DataFrame

In [7]:
dfAstroNova.head(5)

Unnamed: 0,text,links,italic,chapter,graphic,table,marginal,sentences,tagged
0,Chapter 1,[],[],1,[],[],[],[Chapter 1],"[[(Chapter, None), (1, NUM)]]"
1,On the distinction between the first motion an...,[],[],1,[],[],[],[On the distinction between the first motion a...,"[[(On, None), (the, None), (distinction, None)..."
2,The testimony of the ages confirms that the mo...,[],[],1,[],[],[ Terms: 1. The first motion is that of the wh...,[The testimony of the ages confirms that the m...,"[[(The, None), (testimony, None), (of, None), ..."
3,It is just this from which astronomy arose amo...,[],[],1,[ ch 1 gr 1],[],[],[It is just this from which astronomy arose am...,"[[(It, None), (is, None), (just, None), (this,..."
4,Before the distinction between the first motio...,[],[(such],1,[],[],[ 2],[Before the distinction between the first moti...,"[[(Before, None), (the, None), (distinction, N..."


In [8]:
len(dfAstroNova)

1605

### Read the Sentences

In [9]:
df=dfAstroNova.sentences


In [10]:
sents=[]
for para in dfAstroNova.sentences:
    sents +=para

## Annotation Based on Consistant format with Prodigy 

<fig size= 3>
Prodigy accept a specific format; a JSONL format (newline-delimited JSON). Entities and other highlighted spans of text can be defined in the "spans" property. A example could look like this dictionary:
</p>    
text': 'On 1595 October 30 at 8h 20m, they found Mars at 17° 48’ Taurus, with a diurnal motion of 22’ 54” ^15.',
  'spans': [{'start': 22, 'end': 29, 'label': 'TIME'}]}
</p>
which start and end's numbers refer to the position of entitiy  in the sentence. (here TIME)

    
<fig size>

In [11]:
label = "TIME"   
texts = sents  
regex_patterns = [
re.compile(r"\d{1,2}h\s\d{1,2}m"
            "|\d{1,3}h"
            "|\d{1,2}\s?hours?\sand\s\d{1,2}\s?minutes"
            "|\d{1,2}\s?hours\s\d{1,2}\s?minutes"
            "|\d{1,3}\s?hours"
            )
]


examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

#write_jsonl("NER_TIME_01.jsonl", examples)

In [12]:
def print_sents_entities(examples):
    for example in examples:
        if example["spans"]!=[]:
            sent,entity=example['text'], example['text'][example["spans"][0]["start"]:example["spans"][0]["end"]]
            print(sent,[entity])

In [13]:
print_sents_entities(examples)

On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations,  by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. ['10h 50m']
Therefore, on the seventeenth at the same hour of 10h 50m, Mars ought to have been seen at either 6° 41’ 50” Gemini, or 6° 44’ 50”. ['10h 50m']
At 9h 40m (which Tycho gives as the moment of opposition), it is 1’ 4” farther forward, at either 6° 42’ 54” or 6° 45’ 54”. ['9h 40m']
On 1582 December 28 at 11h 30m, they set Mars down at 16° 47’ Cancer by observation ^6. ['11h 30m']
On 1585 January 31 at 12h 0m, Mars was placed at 21° 18’ 11” Leo ^8. ['12h 0m']
The moment of opposition followed at 19h 35m, 7 hours and 35 minutes later. ['19h 35m']
On 1587 March 7 at 19h 10m they deduced the position of Mars from the observations, which was 25° 10’ 20” Virgo. ['19h 10m']
This they kept in the table, but changed the time to 17h 22m. ['17h 22m']
The difference of 1h 48

In [14]:
def print_entities(examples):
    entities = []
    for example in examples:
        if example["spans"] != []:
            entities += [example['text'][example["spans"][0]["start"]:example["spans"][0]["end"]]]
    print("There is {} entities with this detail\n{}".format(len(entities),entities))

In [15]:
print_entities(examples)

There is 183 entities with this detail
['10h 50m', '10h 50m', '9h 40m', '11h 30m', '12h 0m', '19h 35m', '19h 10m', '17h 22m', '1h 48m', '12h 5m', '1h 30m', '12h 20m', '4 hours and 5 minutes', '4h 5m', '10h 30m', '8h 17m', '8h 20m', '11h 48m', '8h 30m', '5h 5m', '11h 50m', '11h 40m', '9h 40m', '8h 28m', '7h 15m', '8h', '7h 47m', '24h 21m', '10h', '7h 30m', '5h 0m', '5h', '5h 20m', '3h 0m', '23h 45m', '7h 34m', '4h 52m', '9h 18m', '8h 52m', '7h 53m', '7h 3m', '22h 11m', '8 hours', '7h 10m', '10h 15m', '12h 20m', '7h 10m', '6h 30m', '6h', '6h', '8h', '9h 42m', '6h 30m', '10h 50m', '120 hours', '14 hours 41 minutes', '1h 31m', '11h 30m', '24 hours', '11 hours 30 minutes', '12h 0m', '24 hours', '19h 14m', '12h', '1h 16m', '24 hours', '7h 23m', '12h 5m', '24 hours', '6h 23m', '12h 20m', '11h 50m', '19 hours 24 minutes', '10h 30m', '10h 20m', '6 hours', '5h 27m', '8h 20m', '16 hours', '0h 39m', '8h 30m', '7 hours', '3h 44m', '3h 40m', '11h 40m', '2h 22m', '10h 30m', '12h 40m', '3 hours 43 min

## See Misaligned Tokens

<fig size= 3>
This step is very important, since if Pordigy faces with Misaligned tokens, we can see that before and try to adjust regex accordingly
<fig size>

In [16]:
def misaligned_token(examples):
    counter=0
    nlp = spacy.load("en_core_web_sm")  
    for example in examples:  
        doc = nlp(example["text"])
        for span in example["spans"]:
            char_span = doc.char_span(span["start"], span["end"])
            if char_span is None:  
                counter+=1
                print("{}- Misaligned tokens-->".format(counter), example["text"],span)

In [17]:
misaligned_token(examples)

1- Misaligned tokens--> On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations,  by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. {'start': 23, 'end': 30, 'label': 'TIME'}
2- Misaligned tokens--> The moment designated for the opposition preceded this by 8h 17m (for it was at 2h 13m),14 to which corresponds a motion of 5’ 48” eastward. {'start': 80, 'end': 86, 'label': 'TIME'}
3- Misaligned tokens--> And since the diurnal motion at 21° Gemini (Cancer, today) is about 23', and that of the sun, 61', and the sum is 1° 24', those 41' therefore require 8 hours,11 at which time Mars was visible at 21° 8' Gemini, opposite the sun's apparent position. {'start': 150, 'end': 157, 'label': 'TIME'}


# Edge Cases

As it is clear from above , After using regular expression we faced with some specific format like these words:
- 10h 50m,1
- 2h 13m),14
- 8 hours,11


We consider these words as edge cases which means tokens are not consistant with the tokens assigned by the model’s tokenizer. here we have a samll sent of edge cases 

## One Possible Solution 

We can modify our annotation rules based on the above examples

In [18]:
label = "TIME"   
texts = sents  
regex_patterns = [
re.compile(r"10h 50m,1|2h 13m\),14|8 hours,11"
            "|\d{1,2}h\s\d{1,2}m"
            "|\d{1,3}h"
            "|\d{1,2}\s?hours?\sand\s\d{1,2}\s?minutes"
            "|\d{1,2}\s?hours\s\d{1,2}\s?minutes"
            "|\d{1,3}\s?hours"
            )
]


examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

write_jsonl("NER_TIME_V03.jsonl", examples)

In [19]:
misaligned_token(examples)