# Refined Name Enitiy Recognition (NER) by A Customized SpaCy Model and Pattern RegEx Rules 
 Gerd Graßhoff$^{(1,2)}$, Mohammad Yeghaneh$^{(2)}$

1: Max-Planck-Institute for the History of Science, Berlin

2: Humboldt University, Berlin

Date: October 2019

This notebook provides a colorful print of a refined named entity recognition model on the famous book  "Astronomia nova (1609)" ("The New Astronomy") one of the Kepler's most important work in astronomy.  Through the use of observation of Tycho Brahsuperior planetary tables, Kepler realized that the orbit of Mars fitted the shape of an ellipse.
<br>

We have anotated the follwoing entities in this book:
    
<ul>
<li>LONG: Longitude and Coordinate in Different Formats</li> 
<li>PARA: Numerical Parameters</li> 
<li>ASTR: Astronomical Names</li> 
<li>DATE: Date in Different Formats </li>
<li>TIME: Time in Different Format </li>
<li>STAR: Names of Stars </li>
<li>PLAN: Planet's Names </li>
<li>NAME: Names of People and Places </li>
 <li>GEOM: Geometric Shapes </li>
</ul>
    We annotated the text first entity by entity using pattern regex rules and then corrected by annotator* using the prodigy an annotation tool powered by active learning. We also made a  SpaCy model for Named Entity Recognition for each entity. Then we merged the annotation. Furthermore, in each step, we made a model of spacy regarding each entity. In this way, we can always easily improve annotation for each entity using more precise regex or different deep learning models. Finally, you can see the result of annotation of all entities as a colorful print at the end of the notebook 


In [3]:
import spacy 
from spacy import displacy 


##  Resources

### Import resources

In [4]:
import pandas as pd
import numpy as np
importVersion = '013'

In [5]:
path= '../data/01_df_v{0}.pickle'.format(importVersion)# Put the path of the data in your local machine here, consider the letter "r" before the path
dfAstroNova = pd.read_pickle(path)

## Sort the data

In [6]:
# Sort the data based on the chapters of the book 
dfAstroNova['chapter'] = dfAstroNova.chapter.replace("appendix b",np.nan).astype(float)  
dfAstroNova = dfAstroNova.rename_axis('MyIdx').sort_values(by = ['chapter', 'MyIdx'], ascending = [True, True])
dfAstroNova.chapter.fillna('appendix b', inplace=True)

In [7]:
dfAstroNova.reset_index(inplace=True)
dfAstroNova=dfAstroNova.drop("MyIdx",axis=1,inplace=False)
dfAstroNova=dfAstroNova.drop("html",axis=1)

In [8]:
type(dfAstroNova)

pandas.core.frame.DataFrame

In [9]:
dfAstroNova.head(5)

Unnamed: 0,text,links,italic,chapter,graphic,table,marginal,sentences,tagged
0,Chapter 1,[],[],1,[],[],[],[Chapter 1],"[[(Chapter, None), (1, NUM)]]"
1,On the distinction between the first motion an...,[],[],1,[],[],[],[On the distinction between the first motion a...,"[[(On, None), (the, None), (distinction, None)..."
2,The testimony of the ages confirms that the mo...,[],[],1,[],[],[ Terms: 1. The first motion is that of the wh...,[The testimony of the ages confirms that the m...,"[[(The, None), (testimony, None), (of, None), ..."
3,It is just this from which astronomy arose amo...,[],[],1,[ ch 1 gr 1],[],[],[It is just this from which astronomy arose am...,"[[(It, None), (is, None), (just, None), (this,..."
4,Before the distinction between the first motio...,[],[(such],1,[],[],[ 2],[Before the distinction between the first moti...,"[[(Before, None), (the, None), (distinction, N..."


In [25]:
texts=[]
for sen in dfAstroNova.sentences:
    texts +=sen
    

In [26]:
str1 = ' '.join(texts)

In [27]:
len(texts)

6699

## Upload custom spaCy model provided by training data using pattern regEx rules

In [28]:
path= '../data/Model_V21'
nlp=spacy.load(path)

In [29]:
doc=nlp(str1)

In [30]:
options={"ents": ["TIME", "DATE","PARA","ASTR","LONG" ,"STAR","PLAN","NAME", "GEOM"],"colors":{"TIME":"CORAL","DATE":"TOMATO","PARA":"LIGHGRAY","ASTR":"AQUA","LONG":"MAGENTA","STAR":"GREEN","PLAN":"LIME","NAME":"ROSYBROWN", "GEOM":"BLUE"}}

## Colorful visualization of entities  

In [31]:
displacy.render(doc,style="ent",jupyter="True",options=options )

# Named entity types overall and by target class.

# Result of NER on other Kepler's texts (in progress)

This is only for showing the model's results on other text of kepler. It is basically one of our motivation for using machine learning and deep learning in this context.

In [19]:
str2=" if  you  want  the  exact  time,  was  conceived  mentally  on  the   March 9  in  this  year  One  Thousand  Six  Hundred  and  Eighteen  but  unfelicitously  submitted  to  calculation  and  rejected  as  false,  finally,  summoned  back  on  the  15th  of  May,  with  a  fresh  assault  undertaken,  outfought  the  darkness  of  my  mind  by  the  great  proof afforded by my labor of seventeen years on Brahe's observations and meditation upon it uniting in one concord,  in  such  fashion  that  I  first  believed  I  was  dreaming  and  was  presupposing  the  object  of  my  search  among  the  principles.  But  it  is  absolutely  certain  and  exact  thatthe  ratio  which  exists  between  the  periodic times  of  any  two  planets  is  precisely  the  ratio  of  the  3/2th  power  of  the  mean  distances, i.e., of  the  spheres  themselves;  provided,  however,  that  the  arithmetic  mean  between  both  diameters  of  the  elliptic  orbit  be  slightly less than the longer diameter. And so if any one take the period, say, of the Earth, which is one year, and  the  period  of  Saturn,  which  is  thirty  years,  and  extract  the  cube  roots  of  this  ratio  and  then  square  the  ensuing  ratio  by  squaring  the  cube  roots,  he  will  have  as  his  numerical  products  the  most  just  ratio  of  the  distances of the Earth and Saturn from the sun.1Ninthly [IX]: If now you wish to measure with the same yardstick, so to speak, the true daily journeys of each planet through the ether, two ratios are to be compounded—the ratio of the true (not the apparent) diurnal arcs of the eccentric, and the ratio of the mean intervals of each planet from the sun (because that is the same as the ratio of the amplitude of the spheres),i.e., the true diurnal arc of each planet is to be multiplied by the semidiameter of its sphere: the products will be numbers fitted for investigating whether or not those journeys are in harmonic ratios.   For the cube root of 1 is 1, and the square of it is 1; and the cube  root  of  30  is  greater  than  3,  and  therefore  the  square  of  it  is  greater  than  9.  And  Saturn,  at  its  mean  distance  from  the  sun,  is  slightly  higher  [280]  than  nine  times  the  mean  distance  of  the  Earth  from  the  sun.  Further on, in Chapter 9, the use of this theorem will be necessary for the demonstration of the eccentricities."

In [20]:
doc2=nlp(str2)

In [21]:
displacy.render(doc2,style="ent",jupyter="True",options=options )

In [22]:
str3="circle square the  private  ratio  of  Jupiter circle alone  had  to  be  6,561  :  8,000, i.e.,  104,976  :  128,000  (by  Proposition  XXVIII).  Therefore,  if  the  compound  ratio  of  both  is  divided  by  this,  the  private  ratio  of  Mars  will  be  left  as  72,900  :  104,976,i.e., 25 : 36, the square root of which is 5 : 6.In  another  fashion,  as  follows:  There  is  1  :  32  or  120  :  3,840  from  the  aphelial  movement  of  Saturn  to  the  aphelial movement of the Earth, but from that same movement to the perihelial of Jupiter there is 1 : 3 or 120 : 360, with its increment. But from this to the aphelial movement of Mars is 5 : 24 or 360 : 1,728. Accordingly, from  the  aphelial  movement  of  Mars  to  the  aphelial  movement of  the  Earth,  there  remains  1,728  :  3,840  minus the increment of the ratio of the diverging movements of Saturn and Jupiter. But from the same aphelial movement  of  the  Earth  to  the  perihelial  of  Mars  there  is  3  :  2,i.e.,  3,840  :  2,500.  Therefore  between  the  aphelial and perihelial movements of Mars there remains the ratio 1,728 : 2,560,i.e., 27 : 40 or 81 : 120, minus the  said  increment.  But  81  :  120  is  a  comma  less  than  80  :  120  or  2  :  3.  Therefore,  if  a  comma  is  taken  away  from 2 : 3, and the said increment (which by Proposition XXXVII is equal to the private ratio of Venus) is taken away too, the private ratio of Mars is left. But the private ratio of Venus is the diesis diminished by a comma, by  Proposition  XXVI.  But  the  comma  and  the  diesis  diminished  by  a  comma  make  a  full  diesis  or  24  :  25.  Therefore if you divide 2 : 3,i.e., 24 : 36 by the diesis 24 : 25, Mars’ private ratio of 25 : 36 is left, as before, the square root of which, or 5 : 6, goes to the intervals, by Chapter 3.Behold  again  the  reason  why—above,  in  Chapter  4—the  extreme  intervals  of  Mars  have  been  found  to  embrace the harmonic ratio 5 : 6.XLII.  PROPOSITION.The  great  ratio  of  Mars  and  the  Earth,  or  the  common  ratio  of  the  diverging  movements,  has been necessarily made to be 54 : 125, smaller than the consonance 5 : 12 established by the prior reasons. For  the  private  ratio  of  Mars  had  to  be  a  perfect  fifth,  from  which  a  diesis  has  been  taken  away,  by  the  preceding  proposition.  But  the  common  or  minor  ratio  of  the  converging  movements  of  Mars  and  the  Earth  had to be a perfect fifth or 2 : 3, by Proposition XV. Finally, the private ratio of the Earth is the diesis squared, from which a comma is taken away, by Propositions XXVI and XXVIII. But out of these elements is compounded the major ratio or that of the diverging movements of Mars and the Earth—and it is two perfect fifths (or 4 : 9, i.e.,  108  :  243)  plus  a  diesis  diminished  by  a  comma,i.e.,  plus  243  :  250;  namely,  it  is  108  :  250  or  54  :  125,i.e.,  608  :  1,500.  But  this  is  smaller  than  625  :  1,500,i.e.,  than  5  :  12,  in  the  ratio  602  :  625,  which  is approximately 36 : 37, smaller than 625 : 1,500,i.e., than 5 : 12, in the ratio 602 : 625, which is approximately 36 : 37, smaller than the least concord."

In [23]:
doc3=nlp(str3)

In [24]:
displacy.render(doc3,style="ent",jupyter="True",options=options )