# Operating SpaCy on DataFrame of Publications about Exoplanets in NASA ADS

Before you can use SpaCy you have to download the pretrained model for the english language with:

In [11]:
#!python -m spacy download en_core_web_md

After doing this for the first time you can comment out this line with `#` and load the model with `nlp = spacy.load('en_core_web_md')`

## Loading the packages

In [12]:
import pandas as pd
import numpy as np
import re
from random import randint
from random import seed
seed(5)
from unidecode import unidecode
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_md', disable=["ner", "textcat", "entity_ruler", "merge_noun_chunks", "merge_entities", "merge_subtokens"])
from IPython.display import HTML

## Loading the dataframe into python

Here it is important to include the statement orient = 'table', because this dataframe was exported with this option, so that the exported json is a valid json. The head of the dataframe is printed out.

In [13]:
dfExoplanetsNASA = pd.read_json('./data/dfExoplanetsNASA_v2.json', orient = 'table')
dfExoplanetsNASA = dfExoplanetsNASA[dfExoplanetsNASA.abstract != 'None'].reset_index(drop=True)

In [14]:
dfExoplanetsNASA.head()

Unnamed: 0,bibcode,DOI,authors,affiliation,acknowledgements,grant,published,year,title,abstract,keywords,citation_count
0,2019NewA...70....1B,10.1016/j.newast.2018.12.005,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...","[School of Physics and Electronic Science, Gui...",We thank the anonymous referee for useful comm...,[],2019-07-00,2019,First photometric study of a short-period deta...,NSVS 10441882 is a newly discovered eclipsing ...,"[Binary, Eclipsing binary, Light curve, Orbita...",0
1,2019CNSNS..71...82A,10.1016/j.cnsns.2018.10.026,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...","[Dept. de Matemáticas, UAM-Iztapalapa, Ciudad ...",E. Barrabs has been supported by grants MTM201...,[],2019-06-00,2019,Ejection-Collision orbits in the symmetric col...,"In this paper, we consider the collinear symme...","[Collinear four-body problem, Ejection/collisi...",0
2,2019NewA...69...27E,10.1016/j.newast.2018.11.008,"[Şeyda Enez, Hasan Ali Dal]","[Ege University, Department of Astronomy and S...",We wish to thank the Turkish Scientific and Te...,[],2019-05-00,2019,Cool spot migration and flare activity of KIC ...,Analysing the photometrical data taken from th...,"[Techniques: Photometric, Methods: Statistical...",0
3,2019MNRAS.483.3465F,10.1093/mnras/sty3367,[Giacomo Fragione],"[Racah Institute for Physics, The Hebrew Unive...",Author thanks Nader Haghighipour for useful an...,[],2019-03-00,2019,Dynamical origin of S-type planets in close bi...,Understanding the origin of planets that have ...,"[planets and satellites: general, binaries: cl...",0
4,2019MNRAS.483.3448M,10.1093/mnras/sty3346,"[Kristina Monsch, Barbara Ercolano, Giovanni P...","[Universitäts-Sternwarte, Ludwig-Maximilians-U...",We thank Giovanni Rosotti and Jeff Jennings fo...,[],2019-03-00,2019,The imprint of X-ray photoevaporation of plane...,High-energy radiation from a planet host star ...,"[planets and satellites: formation, planet-dis...",0


## Selecting columns to work with

For the work with this dataframe we only need the following columns 'authors', 'title', 'published' and 'abstract'. To save memory we select only these four columns to be contained in the dataframe.

In [15]:
dfExoplanetsNASA = dfExoplanetsNASA[['authors', 'title', 'published', 'abstract']]

In [16]:
dfExoplanetsNASA.head()

Unnamed: 0,authors,title,published,abstract
0,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...",First photometric study of a short-period deta...,2019-07-00,NSVS 10441882 is a newly discovered eclipsing ...
1,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...",Ejection-Collision orbits in the symmetric col...,2019-06-00,"In this paper, we consider the collinear symme..."
2,"[Şeyda Enez, Hasan Ali Dal]",Cool spot migration and flare activity of KIC ...,2019-05-00,Analysing the photometrical data taken from th...
3,[Giacomo Fragione],Dynamical origin of S-type planets in close bi...,2019-03-00,Understanding the origin of planets that have ...
4,"[Kristina Monsch, Barbara Ercolano, Giovanni P...",The imprint of X-ray photoevaporation of plane...,2019-03-00,High-energy radiation from a planet host star ...


Now the dataframe is ready to work with. The abstracts are stored in the column 'abstract'. This column can be accessed by `dfExoplanetsNASA.abstract`.

## Search through all abstracts containing 'detect'

With `.str.contains('word')` used on the row of a dataframe you could check very quickly which cells contains the word.

In [17]:
dfExoplanetsNASAdetect = dfExoplanetsNASA[dfExoplanetsNASA.abstract.str.contains(' detected')].reset_index()

In [18]:
print(str(len(dfExoplanetsNASAdetect)) + ' abstracts contain the word " detected"!')

3196 abstracts contain the word " detected"!


## Remove Latex charaters from randomly choosen abstracts

In the next step we want to make a spacy doc out of these abstracts. As scientific papers especially in natural scienses are often written in latex, spacy has some troubles with the special characters as $. So we will remove them before using regex.

In [19]:
dfExoplanetsNASAdetect['abstract'] = [re.sub('\\\\', '', re.sub('{', '(', re.sub('}', ')', re.sub('\$', '', re.sub('<SUB>', '_', re.sub('</SUB>', '', re.sub('<SUP>', '^', re.sub('</SUP>', '', i)))))))) for i in dfExoplanetsNASAdetect.abstract]

## Making a spacy doc out of the abstracts & cut abstracts into sentences & POS-tagging of abstract sentences

Now that all the bad characters are removed, we can make a spacy doc out of the abstracts and cut the abstracts in sentences. Let's also do some POS (Part-of-Speech)-Tagging. It analyses for you the syntactical structure of a sentence. In the following example I'm interested in the lemmata contained in the abstracts, what grammatical role do they play in the sentence? For the explanation of the different tags go to https://spacy.io/api/annotation.

In [30]:
sentence = []
absnum = []
tag = []
pos = []
dep = []
lemma = []
for abstract in range(len(dfExoplanetsNASAdetect['abstract'])):
    doc = nlp(dfExoplanetsNASAdetect['abstract'][abstract])
    for sent in doc.sents:
        if ' detected' in sent.string:
            absnum.append(dfExoplanetsNASAdetect['index'][abstract])
            sentence.append(sent.string.strip())
            tags = []
            poss = []
            deps = []
            lemmas = []
            for token in sent:
                tags.append(token.tag_)
                poss.append(token.pos_)
                deps.append(token.dep_)
                lemmas.append(token.lemma_)
            tag.append(tags)
            pos.append(poss)
            dep.append(deps)
            lemma.append(lemmas)

In [33]:
dfExoplanetsNASAdetected = pd.DataFrame({'absnum':absnum,'sent':sentence,'tag':tag,'pos':pos,'dep':dep,'lemma':lemma})

In [36]:
dfExoplanetsNASAdetected

Unnamed: 0,absnum,sent,tag,pos,dep,lemma
0,2,"Apart from the cool spots, flare activity is a...","[RB, IN, DT, JJ, NNS, ,, NN, NN, VBZ, RB, VBN,...","[ADV, ADP, DET, ADJ, NOUN, PUNCT, NOUN, NOUN, ...","[advmod, prep, det, amod, pobj, punct, compoun...","[apart, from, the, cool, spot, ,, flare, activ..."
1,19,We find that the detectability of transits exp...,"[PRP, VBP, IN, DT, NN, IN, NNS, VBZ, DT, JJ, C...","[PRON, VERB, ADP, DET, NOUN, ADP, NOUN, VERB, ...","[nsubj, ROOT, mark, det, nsubj, prep, pobj, cc...","[-PRON-, find, that, the, detectability, of, t..."
2,43,The transit signal was detected in the data fr...,"[DT, NN, NN, VBD, VBN, IN, DT, NNS, IN, NNP, N...","[DET, NOUN, NOUN, VERB, VERB, ADP, DET, NOUN, ...","[det, compound, nsubjpass, auxpass, ROOT, prep...","[the, transit, signal, be, detect, in, the, da..."
3,59,The molecules are detected by cross-correlatin...,"[DT, NNS, VBP, VBN, IN, JJ, JJ, VBG, DT, VBN, ...","[DET, NOUN, VERB, VERB, ADP, ADJ, ADJ, VERB, D...","[det, nsubjpass, auxpass, ROOT, agent, subtok,...","[the, molecule, be, detect, by, cross, -, corr..."
4,64,These are only the second and third sdB pulsat...,"[DT, VBP, RB, DT, JJ, CC, JJ, NNP, NNS, IN, VB...","[DET, VERB, ADV, DET, ADJ, CCONJ, ADJ, PROPN, ...","[nsubj, ROOT, advmod, det, amod, cc, conj, com...","[these, be, only, the, second, and, third, sdB..."
5,73,r ≳ 10 cMpc h^-^1 detected at 2.7σ.,"[CD, NN, CD, NNS, CD, VBD, IN, NNP, .]","[NUM, NOUN, NUM, NOUN, NUM, VERB, ADP, PROPN, ...","[subtok, nmod, nummod, nsubj, appos, ROOT, pre...","[r, ≳, 10, cMpc, h^-^1, detect, at, 2.7σ, .]"
6,73,This is in contrast to equivalent measurements...,"[DT, VBZ, IN, NN, IN, JJ, NNS, IN, JJR, NNS, W...","[DET, VERB, ADP, NOUN, ADP, ADJ, NOUN, ADP, AD...","[nsubj, ROOT, prep, pobj, prep, amod, pobj, pr...","[this, be, in, contrast, to, equivalent, measu..."
7,73,This implies that faint galaxies beyond the re...,"[DT, VBZ, IN, JJ, NNS, IN, DT, NN, IN, JJ, NNS...","[DET, VERB, ADP, ADJ, NOUN, ADP, DET, NOUN, AD...","[nsubj, ROOT, mark, amod, nsubj, prep, det, po...","[this, imply, that, faint, galaxy, beyond, the..."
8,79,No rings or other material were detected withi...,"[DT, NNS, CC, JJ, NN, VBD, VBN, IN, CD, CD, NN...","[DET, NOUN, CCONJ, ADJ, NOUN, VERB, VERB, ADP,...","[det, nsubjpass, cc, amod, conj, auxpass, ROOT...","[no, ring, or, other, material, be, detect, wi..."
9,90,The 2015 K2 observations only spanned 74.8 day...,"[DT, CD, NNP, NNS, RB, VBN, CD, NNS, ,, CC, DT...","[DET, NUM, PROPN, NOUN, ADV, VERB, NUM, NOUN, ...","[det, nummod, compound, nsubj, advmod, ROOT, n...","[the, 2015, K2, observation, only, span, 74.8,..."


In [38]:
#dfExoplanetsNASAdetected.to_json('./data/dfExoplanetsNASAdetected_v2.json', orient = 'table')

In [52]:
dfExoplanetsNASAdetect = pd.read_json('./data/dfExoplanetsNASAdetected100rand_v2.json', orient = 'table')
len(dfExoplanetsNASAdetect)

100

In [61]:
dfExoplanetsNASAdetectes = dfExoplanetsNASAdetected.merge(dfExoplanetsNASAdetect, 'inner', 'sent')
dfExoplanetsNASAdetectes = dfExoplanetsNASAdetectes[['absnum','sent','tag','pos','dep','lemma','label']]

In [64]:
dfExoplanetsNASAdetectes = dfExoplanetsNASAdetectes.drop_duplicates(['sent', 'label']).reset_index(drop = True)

In [66]:
#dfExoplanetsNASAdetectes.to_json('./data/dfExoplanetsNASAdetected100rand_v3.json', orient = 'table')

## Formating the annotated sentences to tuples

In [217]:
dfExoplanetsNASA = pd.read_json('./data/dfExoplanetsNASAdetected100rand_v3.json', orient = 'table')

In [220]:
dfExoplanetsNASAtupRoot = []
for i in range(len(dfExoplanetsNASA)):
    dfExoplanetsNASAtupRoot.append((dfExoplanetsNASA.sent[i], dfExoplanetsNASA.tagRootSent[i], dfExoplanetsNASA.label[i]))

dfExoplanetsNASAtupDetected = []
for i in range(len(dfExoplanetsNASA)):
    dfExoplanetsNASAtupDetected.append((dfExoplanetsNASA.sent[i], dfExoplanetsNASA.tagDetected[i], dfExoplanetsNASA.label[i]))

In [222]:
dfExoplanetsNASAtupRoot[:5]

[("We detected visual companions within 1'' for 5 stars, between 1'' and 2'' for 7 stars, and between 2'' and 4'' for 15 stars.",
  'VBD',
  'discovery'),
 ('Using these data and photometry from the Spitzer Space Telescope, we have identified members with infrared excess emission from circumstellar disks and have estimated the evolutionary stages of the detected disks, which include 31 new full disks and 16 new candidate transitional, evolved, evolved transitional, and debris disks.',
  'VBN',
  'discovery'),
 ('Of the over 800 exoplanets detected to date, over half are on non-circular orbits, with eccentricities as high as 0.93.',
  'VBP',
  'None'),
 ('We find that for these false positive scenarios, CO at 2.35 μm, CO_2 at 2.0 and 4.3 μm, and O_4 at 1.27 μm are all stronger features in transmission than O_2/O_3 and could be detected with S/Ns ≳ 3 for an Earth-size planet orbiting a nearby M dwarf star with as few as 10 transits, assuming photon-limited noise.',
  'VBP',
  'discovery'

In [223]:
dfExoplanetsNASAtupDetected[:5]

[("We detected visual companions within 1'' for 5 stars, between 1'' and 2'' for 7 stars, and between 2'' and 4'' for 15 stars.",
  'VBD',
  'discovery'),
 ('Using these data and photometry from the Spitzer Space Telescope, we have identified members with infrared excess emission from circumstellar disks and have estimated the evolutionary stages of the detected disks, which include 31 new full disks and 16 new candidate transitional, evolved, evolved transitional, and debris disks.',
  'VBN',
  'discovery'),
 ('Of the over 800 exoplanets detected to date, over half are on non-circular orbits, with eccentricities as high as 0.93.',
  'VBN',
  'None'),
 ('We find that for these false positive scenarios, CO at 2.35 μm, CO_2 at 2.0 and 4.3 μm, and O_4 at 1.27 μm are all stronger features in transmission than O_2/O_3 and could be detected with S/Ns ≳ 3 for an Earth-size planet orbiting a nearby M dwarf star with as few as 10 transits, assuming photon-limited noise.',
  'VBN',
  'discovery'