# Operating SpaCy on DataFrame of Publications about Exoplanets in NASA ADS

Before you can use SpaCy you have to download the pretrained model for the english language with:

In [2]:
#!python -m spacy download en_core_web_md

After doing this for the first time you can comment out this line with `#` and load the model with `nlp = spacy.load('en_core_web_md')`

## Loading the packages

In [39]:
import pandas as pd
import numpy as np
import re
from random import randint
from random import seed
seed(5)
from unidecode import unidecode
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_md', disable=["ner", "textcat", "entity_ruler", "merge_noun_chunks", "merge_entities", "merge_subtokens"])
from IPython.display import HTML

## Loading the dataframe into python

Here it is important to include the statement orient = 'table', because this dataframe was exported with this option, so that the exported json is a valid json. The head of the dataframe is printed out.

In [4]:
dfExoplanetsNASA = pd.read_json('./data/dfExoplanetsNASA_v2.json', orient = 'table')
dfExoplanetsNASA = dfExoplanetsNASA[dfExoplanetsNASA.abstract != 'None'].reset_index(drop=True)

In [5]:
dfExoplanetsNASA.head()

Unnamed: 0,bibcode,DOI,authors,affiliation,acknowledgements,grant,published,year,title,abstract,keywords,citation_count
0,2019NewA...70....1B,10.1016/j.newast.2018.12.005,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...","[School of Physics and Electronic Science, Gui...",We thank the anonymous referee for useful comm...,[],2019-07-00,2019,First photometric study of a short-period deta...,NSVS 10441882 is a newly discovered eclipsing ...,"[Binary, Eclipsing binary, Light curve, Orbita...",0
1,2019CNSNS..71...82A,10.1016/j.cnsns.2018.10.026,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...","[Dept. de Matemáticas, UAM-Iztapalapa, Ciudad ...",E. Barrabs has been supported by grants MTM201...,[],2019-06-00,2019,Ejection-Collision orbits in the symmetric col...,"In this paper, we consider the collinear symme...","[Collinear four-body problem, Ejection/collisi...",0
2,2019NewA...69...27E,10.1016/j.newast.2018.11.008,"[Şeyda Enez, Hasan Ali Dal]","[Ege University, Department of Astronomy and S...",We wish to thank the Turkish Scientific and Te...,[],2019-05-00,2019,Cool spot migration and flare activity of KIC ...,Analysing the photometrical data taken from th...,"[Techniques: Photometric, Methods: Statistical...",0
3,2019MNRAS.483.3465F,10.1093/mnras/sty3367,[Giacomo Fragione],"[Racah Institute for Physics, The Hebrew Unive...",Author thanks Nader Haghighipour for useful an...,[],2019-03-00,2019,Dynamical origin of S-type planets in close bi...,Understanding the origin of planets that have ...,"[planets and satellites: general, binaries: cl...",0
4,2019MNRAS.483.3448M,10.1093/mnras/sty3346,"[Kristina Monsch, Barbara Ercolano, Giovanni P...","[Universitäts-Sternwarte, Ludwig-Maximilians-U...",We thank Giovanni Rosotti and Jeff Jennings fo...,[],2019-03-00,2019,The imprint of X-ray photoevaporation of plane...,High-energy radiation from a planet host star ...,"[planets and satellites: formation, planet-dis...",0


## Selecting columns to work with

For the work with this dataframe we only need the following columns 'authors', 'title', 'published' and 'abstract'. To save memory we select only these four columns to be contained in the dataframe.

In [8]:
dfExoplanetsNASA = dfExoplanetsNASA[['authors', 'title', 'published', 'abstract']]

In [9]:
dfExoplanetsNASA.head()

Unnamed: 0,authors,title,published,abstract
0,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...",First photometric study of a short-period deta...,2019-07-00,NSVS 10441882 is a newly discovered eclipsing ...
1,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...",Ejection-Collision orbits in the symmetric col...,2019-06-00,"In this paper, we consider the collinear symme..."
2,"[Şeyda Enez, Hasan Ali Dal]",Cool spot migration and flare activity of KIC ...,2019-05-00,Analysing the photometrical data taken from th...
3,[Giacomo Fragione],Dynamical origin of S-type planets in close bi...,2019-03-00,Understanding the origin of planets that have ...
4,"[Kristina Monsch, Barbara Ercolano, Giovanni P...",The imprint of X-ray photoevaporation of plane...,2019-03-00,High-energy radiation from a planet host star ...


Now the dataframe is ready to work with. The abstracts are stored in the column 'abstract'. This column can be accessed by `dfExoplanetsNASA.abstract`.

## Search through all abstracts containing 'detect'

With `.str.contains('word')` used on the row of a dataframe you could check very quickly which cells contains the word.

In [20]:
dfExoplanetsNASAdetect = dfExoplanetsNASA[dfExoplanetsNASA.abstract.str.contains('detect')]

In [21]:
print(str(len(dfExoplanetsNASAdetect)) + ' abstracts contain the word "detect"!')

10483 abstracts contain the word "detect"!


## Remove Latex charaters from randomly choosen abstracts

In the next step we want to make a spacy doc out of these abstracts. As scientific papers especially in natural scienses are often written in latex, spacy has some troubles with the special characters as $. So we will remove them before using regex.

In [25]:
dfExoplanetsNASAdetect['abstract'] = [re.sub('\\\\', '', re.sub('{', '(', re.sub('}', ')', re.sub('\$', '', re.sub('<SUB>', '_', re.sub('</SUB>', '', re.sub('<SUP>', '^', re.sub('</SUP>', '', i)))))))) for i in dfExoplanetsNASAdetect.abstract]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


## Making a spacy doc out of the abstracts & cut abstracts into sentences & POS-tagging of abstract sentences

Now that all the bad characters are removed, we can make a spacy doc out of the abstracts and cut the abstracts in sentences. Let's also do some POS (Part-of-Speech)-Tagging. It analyses for you the syntactical structure of a sentence. In the following example I'm interested in the lemmata contained in the abstracts, what grammatical role do they play in the sentence? Grammatical structure of the first sentence of the first abstract is displayed. For the explanation of the different tags go to https://spacy.io/api/annotation. The first five sentences are printed out.

In [40]:
sentences = []
tokens = []
for abstract in dfExoplanetsNASAdetect['abstract']:
    doc = nlp(abstract)
    sent = []
    for sent in doc.sents:
        if 'detect' in sent.string:
            sentences.append(sent.string.strip())
            tok = []
            for token in sent:
                if 'detect' in sent.string:
                    tok.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
            tokens.append(tok)

In [41]:
sentences[:5]

['Apart from the cool spots, flare activity is also detected on the target, and 226 flares were determined with their parameters.',
 'I prove that a transit shadow - whether umbral, antumbral, or penumbral - takes the shape of a parabolic cylinder, and finally present geometric constraints on Earth-based observers hoping to detect a three-body syzygy (or perfect alignment) - either in extrasolar systems or within the Solar system - potentially as a double annular eclipse.',
 'As radio emission from solar-like stars is concentrated in active regions, a planet occulting a star-spot can cause a disproportionately deep transit which should be detectable with major radio arrays currently under development, such as the Square Kilometre Array (SKA).',
 'We calculate the radiometric sensitivity of the SKA stages and components, finding that SKA2-Mid can expect to detect transits around the very nearest solar-like stars and many cool dwarfs.',
 'We investigate the role that planet detection ord

After removing all the bad characters the splitted sentences looks quite nice!

## Different usage of "detect" as a verb

What are the different functions 'detect' as a verb could have? This is analysed by looking into the its dependency.

In [48]:
functionDetect = []
for i in tokens:
    for j in i:
        if 'detect' in j[1] and j[2] == 'VERB':
            functionDetect.append(j[4])
setFunctionDetect = list(set(functionDetect))

In [49]:
setFunctionDetect

['xcomp',
 'ccomp',
 'advcl',
 'csubjpass',
 'dep',
 'parataxis',
 'csubj',
 'nsubj',
 'pcomp',
 'relcl',
 'nmod',
 'acl',
 'meta',
 'amod',
 'conj',
 'ROOT',
 'oprd',
 'pobj',
 'acomp',
 'dobj',
 'compound']

These are the different functions of the verb 'detect' in the abstracts.

## Visualisation of the structure of the sentence

For a better understanding let's visualize the grammatical structure of this sentence:

In [51]:
doc = nlp(sentences[0])
displacy.render(doc, style="dep", options={'compact': True})