# Operating SpaCy on DataFrame of Publications about Exoplanets in NASA ADS

Before you can use SpaCy you have to download the pretrained model for the english language with:

In [1]:
# python -m spacy download en_core_web_md

After doing this for the first time you can comment out this line with `#` and load the model with `nlp = spacy.load('en_core_web_md')`

## Loading the packages

In [2]:
import pandas as pd
import numpy as np
import re
from random import randint
from random import seed
seed(5)
from unidecode import unidecode
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
from IPython.display import HTML

## Loading the dataframe into python

Here it is important to include the statement orient = 'table', because this dataframe was exported with this option, so that the exported json is a valid json. The head of the dataframe is printed out.

In [3]:
dfExoplanetsNASA = pd.read_json('./data/dfExoplanetsNASA_v2.json', orient = 'table')
dfExoplanetsNASA = dfExoplanetsNASA[dfExoplanetsNASA.abstract != 'None'].reset_index(drop=True)

In [4]:
dfExoplanetsNASA.head()

Unnamed: 0,bibcode,DOI,authors,affiliation,acknowledgements,grant,published,year,title,abstract,keywords,citation_count
0,2019NewA...70....1B,10.1016/j.newast.2018.12.005,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...","[School of Physics and Electronic Science, Gui...",We thank the anonymous referee for useful comm...,[],2019-07-00,2019,First photometric study of a short-period deta...,NSVS 10441882 is a newly discovered eclipsing ...,"[Binary, Eclipsing binary, Light curve, Orbita...",0
1,2019CNSNS..71...82A,10.1016/j.cnsns.2018.10.026,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...","[Dept. de Matemáticas, UAM-Iztapalapa, Ciudad ...",E. Barrabs has been supported by grants MTM201...,[],2019-06-00,2019,Ejection-Collision orbits in the symmetric col...,"In this paper, we consider the collinear symme...","[Collinear four-body problem, Ejection/collisi...",0
2,2019NewA...69...27E,10.1016/j.newast.2018.11.008,"[Şeyda Enez, Hasan Ali Dal]","[Ege University, Department of Astronomy and S...",We wish to thank the Turkish Scientific and Te...,[],2019-05-00,2019,Cool spot migration and flare activity of KIC ...,Analysing the photometrical data taken from th...,"[Techniques: Photometric, Methods: Statistical...",0
3,2019MNRAS.483.3465F,10.1093/mnras/sty3367,[Giacomo Fragione],"[Racah Institute for Physics, The Hebrew Unive...",Author thanks Nader Haghighipour for useful an...,[],2019-03-00,2019,Dynamical origin of S-type planets in close bi...,Understanding the origin of planets that have ...,"[planets and satellites: general, binaries: cl...",0
4,2019MNRAS.483.3448M,10.1093/mnras/sty3346,"[Kristina Monsch, Barbara Ercolano, Giovanni P...","[Universitäts-Sternwarte, Ludwig-Maximilians-U...",We thank Giovanni Rosotti and Jeff Jennings fo...,[],2019-03-00,2019,The imprint of X-ray photoevaporation of plane...,High-energy radiation from a planet host star ...,"[planets and satellites: formation, planet-dis...",0


## Selecting columns to work with

For the work with this dataframe we only need the following columns 'authors', 'title', 'published' and 'abstract'. To save memory we select only these four columns to be contained in the dataframe.

In [5]:
dfExoplanetsNASA = dfExoplanetsNASA[['authors', 'title', 'published', 'abstract']]

In [6]:
dfExoplanetsNASA.head()

Unnamed: 0,authors,title,published,abstract
0,"[Zhang Bin, Qian Sheng-Bang, Liu Nian-Ping, Zh...",First photometric study of a short-period deta...,2019-07-00,NSVS 10441882 is a newly discovered eclipsing ...
1,"[M. Alvarez-Ramírez, E. Barrabés, M. Medina, M...",Ejection-Collision orbits in the symmetric col...,2019-06-00,"In this paper, we consider the collinear symme..."
2,"[Şeyda Enez, Hasan Ali Dal]",Cool spot migration and flare activity of KIC ...,2019-05-00,Analysing the photometrical data taken from th...
3,[Giacomo Fragione],Dynamical origin of S-type planets in close bi...,2019-03-00,Understanding the origin of planets that have ...
4,"[Kristina Monsch, Barbara Ercolano, Giovanni P...",The imprint of X-ray photoevaporation of plane...,2019-03-00,High-energy radiation from a planet host star ...


Now the dataframe is ready to work with. The abstracts are stored in the column 'abstract'. This column can be accessed by `dfExoplanetsNASA.abstract`.

## Pick 500 random abstracts from dataframe

This code picks 500 random abstracts from the dataframe makes a list out of them and shows these 20 abstracts.

In [7]:
abstracts = [i for i in dfExoplanetsNASA.abstract if i != 'None']
randabs = []
for i in range(500):
    numpaper = randint(0, len(abstracts))
    randabs.append(abstracts[numpaper])

## Pick 10 abstracts from the random abstracts

This code picks 10 abstracts from the random abstracts and makes a list out of them and shows these 10 abstracts.

In [8]:
seed(8)
randabs10 = []
for i in range(10):
    numpaper = randint(0, len(randabs))
    randabs10.append(randabs[numpaper])

## Remove Latex charaters from randomly choosen abstracts

In the next step we want to make a spacy doc out of these abstracts. As scientific papers especially in natural scienses are often written in latex, spacy has some troubles with the special characters as $. So we will remove them before using regex.

In [9]:
randabs10 = [re.sub('\\\\', '', re.sub('{', '(', re.sub('}', ')', re.sub('\$', '', i)))) for i in randabs10]

## Making a spacy doc out of the abstracts & cut abstracts into sentences

Now that all the bad characters are removed, we can make a spacy doc out of the abstracts and cut the abstracts in sentences. The sentences of the first abstract are printed out.

In [10]:
sentences = []
for abstract in randabs10:
    doc = nlp(abstract)
    sent = []
    for i in doc.sents:
        sent.append(i.string.strip())
    sentences.append(sent)

In [11]:
sentences[0]

['The X-ray and EUV emission of stars plays a key role in the loss and evolution of the atmospheres of their planets.',
 'The coronae of dwarf stars later than M6 appear to behave differently to those of earlier spectral types and are more X-ray dim and radio bright.',
 'Too faint to have been observed by the Extreme Ultraviolet Explorer, their EUV behavior is currently highly uncertain.',
 'We have devised a method to use the Chandra X-ray Observatory High Resolution Camera to provide a measure of EUV emission in the 50-170 AA range and have applied it to the M6.5 dwarf LHS 248 in a pilot 10 ks exposure.',
 'Analysis with model spectra using simple, idealised coronal emission measure distributions inspired by an analysis of Chandra HETG spectra of the M5.5 dwarf Proxima Cen and results from the literature, finds greatest consistency with a very shallow emission measure distribution slope, DEM propto T^(3/2) or shallower, in the range log T=5.5-6.5.',
 'Within 2sigma confidence, a much

After removing all the bad characters it looks quite nice!

## POS-tagging of abstract sentences

Let's do some POS (Part-of-Speech)-Tagging. It analyses for you the syntactical structure of a sentence. In the following example I'm interested in the lemmata contained in the abstracts, what grammatical role do they play in the sentence? Grammatical structure of the first sentence of the first abstract is displayed. For the explanation of the different tags go to https://spacy.io/api/annotation.

In [12]:
tokens = []
for abstract in randabs10:
    doc = nlp(abstract)
    tok = []
    for token in doc:
        tok.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_])
    tokens.append(tok)

In [13]:
tokens[0][:25]

[['The', 'the', 'DET', 'DT', 'det', 'Xxx'],
 ['X', 'x', 'NOUN', 'NN', 'nmod', 'X'],
 ['-', '-', 'PUNCT', 'HYPH', 'punct', '-'],
 ['ray', 'ray', 'NOUN', 'NN', 'nmod', 'xxx'],
 ['and', 'and', 'CCONJ', 'CC', 'cc', 'xxx'],
 ['EUV', 'euv', 'PROPN', 'NNP', 'conj', 'XXX'],
 ['emission', 'emission', 'NOUN', 'NN', 'nsubj', 'xxxx'],
 ['of', 'of', 'ADP', 'IN', 'prep', 'xx'],
 ['stars', 'star', 'NOUN', 'NNS', 'pobj', 'xxxx'],
 ['plays', 'play', 'VERB', 'VBZ', 'ROOT', 'xxxx'],
 ['a', 'a', 'DET', 'DT', 'det', 'x'],
 ['key', 'key', 'ADJ', 'JJ', 'amod', 'xxx'],
 ['role', 'role', 'NOUN', 'NN', 'dobj', 'xxxx'],
 ['in', 'in', 'ADP', 'IN', 'prep', 'xx'],
 ['the', 'the', 'DET', 'DT', 'det', 'xxx'],
 ['loss', 'loss', 'NOUN', 'NN', 'pobj', 'xxxx'],
 ['and', 'and', 'CCONJ', 'CC', 'cc', 'xxx'],
 ['evolution', 'evolution', 'NOUN', 'NN', 'conj', 'xxxx'],
 ['of', 'of', 'ADP', 'IN', 'prep', 'xx'],
 ['the', 'the', 'DET', 'DT', 'det', 'xxx'],
 ['atmospheres', 'atmosphere', 'NOUN', 'NNS', 'pobj', 'xxxx'],
 ['of', 'of

For a better understanding let's visualize the grammatical structure of this sentence:

In [14]:
doc = nlp(sentences[0][0])
displacy.render(doc, style="dep", options={'compact': True})

TypeError: __init__() got an unexpected keyword argument 'encoding'