# Operating SpaCy on DataFrame of Publications about Exoplanets in NASA ADS

Before you can use SpaCy you have to download the pretrained model for the english language with:

In [6]:
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


After doing this for the first time you can comment out this line with `#` and load the model with `nlp = spacy.load('en_core_web_md')`

## Loading the packages

In [7]:
import pandas as pd
import numpy as np
import re
from random import randint
from random import seed
seed(5)
from unidecode import unidecode
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_md')
from IPython.display import HTML

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

## Loading the dataframe into python

Here it is important to include the statement orient = 'table', because this dataframe was exported with this option, so that the exported json is a valid json. The head of the dataframe is printed out.

In [None]:
dfExoplanetsNASA = pd.read_json('./data/dfExoplanetsNASA_v2.json', orient = 'table')
dfExoplanetsNASA = dfExoplanetsNASA[dfExoplanetsNASA.abstract != 'None'].reset_index(drop=True)

In [None]:
dfExoplanetsNASA.head()

## Selecting columns to work with

For the work with this dataframe we only need the following columns 'authors', 'title', 'published' and 'abstract'. To save memory we select only these four columns to be contained in the dataframe.

In [None]:
dfExoplanetsNASA = dfExoplanetsNASA[['authors', 'title', 'published', 'abstract']]

In [None]:
dfExoplanetsNASA.head()

Now the dataframe is ready to work with. The abstracts are stored in the column 'abstract'. This column can be accessed by `dfExoplanetsNASA.abstract`.

## Pick 500 random abstracts from dataframe

This code picks 500 random abstracts from the dataframe makes a list out of them and shows these 20 abstracts.

In [None]:
abstracts = [i for i in dfExoplanetsNASA.abstract if i != 'None']
randabs = []
for i in range(500):
    numpaper = randint(0, len(abstracts))
    randabs.append(abstracts[numpaper])

## Pick 10 abstracts from the random abstracts

This code picks 10 abstracts from the random abstracts and makes a list out of them and shows these 10 abstracts.

In [None]:
seed(8)
randabs10 = []
for i in range(10):
    numpaper = randint(0, len(randabs))
    randabs10.append(randabs[numpaper])

## Remove Latex charaters from randomly choosen abstracts

In the next step we want to make a spacy doc out of these abstracts. As scientific papers especially in natural scienses are often written in latex, spacy has some troubles with the special characters as $. So we will remove them before using regex.

In [None]:
randabs10 = [re.sub('\\\\', '', re.sub('{', '(', re.sub('}', ')', re.sub('\$', '', i)))) for i in randabs10]

## Making a spacy doc out of the abstracts & cut abstracts into sentences

Now that all the bad characters are removed, we can make a spacy doc out of the abstracts and cut the abstracts in sentences. The sentences of the first abstract are printed out.

In [None]:
sentences = []
for abstract in randabs10:
    doc = nlp(abstract)
    sent = []
    for i in doc.sents:
        sent.append(i.string.strip())
    sentences.append(sent)

In [None]:
sentences[0]

After removing all the bad characters it looks quite nice!

## POS-tagging of abstract sentences

Let's do some POS (Part-of-Speech)-Tagging. It analyses for you the syntactical structure of a sentence. In the following example I'm interested in the lemmata contained in the abstracts, what grammatical role do they play in the sentence? Grammatical structure of the first sentence of the first abstract is displayed. For the explanation of the different tags go to https://spacy.io/api/annotation.

In [None]:
tokens = []
for abstract in randabs10:
    doc = nlp(abstract)
    tok = []
    for token in doc:
        tok.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_])
    tokens.append(tok)

In [None]:
tokens[0][:25]

For a better understanding let's visualize the grammatical structure of this sentence:

In [None]:
doc = nlp(sentences[0][0])
displacy.render(doc, style="dep", options={'compact': True})