<a href="https://colab.research.google.com/github/julianflowers/herbivores_ghg/blob/master/Copy_of_spacy_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook outlines how to use natural language processing (NLP) to extract locations and numbers from documents

It uses Python modules (like R packages) which are designed for NLP and find relevant information in texts. This process is known as *named entity recognition* (NER) and relies on language models - very large datasets of words which have been pre-grouped into categories (*trained*), and breaking down texts into words (*tokens*) which are then matched with these datasets and assigned a category if one is found. This process is called *annotation*. 

## Getting started

The first step is install relevant packages and load them into Python. The gold standard NLP package is called `spacy` https://spacy.io/. We will also install a core language model which has been trained on the whole of wikipedia as well as other datasets like pubmed - this is called `en_core_web_lg` (**l**ar**g**e **en**glish core web-based model). 

In [None]:
! pip install spacy
import spacy
from spacy import displacy
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl


We'll also install `spacypdfreader` - which unsurpisingly reads pdfs into the colab python environment

In [None]:
!pip install spacypdfreader


In [6]:
from spacypdfreader import pdf_reader

And finally we can upload some files to process...

In [2]:
from google.colab import files

uploaded = files.upload()

User uploaded file "s13717-020-00230-z" with length 1827044 bytes
User uploaded file "sustainability-12-02425-v2.pdf" with length 2986118 bytes
User uploaded file "agronomy-11-01421-v2.pdf" with length 685973 bytes
User uploaded file "Testing%20DayCent%20and%20DNDC%20model%20simulations%20of%20N2O%20fluxes%20and%20assessing%20the%20impacts%20of%20climate%20change%20on%20the%20gas%20flux%20and%20biomass%20production%20from%20a%20humid%20pasture.pdf" with length 1938112 bytes
User uploaded file "s42452-020-03538-9.pdf" with length 1365197 bytes
User uploaded file "9995c9d0bab3b678e4835d05490b4c993120.pdf" with length 2305986 bytes


## Named entity recognition

We need to load the language model. The large model has 18 categories. Of most interest will probably be:


*   CARDINAL - numbers
*   GPE - geopolitical entities (e.g. countries)
*    LOC - non-GPE locations
*   DATE
*  PERS
*  ORG











In [9]:
ner = spacy.load("en_core_web_lg")


  message = f"Error running command:\n\n{cmd_str}\n\n"


Read in a pdf to a text object

In [10]:
text = pdf_reader("/content/sustainability-12-02425-v2.pdf", ner )

In [None]:
text

And run the code below to visualised how the NLP process has annotated the document

In [None]:
displacy.render(text,style="ent",jupyter=True)

or see a list of the named entities

In [None]:
ner(text)
for entity in text.ents:
    print(entity.text,entity.label_)

Let's try another document

In [14]:
text1 = pdf_reader("/content/s42452-020-03538-9.pdf", ner)

In [None]:
displacy.render(text1,style="ent",jupyter=True)

We can just extract the entities we need by creating a list of all the text annotated in the category of interest e.g. LOC

In [16]:
list_of_loc=[]

# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='LOC':
    list_of_loc.append(entity.text)

list_of_gpe=[]

# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='GPE':
    list_of_gpe.append(entity.text)

In [35]:
list_of_loc

['Africa',
 'Northern Ethiopia',
 'Tigray',
 'Northern Ethiopia',
 'Tigray',
 'Tigray',
 'Northern Ethiopia',
 'Northern Ethiopia',
 'Mai-Saba',
 'Northern Ethiopia',
 'North West',
 'Map',
 'Northern Ethiopia',
 'Southern \nTigray',
 'Southern Tigray',
 'South Western Ethiopia',
 'Illubabor',
 'South \nWestern Ethiopia',
 'Northern China',
 'South Wello Zone',
 'Central highlands',
 'Butajira Area',
 'the Middle Silluh Valley',
 'Northern Ethiopia',
 'Northern Ethiopia',
 'Pampa',
 'Nile',
 'Northern \nEthiopia',
 'Southern Ethiopia',
 'North-Western Ethiopia',
 'Northern Ethiopia',
 'J Arid Environ',
 'North Western Zone',
 'Africa',
 'the Tengger Desert',
 'Northern Ethiopia',
 'Central New \nSouth Wales',
 'Ethiopian Rift',
 'Northern Ethiopia',
 'Inner Mongolia',
 'Southern Ethiopia',
 'Peninsular Malaysia',
 'Southern']

In [None]:
list_of_gpe

In [None]:
list_of_numbers=[]


# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='CARDINAL':
    list_of_numbers.append(entity.text)

list_of_numbers

and then we can write these lists to text files which we can open in Excel (or R) for further analysis or filtering.

In [33]:
names = list_of_numbers
with open("numbers.txt", mode="w") as file:
    file.write("\n".join(names))

locs = list_of_loc
with open("locs.txt", mode='w') as file:
     file.write("\n".join(locs))