<a href="https://colab.research.google.com/github/julianflowers/herbivores_ghg/blob/master/Copy_of_spacy_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook outlines how to use natural language processing (NLP) to extract locations and numbers from documents

It uses Python modules (like R packages) which are designed for NLP and find relevant information in texts. This process is known as *named entity recognition* (NER) and relies on language models - very large datasets of words which have been pre-grouped into categories (*trained*), and breaking down texts into words (*tokens*) which are then matched with these datasets and assigned a category if one is found. This process is called *annotation*. 

## Getting started

The first step is install relevant packages and load them into Python. The gold standard NLP package is called `spacy` https://spacy.io/. We will also install a core language model which has been trained on the whole of wikipedia as well as other datasets like pubmed - this is called `en_core_web_lg` (**l**ar**g**e **en**glish core web-based model). 

In [2]:
! pip install spacy
import spacy
from spacy import displacy
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl (778.8 MB)
[K     |████████████████████████████████| 778.8 MB 19 kB/s 
[?25hCollecting spacy<3.1.0,>=3.0.0
  Downloading spacy-3.0.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 4.4 MB/s 
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting typing-extensions<4.0.0.0,>=3.7.4
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions, typer, spacy, en-core-web-lg
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 4.1.1
   

We'll also install `spacypdfreader` - which unsurpisingly reads pdfs into the colab python environment

In [3]:
!pip install spacypdfreader


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacypdfreader
  Downloading spacypdfreader-0.2.1-py3-none-any.whl (9.4 kB)
Collecting rich<11.0.0,>=10.15.2
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
[K     |████████████████████████████████| 214 kB 5.1 MB/s 
[?25hCollecting pdfminer.six<20211013,>=20211012
  Downloading pdfminer.six-20211012-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 65.4 MB/s 
[?25hCollecting cryptography
  Downloading cryptography-37.0.4-cp36-abi3-manylinux_2_24_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 35.8 MB/s 
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 6.2 MB/s 
[?25hCollecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Installing collected packages: cryptography, commonmark, colorama, rich

In [4]:
from spacypdfreader import pdf_reader

And finally we can upload some files to process...

In [5]:
from google.colab import files

uploaded = files.upload()

Saving The Legacy of Destructive Snow Goose Foraging on Supratidal Marsh Habitat in the Hudson Bay Lowlands.pdf to The Legacy of Destructive Snow Goose Foraging on Supratidal Marsh Habitat in the Hudson Bay Lowlands.pdf
Saving Abdalla et al. 2009 N2O Ireland (1).pdf to Abdalla et al. 2009 N2O Ireland (1).pdf
Saving Allard et al. 2007 CO2 N2O CH4 France (1).pdf to Allard et al. 2007 CO2 N2O CH4 France (1).pdf


## Named entity recognition

We need to load the language model. The large model has 18 categories. Of most interest will probably be:


*   CARDINAL - numbers
*   GPE - geopolitical entities (e.g. countries)
*    LOC - non-GPE locations
*   DATE
*  PERS
*  ORG











In [6]:
ner = spacy.load("en_core_web_lg")


  message = f"Error running command:\n\n{cmd_str}\n\n"


Read in a pdf to a text object

In [7]:
text = pdf_reader("/content/The Legacy of Destructive Snow Goose Foraging on Supratidal Marsh Habitat in the Hudson Bay Lowlands.pdf", ner )

In [None]:
text

And run the code below to visualised how the NLP process has annotated the document

In [None]:
displacy.render(text,style="ent",jupyter=True)

or see a list of the named entities

In [None]:
ner(text)
for entity in text.ents:
    print(entity.text,entity.label_)

Let's try another document

In [14]:
text1 = pdf_reader("/content/Allard et al. 2007 CO2 N2O CH4 France (1).pdf", ner)

In [None]:
displacy.render(text1,style="ent",jupyter=True)

We can just extract the entities we need by creating a list of all the text annotated in the category of interest e.g. LOC

In [16]:
list_of_loc=[]

# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='LOC':
    list_of_loc.append(entity.text)

list_of_gpe=[]

# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='GPE':
    list_of_gpe.append(entity.text)

In [17]:
list_of_loc

['Urals',
 't3',
 'Europe',
 'the Last Glacial\nMaximum',
 'the U.K. Soil Biol',
 'Europe',
 'Europe',
 'Europe',
 'Breidt',
 'Vuichard',
 'Europe']

In [18]:
list_of_gpe

['Bre´zet',
 'France',
 'France',
 'Le Roc',
 'France',
 'France',
 'France',
 'France',
 'Andosol',
 'France',
 'Batjes',
 'Hall',
 'France',
 'UK',
 'USA',
 'Fleach',
 'Fleach',
 'Retsch',
 'Germany',
 'Moss',
 'Denmark',
 'Magnugistics',
 'USA',
 'Batjes',
 'Granli',
 'Bockman',
 'Jarvis',
 'S.C.',
 'Headon',
 'A.',
 'A.S.',
 'P.H.',
 'A.',
 'Grunwald',
 'Vesala',
 'Batjes',
 'N.H.',
 'Batjes',
 'N.H.',
 'Soussana',
 'J.F.',
 'F.S.',
 'Matson',
 'H.A.',
 'New York',
 'A.',
 'Ogee',
 'Grunwald',
 'B.',
 'Knohl',
 'Loustau',
 'Miglietta',
 'Ourcival',
 'Pilegaard',
 'Soussana',
 'J.F.',
 'McTaggart',
 'I.P.',
 'K.J.',
 'J.-M.',
 'Blanchart',
 'M.C.',
 'Ndandou',
 'J.F.',
 'Nutr',
 'C.R.',
 'K.J.',
 'Ball',
 'B.C.',
 'Jolivot',
 'Environ',
 'W.A.',
 'Gilmanov',
 'Soussana',
 'J.F.',
 'Aires',
 'Dore',
 'Jacobs',
 'Sutton',
 'Environ',
 'Bockman',
 'O.C.',
 'Essex',
 'England',
 'Tokyo',
 'Japan',
 'A.',
 'Folberth',
 'S.C.',
 'Yamulki',
 'Stauch',
 'V.J.',
 'Environ',
 'Kirschbaum',
 '

In [19]:
list_of_numbers=[]


# Appending entities which have the label 'LOC' to the list
for entity in text1.ents:
  if entity.label_=='CARDINAL':
    list_of_numbers.append(entity.text)

list_of_numbers

['234',
 'two',
 'half',
 'four',
 'two',
 '97',
 '69',
 '89',
 '1',
 '1',
 '18',
 '48',
 '47–58',
 '0.52',
 '150 millions',
 'two',
 'One',
 'half',
 'two',
 '2',
 '2.1',
 '35',
 '7 8C',
 '0.7 8C',
 '14.8',
 '2.1.1',
 'two',
 '2.81',
 'half',
 '80',
 '174',
 'three',
 '2.2',
 '47–58',
 'two',
 '5–10',
 '2.3',
 '0.08',
 '2.4',
 'Eight',
 'two',
 't0',
 '3400Cx',
 '2.5',
 'Eight',
 'seven',
 '2.6',
 '1',
 '50',
 '47–58',
 '2',
 '3',
 '4',
 '5',
 '127',
 '2',
 '3',
 '2.7',
 '50 8C',
 '2300',
 '2.8',
 '16',
 '1',
 '47–58',
 '51',
 '14',
 '1',
 '3',
 '3.1',
 '1',
 '1128',
 '2',
 '3',
 '1',
 '5',
 '8C',
 '1',
 '3.2',
 '2',
 '3',
 '1',
 'about 4',
 'half',
 '0.9',
 '2',
 '3',
 'half',
 '1',
 '0.9',
 '106',
 '1',
 '1.2',
 '390',
 '0.5',
 '443',
 '0.5',
 '425',
 '391',
 '137',
 '1',
 '1',
 '3.3',
 '3',
 'close to',
 '3',
 '2',
 '52',
 '47–58',
 '3',
 '3',
 '42',
 '91',
 '3',
 '2',
 '3',
 '2',
 '3',
 '3.4',
 'between 100 and 150',
 '5B',
 '6',
 '3.5',
 '0.36',
 'about two',
 '2',
 '37.2',
 '0.0

and then we can write these lists to text files which we can open in Excel (or R) for further analysis or filtering.

In [20]:
names = list_of_numbers
with open("numbers.txt", mode="w") as file:
    file.write("\n".join(names))

locs = list_of_loc
with open("locs.txt", mode='w') as file:
     file.write("\n".join(locs))