## Named Entity Recognition — Code

[Download relevant files](https://melaniewalsh.org/spacy.zip)

This notebook is a streamlined version of a previous lesson on **Named Entity Recognition**. It is primarily intended for those who want to reuse the code without the previous lessons' overview and explanations.

<img src="../images/Ada-Lovelace-NER.png" >

## Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy

We're also going to import the `Counter` module for counting people, places, and things later on; the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [2]:
from collections import Counter

In [3]:
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

## Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [None]:
!python -m spacy download en_core_web_sm

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [4]:
nlp = spacy.load('en_core_web_sm')

## Create a Processed spaCy Document

In [None]:
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

|Type Label|Description|
|--- |--- |
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


All the named entities in our `document` can be found in the `document.ents` property. We can access the entity labels by iterating through the `document.ents` with a simple `for` loop and pulling out the `.label_` attribute.

In [None]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

## Get People

|Type Label|Description|
|--- |--- |
|PERSON|People, including fictional.|

In [None]:
people = [named_entity.text for named_entity in document.ents if named_entity.label_ == "PERSON"]
people_tally = Counter(people)
df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

## Process Long Documents (or Many Documents)

Rather than creating a single processed `document` with `nlp()`, we're going to create a bunch of smaller spaCy `documents` with `nlp.pipe()`. The [`nlp.pipe()`](https://spacy.io/usage/processing-pipelines#processing) method is faster and more efficient when we're processing many documents.

In [None]:
filepath = "../texts/literature/Little-Women.txt"
text = open(filepath, encoding="utf-8").read()

#Split text on line breaks 
chunked_text = text.split('\n')
#Process each chunk of text and return a list of processed documents
chunked_documents = list(nlp.pipe(chunked_text))

We `open()` and `.read()` our text file, then `.split()` the text on every line break `\n` and process each chunk of the text as its own document, returning a list of `chunked_documents`.

To extract people from all the `chunked_documents`, all we need to do is add one more `for` loop to our code and iterate through every document in `chunked_documents`.

In [29]:
people = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)
            
people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Jo,1295
1,Laurie,482
2,Beth,435
3,Amy,424
4,Meg,422
...,...,...
541,Gutenberg-tm,1
542,Project Gutenberg-tm's,1
543,Gregory B. Newby,1
544,Michael S. Hart,1


In [None]:
places = [named_entity.text  for document in chunked_documents for named_entity in document.ents if named_entity.label_ == "GPE"]

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

To write these dataframe to a CSV file, we can use `df.to_csv()`:

In [None]:
#df.to_csv("people.csv", encoding='utf-8', index=False)