## Part-of-Speech Tagging & Keyword Extraction — Code

[Download relevant files](https://melaniewalsh.org/spacy.zip)

This notebook is a streamlined version of a previous lesson on **part-of-speech tagging**. It is primarily intended for those who want to reuse the code without the previous lessons' overview and explanations.

<img src="../images/Ada-Lovelace-NER.png" >

## Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy

We're also going to import the `Counter` module for counting people, places, and things later on; the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [2]:
from collections import Counter

In [3]:
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

## Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [5]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [6]:
nlp = spacy.load('en_core_web_sm')

## Create a Processed spaCy Document

`document = nlp(open(filepath, , encoding='utf-8').read())`

In [7]:
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"

document = nlp(open(filepath, encoding='utf-8').read())

## Get Part-Of-Speech Tags

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ’s, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :), 😝             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |


To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.


In [15]:
for token in document:
    print(token.text, token.pos_, token.dep_)

A DET det
gifted ADJ amod
mathematician NOUN ROOT
who PRON nsubjpass
is VERB auxpass
now ADV advmod
recognized VERB relcl
as ADP prep
the DET det
first ADJ amod
computer NOUN compound
programmer NOUN pobj
. PUNCT punct
By ADP ROOT
CLAIRE PROPN pobj
CAIN PROPN compound
MILLER ADV pobj


  SPACE 
A DET det
century NOUN npadvmod
before ADP prep
the DET det
dawn NOUN pobj
of ADP prep
the DET det
computer NOUN compound
age NOUN pobj
, PUNCT punct
Ada PROPN compound
Lovelace PROPN nsubj
imagined VERB ROOT
the DET det
modern ADJ amod
- PUNCT punct
day NOUN nmod
, PUNCT punct
general ADJ amod
- PUNCT punct
purpose NOUN compound
computer NOUN dobj
. PUNCT punct
It PRON nsubjpass
could VERB aux
be VERB auxpass
programmed VERB ccomp
to PART aux
follow VERB xcomp
instructions NOUN dobj
, PUNCT punct
she PRON nsubj
wrote VERB ROOT
in ADP prep
1843 NUM pobj
. PUNCT punct
It PRON nsubj
could VERB aux
not ADV neg
just ADV advmod
calculate VERB ROOT
but CCONJ cc
also ADV advmod
create VERB conj
, PUNCT

## Get Verbs

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| VERB  | verb                      | run, runs, running, eat, ate, eating          |

In [16]:
verbs = [token.text for token in document if token.pos_ == 'VERB']
verbs_tally = Counter(verbs)
df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,was,18
1,could,11
2,wrote,11
3,had,8
4,has,6
5,have,5
6,be,4
7,said,4
8,is,3
9,imagined,3


To write this dataframe to a CSV file, we can use `df.to_csv()`:

In [14]:
#df.to_csv("Lovelace-verbs.csv", encoding='utf-8', index=False)

## Get Keyword in Context

In [17]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [18]:
def get_keyword_in_context(keyword, word_list, number_surrounding_words, pos_label=None):
    
    ngrams = []
    words_around_keyword = []
    adj_length_of_word_list = len(word_list) - (number_surrounding_words)
    
    keyword = keyword.lower()
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + (number_surrounding_words + 1)]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
    
    
    for word_label_pair in ngrams:
    
        words = [word.lower() for word, label in word_label_pair]
        labels = [label for word, label in word_label_pair if word != keyword]

        if keyword in words:

            if pos_label in labels or pos_label == None:
                if keyword in words[0]:
                    words_around_keyword.append(" ".join(words[1:]))

                elif keyword in words[number_surrounding_words]:
                    words_around_keyword.append(" ".join(words[:number_surrounding_words]))
    
    words_around_keyword = [word.lower() for word in words_around_keyword]
    
    return Counter(words_around_keyword).most_common()

In [21]:
get_keyword_in_context("computer", tokens_and_labels, number_surrounding_words=2)

[('the first', 2),
 ('programmer by', 1),
 ('of the', 1),
 ('age ada', 1),
 ('general purpose', 1),
 ('it could', 1),
 ('leaves the', 1),
 ('she was', 1),
 ('programmer the', 1),
 ('what a', 1),
 ('could do', 1),
 ('contribution to', 1),
 ('science she', 1),
 ('how the', 1),
 ('would work', 1),
 ('martin a', 1),
 ('scientist at', 1)]

In [28]:
get_keyword_in_context("computer", tokens_and_labels, number_surrounding_words=2, pos_label="ADJ")

[('the first', 2), ('general purpose', 1)]

In [27]:
get_keyword_in_context("women", tokens_and_labels, number_surrounding_words=2)

[('celebration of', 1),
 ('in technology', 1),
 ('lived when', 1),
 ('were not', 1),
 ('industry where', 1),
 ('are severely', 1)]