# Parts of Speech and Keywords in Context

-------------------
**Contents of this notebook**

[Parts of Speech Tagging](#section-7)
- [Finding all the adjectives in a text](#section-8)
- [Finding all the subjects in a text](#section-9)
- [Finding the most common adjectives associated with a given keyword](#section-10)

-------------------

<a id='section-7'></a>
## Parts of Speech Tagging

We're going to use spaCy to identify parts of speech in a text

In [None]:
#Imports
import spacy
from collections import Counter

In [None]:
#Download the language model you're interested in
!python -m spacy download en_core_web_md

In [None]:
#Load language model
nlp = spacy.load('en_core_web_md')

#Create spaCy document
text = open('soderberg-corpus/1897_Drizzle.txt', encoding='utf-8').read()
document = nlp(text)

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.

In [None]:
#Iterate through tokens in spacy document and retrieve for each token
#the text of that token, the POS label associated with it and the Dependency label associated with it
for token in document:
    print(token.text, token.pos_, token.dep_)

If you inspect the list above, you might notice it is not always completely reliable (and the quality will vary greatly for different languages). 

<a id='section-8'></a>
#### Finding all the adjectives in a text

In [None]:
"""
Create an empty list
then for loop iterating over the tokens in the document
and append to the list if it is an adjective.

You can change the parts of speech tag to whatever tag you're interested in
e.g. adverbs (ADV), noun (NOUN), pronouns (PRON), proper nouns (PROPN), etc.)
"""
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.text)
adjs

In [None]:
#Count the most common adjectives
adjs_tally = Counter(adjs)
adjs_tally.most_common()

<a id='section-9'></a>
#### Finding all the subjects in a text

In [None]:
"""
Create an empty list
then for loop iterating over the tokens in the document
and append to the list if it is a nominal subject.

You can change the dependency tag to whatever tag you're interested in
e.g.'nsubj': nominal subjects (nsubj); direct objects (dobj); indirect objects (iobj)
"""

subjs = []
for token in document:
    if token.dep_ == 'nsubj':
        subjs.append(token.text)
subjs

In [None]:
#Count the most common subjects
subjs_tally = Counter(subjs)
subjs_tally.most_common()

<a id='section-10'></a>
#### Finding the most common adjectives associated with a given keyword

In [None]:
#Make a list of (word) tokens and POS labels from the document 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [None]:
#Define a function to return list of ngrams
def make_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens)-(n-1)):
        ngrams.append(tokens[i:i+n])
    return ngrams

In [None]:
#Call your functions
#Change the number to change your context window
#(i.e. how many words you want around the keyword)
ngrams = make_ngrams(tokens_and_labels, 6)
ngrams

In [None]:
#Define a function to return most frequent words 
#that appear next to a particular keyword
#and are a particular parts of speech
def get_neighbor_words_and_labels(keyword, ngrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for ngram in ngrams:
        words = [word.lower() for word, label in ngram]
        if keyword in words:
            for word, label in ngram:
                if label == pos_label or pos_label == None:
                    neighbor_words.append(word.lower())
    return Counter(neighbor_words).most_common()

In [None]:
#Call your function
#For example, look for most common adjectives associated with 'sun'
get_neighbor_words_and_labels('sun', ngrams, pos_label='ADJ')

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Chinese/03-POS-Keywords-Chinese.html#keyword-extraction) and William Turkel and Adam Crymble's ["Keywords in Context (using n-grams) with Python"](https://programminghistorian.org/en/lessons/keywords-in-context-using-n-grams).