## Parts of Speech Tagging

We're going to use spaCy to identify parts of speech in a text

In [None]:
#Imports
import spacy
from collections import Counter

In [None]:
#Download the language model you're interested in
#e.g. for Chinese: python -m spacy download zh_core_web_sm
#e.g. for Korean: ko_core_news_sm
#Visit: https://spacy.io/usage/models#languages for more
!python -m spacy download ko_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('ko_core_news_sm')

#Create spaCy document
text = open('korean-corpus.txt', encoding='utf-8').read()
document = nlp(text)

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.lemma_` attribute for each token, which gives us the un-inflected version of the word. We’ll also pull out the `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.

In [None]:
#Iterate through tokens in spacy document and retrieve for each token
#the lemmatized version of that token, 
#the POS label associated with it and the Dependency label associated with it
for token in document:
    print(token.lemma_, token.pos_, token.dep_)

If you inspect the list above, you might notice it is not always completely reliable (and the quality will vary greatly for different languages).

#### Finding all the adjectives in a text

In [None]:
"""
Create an empty list
then for loop iterating over the tokens in the document
and append to the list if it is an adjective.

You can change the parts of speech tag to whatever tag you're interested in
e.g. adverbs (ADV), noun (NOUN), pronouns (PRON), proper nouns (PROPN), etc.)
"""
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.text)
adjs

In [None]:
#Count the most common adjectives
adjs_tally = Counter(adjs)
adjs_tally.most_common()

#### Finding the most common adjectives associated with a given keyword

In [None]:
#Make a list of (word) tokens and POS labels from the document 
tokens_and_labels = [(token.lemma_, token.pos_) for token in document if token.is_alpha]

In [None]:
#Define a function to return list of ngrams
def make_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens)-(n-1)):
        ngrams.append(tokens[i:i+n])
    return ngrams

In [None]:
#Call your functions
#Change the number to change your context window
#(i.e. how many words you want around the keyword)
ngrams = make_ngrams(tokens_and_labels, 6)
ngrams

In [None]:
#Define a function to return most frequent words 
#that appear next to a particular keyword
#and are a particular parts of speech
def get_neighbor_words_and_labels(keyword, ngrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for ngram in ngrams:
        words = [word.lower() for word, label in ngram]
        if keyword in words:
            for word, label in ngram:
                if label == pos_label or pos_label == None:
                    neighbor_words.append(word.lower())
    return Counter(neighbor_words).most_common()

In [None]:
#Call your function
#For example, look for most common adjectives associated with 'sun'
get_neighbor_words_and_labels('보+면', ngrams, pos_label='ADJ')

## Named Entity Recognition

#### Finding all named entities in a document

In [None]:
# We can use `.ents` to pull out all the Named Entities spaCy reocgnizes in the document
document.ents

In [None]:
#Get Named Entities and their label
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

In [None]:
#Visualize all the Named Entities using displacy
from spacy import displacy
displacy.render(document, style="ent")

In [None]:
#Get only Named Entities of a certain type (e.g. people with PERSON)
for named_entity in document.ents:
    if named_entity.label_ == 'LC':
        print(named_entity)

#### Finding the most frequent Named Entities of a given type

In [None]:
#Define a function that finds Named Entities of a given label 
def find_most_frequent_NE(doc, NE_label=None):
    
    named_entities = []
    
    for named_entity in document.ents:
        if named_entity.label_ == NE_label or NE_label == None:
            named_entities.append(named_entity.text)        
    return(Counter(named_entities).most_common())

In [None]:
#Call your function for a given NE (e.g. PERSON, or DATE or TIME)
find_most_frequent_NE(document, NE_label='LC')

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Chinese/03-POS-Keywords-Chinese.html#keyword-extraction) and William Turkel and Adam Crymble's ["Keywords in Context (using n-grams) with Python"](https://programminghistorian.org/en/lessons/keywords-in-context-using-n-grams).