In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Tokenization

Tokenization is the process of splitting text into minimal meaningful units such as words, punctuation marks, symbols, etc. For example, the sentence "We live in Paris" could be tokenized into four tokens: We, live, in, Paris. Tokenization is typically the first step of every NLP process.

In [6]:
# Small example
sentence = nlp.tokenizer("We live in Paris")
print("The number of Tokens: " , len(sentence))

print("The tokens: ")
for words in sentence:
    print(words)
    


The number of Tokens:  4
The tokens: 
We
live
in
Paris


AttributeError: 'DataFrame' object has no attribute 'question_tokens'

In [8]:
#####################################################
    
import pandas as pd
import os
cwd = os.getcwd()

# Import question
data = pd.read_csv(cwd+'\JEOPARDY_CSV_reduced.csv')
data = pd.DataFrame(data=data)

# lowercase, strip whitespace, and view column names
data.columns = map(lambda x: x.lower().strip(), data.columns)

# Reduce size of data
data = data[0:1000]

# Tokenize Jeopardy Questions
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))

In [10]:
# View first question
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questio is:")
print(example_question)


# Individual tokens of first question
print("The tokens from the first question are:")
for tokens in example_question_tokens:
    print(tokens)

The first questio is:
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
The tokens from the first question are:
For
the
last
8
years
of
his
life
,
Galileo
was
under
house
arrest
for
espousing
this
man
's
theory


# Part-of-Speech (POS) tagging

Part-of-speech (POS) tagging is the process of assigning word types to tokens, such as noun, pronoun, verb, adverb, adjective, conjuntion, preposition, interjection, etc. For "We live in Paris", the parts of speech are: pronoun, verb, preposition, and noun. This part-of-speech gives each token a bit more metadata, making it easier for the machine to assign relationships between each token and every other token.

In [11]:
# Part of speech tags for tokens in the first question

print("Here are the Part-of-Speech tags for each token in the first questions: ")
for token in example_question_tokens:
    print(token.text, token.pos_, spacy.explain(token.pos_))

Here are the Part-of-Speech tags for each token in the first questions: 
For ADP adposition
the DET determiner
last ADJ adjective
8 NUM numeral
years NOUN noun
of ADP adposition
his PRON pronoun
life NOUN noun
, PUNCT punctuation
Galileo PROPN proper noun
was AUX auxiliary
under ADP adposition
house NOUN noun
arrest NOUN noun
for ADP adposition
espousing VERB verb
this DET determiner
man NOUN noun
's PART particle
theory NOUN noun


# Dependency Parsing

Dependency parsing involves labeling the relationships between individual tokens, assigning a syntactic structure to the sentence. Once the relationships are labeled, the entire sentence can be structured as a series of relationships among sets of tokens. 

In [15]:
# Dependency Parsing tags for tokens in the first question

for token in example_question_tokens:
    print(token.text, token.dep_, spacy.explain(token.dep_))
    
# Visualize the dependency parse
from spacy import displacy

displacy.render(example_question_tokens, style='dep', jupyter=True, options={'distance': 100})

For prep prepositional modifier
the det determiner
last amod adjectival modifier
8 nummod numeric modifier
years pobj object of preposition
of prep prepositional modifier
his poss possession modifier
life pobj object of preposition
, punct punctuation
Galileo nsubj nominal subject
was ROOT None
under prep prepositional modifier
house compound compound
arrest pobj object of preposition
for prep prepositional modifier
espousing pcomp complement of preposition
this det determiner
man poss possession modifier
's case case marking
theory dobj direct object


# Chunking

Chunking involves combining related tokens into a signle token, creating related noun groups, related verb groups, etc. For example, "New York City" could be treated as a single token / chunk instead of as three separate tokens.

In [19]:
# Tokens without chunking

print("Only tokens:")
sentence = "My parents live in New York City"
for token in nlp(sentence):
    print(token.text)
    
print("\nChunks:")
for chunk in nlp(sentence).noun_chunks:
    print(chunk.text)

Only tokens:
My
parents
live
in
New
York
City

Chunks:
My parents
New York City


# Lemmatization

Lemmatization is the process of converting words into their base forms. For example, lemmatization converts "horses" to "horse", "slept" to "sleep", and "biggest" to "big". It allows the machine to simplify the text processing work it has to perform.

# Stemming

Stemming is a process related to lemmatization, but simpler. Stemming reduces words to their word stems. Stemming algorithms are typically rule-based. For example, the word "biggest" would be reduced to "big", but the word "slept" would not be reduced at all. Stemming sometimes results in nonsensical subwords, and we prefer lemmatization to stemming for this reason.

In [22]:
# Print Lemmatization for tokens in the first question

lemmatization = pd.DataFrame(data=[], columns=["original", "lemmatized"])
i = 0
for token in example_question_tokens:
    lemmatization.loc[i, "original"] = token.text
    lemmatization.loc[i, "lemmatized"] = token.lemma_
    i = i+1
    
print(lemmatization)

     original lemmatized
0         For        for
1         the        the
2        last       last
3           8          8
4       years       year
5          of         of
6         his        his
7        life       life
8           ,          ,
9     Galileo    Galileo
10        was         be
11      under      under
12      house      house
13     arrest     arrest
14        for        for
15  espousing    espouse
16       this       this
17        man        man
18         's         's
19     theory     theory


# Named Entity Recognition (NER)

Named Entity Recognition (NER), is the process of assigning labels to known objects (or entities) such as person, organization, location, date, currency, etc. In "We live in Paris", "Paris" would be marked as the location. 

In [24]:
# Print NER results
example_sentence = "George Washington was an American political leader, military general, statesman, and Founding Father of the United States, who served as the first president of the United States from 1789 to 1797.\n"

print(example_sentence)

print("Text Start End Label")
doc = nlp(example_sentence)
for token in doc.ents:
    print(token.text, token.start_char, token.end_char, token.label_)
    
    
displacy.render(doc, style="ent", jupyter=True, options={'distance': 100})

George Washington was an American political leader, military general, statesman, and Founding Father of the United States, who served as the first president of the United States from 1789 to 1797.

Text Start End Label
George Washington 0 17 PERSON
American 25 33 NORP
the United States 104 121 GPE
first 141 146 ORDINAL
the United States 160 177 GPE
1789 to 1797 183 195 DATE


# Named Entity Linking (NEL)

Entity linking is the process of disambiguanting entitie to an external database, linking text in on form or another. This is important both for entity resolution applications (e.g., deduping datasets) and information retrieval applications. In the George W. Bush example, we would eant to resolve all instances of "George W. Bush" as "George W. Bush", but not to "George H. W. Bush", George W. Bush's father and also former US President.