# LING 242 Python Lecture 7: Text processing

* Sentence segmentation
* Tokenization
* Lemmatization and Stemming
* POS tagging
* End-to-end preprocessing with SpaCy
* T/F questions

## Sentence segmentation

For many computational linguistics applications, the sentence is a key unit of processing. All modern written languages have an end-of-sentence marker. For some languages, like Chinese, this marker is unambigious and so getting the sentences of the text is very easy.

In [1]:
zh_text = "你好。我叫J。我是老师。"

zh_sents = zh_text.split("。")
zh_sents

['你好', '我叫J', '我是老师', '']

However, English is not so easy because the period (".") is ambiguous. It has various uses. This is true for many other European languages as well.

In [2]:
en_text = "The luxury auto maker last year sold 1,214 cars in the U.S. Howard Mosher, president and chief executive officer, said he anticipates growth for the luxury auto maker in Britain and Europe, and in Far Eastern markets."

for sent in en_text.split("."):
    print(sent)


The luxury auto maker last year sold 1,214 cars in the U
S
 Howard Mosher, president and chief executive officer, said he anticipates growth for the luxury auto maker in Britain and Europe, and in Far Eastern markets



There are ways to improve things considerably. 

For major languages with significant ambiguity, you'll probably want to use a dedicated sentence splitter, which NLTK has for some languages. But don't expect perfection!

In [3]:
import nltk

from nltk import sent_tokenize

sent_tokenize(en_text)

['The luxury auto maker last year sold 1,214 cars in the U.S. Howard Mosher, president and chief executive officer, said he anticipates growth for the luxury auto maker in Britain and Europe, and in Far Eastern markets.']

In [4]:
import nltk
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

tokenizer.tokenize(en_text)

['The luxury auto maker last year sold 1,214 cars in the U.S.',
 'Howard Mosher, president and chief executive officer, said he anticipates growth for the luxury auto maker in Britain and Europe, and in Far Eastern markets.']

## Tokenization

Again, breaking a sentence up into words seems like it should be easy, but actually rarely is. Spaces separate most word tokens, but punctuation is a problem.


In [5]:
sents = tokenizer.tokenize(en_text)
sent = sents[1]
sent


'Howard Mosher, president and chief executive officer, said he anticipates growth for the luxury auto maker in Britain and Europe, and in Far Eastern markets.'

In [6]:
sent.split(" ")

['Howard',
 'Mosher,',
 'president',
 'and',
 'chief',
 'executive',
 'officer,',
 'said',
 'he',
 'anticipates',
 'growth',
 'for',
 'the',
 'luxury',
 'auto',
 'maker',
 'in',
 'Britain',
 'and',
 'Europe,',
 'and',
 'in',
 'Far',
 'Eastern',
 'markets.']

In [7]:
import re
[match.group() for match in re.finditer("\w+", sent)]

['Howard',
 'Mosher',
 'president',
 'and',
 'chief',
 'executive',
 'officer',
 'said',
 'he',
 'anticipates',
 'growth',
 'for',
 'the',
 'luxury',
 'auto',
 'maker',
 'in',
 'Britain',
 'and',
 'Europe',
 'and',
 'in',
 'Far',
 'Eastern',
 'markets']

In computational linguistics for English, *clitics* (small words that are phonologically joined to a host word) such as "n't" and "'s" are often treated as separate words, and need to be dealt with specially). A proper word tokenizer (such as is included in NLTK) will deal with these subtle issues.

In [8]:
from nltk import word_tokenize

word_tokenize(sent)

['Howard',
 'Mosher',
 ',',
 'president',
 'and',
 'chief',
 'executive',
 'officer',
 ',',
 'said',
 'he',
 'anticipates',
 'growth',
 'for',
 'the',
 'luxury',
 'auto',
 'maker',
 'in',
 'Britain',
 'and',
 'Europe',
 ',',
 'and',
 'in',
 'Far',
 'Eastern',
 'markets',
 '.']

In some languages (such as Chinese) there are no spaces in the words. Much more sophisticated word segmenters are needed in this case, for example [jieba](https://github.com/fxsjy/jieba) for Chinese (you'll need to install the package to run this code). One challenge for these languages is that it isn't always clear what a word is!

In [9]:
# !pip install jieba
from jieba import cut
zh_text = "分词是小菜一碟" # tokenization is a piece of cake

print(" ".join(cut(zh_text)))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/n9/3r6hrp015t9d1n6m8l36t0yh0000gn/T/jieba.cache
Loading model cost 0.301 seconds.
Prefix dict has been built successfully.


分词 是 小菜一碟


## Lemmatization and Stemming

For some applications, it can be useful to ignore the morphological differences between words. A classic example is information retrival (i.e. web search); if a user looks for "sleeping kitties", you might want to include in your results a page which mentions that "the kitty sleeps". By default, though, "kitty" and "kitties" and "sleeps" and "sleeping" are completely different word types.

In [10]:
text = ["The", "kitty", "sleeps"]

In [11]:
"sleeping" in text

False

In [12]:
"kitties" in text

False

Lemmatization converts an inflected form to a base, uninflected form. This decreases the size of your vocabulary (eliminating rare forms), and is often useful if you want calculate statistics using lexicons but don't want to store all the possible forms.  NLTK has a lemmatizer for English that uses the WordNet lexicon (more on WordNet in COLX 561). One tricky aspect is that by default it requires a part-of-speech. It works for both regular and irregular forms.

In [13]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
lemmatizer.lemmatize("kitties", "n")

'kitty'

In [15]:
lemmatizer.lemmatize("sleeping", "v")

'sleep'

In [16]:
lemmatizer.lemmatize("women", "n")

'woman'

In [17]:
lemmatizer.lemmatize("sleep", "v")

'sleep'

If we don't want to POS tag, trying to lemmatize as a noun first, then a verb, will work in most cases, but not all

In [18]:
def lemmatize(word):
    lemma = lemmatizer.lemmatize(word, "n")
    if lemma == word:
        lemma = lemmatizer.lemmatize(word, "v")
    return lemma

In [19]:
lemmatize("kitties")

'kitty'

In [20]:
lemmatize("sleeping")

'sleep'

In [21]:
lemmatize("does")

'doe'

In [22]:
lemmatizer.lemmatize("does", "v")

'do'

Stemming has a similar purpose but it strips off both inflectional and derivational mophology to reach a stem. This stem is not always itself a word. Sometime stemming incorrectly collapses/conflates words with very different meaning, or fails to collapse related words. The most popular stemming algorithm for English is called the [porter stemmer](http://snowball.tartarus.org/algorithms/porter/stemmer.html), it involves a bunch of rewrite rules to get rid of common suffixes.

In [23]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [24]:
stemmer.stem("sleeping")

'sleep'

In [25]:
stemmer.stem("kitty")

'kitti'

In [26]:
stemmer.stem("kitties")

'kitti'

In [27]:
S1 = "automatization"
S2 = "automatic"
S3 = "has"
S4 = "have"

In [28]:
stemmer.stem(S1)

'automat'

In [29]:
stemmer.stem(S2)

'automat'

In [30]:
stemmer.stem(S3)

'ha'

In [31]:
stemmer.stem(S4)

'have'

Let's try both lemmatizing and stemming the sentence below (after tokenizing), and compare the results

In [32]:
S = "Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people"
words = word_tokenize(S)
lemmas = [lemmatize(word) for word in words]
print(lemmas)
stems = [stemmer.stem(word) for word in words]
print(stems)

['Whereas', 'disregard', 'and', 'contempt', 'for', 'human', 'right', 'have', 'result', 'in', 'barbarous', 'act', 'which', 'have', 'outrage', 'the', 'conscience', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'being', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and', 'belief', 'and', 'freedom', 'from', 'fear', 'and', 'want', 'ha', 'be', 'proclaim', 'a', 'the', 'highest', 'aspiration', 'of', 'the', 'common', 'people']
['wherea', 'disregard', 'and', 'contempt', 'for', 'human', 'right', 'have', 'result', 'in', 'barbar', 'act', 'which', 'have', 'outrag', 'the', 'conscienc', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'be', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and', 'belief', 'and', 'freedom', 'from', 'fear', 'and', 'want', 'ha', 'been', 'proclaim', 'as', 'the', 'highest', 'aspir', 'of', 'the', 'common', 'peopl']


## Part of Speech Tagging

We've already seen that POS tagging can be useful for doing analysis of corpora. It has other uses, for instance it can be used to focus on particular kinds of words for certain applications, and it can provide simple word sense disambiguation (e.g. the word "cross"). The NLTK POS tagger for English is easy to use and effective. 

In [33]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [34]:
from nltk import pos_tag

S = "I think the NLTK pos tagger is a pretty solid tool for doing computational linguistics"
print(pos_tag(S.split(" ")))

[('I', 'PRP'), ('think', 'VBP'), ('the', 'DT'), ('NLTK', 'NNP'), ('pos', 'NN'), ('tagger', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('pretty', 'RB'), ('solid', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('doing', 'VBG'), ('computational', 'JJ'), ('linguistics', 'NNS')]


The standard tagset for English and used for the NLTK tagger (and most others) is the one created as part of the Penn Treebank annotation project, defined [here](https://www.anc.org/penn.html). We can also POS tag using the universal tagset if we like

In [35]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [36]:
pos_tag(S.split(" "), tagset="universal")

[('I', 'PRON'),
 ('think', 'VERB'),
 ('the', 'DET'),
 ('NLTK', 'NOUN'),
 ('pos', 'NOUN'),
 ('tagger', 'NOUN'),
 ('is', 'VERB'),
 ('a', 'DET'),
 ('pretty', 'ADV'),
 ('solid', 'ADJ'),
 ('tool', 'NOUN'),
 ('for', 'ADP'),
 ('doing', 'VERB'),
 ('computational', 'ADJ'),
 ('linguistics', 'NOUN')]

We can also use NLTK to build a POS tagger for any language where we have a manually tagged corpus and/or morphological information which can be expressed in the form of regular expressions.

In [37]:
from nltk import RegexpTagger, UnigramTagger
from nltk.corpus import treebank 

patterns = [(r"\d+", "CD"),(r".*ing$", "VBG"), (r".*ed$", "VBD"),(r".*s$", "NNS"),(r".*", "NN")]
sentence = ["he", "googled", "cats"]
tagged = treebank.tagged_sents()

re_tagger= RegexpTagger(patterns)
uni_tagger= UnigramTagger(tagged, backoff=re_tagger)
print(pos_tag(sentence))
print(uni_tagger.tag(sentence))

[('he', 'PRP'), ('googled', 'VBD'), ('cats', 'NNS')]
[('he', 'PRP'), ('googled', 'VBD'), ('cats', 'NNS')]


Let's use NLTK to tokenize the sentences about me, and then, for each sentence, compare the tags from the main NLTK POS tagger and the simple one we just built. Are there many differences? When there are, which do you think is correct?

In [38]:
for sent in sents:
    # my code here
    tokens = word_tokenize(sent)
    NLTK_pos = pos_tag(tokens)
    my_pos = uni_tagger.tag(tokens)
    for i in range(len(tokens)):
        print(tokens[i])
        print(NLTK_pos[i][1])
        print(my_pos[i][1])
    # my code here

The
DT
DT
luxury
NN
NN
auto
NN
NN
maker
NN
NN
last
JJ
JJ
year
NN
NN
sold
VBD
VBN
1,214
CD
CD
cars
NNS
NNS
in
IN
IN
the
DT
DT
U.S
NNP
NN
.
.
.
Howard
NNP
NNP
Mosher
NNP
NN
,
,
,
president
NN
NN
and
CC
CC
chief
JJ
NN
executive
NN
NN
officer
NN
NN
,
,
,
said
VBD
VBD
he
PRP
PRP
anticipates
VBZ
VBZ
growth
NN
NN
for
IN
IN
the
DT
DT
luxury
NN
NN
auto
NN
NN
maker
NN
NN
in
IN
IN
Britain
NNP
NNP
and
CC
CC
Europe
NNP
NNP
,
,
,
and
CC
CC
in
IN
IN
Far
NNP
RB
Eastern
NNP
NNP
markets
NNS
NNS
.
.
.


## All-in-one preprocessing with SpaCy

NLTK isn't the only option for preprocessing in English with Python. Another popular choice is [SpaCy](https://spacy.io), which will do everything you might need in one line of code (including stuff we aren't talking about in this course, i.e. parsing). You'll need to install the package and its models. It's fast and lightweight.

In [39]:
# !python3 -m pip install spacy
# !python3 -m spacy download en_core_web_sm

In [40]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi there. How are you?")

You can access what you need in the resulting document object. The sentences are in sents, each sentence is a list of tokens, and each token has a lemmas_, pos_ (the universal POS tag), and tag_ attribute (the treebank pos tag).

In [41]:
for sent in doc.sents:
    print(sent)
    for token in sent:
        print("token:", token)
        print("lemma:", token.lemma_)
        print("Univeral POS:", token.pos_)  
        print("PTB POS:", token.tag_)
        print("-----")

Hi there.
token: Hi
lemma: hi
Univeral POS: INTJ
PTB POS: UH
-----
token: there
lemma: there
Univeral POS: ADV
PTB POS: RB
-----
token: .
lemma: .
Univeral POS: PUNCT
PTB POS: .
-----
How are you?
token: How
lemma: how
Univeral POS: SCONJ
PTB POS: WRB
-----
token: are
lemma: be
Univeral POS: AUX
PTB POS: VBP
-----
token: you
lemma: you
Univeral POS: PRON
PTB POS: PRP
-----
token: ?
lemma: ?
Univeral POS: PUNCT
PTB POS: .
-----


Let's look at the results for SpaCy and NLTK for preprocessing the sentence about me and find some difference in the results of their preprocessing.

In [42]:
spacy_sents = []
nltk_sents = []

en_doc = nlp(en_text)
for sent in en_doc.sents:
    spacy_sents.append([(str(token), token.tag_) for token in sent])
for sent in sent_tokenize(en_text):
    nltk_sents.append(pos_tag(word_tokenize(sent)))

print(len(spacy_sents))
print(len(nltk_sents))
for i in range(len(spacy_sents)):
    print(spacy_sents[i])
    print(nltk_sents[i])

1
1
[('The', 'DT'), ('luxury', 'NN'), ('auto', 'NN'), ('maker', 'NN'), ('last', 'JJ'), ('year', 'NN'), ('sold', 'VBD'), ('1,214', 'CD'), ('cars', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('U.S.', 'NNP'), ('Howard', 'NNP'), ('Mosher', 'NNP'), (',', ','), ('president', 'NN'), ('and', 'CC'), ('chief', 'JJ'), ('executive', 'JJ'), ('officer', 'NN'), (',', ','), ('said', 'VBD'), ('he', 'PRP'), ('anticipates', 'VBZ'), ('growth', 'NN'), ('for', 'IN'), ('the', 'DT'), ('luxury', 'NN'), ('auto', 'NN'), ('maker', 'NN'), ('in', 'IN'), ('Britain', 'NNP'), ('and', 'CC'), ('Europe', 'NNP'), (',', ','), ('and', 'CC'), ('in', 'IN'), ('Far', 'JJ'), ('Eastern', 'JJ'), ('markets', 'NNS'), ('.', '.')]
[('The', 'DT'), ('luxury', 'NN'), ('auto', 'NN'), ('maker', 'NN'), ('last', 'JJ'), ('year', 'NN'), ('sold', 'VBD'), ('1,214', 'CD'), ('cars', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('U.S.', 'NNP'), ('Howard', 'NNP'), ('Mosher', 'NNP'), (',', ','), ('president', 'NN'), ('and', 'CC'), ('chief', 'JJ'), ('executive', 'N

SpaCy also has [multilingual support](https://spacy.io/usage/models#languages). Let's take a quick look at French.

In [43]:
# !python3 -m spacy download fr_core_news_sm

nlp = spacy.load('fr_core_news_sm')
doc = nlp("Considérant que la méconnaissance et le mépris des droits de l'homme ont conduit à des actes de barbarie qui révoltent la conscience de l'humanité et que l'avènement d'un monde où les êtres humains seront libres de parler et de croire, libérés de la terreur et de la misère, a été proclamé comme la plus haute aspiration de l'homme")
for sent in doc.sents:
    print(sent)
    for token in sent:
        print("token:", token)
        print("lemma:", token.lemma_)
        print("Universal POS:", token.pos_)  
        print("Language-specific POS:", token.tag_)
        print("------")

Considérant que la méconnaissance et le mépris des droits de l'homme ont conduit à des actes de barbarie qui révoltent la conscience de l'humanité et que l'avènement d'un monde où les êtres humains seront libres de parler et de croire, libérés de la terreur et de la misère, a été proclamé comme la plus haute aspiration de l'homme
token: Considérant
lemma: considérer
Universal POS: VERB
Language-specific POS: VERB
------
token: que
lemma: que
Universal POS: SCONJ
Language-specific POS: SCONJ
------
token: la
lemma: le
Universal POS: DET
Language-specific POS: DET
------
token: méconnaissance
lemma: méconnaissance
Universal POS: NOUN
Language-specific POS: NOUN
------
token: et
lemma: et
Universal POS: CCONJ
Language-specific POS: CCONJ
------
token: le
lemma: le
Universal POS: DET
Language-specific POS: DET
------
token: mépris
lemma: mépris
Universal POS: NOUN
Language-specific POS: NOUN
------
token: des
lemma: de
Universal POS: ADP
Language-specific POS: ADP
------
token: droits
lemm

SpaCy is solid and stable, but if you want the very best, state-of-the-art NLP tools and are willing to sacrifice speed, you might also try [Flair](https://github.com/zalandoresearch/flair). [Textblob](https://textblob.readthedocs.io/en/dev/) is another popular one-stop option for NLP processing.

## T/F questions

1. The only reason English sentence segmentation is challenging is that the period is ambiguous.
2. Word tokenization for Chinese is easier than English, because Chinese characters correspond directly to words, whereas English has clitics.
3. After you stem a word, what is left is also a word.
4. The morphology of a word (its affixes) gives you a lot of information about what its part of speech is.
5. If you are doing both lemmatization and POS tagging, you should do POS tagging first.

See [_NLTK BOOK_](https://www.nltk.org/book/)