# COLX 521 Lecture 7: Text Preprocessing

* Sentence segmentation
* Tokenization
* Lemmatization and Stemming
* POS tagging
* End-to-end preprocessing with SpaCy

## Sentence segmentation

For many computational linguistics applications, the sentence is a key unit of processing. All modern written languages have an end-of-sentence marker. For some languages, like Chinese, this marker is unambigious and so getting the sentences of the text is very easy.

In [1]:
zh_text = "你好。我叫志林。我是老师。"

# my code here
zh_sents = zh_text.split("。")
zh_sents
# my code here

['你好', '我叫志林', '我是老师', '']

However, English is not so easy because the period (".") is ambiguous. It has various uses. This is true for many other European languages as well.

In [8]:
en_text = "Dr. Brooke got his Ph.D. from the Univ. of Toronto on 12/2013. Dr. Brooke's GPA was 4.1, and his thesis wasn't half-bad, if a bit too long. He went into industry...but later came back to academia."

# my code here
print(en_text.split("."))
# my code here


['Dr', ' Brooke got his Ph', 'D', ' from the Univ', ' of Toronto on 12/2013', ' Dr', " Brooke's GPA was 4", "1, and his thesis wasn't half-bad, if a bit too long", ' He went into industry', '', '', 'but later came back to academia', '']


There are ways to improve things considerably by using regular expressions; for instance, splitting only when a period is followed by a space and then an upper case letter. 

In [9]:
import re

regex = r"(?<!Dr)\. (?=[A-Z])"

# my code here
re.split(regex, en_text)
# my code here

['Dr. Brooke got his Ph.D. from the Univ. of Toronto on 12/2013',
 "Dr. Brooke's GPA was 4.1, and his thesis wasn't half-bad, if a bit too long",
 'He went into industry...but later came back to academia.']

For major languages with significant ambiguity, you'll probably want to use a dedicated sentence splitter, which NLTK has for some languages (17). But don't expect perfection!

In [66]:
from nltk import sent_tokenize

# my code here
sent_tokenize(en_text)
# my code here


['Dr. Brooke got his Ph.D. from the Univ.',
 'of Toronto on 12/2013.',
 "Dr. Brooke's GPA was 4.1, and his thesis wasn't half-bad, if a bit too long.",
 'He went into industry...but later came back to academia.']

## Tokenization

Again, breaking a sentence up into words seems like it should be easy, but actually rarely is. Spaces separate most word tokens, but punctuation is a problem.


In [69]:
#provided code
sents = sent_tokenize(en_text)
sent = sents[2]
sent


"Dr. Brooke's GPA was 4.1, and his thesis wasn't half-bad, if a bit too long."

In [64]:
sent.split(" ")

['Dr.',
 "Brooke's",
 'GPA',
 'was',
 '4.1,',
 'and',
 'his',
 'thesis',
 "wasn't",
 'half-bad,',
 'if',
 'a',
 'bit',
 'too',
 'long.']

In [65]:
[match.group() for match in re.finditer("\w+", sent)]

['Dr',
 'Brooke',
 's',
 'GPA',
 'was',
 '4',
 '1',
 'and',
 'his',
 'thesis',
 'wasn',
 't',
 'half',
 'bad',
 'if',
 'a',
 'bit',
 'too',
 'long']

In computational linguistics for English, clitics (small words that are phonologically joined to a host word) such as "n't" and "'s" are often treated as separate words, and need to be dealt with specially). A proper word tokenizer (such as is included in NLTK) will deal with these subtle issues.

In [63]:
from nltk import word_tokenize

# my code here
word_tokenize(sent)
# my code here


['Dr.',
 'Brooke',
 "'s",
 'GPA',
 'was',
 '4.1',
 ',',
 'and',
 'his',
 'thesis',
 'was',
 "n't",
 'half-bad',
 ',',
 'if',
 'a',
 'bit',
 'too',
 'long',
 '.']

In some languages (such as Chinese) there are no spaces in the words. Much more sophisticated word segmenters are needed in this case, for example [jieba](https://github.com/fxsjy/jieba) for Chinese (you'll need to install the package to run this code). One challenge for these languages is that it isn't always clear what a word is!

In [52]:
#provided code
from jieba import cut
zh_text = "分词是小菜一碟" # tokenization is a piece of cake

print(" ".join(cut(zh_text)))

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\User\AppData\Local\Temp\jieba.cache
Loading model cost 0.860 seconds.
Prefix dict has been built succesfully.


分词 是 小菜一碟


Exercise: In English, a similar case to Chinese (and other no-space languages) occurs in the context of hashtags. Write code to automatically split the hashtags into their constituent words (you can assume only two, which makes it easier than the general case), using the NLTK words lexicon. 

In [None]:
#provided code
import nltk
nltk.download('words')

In [68]:
from nltk.corpus import words
en_words = set(words.words("en"))

def dehash(hashtag):
    hashtag = hashtag.strip("#")
    # your code here
    for i in range(1, len(hashtag) -1):
        if hashtag[:i] in en_words and hashtag[i:] in en_words:
            return (hashtag[:i], hashtag[i:])
    # your code here

hashtags = ["#followme", "#goodmorning","#happyhour"]

for hashtag in hashtags:
    print(dehash(hashtag))

('follow', 'me')
('good', 'morning')
('happy', 'hour')


## Lemmatization and Stemming

For some applications, it can be useful to ignore the morphological differences between words. A classic example is information retrival (i.e. web search); if a user looks for "sleeping kitties", you might want to include in your results a page which mentions that "the kitty sleeps". By default, though, "kitty" and "kitties" and "sleeps" and "sleeping" are completely different word types.

In [22]:
#provided code
text = ["The", "kitty", "sleeps"]

In [23]:
"sleeping" in text

False

In [24]:
"kitties" in text

False

Lemmatization converts an inflected form to a base, uninflected form. NLTK has a lemmatizer for English that uses the WordNet lexicon. One tricky aspect is that it requires a part-of-speech. It works for both regular and irregular forms.

In [41]:
#provided code
from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()

In [42]:
lemmatizer.lemmatize("kitties", "n")

'kitty'

In [43]:
lemmatizer.lemmatize("sleeping", "v")

'sleep'

In [94]:
lemmatizer.lemmatize("sleep", "v")

'sleep'

In [44]:
def lemmatize(word):
    lemma = lemmatizer.lemmatize(word, "n")
    if lemma == word:
        lemma = lemmatizer.lemmatize(word, "v")
    return lemma

In [45]:
lemmatize("kitties")

'kitty'

In [46]:
lemmatize("sleeping")

'sleep'

In [47]:
lemmatize("does")

'doe'

Stemming has a similar purpose but it strips off both inflectional and derivational mophology to reach a stem. This stem is not often itself a word. Sometime stemming incorrectly collapses/conflates words with very different meaning. The most popular stemming algorithm for English is called the [porter stemmer](http://snowball.tartarus.org/algorithms/porter/stemmer.html)

In [95]:
#provided code
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [34]:
stemmer.stem("kitty")

'kitti'

In [35]:
stemmer.stem("kitties")

'kitti'

In [96]:
stemmer.stem("sleeping")

'sleep'

In [97]:
#provided code
S1 = "automatization"
S2 = "automatic"
S3 = "has"
S4 = "have"

In [98]:
stemmer.stem(S1)

'automat'

In [99]:
stemmer.stem(S2)

'automat'

In [87]:
stemmer.stem(S3)

'ha'

In [91]:
stemmer.stem(S4)

'have'

Exercise: Try both lemmatizing and stemming the sentence below (after tokenizing), and compare the results

In [93]:
S = "Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people"
words = word_tokenize(S)
#your code here
lemmas = [lemmatize(word) for word in words]
print(lemmas)
stems = [stemmer.stem(word) for word in words]
print(stems)
#your code here

['Whereas', 'disregard', 'and', 'contempt', 'for', 'human', 'right', 'have', 'result', 'in', 'barbarous', 'act', 'which', 'have', 'outrage', 'the', 'conscience', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'being', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and', 'belief', 'and', 'freedom', 'from', 'fear', 'and', 'want', 'ha', 'be', 'proclaim', 'a', 'the', 'highest', 'aspiration', 'of', 'the', 'common', 'people']
['wherea', 'disregard', 'and', 'contempt', 'for', 'human', 'right', 'have', 'result', 'in', 'barbar', 'act', 'which', 'have', 'outrag', 'the', 'conscienc', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'be', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and', 'belief', 'and', 'freedom', 'from', 'fear', 'and', 'want', 'ha', 'been', 'proclaim', 'as', 'the', 'highest', 'aspir', 'of', 'the', 'common', 'peopl']


## Part of Speech Tagging

We've already seen that POS tagging can be useful for doing analysis of corpora. It has other uses, for instance it can be used to focus on particular kinds of words for certain applications, and it can provide simple word sense disambiguation (e.g. the word "cross"). The NLTK POS tagger for English is easy to use and effective. 

In [None]:
#provided code
import nltk
nltk.download('averaged_perceptron_tagger')

In [7]:
from nltk import pos_tag

S = "I think the NLTK pos tagger is a pretty solid tool for doing computational linguistics"
#my code here
print(pos_tag(S.split(" ")))
#my code here

[('I', 'PRP'), ('think', 'VBP'), ('the', 'DT'), ('NLTK', 'NNP'), ('pos', 'NN'), ('tagger', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('pretty', 'RB'), ('solid', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('doing', 'VBG'), ('computational', 'JJ'), ('linguistics', 'NNS')]


The standard tagset for English and used for the NLTK tagger (and most others) is the one created as part of the Penn Treebank annotation project, defined [here](https://www.anc.org/penn.html). There is another popular tagset is known as the universal POS set which is applicable to any language.

In [None]:
#provided code
nltk.download('universal_tagset')

In [92]:
pos_tag(S.split(" "), tagset="universal")

[('I', 'PRON'),
 ('think', 'VERB'),
 ('the', 'DET'),
 ('NLTK', 'NOUN'),
 ('pos', 'NOUN'),
 ('tagger', 'NOUN'),
 ('is', 'VERB'),
 ('a', 'DET'),
 ('pretty', 'ADV'),
 ('solid', 'ADJ'),
 ('tool', 'NOUN'),
 ('for', 'ADP'),
 ('doing', 'VERB'),
 ('computational', 'ADJ'),
 ('linguistics', 'NOUN')]

We can also use NLTK to build a POS tagger for any language where we have a manually tagged corpus and/or morphological information which can be expressed in the form of regular expressions.

In [8]:
#provided code
from nltk import RegexpTagger, UnigramTagger
from nltk.corpus import treebank 

patterns = [(r".*ing$", "VBG"), (r".*ed$", "VBD"),(r".*s$", "NNS"),(r".*", "NN")]
sentence = ["he", "googled", "cats"]
tagged = treebank.tagged_sents()

re_tagger= RegexpTagger(patterns)
uni_tagger= UnigramTagger(tagged, backoff=re_tagger)
print(pos_tag(sentence))
print(uni_tagger.tag(sentence))

[('he', 'PRP'), ('googled', 'VBD'), ('cats', 'NNS')]
[('he', 'PRP'), ('googled', 'VBD'), ('cats', 'NNS')]


Exercise: Use NLTK to tokenize the sentences about me, and then, for each sentence, compare the tags from the main NLTK POS tagger and the simple one we just built. Are there many differences? When there are, which do you think is correct?

In [58]:
for sent in sents:
    # your code here
    tokens = word_tokenize(sent)
    print(pos_tag(tokens))
    print(uni_tagger.tag(tokens))
    # your code here

[('Dr.', 'NNP'), ('Brooke', 'NNP'), ('got', 'VBD'), ('his', 'PRP$'), ('Ph.D.', 'NN'), ('from', 'IN'), ('the', 'DT'), ('Univ', 'NNP'), ('.', '.')]
[('Dr.', 'NNP'), ('Brooke', 'NNP'), ('got', 'VBD'), ('his', 'PRP$'), ('Ph.D.', 'NN'), ('from', 'IN'), ('the', 'DT'), ('Univ', 'NN'), ('.', '.')]
[('of', 'IN'), ('Toronto', 'NNP'), ('on', 'IN'), ('12/2013', 'CD'), ('.', '.')]
[('of', 'IN'), ('Toronto', 'NNP'), ('on', 'IN'), ('12/2013', 'NN'), ('.', '.')]
[('Dr.', 'NNP'), ('Brooke', 'NNP'), ("'s", 'POS'), ('GPA', 'NNP'), ('was', 'VBD'), ('4.1', 'CD'), (',', ','), ('and', 'CC'), ('his', 'PRP$'), ('thesis', 'NN'), ('was', 'VBD'), ("n't", 'RB'), ('half-bad', 'JJ'), (',', ','), ('if', 'IN'), ('a', 'DT'), ('bit', 'NN'), ('too', 'RB'), ('long', 'RB'), ('.', '.')]
[('Dr.', 'NNP'), ('Brooke', 'NNP'), ("'s", 'POS'), ('GPA', 'NN'), ('was', 'VBD'), ('4.1', 'CD'), (',', ','), ('and', 'CC'), ('his', 'PRP$'), ('thesis', 'NNS'), ('was', 'VBD'), ("n't", 'RB'), ('half-bad', 'NN'), (',', ','), ('if', 'IN'), ('a'

## All-in-one preprocessing with SpaCy

NLTK isn't the only option for preprocessing in English with Python. Another popular choice is [SpaCy](https://spacy.io), which will do everything you might need in one line of code. (you'll need to install the package and its models). It's fast and lightweight.

In [None]:
#provided code
#pip install spacy
#python -m spacy download en_core_web_sm

In [15]:
# provided code
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi there. How are you?")


You can access what you need in the resulting document object. The sentences are in sents, each sentence is a list of tokens, and each token has a lemmas_, pos_ (the universal POS tag), and tag_ attribute (the treebank pos tag).

In [4]:
#provided code
for sent in doc.sents:
    print(sent)
    for token in sent:
        print(token)
        print(token.lemma_)
        print(token.pos_)
        print(token.tag_)

Hi there.
Hi
hi
INTJ
UH
there
there
ADV
RB
.
.
PUNCT
.
How are you?
How
how
ADV
WRB
are
be
VERB
VBP
you
-PRON-
PRON
PRP
?
?
PUNCT
.


Exercise: Play around with SpaCy and NLTK for preprocessing English until you find some difference in the results of their preprocessing.

In [14]:
en_doc = nlp(en_text)
for sent in en_doc.sents:
    print([(str(token), token.tag_) for token in sent])
for sent in sent_tokenize(en_text):
    print(pos_tag(word_tokenize(sent)))


[('Dr.', 'NNP'), ('Brooke', 'NNP'), ('got', 'VBD'), ('his', 'PRP$'), ('Ph.D.', 'NN'), ('from', 'IN'), ('the', 'DT'), ('Univ', 'NNP'), ('.', '.')]
[('of', 'IN'), ('Toronto', 'NNP'), ('in', 'IN'), ('12/2013', 'CD'), ('.', '.')]
[('Dr.', 'NNP'), ('Brooke', 'NNP'), ("'s", 'POS'), ('GPA', 'NNP'), ('was', 'VBD'), ('4.3', 'CD'), (',', ','), ('and', 'CC'), ('his', 'PRP$'), ('thesis', 'NN'), ('was', 'VBD'), ("n't", 'RB'), ('half', 'RB'), ('-', 'HYPH'), ('bad', 'JJ'), (',', ','), ('if', 'IN'), ('a', 'DT'), ('bit', 'NN'), ('too', 'RB'), ('long', 'RB'), ('.', '.')]
[('He', 'PRP'), ('went', 'VBD'), ('into', 'IN'), ('industry', 'NN'), ('...', ':'), ('but', 'CC'), ('later', 'RB'), ('came', 'VBD'), ('back', 'RB'), ('to', 'IN'), ('academia', 'NN'), ('.', '.')]
[('Dr.', 'NNP'), ('Brooke', 'NNP'), ('got', 'VBD'), ('his', 'PRP$'), ('Ph.D.', 'NN'), ('from', 'IN'), ('the', 'DT'), ('Univ', 'NNP'), ('.', '.')]
[('of', 'IN'), ('Toronto', 'NNP'), ('in', 'IN'), ('12/2013', 'CD'), ('.', '.')]
[('Dr.', 'NNP'), ('B

Exercise: Try SpaCy's [multilingual support](https://spacy.io/usage/models#languages). If you can, pick a language that you know well enough that you can tell whether it is doing a good job.

In [1]:
# fr_core_news_sm
import spacy

nlp = spacy.load('fr_core_news_sm')
doc = nlp("Considérant que la méconnaissance et le mépris des droits de l'homme ont conduit à des actes de barbarie qui révoltent la conscience de l'humanité et que l'avènement d'un monde où les êtres humains seront libres de parler et de croire, libérés de la terreur et de la misère, a été proclamé comme la plus haute aspiration de l'homme")
for sent in doc.sents:
    print(sent)
    for token in sent:
        print(token)
        print(token.lemma_)
        print(token.pos_)
        print(token.tag_)

Considérant que la méconnaissance et le mépris des droits de l'homme ont conduit à des actes de barbarie qui révoltent la conscience de l'humanité et que l'avènement d'un monde où les êtres humains seront libres de parler et de croire, libérés de la terreur et de la misère, a été proclamé comme la plus haute aspiration de l'homme
Considérant
considérer
VERB
VERB__Tense=Pres|VerbForm=Part
que
que
SCONJ
SCONJ___
la
le
DET
DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art
méconnaissance
méconnaissance
NOUN
NOUN__Gender=Fem|Number=Sing
et
et
CCONJ
CCONJ___
le
le
DET
DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art
mépris
mépris
NOUN
NOUN__Gender=Masc|Number=Sing
des
un
DET
DET__Definite=Ind|Number=Plur|PronType=Art
droits
droit
NOUN
NOUN__Gender=Masc|Number=Plur
de
de
ADP
ADP___
l'
le
DET
DET__Definite=Def|Number=Sing|PronType=Art
homme
homme
NOUN
NOUN__Gender=Masc|Number=Sing
ont
avoir
AUX
AUX__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
conduit
conduire
VERB
VERB__Gen

SpaCy is solid and stable, but if you want the very best, state-of-the-art NLP tools and are willing to sacrifice speed, you might also try [Flair](https://github.com/zalandoresearch/flair)