# Text Processing

# Topics
- Overview
- Parsing, Stemming, Lemmatization
- Named Entity Recognition
- Stop Words
- Frequency Analysis
- Workshop: Document Summarization

# Toolkits

### NLTK: NLP toolkit

Book: http://www.nltk.org/book/

Wiki: https://github.com/nltk/nltk/wiki

Corpus: http://www.nltk.org/nltk_data/

### spaCy: another NLP toolkit

Simpler to use than NLTK (but usually fewer knobs)

API: https://spacy.io/api/

Models: https://spacy.io/usage/models

Tutorial: https://spacy.io/usage/spacy-101

# Setup

Run this command from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) conda install nltk spacy scikit-learn pandas
```


# What is Text Processing?

- A sub-field of Natural Language Processing (NLP)
- Natural Language Processing is ...
 - Teaching machines to understand and produce language (text, speech)
 - A combination of computer science and computational linguistics

# Text Processing Tasks

- Word categorization and tagging: part of speech, type of entity
- Semantic Analysis: finding meanings of documents
- Topic Modeling: finding topics from documents
- Document similarity: comparing if two documents are semantically similar
- etc.

Note: Speech is text processing + acoustic model

# Parsing, Stemming & Lemmatization

- Tokenization: splitting text into words
- Sentence boundary detection: splitting text into sentences
- Stemming: finding word stems
   - stating => state, reference => refer
- Lemmatization: finding the base form of words
   - was => be

## Tokenization

- Segmenting text into words, punctuation, etc.
- Rule-based

### Tokenization with spaCy

![tokenization in spaCy](assets/text/tokenization.svg)

(image: https://spacy.io/usage/spacy-101#annotations-token)

In [None]:
# Download the English model
# You can find other models here: https://spacy.io/models/en
!python -m spacy download en_core_web_sm

In [None]:
text = u"This is a test. A quick brown fox jumps over the lazy dog."

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

# sentence tokenizer
for sent in doc.sents:
    print()
    print(sent)

In [None]:
# word tokenizer
for token in doc:
    print(token.text)

In [None]:
spacy.explain('DET')

https://spacy.io/api/token

https://spacy.io/api/token#attributes

In [None]:
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

displacy.render(doc, style='dep', jupyter=True, options={'distance': 140})

### Tokenization with NLTK

http://www.nltk.org/api/nltk.tokenize.html

nltk.tokenize
 - sent_tokenize
 - word_tokenize
 - wordpunc_tokenize


In [None]:
# Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html

# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import sent_tokenize

# list of sentences
sent_tokenize(text)

In [None]:
from nltk.tokenize import word_tokenize

# flat list of words and punctuations
word_tokenize(text)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)

# list of lists
[word_tokenize(sentence) for sentence in sentences]

In [None]:
from nltk.tokenize import wordpunct_tokenize

text2 = "'The time is now 5.30am,' he said."

print(word_tokenize(text2))

print(wordpunct_tokenize(text2))

In [None]:
# Part of speech tagging
import nltk
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)
sentences = [word_tokenize(sentence) for sentence in sentences]

[nltk.pos_tag(word) for word in sentences]

In [None]:
spacy.explain('JJ')

#### Twitter-aware tokenizer

`nltk.tokenize.TweetTokenizer`

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

In [None]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

tknzr.tokenize(tweet)

In [None]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
tweet = '@remy: This is waaaaayyyy too much for you!!!!!!'

tknzr.tokenize(tweet)

## Stemming vs. Lemmatization

- Stemming uses rule-based heuristics
  - ponies => poni
  - Quicker, but less precision
- Lemmatization uses vocabulary and morphological analysis
  - ponies => pony
  - For English, not much improvement over stemming because context of word use is more important

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

## Porter Stemmer

- 5 sequential phases of word reductions
- Applies rules such as "sses -> ss", "ies => i"

![stemmers](assets/text/stemmers.png)

(image: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

### Stemming & Lemmatization with spaCy

`spacy.lemmatizer.Lemmatizer`

https://spacy.io/api/lemmatizer

In [None]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

nlp = spacy.load('en_core_web_sm')
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

doc = nlp(text)

for token in doc:
    print(lemmatizer(token.text, token.pos_))

### Stemming & Lemmatization with NLTK

`nltk.stem`
- `PorterStemmer`
- `WordNetLemmatizer`

http://www.nltk.org/api/nltk.stem.html

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

tokens = word_tokenize(text)

for token in tokens:
    print(stemmer.stem(token))

In [None]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

tokens = word_tokenize(text)

for token in tokens:
    print(lemmatizer.lemmatize(token))

## Named Entity Recognition

- Find and classify entities within text
  - Persons
  - Organizations
  - Locations
  - Time expressions
  - Quantities
  - Phone numbers
  - etc
  
- Grammar-based models, trained classifiers

- Can be corpus-dependent, see https://spacy.io/api/annotation#named-entities

### Named Entity Recognition with spaCy

https://spacy.io/api/annotation#named-entities

In [None]:
nlp = spacy.load('en_core_web_sm')

text3 = u"Flight 224 is scheduled to arrive in Frankfurt at 4pm July 5th, 2018."
doc = nlp(text3)

for entity in doc.ents:
    print(entity.text, entity.label_, entity.start_char, entity.end_char)

In [None]:
spacy.explain('NORP')

In [None]:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

### Named Entity Recognition with NLTK

```
nltk.ne_chunk()
```

https://www.nltk.org/book/ch07.html

In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text3)
sentences = [word_tokenize(sentence) for sentence in sentences]

# Input to ne_chunk needs to be a part-of-speech tagged word
sentences_pos_tagged = [nltk.pos_tag(word) for word in sentences]

[nltk.ne_chunk(word_pos) for word_pos in sentences_pos_tagged]

## Stop words

Stop words are high-frequency words that don't contribute much lexical content:

- the
- a
- to

NLP libraries usually include a corpus of stop words.

Stop word lists:
- http://www.nltk.org/book/ch02.html#stopwords_index_term
- https://www.semantikoz.com/blog/free-stop-word-lists-in-23-languages/

### Stop words with spaCy

`spacy.lang.en.stop_words`

`token.is_stop`

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS

In [None]:
# Deutsch
from spacy.lang.de.stop_words import STOP_WORDS

STOP_WORDS

In [None]:
doc = nlp(text3)

for token in doc:
    print(token.text, token.is_stop)

In [None]:
# Adding stop words
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS.add('MLDDS')

doc = nlp(u"Sorry I'm not free tonite, I have MLDDS (lowercase: mldds).")

for token in doc:
    print(token.text, token.is_stop)

### Stop words with NLTK

```
nltk.corpus.stopwords
```

In [None]:
# Download corpus
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

stopwords.words('english')

In [None]:
stopwords.words('german')

In [None]:
tokens = nltk.word_tokenize(text3)

stops = set(stopwords.words('english'))

for token in tokens:
    print(token, token in stops)

In [None]:
# Adding stop words
stops = stopwords.words('english')
stops.append("MLDDS")
stops = set(stops)

tokens = nltk.word_tokenize(u"Sorry I'm not free tonite, I have MLDDS (lowercase: mldds).")

for token in tokens:
    print(token, token in stops)

# Frequency Analysis

Answers two questions:

1. How often does a word appear in a document?

2. How important is a word in a document?

Measure: Term Frequency - Inverse Document Frequency (TF-IDF)

## Term Frequency

Most common formula:

$$\frac{f_{t, d}}{\sum_{t' \in d} \, f_{t',d}}$$

$f_{t, d}$: count of term $t$ in document $d$

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

## Inverse Document Frequency

Most common formula:

$$log\frac{N}{\mid\{d \in D : t \in d \}\mid}$$

$N$: number of documents

$\mid\{d \in D : t \in d \}\mid$: number of documents containing term $t$

## TD-IDF

$$tfidf(t, d, D) = tf(t, d) * idf(t, D)$$

|term|tf|idf|tf-idf|
|--|--|--|--|--|
|to|large|very small|closer to 0|
|coffee|small|large|not closer to 0|

## Computing TF-IDF

#### Scikit-learn:

```
sklearn.feature_extraction.text.CountVectorizer

sklearn.feature_extraction.text.TfidfVectorizer
```
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


#### NLTK
Supports tf-idf, but less popular
```
nltk.text.TextCollection
```

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

In [None]:
text5 = u"This is a test.\n" \
    u"The quick brown fox jumps over the lazy dog.\n" \
    u"The early bird gets the worm.\n"

#### Computing Word Counts

In [None]:
# http://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import CountVectorizer

nlp = spacy.load('en_core_web_sm')
doc = nlp(text5)
sentences = [sent.text for sent in doc.sents]

# Count word occurrences
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()

X_dense

In [None]:
vectorizer.get_feature_names()

In [None]:
# display as a dataframe
import pandas as pd

df_wc = pd.DataFrame(X_dense, columns=vectorizer.get_feature_names())
df_wc

#### Computing TF-IDF

In [None]:
# http://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import TfidfVectorizer

# TfidfVectorizer is a combination of
#   CountVectorizer + TfidfTransformer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()

print(X_dense.shape)
print(vectorizer.get_feature_names())
X_dense

In [None]:
# for each sentence, get the highest tf-idf
import numpy as np

terms = vectorizer.get_feature_names()
tfidf_arr = np.array(X_dense)

for i in np.arange(len(sentences)):
    print(sentences[i])
    sorted_idx = np.argsort(tfidf_arr[i])[::-1]
    [print(terms[j], tfidf_arr[i][j]) for j in sorted_idx]
    print()

## Exercise

1. Get 3-5 of your own sample sentences
2. Compute the TF-IDF
3. Compute the TF-IDF with stop_words filtered out:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

```
from spacy.lang.en.stop_words import STOP_WORDS

vectorizer = TfidfVectorizer(stop_words=STOP_WORDS)

...

```

## N-grams

TF-IDF can be applied to N-grams (N words at a time), to try to capture some context information.

```
CountVectorizer(ngram_range=(minN, maxN)), ..)

TfidfVectorizer(ngram_range=(minN, maxN)), ..)
```

In [None]:
text5 = u"This is a test.\n" \
    u"The quick brown fox jumps over the lazy dog.\n" \
    u"The early bird gets the worm.\n"

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

nlp = spacy.load('en_core_web_sm')
doc = nlp(text5)
sentences = [sent.text for sent in doc.sents]

# Count word occurrences using 1 and 2-grams
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()
print(X_dense.shape)

pd.DataFrame(X_dense, columns=vectorizer.get_feature_names())

## Exercise: TF-IDF with Trigrams

- Compute the TF-IDF for trigrams (1 to 3-grams), using your sample text.
- Try with and without stop words included

In [None]:
# Your code here

from sklearn.feature_extraction.text import TfidfVectorizer

# TfidfVectorizer is a combination of
#   CountVectorizer + TfidfTransformer
vectorizer = TfidfVectorizer(ngram_range=(1, 3))
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()

print(X_dense.shape)
print(vectorizer.get_feature_names())

terms = vectorizer.get_feature_names()
tfidf_arr = np.array(X_dense)

for i in np.arange(len(sentences)):
    print(sentences[i])
    sorted_idx = np.argsort(tfidf_arr[i])[::-1]
    [print(terms[j], tfidf_arr[i][j]) for j in sorted_idx]
    print()

### NLTK N-gram support

You can also split text into trigrams and bigrams using NLTK.

In [13]:
from nltk import bigrams, trigrams, ngrams, word_tokenize

# http://www.taleswithmorals.com/aesop-fable-the-ant-and-the-grasshopper.htm
text6 = "In a field one summer's day a Grasshopper was hopping about, " \
        "chirping and singing to its heart's content."

words = word_tokenize(text6)

print(list(bigrams(words)))

[('In', 'a'), ('a', 'field'), ('field', 'one'), ('one', 'summer'), ('summer', "'s"), ("'s", 'day'), ('day', 'a'), ('a', 'Grasshopper'), ('Grasshopper', 'was'), ('was', 'hopping'), ('hopping', 'about'), ('about', ','), (',', 'chirping'), ('chirping', 'and'), ('and', 'singing'), ('singing', 'to'), ('to', 'its'), ('its', 'heart'), ('heart', "'s"), ("'s", 'content'), ('content', '.')]


In [14]:
print(list(trigrams(words)))

[('In', 'a', 'field'), ('a', 'field', 'one'), ('field', 'one', 'summer'), ('one', 'summer', "'s"), ('summer', "'s", 'day'), ("'s", 'day', 'a'), ('day', 'a', 'Grasshopper'), ('a', 'Grasshopper', 'was'), ('Grasshopper', 'was', 'hopping'), ('was', 'hopping', 'about'), ('hopping', 'about', ','), ('about', ',', 'chirping'), (',', 'chirping', 'and'), ('chirping', 'and', 'singing'), ('and', 'singing', 'to'), ('singing', 'to', 'its'), ('to', 'its', 'heart'), ('its', 'heart', "'s"), ('heart', "'s", 'content'), ("'s", 'content', '.')]


In [17]:
print(list(ngrams(words, 4)))

[('In', 'a', 'field', 'one'), ('a', 'field', 'one', 'summer'), ('field', 'one', 'summer', "'s"), ('one', 'summer', "'s", 'day'), ('summer', "'s", 'day', 'a'), ("'s", 'day', 'a', 'Grasshopper'), ('day', 'a', 'Grasshopper', 'was'), ('a', 'Grasshopper', 'was', 'hopping'), ('Grasshopper', 'was', 'hopping', 'about'), ('was', 'hopping', 'about', ','), ('hopping', 'about', ',', 'chirping'), ('about', ',', 'chirping', 'and'), (',', 'chirping', 'and', 'singing'), ('chirping', 'and', 'singing', 'to'), ('and', 'singing', 'to', 'its'), ('singing', 'to', 'its', 'heart'), ('to', 'its', 'heart', "'s"), ('its', 'heart', "'s", 'content'), ('heart', "'s", 'content', '.')]


# Workshop: Document Summarization

- assign a score to each word in a document corresponding to its level of "importance"
- rank each sentence in the document
  - by summing the individual word scores and dividing by the number of tokens in the sentence
- extract the top N highest scoring sentences and return them as our "summary"

Credits: 
- https://github.com/charlieg/A-Smattering-of-NLP-in-Python
- http://anthology.aclweb.org/P/P11/P11-3014.pdf

In [4]:
import datetime, re, sys
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import reuters
from nltk.stem import PorterStemmer

import nltk
nltk.download('reuters')

stemmer = PorterStemmer()

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

token_dict = {}
for article in reuters.fileids():
    token_dict[article] = reuters.raw(article)
        
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')
print('building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']')

tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds)
print('done! [process finished: ' + str(datetime.datetime.now()) + ']')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\issohl\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
building term-document matrix... [process started: 2018-06-23 22:34:29.859958]
done! [process finished: 2018-06-23 22:35:25.331088]


In [18]:
from random import randint

feature_names = tfidf.get_feature_names()
print ('TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents')

print('first term: ' + feature_names[0])
print('last term: ' + feature_names[len(feature_names) - 1])

for i in range(0, 4):
    print('random term: ' + feature_names[randint(1,len(feature_names) - 2)])

TDM contains 26304 terms and 10788 documents
first term: 'ali
last term: zzzz
random term: significnt
random term: maud
random term: pasta
random term: realism


In [21]:
import math
import nltk
import random

article_id = random.randint(0, tdm.shape[0] - 1)
article_text = reuters.raw(reuters.fileids()[article_id])

sent_scores = []
for sentence in nltk.sent_tokenize(article_text):
    score = 0
    sent_tokens = tokenize_and_stem(sentence)
    for token in (t for t in sent_tokens if t in feature_names):
        score += tdm[article_id, feature_names.index(token)]
    sent_scores.append((score / len(sent_tokens), sentence))

summary_length = int(math.ceil(len(sent_scores) / 5))
sent_scores.sort(key=lambda sent: sent[0])

print('*** SUMMARY ***')
for summary_sentence in sent_scores[:summary_length]:
    print(summary_sentence[1])

print('\n*** ORIGINAL ***')
print(article_text)

*** SUMMARY ***
Comdata said in the merger each share of the company's
  stock would be converted at the holders election into either 15
  dlrs in cash or a combination of 10 dlrs in cash and a unit of
  securities including common stock.
Comdata said WCAS and its affiliate investors would commit
  50 mln dlrs to buy the securities comprising the new entities
  units of securities resulting from the merger in the same
  proportions and at the same price as the company shareholders.

*** ORIGINAL ***
COMDATA &lt;CDN> IN MERGER AGREEMENT
  Comdata Network Inc said
  it has entered into a letter of intent with a limited
  partnership managed by Welsh, Carson, Anderson and Stowe (WCAS)
  to merge Comdata into a corproration to be formed by WCAS.
      Comdata said in the merger each share of the company's
  stock would be converted at the holders election into either 15
  dlrs in cash or a combination of 10 dlrs in cash and a unit of
  securities including common stock.
      Comdata said 