# GIAN 3: Cleaning, Separating, and Counting

You should now be familiar with searching and extracting text using regular expressions.

This notebook shows how to clean text and split large chunks of text into smaller elements. It also shouws how you can count the elements of text and how you can use those counts directly to gain knowledge from text.

## 1. Cleaning

After you have gathered your data, the next step in a text mining project is almost always to clean that data.

Gathering and cleaning can be considered to be the preparatory work in text mining and are usually more than half of the total effort.

In our current demonstration, we will have a relatively easy cleaning step.

In other projects, more extensive cleaning steps may be involved, such as:

+ Stripping html tags
+ Removing time codes 
+ Removing duplicate documents
+ Removing OCR errors
+ ...

As an example, we will use the file for Charles Dickens' "[A Tale of Two Cities](https://www.gutenberg.org/ebooks/98)", downloaded from [Project Gutenberg](https://www.gutenberg.org/). 

Books from project Gutenberg contain all kinds of legalese and licensing information before and after the main text. In many cases, we want to remove this information.

In [None]:
# Regular expressions are incredibly useful for cleaning!
import re

In [None]:
# This file contains the book as downloaded from Project Gutenberg
t0 = open('GIAN3_data/pg98.txt', encoding="utf-8").read()

In [None]:
len(t0)

We use a Python [function](https://docs.python.org/3/tutorial/controlflow.html#defining-functions) to define some code that we can apply to any text

In [None]:
def clean_gutenberg_text(text):
    """Remove front and back matter from Gutenberg text
    
    Takes a text string corresponding to a project Gutenberg book 
    and returns a text string with front and back matter removed.
    """
    m1 = re.search("START OF THIS PROJECT GUTENBERG EBOOK .+\n", text)
    m2 = re.search("End of the Project Gutenberg EBook .+\n", text)
    tstart=m1.span()[1]+1 # Text starts one character after the end of the front matter
    tstop=m2.span()[0]  # Text ends one character before the beginning of the back matter 
    return(text[tstart:tstop])

In [None]:
t1 = clean_gutenberg_text(t0)
len(t1)

In [None]:
# Let's check if we cleaned more or less correctly
print(t1[:200]) # first 200 characters
print("*******")
print(t1[-200:]) # final 200 characters

## 2. Separating or *tokenizing*

In lecture 2, we have seen how we can build up regular expressions to split text into words. In the domain of NLP (Natural Language Processing), splitting a text into words is called *tokenizing*. Tokenization can become very complex, but luckily we can use existing NLP software to do it for us.

For this lesson, we have chosen spaCy (https://spacy.io/), a recent library that tries to makes NLP practical and fast. SpaCy tries to give you the best available tools so you don't have to make any decision about which tokenizer, parser, or other component to use.

If you want to install spaCy, follow the instructions on the [website](https://spacy.io/usage/).

Things evolve quickly in NLP, so it is good to keep up-to-date about developments so that you can always use the best tool for the job.

In [None]:
import spacy

If you have not installed an English language module for spacy, the code in the following cell provides an easy way to do so. If it doesn't work on your system, please [consult the documentation](https://spacy.io/usage/models).

In [None]:
!python -m spacy download en

We can now load the English language model and disable the parser and the tagger (because we only want to do tokenizing)

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load(disable=['parser', 'tagger', 'ner'])

In [None]:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
nlp.max_length=2E6 # (makes spacy work with longer documents)

In [None]:
doc1=nlp(t1) # Tokenize the entire document

In [None]:
# How many words are in the tokenized document ?
nwords=len(doc1)
print(nwords)

Now that we have the text split into words, we can compute basic text statistics!

In [None]:
word_lengths=[len(word) for word in doc1] # lengths of all the words
mean_word_length=sum(word_lengths)/nwords
min_word_length=min(word_lengths)
max_word_length=max(word_lengths)

In [None]:
print("Mean word length:", mean_word_length)
print("Shortest word length:", min_word_length)
print("Longest word length:", max_word_length)

We can even find words of a particular length

In [None]:
def words_of_length(words, length):
    result=[word.lower_ for word in words if len(word)==length]
    return(set(result))

In [None]:
print("words", words_of_length(doc1,6))

## 3. Counting words

Word frequencies are the basic building blocks of many text mining techniques. Once text has been tokenized it becomes very easy to count how often each particular word occurs in that text.

It is good to remember that:

+ We call each particular word a word *type*
+ We call every occurrence of a word a word *token*

In [None]:
from collections import Counter
from math import *

In [None]:
doc1_wf=Counter([word.orth_ for word in doc1]) # frequency of the word in the document

In [None]:
doc1_wf.most_common(10)

In [None]:
# Let's make all words lowercase and get rid of punctuation!
doc1_wf=Counter([word.lower_ for word in doc1 if re.search("\w+", word.lower_)])

In [None]:
doc1_wf.most_common(10)

### Relative Frequencies
It is very useful to have *relative frequencies* for words. It enables us to compare the frequencies of words across documents of different length.

Typical scales for word frequencies are occurrences per million and per billion words.

Transforming absolute frequencies to relative frequencies is very easy:

$f_{rel}=\frac{f_{abs}}{n}*s$

In words: divide the absolute frequency $f_{abs}$ by the number of tokens $n$ in the document and multiply the result by the scale $s$ you want to use. So, if you want frequencies per million, you should multiply by 1000000, or, in scientific notation 1E6.

For instance, if the frequency of "the" in our document is 8052 and the number of tokens in the document is 180841, then the relative frequency of "the" would be:

In [None]:
(8052/180841)*1E6

Luckily, we can do this for all words in a document at once

In [None]:
n=sum(doc1_wf.values())
doc1_fpm=Counter({word: (frequency/n*1E6) for word, frequency in doc1_wf.items()})

In [None]:
doc1_fpm.most_common(10)

### Logarithmic Frequencies

For NLP applications, frequencies are usually transformed to a logarithmic scale.

For large documents, log10 of the frequency per billion words is a practical transformation.

$log_{10}(\frac{f_{abs}+1}{(n+{ntypes})}*s)$

In other words: add one to the absolute frequency $f_{abs}$, divide this by sum of the number of tokens $n$ and the number of types $ntypes$ in the document, multiply this result by the scale $s$ you want to use, and, finally, take the log10 of this number. If we want frequencies per million words, $s$ will be 1E6; if we want frequencies per billion words, it will be 1E9.

Another name for log10 frequencies per billion is the [Zipf scale](http://crr.ugent.be/archives/1352), named after the famous American linguist [George Kingsley Zipf](https://en.wikipedia.org/wiki/George_Kingsley_Zipf), who first described many of the properties of word frequencies.

In [None]:
n=sum(doc1_wf.values())
ntypes=len(doc1_wf.values())
doc1_zipf = Counter({word: log10((f+1)/(n+ntypes)*1E9) for word, f in doc1_wf.items()}) #log 10 frequency per billion words

In [None]:
doc1_zipf.most_common(10)

### The difference between frequencies and log frequencies

Let's make a graph where the frequency of each word is plotted against its rank. The word with the highest frequency gets rank 1, the word with the second highest frequency gets rank 2, etc.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from math import *

In [None]:
frequencies=sorted(doc1_fpm.values(), reverse=True)
ranks=list(range(len(frequencies))) 

In [None]:
plt.plot(ranks, frequencies)

You can see that the curve is concentrated at the bottom edge of the graph: There are very few words with a high frequency. Almost all words have a frequency that is so low it's near the bottom.

Log frequencies help us transform the frequencies, so that the difference between words with a frequency of 1 and 10 has the same importance as the difference between words with a frequency of 10 and 100, 100 and 1000, etc.

In [None]:
# sort the frequencies in descending order for the graph
frequencies=sorted(doc1_zipf.values(),reverse=True)

In [None]:
plt.plot(ranks, frequencies)

The relation between a word's rank and its frequency is now much more visible on the graph!

To make future processing easy, we wrote a function that takes a document that was tokenized by spaCy and returns log10 frequencies per billion words (zipf scale frequencies).

In [None]:
def zipf(doc, alphanumeric=True, lowercase=True):
    """Compute log 10 frequencies per billion from a spaCy doc"""
    if alphanumeric==True:
        filter_pattern="\w+"
    else:
        filter_pattern="."
    if lowercase==True:
        awf=Counter([word.lower_ for word in doc if re.search(filter_pattern, word.lower_)])
    else:
        awf=Counter([word.orth_ for word in doc if re.search(filter_pattern, word.orth_)])
    n=sum(awf.values())
    ntypes=len(awf.values())
    zipf = Counter({word: log10((f+1)/(n+ntypes)*1E9) for word, f in awf.items()})
    return(zipf)

## 4. An application: Language identification

Word frequencies give us a convenient way to identify the language of a document.

Let's take some text in another language

In [None]:
t2 = open('GIAN3_data/pg22367.txt', encoding="utf-8").read()

In [None]:
# Remove front and back matter
t2=clean_gutenberg_text(t2)

In [None]:
# Tokenize the content (we can use the English language tokenizer for most alphabetic languages)
doc2=nlp(t2)

In [None]:
doc2_zipf=zipf(doc2)

In [None]:
doc2_zipf.most_common(10)

In [None]:
# Compare this with the 10 most frequent words from the earlier document
doc1_zipf.most_common(10)

Maybe the top words for documents in a particular language are always very similar to each other ?

In [None]:
doc3 = nlp(clean_gutenberg_text(open('GIAN3_data/pg5200.txt', encoding="utf-8").read()))
doc3_zipf=zipf(doc3)

In [None]:
doc3_zipf.most_common(10)

Let's try another document

In [None]:
doc4 = nlp(clean_gutenberg_text(open('GIAN3_data/40739-0.txt', encoding="utf-8").read()))
doc4_zipf = zipf(doc4)

In [None]:
doc4_zipf.most_common(10)

We can now build an easy German-English text classification system by:

- Taking a number of documents for which we know the language
- Comparing the top $n$ words for a new document to the top $n$ words in each of those documents
- Predicting the language of the new document from the language of the document with the highest overlap in the top $n$ words.

In [None]:
docbase=((doc1_zipf, "English"), (doc2_zipf, "German"), (doc3_zipf, "English"), (doc4_zipf, "German"))

In [None]:
def german_or_english(newdoc, docbase, n=10):
    """Predict whether a document is German or English"""
    newdoc_zipf=zipf(newdoc)
    results=[]
    for doc_zipf, language in docbase:
        new_topwords=[word for word, frequency in newdoc_zipf.most_common(n)]
        db_topwords=[word for word, frequency in doc_zipf.most_common(n)]
        topword_overlap=len(set(new_topwords).intersection(db_topwords))
        results.append((topword_overlap, language))
    results.sort(reverse=True)
    return(results[0][1])

In [None]:
doc5=nlp(clean_gutenberg_text(open('GIAN3_data/pg26971.txt', encoding="utf-8").read()))

In [None]:
german_or_english(doc5, docbase)

In [None]:
doc5[1000:1100]

In [None]:
doc6=nlp(clean_gutenberg_text(open('GIAN3_data/pg27000.txt', encoding="utf-8").read()))

In [None]:
german_or_english(doc6, docbase)

In [None]:
doc6[600:700]

In [None]:
t7= """I want to live,
I want to give
I've been a miner
for a heart of gold
It's these expressions
I never give
That keep me searching
for a heart of gold
And I'm getting old
Keeps me searching
for a heart of gold
And I'm getting old

I've been to Hollywood
I've been to Redwood
I crossed the ocean
for a heart of gold
I've been in my mind,
it's such a fine line
That keeps me searching
for a heart of gold
And I'm getting old
Keeps me searching
for a heart of gold
And I'm getting old

Keep me searching
for a heart of gold
You keep me searching
And I'm growing old
Keep me searching
for a heart of gold
I've been a miner
for a heart of gold
"""

In [None]:
doc7=nlp(t7)

In [None]:
german_or_english(doc7, docbase)

In [None]:
german_or_english(nlp("the cricket and the ant"), docbase) 