[View in Colaboratory](https://colab.research.google.com/github/gmum/natural-language-processing-classes/blob/master/lab-2-preprocessing/notebook.ipynb)

# Lecture 2 - Text preprocessing

## Example of preprocessing

(from [article](https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html) by Matthew Mayo)

Beyond the standard Python libraries, we are also using the following:

- [NLTK](http://www.nltk.org/) - The Natural Language ToolKit is one of the best-known and most-used NLP libraries in the Python ecosystem, useful for all sorts of tasks from tokenization, to stemming, to part of speech tagging, and beyond
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - BeautifulSoup is a useful library for extracting data from HTML and XML documents
- [Inflect](https://pypi.org/project/inflect/) - This is a simple library for accomplishing the natural language related tasks of generating plurals, singular nouns, ordinals, and indefinite articles, and converting numbers to words

In [0]:
import re, string, unicodedata
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('gutenberg')
nltk.download('averaged_perceptron_tagger')
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, PorterStemmer,WordNetLemmatizer
!pip install inflect
import inflect

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Collecting inflect
[?25l  Downloading https://files.pythonhosted.org/packages/6e/1b/6b9b48323b714b5f66dbea2bd5d4166c4f99d908bc31d5307d14083aa9a2/inflect-1.0.1-py2.py3-none-any.whl (59kB)
[K    100% |████████████████████████████████| 61kB 2.2MB/s 
[?25hInstalling collected packages: inflect
Successfully installed inflect-1.0.1


We need some sample text. We'll start with something very small and artificial in order to easily see the results of what we are doing step by step.

In [0]:
sample = """<h1>Title Goes Here</h1>

<b>Bolded Text</b>
<i>Italicized Text</i>

<img src="this should all be gone"/>

<a href="this will be gone, too">But this will still be here!</a>

I run. He ran. She is running. Will they stop running?

I talked. She was talking. They talked to them about running. Who ran to the talking runner?

[Some text we don't want to keep is in here]

¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.

[This is some other unwanted text]

John: "Well, well, well."
James: "There, there. There, there."

&nbsp;&nbsp;

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.
I have to go get 2 tutus from 2 different stores, too.

22    45   1067   445

{{Here is some stuff inside of double curly braces.}}
{Here is more stuff in single curly braces.}

[DELETE]

</body>
</html>"""

A toy dataset indeed, but make no mistake; the steps we are taking here to preprocessing this data are fully transferable.

The text data preprocessing framework:

![](https://www.kdnuggets.com/wp-content/uploads/text-preprocessing-framework-2.png)



### Noise Removal

Let's loosely define noise removal as text-specific normalization tasks which often take place prior to tokenization. Some would argue that, while the other 2 major steps of the preprocessing framework (tokenization and normalization) are basically task-independent, noise removal is much more task-specific.

Sample noise removal tasks could include:

- removing text file headers, footers
- removing HTML, XML, etc. markup and metadata
- extracting valuable data from other formats, such as JSON

As you can imagine, the boundary between noise removal and data collection and assembly, on the one hand, is a fuzzy one, while the line between noise removal and normalization is blurred on the other. Given its close relationship with specific texts and their collection and assembly, many denoising tasks, such as parsing a JSON structure, would obviously need to be implemented prior to tokenization.

In our data preprocessing pipeline, we will strip away HTML markup with the help of the BeautifulSoup library, and use regular expressions to remove open and close double brackets and anything in between them (we assume this is necessary based on our sample text).

In [0]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub(r'\[[^]]*\]', '', text)

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

sample = denoise_text(sample)
print(sample)

### Tokenization

Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized. Tokenization is also referred to as text segmentation or lexical analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

For our task, we will tokenize our sample text into a list of words. This is done using NTLK's word_tokenize() function.



In [0]:
words = nltk.word_tokenize(sample)
print(words)

###Normalization
 
Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.

Normalizing text can mean performing a number of tasks, but for our framework we will approach normalization in 3 distinct steps: 
- stemming, 
- lemmatization,
- everything else. 

For specifics on what these distinct steps may be, [see this post](https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html).

Remember, after tokenization, we are no longer working at a text level, but now at a word level. Our normalization functions, shown below, reflect this. Function names and comments should provide the necessary insight into what each does.

In [0]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    return words

words = normalize(words)
print(words)

Calling the stemming and lemming functions are done as below:

In [0]:
def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

stems, lemmas = stem_and_lemmatize(words)
print('Stemmed:\n', stems)
print('\nLemmatized:\n', lemmas)

Depending on your NLP task or preference, one of these may be more appropriate than the other. See here for a [discussion on lemmatization vs stemming](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/).

In order to resolve ambiguous cases, lemmatization usually requires tokens to be accompanied by part-of-speech tags. For example, the word lemma for rose depends on whether it is used as a noun or a verb:

In [0]:
lemmer = WordNetLemmatizer()
print(f"noun lemmatization: {lemmer.lemmatize('rose', 'n')}")
print(f"verb lemmatization: {lemmer.lemmatize('rose', 'v')}")

## Exercise 1.

In [0]:
text_1 = """Now the Children of Ilu´vatar are Elves and Men, the Firstborn
and the Followers. And amid all the splendours of the
World, its vast halls and spaces, and its wheeling fires, Ilu´vatar
chose a place for their habitation in the Deeps of Time
and in the midst of the innumerable stars. And this habitation
might seem a little thing to those who consider only the
majesty of the Ainur, and not their terrible sharpness; as who
should take the whole field of Arda for the foundation of a
pillar and so raise it until the cone of its summit were more
bitter than a needle; or who consider only the immeasurable
vastness of the World, which still the Ainur are shaping, and
not the minute precision to which they shape all things
therein. But when the Ainur had beheld this habitation in a
vision and had seen the Children of Ilu´vatar arise therein,
then many of the most mighty among them bent all their
thought and their desire towards that place. And of these
Melkor was the chief, even as he was in the beginning the
greatest of the Ainur who took part in the Music. And he
feigned, even to himself at first, that he desired to go thither
and order all things for the good of the Children of Ilu´vatar,
controlling the turmoils of the heat and the cold that had
come to pass through him. But he desired rather to subdue
to his will both Elves and Men, envying the gifts with which
Ilu´vatar promised to endow them; and he wished himself to
have subjects and servants, and to be called Lord, and to be
a master over other wills.
But the other Ainur looked upon this habitation set within
the vast spaces of the World, which the Elves call Arda,
the Earth; and their hearts rejoiced in light, and their eyes
beholding many colours were filled with gladness; but
because of the roaring of the sea they felt a great unquiet.
And they observed the winds and the air, and the matters of
which Arda was made, of iron and stone and silver and gold
and many substances: but of all these water they most greatly
praised. And it is said by the Eldar that in water there lives
yet the echo of the Music of the Ainur more than in any
substance else that is in this Earth; and many of the Children
of Ilu´vatar hearken still unsated to the voices of the Sea, and
yet know not for what they listen.
Now to water had that Ainu whom the Elves call Ulmo
turned his thought, and of all most deeply was he instructed
by Ilu´vatar in music. But of the airs and winds Manwe¨ most
had pondered, who is the noblest of the Ainur. Of the fabric
of Earth had Aule¨ thought, to whom Ilu´vatar had given skill
and knowledge scare less than to Melkor; but the delight and
pride of Aule¨ is in the deed of making, and in the thing made,
and neither in possession nor in his own mastery; wherefore
he gives and hoards not, and is free from care, passing ever
on to some new work.
And Ilu´vatar spoke to Ulmo, and said: ‘Seest thou not how
here in this little realm in the Deeps of Time Melkor hath
made war upon thy province? He hath bethought him of
bitter cold immoderate, and yet hath not destroyed the beauty
of thy fountains, nor of thy clear pools. Behold the snow,
and the cunning work of frost! Melkor hath devised heats
and fire without restraint, and hath not dried up thy desire
nor utterly quelled the music of the sea. Behold rather the
height and glory of the clouds, and the everchanging mists;
and listen to the fall of rain upon the Earth! And in these
clouds thou art drawn nearer to Manwe¨, thy friend, whom
thou lovest.’
Then Ulmo answered: ‘Truly, Water is become now fairer
than my heart imagined, neither had my secret thought conceived
the snowflake, nor in all my music was contained the
falling of the rain. I will seek Manwe¨, that he and I may make
melodies for ever to thy delight!’ And Manwe¨ and Ulmo have
from the beginning been allied, and in all things have served
most faithfully the purpose of Ilu´vatar.
But even as Ulmo spoke, and while the Ainur were yet
gazing upon this vision, it was taken away and hidden from
their sight; and it seemed to them that in that moment they
perceived a new thing, Darkness, which they had not known
before except in thought. But they had become enamoured
of the beauty of the vision and engrossed in the unfolding
of the World which came there to being, and their minds
were filled with it; for the history was incomplete and the
circles of time not full-wrought when the vision was taken
away. And some have said that the vision ceased ere the
fulfilment of the Dominion of Men and the fading of the
Firstborn; wherefore, though the Music is over all, the Valar
have not seen as with sight the Later Ages or the ending of
the World.
Then there was unrest among the Ainur; but Ilu´vatar called
to them, and said: ‘I know the desire of your minds that what
ye have seen should verily be, not only in your thought, but
even as ye yourselves are, and yet other. Therefore I say: Ea¨!
Let these things Be! And I will send forth into the Void the
Flame Imperishable, and it shall be at the heart of the World,
and the World shall Be; and those of you that will may go
down into it.’ And suddenly the Ainur saw afar off a light,
as it were a cloud with a living heart of flame; and they knew
that this was no vision only, but that Ilu´vatar had made a
new thing: Ea¨, the World that Is."""



1. Make a vocabulary that for each token contains the number of its occurencies in above text. Store the vocabulary as a list of tuples. Sort this vocabulary by the number of occurences, from biggest to smallest.  Use word_tokenize from nltk.

2. Repeat this process, but this time also convert all tokens to lowercase and lemmatize all tokens as verbs.

3. Use nltk.sent_tokenize to find the longest sentence (with respect to number of characters) in the text and return the number of words in this sentence (excluding punctuation!)

4. Read about [different tokenizers](https://www.nltk.org/api/nltk.tokenize.html) from NLTK. Give example of sentence, that would be tokenized better by TweetTokenizer().tokenize() and a sentence that would be better after word_tokenize().


Hints:
- you might need to deal with \n after each line, it can be converted to a space


In [0]:
#solution

In [0]:
# 1
vocab_1 = 

# 2
vocab_2 = 

# 3
num_tokens = 


assert len(vocab_1) == 379
assert len(vocab_2) == 336
assert num_tokens == 82

## Exercise 2.



In [0]:
raw = nltk.corpus.gutenberg.raw("burgess-busterbrown.txt")
# print(raw[:500])
words = nltk.corpus.gutenberg.words("burgess-busterbrown.txt")
# print(words[:20])
sents = nltk.corpus.gutenberg.sents("burgess-busterbrown.txt")
# print(sents[:5])

Using "burgess-busterbrown.txt", do the following:

1. Count the number of sentences containing token "the" (case insensitive)
2. Compute the average **token** length in the above corpus.
3. (Stemming) Find tokens from above file that differ after using [Porter](http://snowball.tartarus.org/algorithms/english/stemmer.html) and [Lancaster](https://www.nltk.org/_modules/nltk/stem/lancaster.html) stemming algorithms (after lowercasing the tokens).
4. (Lemmatization) Perform lemmatization on above corpus. Use POS tagger (defined below) to improve the lemmatizer. Give an example (from corpus) where using POS tagger helps.

In [0]:
#solution

In [0]:
# For 4:

def pt_to_wn(pos):
    """
    Takes a Penn Treebank tag and converts it to an
    appropriate WordNet equivalent for lemmatization.

    A list of Penn Treebank tags is available at:
    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    """

    from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV

    pos = pos.lower()

    if pos.startswith('jj'):
        tag = ADJ
    elif pos == 'md':
        # Modal auxiliary verbs
        tag = VERB
    elif pos.startswith('rb'):
        tag = ADV
    elif pos.startswith('vb'):
        tag = VERB
    elif pos == 'wrb':
        # Wh-adverb (how, however, whence, whenever...)
        tag = ADV
    else:
        # default to VERB
        # This is not strictly correct, but it is good
        # enough for lemmatization.
        tag = VERB

    return tag
  
  
def nltk_pos_tagger(tokens):
    """
    Takes a list of tokens and returns a list of 
    tuples [(token, wordnet_tag), ..]
    """

    # Tag tokens with part-of-speech:
    tagged = nltk.pos_tag(tokens) 

    # Convert our Treebank-style tags to WordNet-style tags.
    tagged = [(word, pt_to_wn(tag))
                     for (word, tag) in tagged]
    return tagged
  
  
def lemmatizer(tokens):
    tagged = nltk_pos_tagger(tokens)  
    
    #write code to lemmatize tokens using taggs from nltk_pos_tagger
    
    pass
    
    