<a href="https://colab.research.google.com/github/gordeli/NLP_EDHEC/blob/main/colab/01_Text_Processing_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Processing @ EDHEC, 2022

# Part 1: Text Processing Basics

[-> Data Collection](https://colab.research.google.com/github/gordeli/NLP_EDHEC/blob/main/colab/02_Data_Collection.ipynb)

Dates: January 31 - February 11, 2022

Facilitator: [Ivan Gordeliy](https://www.linkedin.com/in/gordeli/)

(To edit this notebook: File -> Open in Playground Mode)

---



In [None]:
## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.


In [None]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# counting and data management
import collections
# operating system utils
import os
# regular expressions
import re
# additional string functions
import string
# system utilities
import sys
# request() will be used to load web content
import urllib.request


# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')
nltk.download('omw-1.4')

# numpy: matrix library for Python
import numpy as np

# scipy: scientific operations
# works with numpy objects
import scipy

# matplotlib (and pyplot) for visualizations
import matplotlib
import matplotlib.pyplot as plt

# sklearn for basic machine learning operations
import sklearn
import sklearn.manifold
import sklearn.cluster

# worldcloud tool
!pip install wordcloud
from wordcloud import WordCloud

# for checking object memory usage
!pip install pympler
from pympler import asizeof

!pip install spacy
import spacy

# Downloading data
# ----------------
if not os.path.exists("aclImdb"):
    !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xzf aclImdb_v1.tar.gz

print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---
## Basic Text Preprocessing



### Built-in Python functions

- Basic Python fuctions provide a good starting place.

- First, we should try to split a sentence into individual words:

In [None]:
text = "EDHEC Business School (French: Ecole des Hautes Etudes Commerciales du Nord) is a French business school. As a Grande école in France, it specializes in business and management studies. It has five campuses: Lille, Nice, Paris, London, and Singapore,[4] and offers undergraduate (BBA), graduate (MSc and MiM), executive education (Global MBA, EMBA) PhD in Finance, and a variety of open and customized programmes.[5] It has 8,000 students enrolled in traditional graduate and undergraduate programmes, 150 partner universities and a network of more than 40,000 alumni in over 125 countries."

# We can split on all whitespace with split()
words = text.split()
print("WORDS:",words)

- It is fairly straightforward to do things like remove punctuation, lowercase, or access individual letters:

In [None]:
# for the first 10 words
for word in words [:10]:
    
    # print the string "word:", the word itself, 
    # and end with a veritcal bar character instead of a newline
    print("word:", word, end=' | ')
    
    # strip removes characters at the beginning and end of a string
    # string.punctuation contains: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    print("no punctuation:", word.strip(string.punctuation), end=' | ')
    
    # lower() and upper() change case
    print("lowercase:", word.lower(), end=' | ')
    
    # characters in strings can be indexed just like items in lists
    print("first letter:", word[0].upper())

- How about dealing with multiple sentences?

In [None]:
# From https://en.wikipedia.org/wiki/Data_science

text =  'Data science is a "concept to unify statistics, data analysis, machine ' + \
        'learning and their related methods" in order to "understand and analyze ' + \
        'actual phenomena" with data. '

text += 'It employs techniques and theories drawn from many fields within the ' + \
        'context of mathematics, statistics, computer science, and information ' + \
        'science. '

text += 'Turing award winner Jim Gray imagined data science as a "fourth paradigm"' + \
        'of science (empirical, theoretical, computational and now data-driven) ' + \
        'and asserted that "everything about science is changing because of the ' + \
        'impact of information technology" and the data deluge. '

text += 'In 2015, the American Statistical Association identified database ' + \
        'management, statistics and machine learning, and distributed and ' + \
        'parallel systems as the three emerging foundational professional communities."'

# We could try splitting on the period character...
sentences = text.split('.')
print('.\n'.join(sentences))

- When might this not work?

In [None]:
# Try this:
text =  "Dr. Martin registered the domain name drmartin.com before moving to the " + \
        "U.K. in January. "
text += "During that time, 1.6 million users visited her website... it was very " + \
        "unexpected and caused a server to crash."
sentences = text.split('.')
print('.\n'.join(sentences))

###Introducing the Natural Language Toolkit (NLTK)

- NLTK is a very handy library for basic text processing operations.

- We can split sentences in a much smarter way:

In [None]:
sentences = nltk.sent_tokenize(text)
print('\n'.join(sentences))

- **What else can we do with NLTK?**
- Smarter word tokenization:

In [None]:
sentence_words = nltk.word_tokenize(sentences[0])
print("Words:",' '.join(sentence_words))

- Finding word stems:

In [None]:
# Add the words from the 2nd sentence
sentence_words += nltk.word_tokenize(sentences[1])

# Stemming
stemmer = nltk.stem.PorterStemmer()
stems = [stemmer.stem(word) for word in sentence_words]
print(stems)

Read more about Porter's stemming algorythm [here](https://tartarus.org/martin/PorterStemmer/ )

- Labeling words with their part-of-speech, and even finding their lemmas:

In [None]:
# Part-of-speech tagging
pos_tags = nltk.pos_tag(sentence_words)
print("Parts of speech:",pos_tags)

# Lemmatization
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'
    
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for (word,pos) in pos_tags]
print("Lemmas:", ' '.join(lemmas))

[Complete list of POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html )

- Sometimes, it is helpful to remove "stopwords", like "a, the, I, do," and others.
    - It's worth thinking about whether or not these words are important in your application.
    - These kinds of words do carry a lot of important information!

In [None]:
# Stopword (non-content word) removal
stop_words = set(nltk.corpus.stopwords.words('english'))
content_words = [word for word in sentence_words if word not in stop_words]
removed_stop_words = [word for word in sentence_words if word in stop_words]
print("Content words:", ' '.join(content_words))
print("Removed Stop words:", ' '.join(removed_stop_words))

- Let's look at a simple plot of the word frequencies in our sample text.

In [None]:
# Get word frequencies
frequencies = nltk.probability.FreqDist(sentence_words)

# Plot the frequencies
frequencies.plot(15,cumulative=False)
plt.show()

### Putting it together: Creating a Word Cloud
- Now, it's your turn to try out some of the techniques we've covered.

1. First, run the code block below labeled "Run this code first" to perform some setup.
2. Then, modify the code marked "Exercise 1" to convert a document into **preprocessed lemma frequencies**.
    - There is a sample solution below. It's hidden for now, but you can take a peek when you are ready.
3. Finally, run the code labeled "build a word cloud" to see the result.

In [None]:
#@title Run this code first: Wordcloud function and loading the document (double-click to view) {display-mode: "form"}


# Draw a wordcloud!
# Inputs:
#   word_counts: a dictionary mapping strings to their counts
def draw_wordcloud(freq_dist, colormap):
    
    #TODO add a few corpus specific checks here to make sure people have done casing, lemmatization, punct removal
    uniq_count = len(freq_dist.keys())
    print("Building a word cloud with",uniq_count,"unique words...")
    wc = WordCloud(colormap=colormap, width=1500, 
                   height=1000).generate_from_frequencies(freq_dist)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    
print("draw_wordcloud() function is ready to use.")

# Load the contents of the book "The Wonderful Wizard of Oz" 
#   by L. Frank Baum (from project Gutenberg)
# document = urllib.request.urlopen("http://www.gutenberg.org/cache/epub/55/pg55.txt").read().decode('utf-8')
file_name = "pg55.txt"
document = open(file_name).read()

print('"The Wonderful Wizard of Oz" full text is loaded.')

**Exercise 1**

Write your code here. Make sure to click the "run" button when you're finished.

In [None]:
# Convert text to a dictionary mapping strings to a FreqDist object
# containing the frequences of the lemmas in the text.
# All stopwords should be removed.
# Inputs:
#   text: a string as input, possibly containing multiple sentences.
def text_to_lemma_frequencies(text):
    
# ------------- Exercise 1 -------------- #

    # write your preprocessing code here

    # replace this return function with your own
    return nltk.probability.FreqDist(["Hello", "world", "hello", "world."])
# ---------------- End ------------------ #

    
# quick test (do not modify this)
test_doc = "This is a test. Does this work?"
result = text_to_lemma_frequencies(test_doc)
passed = result == nltk.probability.FreqDist(["test","work"])
if passed:
    print ("Test passed!")
else:
    print("Test did not pass yet.")
    if type(result) == type(nltk.probability.FreqDist(["a"])):
        print("got these words:", result.keys(),\
              "\nwith these counts:", result.values())
    else:
        print("Did not return a FreqDist object.")

Now, let's **build a word cloud** for the book "[The Wonderful Wizard of Oz](http://www.gutenberg.org/cache/epub/55/pg55.txt)."

In [None]:
# Get the word frequency distribution
freq_dist = text_to_lemma_frequencies(document)

# Use default colormap
colormap = None
# Bonus: try out some other matplotlib colormaps
#colormap = "spring" # see more here: https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html

# Call the function to draw the word cloud
draw_wordcloud(freq_dist, colormap)

In [None]:
#@title Sample Solution (double-click to view) Run to load sample solution. {display-mode: "form"}

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'
    
# quick test:
test_doc = "This is a test. Does this work?"
result = text_to_lemma_frequencies(test_doc)
passed = result == nltk.probability.FreqDist(["test","work"])
if passed:
    print ("Test passed!")
else:
    print("Test did not pass yet.")
    if type(result) == type(nltk.probability.FreqDist(["a"])):
        print("got these words:", result.keys(),\
              "\nwith these counts:", result.values())
    else:
        print("Did not return a FreqDist object.")

### Bonus: Zipf's Law

- Let's check the frequency distribution over the top N words in the book.

In [None]:
top_n_words = 100
freq_dist.plot(top_n_words, cumulative=False)
plt.show()

- You've just observed (a "Wizard of Oz" version of) [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law)  at work!

- Remember that we've also removed stopwords. 

- _Try this_: 
    - Load the sample `text_to_lemma_frequencies()` function, then run the code below to see what this looks like with stopwords.

    - Pay attention to how the y-axis is different from the example above.

    - Compare the result to [this example](https://phys.org/news/2017-08-unzipping-zipf-law-solution-century-old.html).

In [None]:
freq_dist = text_to_lemma_frequencies(document, remove_stop_words=False)
top_n_words = 100
freq_dist.plot(top_n_words, cumulative=False)
plt.show()