# Text Mining

![miners](img/text-miners.jpeg)

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

#### How is text mining different? What is text?

- Order the words from **SMALLEST** to **LARGEST** units
 - character
 - corpora
 - sentence
 - word
 - corpus
 - paragraph
 - document

(after it is all organized)

- Any disagreements about the terms used?

### Bag of Words Steps

<img style="float: left" src="./img/bag_of_word.jpg" width="200">

1. make all lower case
2. Remove punctuation, numbers, symbols, etc
3. Remove stop words, perhaps develop custom stop words list
4. Stemming/Lemmatization

## New library!

NLTK is its own python library. And of course, it has its own [documentation](https://www.nltk.org/).

In [None]:
#from __future__ import print_function
import nltk
import sklearn
%matplotlib inline

In [None]:
#nltk.download()

#nltk.download('popular')

In [None]:
import string, re
import urllib

In [None]:
metamorph = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/5200/pg5200.txt').read()

In [None]:
metamorph

In [None]:
metamorph_st = metamorph.decode("utf-8")

In [None]:
print(metamorph_st[:1000])

## Your Turn!

In [None]:
# Load your article here; you're welcome to use gutenberg.org.



We're going to use a regular expression here. For more on RegEx see [this blog post](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285) and [this more official site](https://www.regular-expressions.info/).

In [None]:
pattern = "[a-zA-Z]+(?:'[a-z]+)?" # I'm looking for whole words, i.e.: some lower- or uppercase letters
                                  # followed by zero or more lowercase letters.
metamorph_tokens_raw = nltk.regexp_tokenize(metamorph_st, pattern)
print(metamorph_tokens_raw[:100])

In [None]:
metamorph_tokens = [i.lower() for i in metamorph_tokens_raw]
print(metamorph_tokens[:100])

We often want to omit counts of very common words: "stopwords".

In [None]:
nltk.corpus.stopwords.words("english")

In [None]:
nltk.corpus.stopwords.words('greek')

In [None]:
stop_words = set(nltk.corpus.stopwords.words('english'))
metamorph_tokens_stopped = [w for w in metamorph_tokens if not w in stop_words]
print(metamorph_tokens_stopped[:100])

## Stemming / Lemmatizing

### Stemming - Porter Stemmer 


This algorithm is named for [Martin Porter](https://tartarus.org/martin/index.html).

In [None]:
porter = nltk.stem.PorterStemmer()
example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'seizing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

In [None]:
singles = [porter.stem(examp) for examp in example]
print(' '.join(singles))

### Stemming - Snowball Stemmer
![snowball](https://localtvwiti.files.wordpress.com/2018/08/gettyimages-936380496.jpg?quality=85&strip=all)

In [None]:
print(" ".join(nltk.stem.SnowballStemmer.languages))

In [None]:
snow = nltk.stem.SnowballStemmer("english")
print(snow.stem("running"))

### Porter vs Snowball

[This](https://intellipaat.com/community/3111/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-algorithms) is helpful in describing the difference.

In [None]:
print(snow.stem("generously"))
print(porter.stem('generously'))

### Use Snowball on _Metamorphosis_

In [None]:
meta_stemmed = [snow.stem(word) for word in metamorph_tokens_stopped]
print(meta_stemmed[:100])

### Lemmatization

Lemmatization is often superior to mere stemming, because it makes use of "deeper" facts about the language in question. NLTK's version uses a corpus of words called "WordNet". Let's see it in action.

In [None]:
wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
wordnet_lemmatizer.lemmatize('calculi', pos='n') # The 'pos' parameter is for Part Of Speech.
                                                 # The default value is 'n' (for Noun).

In [None]:
wordnet_lemmatizer.lemmatize('is')

In [None]:
wordnet_lemmatizer.lemmatize('is', pos='v')

In [None]:
' '.join(wordnet_lemmatizer.lemmatize(word) for word in metamorph_tokens_stopped[:100])

In [None]:
wordnet_lemmatizer.lemmatize('lifted')

In [None]:
wordnet_lemmatizer.lemmatize('lifted', pos='v')

In [None]:
nltk.pos_tag(['lifted'])

In [None]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": nltk.corpus.wordnet.ADJ,
                "N": nltk.corpus.wordnet.NOUN,
                "V": nltk.corpus.wordnet.VERB,
                "R": nltk.corpus.wordnet.ADV}

    return tag_dict.get(tag, nltk.corpus.wordnet.NOUN)

In [None]:
# 3. Lemmatize a Sentence with the appropriate POS tag

sentence = """
Oh say can you see, by the dawn's early light, what so proudly we hailed at the twilight's
last gleaming, whose broad stripes and bright stars, through the perilous fight, o'er the
ramparts we watched, were so gallantly streaming?
"""
[wordnet_lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]

The above function comes from [this site](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/), which is a nice resource for help with lemmatization.

## Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.

### Document statistics

Average word length in document

In [None]:
float(sum(map(len, meta_stemmed))) / len(meta_stemmed)

Number of words in document

In [None]:
len(meta_stemmed)

## Frequency distributions

In [None]:
meta_freqdist = nltk.FreqDist(meta_stemmed)

In [None]:
meta_freqdist.most_common(50)

In [None]:
meta_freqdist.plot(20, cumulative=False);

**TASK**: Create a word frequency plot for your article. Don't worry about stemming or lemmatizing to start.

Question:  Should any more stop words be added to the list given your plot results?

## Creating a Data frame that compares the documents

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df