# Text Mining

<img src="img/text-miners.jpeg" width="500">

## How is text mining different? What is text?

- Order the words from **SMALLEST** to **LARGEST** units
 - character
 - corpora
 - sentence
 - word
 - corpus
 - paragraph
 - document

(after it is all organized)

- Any disagreements about the terms used?

## Tokenization



[Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)
### start small

In [None]:
import string

In [None]:
token_test = "Here is a sentence. Or two, I don't think there will be more."
token_test_2 = "i thought this sentence was good."
token_test_3 = "Here's a sentence... maybe two. Depending on how you like to count!"

In [None]:
# let's tokenize a document... into sentences
def make_sentences(doc):
    sentences = doc.split('.')
    return [s.strip() for s in sentences if s]

make_sentences(token_test_3)

In [None]:
import string
new_token = token_test_3.translate(str.maketrans('', '', string.punctuation))
new_token.split(' ')

In [None]:

# let's tokenize a document into words
# with these 3 test cases what would you look out for?
def tokenize_it(doc):
    return doc.split(' ')

tokenize_it(token_test_3)

## New library!

while we have seen language processing tools in spark, NLTK is its own python library. And of course, it has its own [documentation](https://www.nltk.org/)

In [None]:
import nltk
import sklearn

### Bigger Data

In [None]:
import requests
resp = requests.get('http://www.gutenberg.org/cache/epub/5200/pg5200.txt')
metamorph = resp.text

In [None]:
print(metamorph[:1000])

Load your article here

In [None]:
import re
pattern = "([a-zA-Z]+(?:'[a-z]+)?)" ##what is this non-sense?

[Regex playground](https://regexr.com/)

[Regex resources](https://www.programiz.com/python-programming/regex)

In [None]:
metamorph_tokens_raw = nltk.regexp_tokenize(metamorph, pattern)
print(metamorph_tokens_raw[:1200])

In [None]:
## length of metamorph_tokens_raw

len(metamorph_tokens_raw)

In [None]:
import collections

In [None]:
collections.Counter(metamorph_tokens_raw)

In [None]:
metamorph_tokens = [i.lower() for i in metamorph_tokens_raw]
print(metamorph_tokens[:100])


In [None]:
# replace "'"

metamorph_tokens = [i.replace("'", "").replace("-", "") for i in metamorph_tokens]


# replace "-"

In [None]:
## let's check the most frequent words

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stopwords.words("english")[0:1000]

In [None]:
stop_words = set()


In [None]:
stop_words = stopwords.words('english') + ['copyright', 'project', 'ebook', 'gutenberg']

In [None]:
metamorph_tokens_stopped = [w for w in metamorph_tokens if not w in stop_words]
print(metamorph_tokens_stopped[:100])

## Stemming / Lemmatization

[Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

### Stemming - Porter Stemmer 
<img src="https://cdn.homebrewersassociation.org/wp-content/uploads/Baltic_Porter_Feature-600x800.jpg" width="300">

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

In [None]:
singles = [stemmer.stem(e) for e in example]
print(*singles)

### Stemming - Snowball Stemmer
<img src="https://localtvwiti.files.wordpress.com/2018/08/gettyimages-936380496.jpg" width="300">

In [None]:
print(*SnowballStemmer.languages, sep='\n')

In [None]:
stemmer = SnowballStemmer("english")
print(stemmer.stem("probably"))

### Porter vs Snowball

In [None]:
print(SnowballStemmer("english").stem("generously"))
print(SnowballStemmer("porter").stem("generously"))

### Use Snowball on Metamorphosis

In [None]:
metamorph_tokens_stopped

In [None]:
meta_stemmed = [stemmer.stem(word) for word in metamorph_tokens_stopped]
print(meta_stemmed[:100])

### Lemmatizer

[Resources](https://www.guru99.com/stemming-lemmatization-python-nltk.html)

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("best", pos="a"))

In [None]:
from nltk.corpus import wordnet


def get_wordnet_pos(tag):

    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

In [None]:
nltk.pos_tag(metamorph_tokens_stopped)[:10]

In [None]:
metamorph_lemmas_pos = []
for x, y in nltk.pos_tag(metamorph_tokens_stopped):
    metamorph_lemmas_pos.append((x, get_wordnet_pos(y)))

In [None]:
metamorph_lemmas_pos


In [None]:
metamorph_lemmas_pos[:100]

[POS_tag](https://www.guru99.com/pos-tagging-chunking-nltk.html)

### Use Lemmatizer on Metamorphosis

In [None]:
meta_lemmaed = []
for word, pos in metamorph_lemmas_pos:
    meta_lemmaed.append(lemmatizer.lemmatize(word, pos=pos))
print(*zip(metamorph_tokens_stopped[100:200], meta_lemmaed[100:200]), sep='\n')

## Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.

## Frequency distributions

In [None]:
from nltk import FreqDist

In [None]:
meta_freqdist = FreqDist(meta_stemmed)

In [None]:
meta_freqdist.most_common(50)

In [None]:
%matplotlib inline
meta_freqdist.plot(30, cumulative=False)

# Vectorization
## this step happens after we account for stopwords and lemmas; depending on the library...
* we make a **Count Vector**, which is the formal term for a **bag of words**
* we use vectors to pass text into machine learning models


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Let's check out the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [None]:
# test the CountVectorizer method on 'basic_example'
basic_example = ['The Data Scientist wants to train a machine to train machine learning models.']
cv = CountVectorizer()
cv.fit(basic_example)

In [None]:
cv.get_feature_names()

In [None]:
# what info can we get from cv?
# hint -- look at the docs again

In [None]:
basic_example

In [None]:
print(cv.transform(basic_example))

In [None]:
corpora = ['Data scientist discovered a brand new way.', 
           'Data scientist spent a lot time in cleaning the data', 
           'Data scientist should not peak at their test data']

In [None]:
cv.fit(corpora)

In [None]:
cv.get_feature_names()

In [None]:
cv.transform(['But data scientist discovered the test data anyway', "A better data scientist didn't" ]).toarray()

## Vectorization allows us to compare two documents

In [None]:
# use pandas to help see what's happening
import pandas as pd

In [None]:
# we fit the CountVectorizer on the 'basic_example', now we transform 'basic_example'
example_vector_doc_1 = cv.transform(basic_example)

In [None]:
# what is the type

print(type(example_vector_doc_1))

In [None]:
# what does it look like

print(example_vector_doc_1)

In [None]:
# let's visualize it
example_vector_df = pd.DataFrame(example_vector_doc_1.toarray(), columns=cv.get_feature_names())
example_vector_df

In [None]:
# here we compare new text to the CountVectorizer fit on 'basic_example'
new_text = ['the data scientist plotted the residual error of her model']
new_data = cv.transform(new_text)
new_count = pd.DataFrame(new_data.toarray(),columns=cv.get_feature_names())
new_count

## N-grams

In [None]:
# in this the object 'sentences' becomes the corpus
sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
             'the data scientist plotted the residual error of her model in her analysis',
             'Her analysis was so good, she won a Kaggle competition.',
             'The machine gained sentience']

In [None]:
# go back to the docs for count vectorizer, how would we use an ngram
# pro tip -- include stop words
bigrams = CountVectorizer()

In [None]:
bigram_vector = bigrams.fit_transform(sentences)
bigram_vector

In [None]:
print(f'There are {str(len(bigrams.get_feature_names()))} features for this corpus')
bigrams.get_feature_names()[:26]

In [None]:
# let's visualize it
bigram_df = pd.DataFrame(bigram_vector.toarray(), columns=bigrams.get_feature_names())
bigram_df.head()

# TF-IDF
## Term Frequency - Inverse Document Frequency

In [None]:
'the brown cow'

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


In [None]:
import pandas as pd

In [None]:
tf_idf_sentences = ['The data Scientist wants to train a machine to train machine learning models.',
                    'the data scientist plotted the residual error of her model in her analysis',
                    'Her data analysis was so good, she won a Kaggle competition.',
                    'The data machine']
# take out stop words
tfidf = TfidfVectorizer(stop_words='english')
# fit transform the sentences
tfidf_sentences = tfidf.fit_transform(tf_idf_sentences)

In [None]:
# visualize it
tfidf_df = pd.DataFrame(tfidf_sentences.toarray(), columns=tfidf.get_feature_names())

In [None]:
tfidf_df

In [None]:
# compared to bigrams
bigram_df

In [None]:
# now let's test out our TfidfVectorizer
test_tdidf = tfidf.transform(['this is a test document','look at me I am a test document'])

In [None]:
# this is a vector
test_tdidf

In [None]:
test_tfidf_df = pd.DataFrame(test_tdidf.toarray(), columns=tfidf.get_feature_names())
test_tfidf_df

## Measuring the Similarity Between Documents

We can tell how similar two documents are to one another, normalizing for size, by taking the cosine similarity of the two. 

This number will range from [0,1], with 0 being not similar whatsoever, and 1 being the exact same. A potential application of cosine similarity is a basic recommendation engine. If you wanted to recommend articles that are most similar to other articles, you could talk the cosine similarity of all articles and return the highest one.

<img src="./img/better_cos_similarity.png" width=600>

In [None]:
sample = CountVectorizer()
sunday_afternoon = ['I ate a burger at burger queen and it was very good.',
                    'I ate a hot dog at burger prince and it was bad',
                    'I drove a racecar through your kitchen door',
                    'I ate a hot dog at burger king and it was bad. I ate a burger at burger queen and it was very good']

trial.fit(sunday_afternoon)
text_data = trial.transform(sunday_afternoon)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# the 0th and 2nd index lines are very different, a number close to 0
cosine_similarity(text_data[0],text_data[2])


In [None]:
# the 0th and 3rd index lines are very similar, despite different lengths
cosine_similarity(text_data[0],text_data[3])