## Text Mining and NLP

## Part 1

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### Discussion

- What type of problem is this?
- What are we trying to do?
- What steps do you think might be involved? (big picture steps)

![talk](https://media.giphy.com/media/l2SpQRuCQzY1RXHqM/giphy.gif)

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

### Steps with articles:

https://github.com/aapeebles/text_examples 

1. Create list of words
2. tally how many times words are used
3. order the words by frequency
4. try to find similar articles in the group using only your frequencies 


Yes, the list might might be long.
![list](https://media.giphy.com/media/YLHwkqayc1j7a/giphy.gif)

DISCUSS!

### Bag of Words Steps

![step by step](https://i.gifer.com/VxbJ.gif)

1. make all lower case
2. Remove punctuation, numbers, symbols, etc
3. Remove stop words, perhaps develop custom stop words list
4. Stemming/Lemmatization


But what about tokenization? when's the best time to tokenize?

In [1]:
from __future__ import print_function
import nltk
import sklearn

In [3]:
nltk.download() #for when you are bringing in files from gutenburg, etf

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib

In [None]:
metamorph = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/5200/pg5200.txt').read()
#print(x.read())


In [None]:
metamorph_st = metamorph.decode("utf-8") 

Load your article here

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
metamorph_tokens_raw = nltk.regexp_tokenize(metamorph_st, pattern)
print(metamorph_tokens_raw[:100])

In [None]:
metamorph_tokens = [i.lower() for i in metamorph_tokens_raw]
print(metamorph_tokens[:100])


In [None]:
from nltk.corpus import stopwords
stopwords.words("english")

In [None]:
stop_words = set(stopwords.words('english'))
metamorph_tokens_stopped = [w for w in metamorph_tokens if not w in stop_words]
print(metamorph_tokens_stopped[:100])

## Stemming / Lemming

### Stemming - Porter Stemmer 
![porter](https://cdn.homebrewersassociation.org/wp-content/uploads/Baltic_Porter_Feature-600x800.jpg)

In [None]:
from nltk.stem import *
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

In [None]:
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

### Stemming - Snowball Stemmer
![snowball](https://localtvwiti.files.wordpress.com/2018/08/gettyimages-936380496.jpg?quality=85&strip=all)

In [None]:
print(" ".join(SnowballStemmer.languages))

In [None]:
stemmer = SnowballStemmer("english")
print(stemmer.stem("buying"))

### Porter vs Snowball

In [None]:
print(SnowballStemmer("english").stem("generously"))
print(SnowballStemmer("porter").stem("generously"))



### Use Snowball on metamorphesis

In [None]:
meta_stemmed = [stemmer.stem(word) for word in metamorph_tokens_stopped]
print(meta_stemmed[:100])

### Lemmatization

Uses a corpus of words "WordNet"

`from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()`


Challenge of lemmatization:

`wordnet_lemmatizer.lemmatize(word, pos="v")`

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [None]:
wordnet_lemmatizer.lemmatize('dreamt',pos='v')

## Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.

### Document statistics

Average word length in document

In [None]:
float(sum(map(len, meta_stemmed))) / len(meta_stemmed)

Number of words in document

In [None]:
len(meta_stemmed)

## What you've all been waiting for 

![big deal](http://reddebtedstepchild.com/wp-content/uploads/2013/04/Big-deal-gif.gif)


## Frequency distributions

In [None]:
meta_freqdist = FreqDist(meta_stemmed)

In [None]:
meta_freqdist.most_common(50)

In [None]:
meta_freqdist.plot(30,cumulative=False)

**TASK**: Create word frequency plot for your article

Question:  Should any more stop words be added to the list given your plot results?

In [None]:
meta_finder = BigramCollocationFinder.from_words(meta_stemmed)

## Creating a Data frame that compares the documents

**Puzzle**: how could you adapt the code below to allow you to compare documents and word counts?

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)