# Natural Language Processing

<img src="https://media.giphy.com/media/msKNSs8rmJ5m/giphy.gif" alt="Drawing" style="width: 300px;"/>




<img src = 'images/thinking.jpeg'></img>

#### Scenario: You wok for a international political consultant.  Your work is on the level and not shady at all.  Scott, a friendly coworker, is trying to sort through news articles to quickly filter political from non-political articles. He has heard you possess some solid NLP chops, and has asked for your help in automating the task.

<div>
<p float = 'left'> What type of problem is this?</p>
<p float = 'left'> What steps do you anticipate carrying out? </p>
<p float = 'left'> What challenges do you foresee? </p>
</div>


<img src = 'https://media.giphy.com/media/WqLmcthJ7AgQKwYJbb/giphy.gif' alt="Drawing" style="width: 300px;"  float = 'right'> </img>

## Vanilla Python Text Exploration

Explore the texts in the example texts by:

1. Creating a list of words in each text.
2. Counting the number of occurences of the words.  
3. Ordering the words by number of occurences.
4. Comparing the counts





In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict
import string

with open('text_examples/A.txt', 'r') as read_file:
    text = read_file.read()

In [None]:
# Create a list of words

In [None]:
# Count the number of occurences

In [None]:
# Now you order the number of occurences and compare the texts


<h2>Bag of Words</h2>

<img src = "images/bag_of_word.jpg"></img>

What is the problem with text in relation to machine learning?
BOW takes a text, breaks it up into small pieces (words, bigrams, stems, lemma), an converts it into counts.  These counts can then be fed into our familiar machine learning algorithms.

Question: Did any algorithm pop into your head that might be particularly suited to bags of words?

## Steps for creating a bag of words

1. make lowercase 
2. remove punctuation
3. remove stopwords
4. apply stemmer/lemmatizer



To help us with the above steps, we will introduce a new library, **NLTK**  [documentation](https://www.nltk.org/).




In [None]:
#We will come back to the articles later, but to practice preparing a text, 
#let's use another of the NLTK resources (https://www.nltk.org/book/ch02.html)

import nltk
from nltk import word_tokenize, regexp_tokenize
nltk.corpus.gutenberg.fileids()
# text = nltk.corpus.gutenberg.raw(<fill_in>).replace('\n',' ')[:2000]


In [None]:
## tokenize the text
text_tokens = word_tokenize(text)


In [None]:
# or use regexp which can take care of punctuation removal as well.
#https://regexr.com/
pattern = ("([a-zA-Z]+(?:'[a-z]+)?)")


In [None]:
# 1. Make Lowercase


In [None]:
# 2. Remove punctuation
# 3. Remove Stopwords
from nltk.corpus import stopwords
# stopwords.words()

# Stemmers/Lemmatizers

In [None]:
## Stemmers/Lemmatizers
## why would I use a stemmer and not a lemmatizer at all times?
# raw_text = nltk.corpus.gutenberg.raw()
from nltk.stem import *


example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

p_stem = PorterStemmer()
stemmed_words = [p_stem.stem(word) for word in example]
stemmed_words

Porter Stemmer: Least aggressive stemmer.
Snowball stemmer: more aggressive in how it stems.

In [None]:
# Lemmatizers

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

from nltk import pos_tag
# Problem with POS
wordnet_lemmatizer.lemmatize(example[6])

In [None]:
## Frequency Distributions the easy way

from nltk import FreqDist

## Comparing using data frames

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['the bread week winner goes to the final', 
        'the winner never has a soggy bottom', 
        'the contestants at the bottom were poor bread bakers' ]
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

# Each word is a dimension!

![word](https://media.giphy.com/media/xT1R9ERHwyzbCkIwla/giphy.gif)