# Natural Language Processing (NLP)

## Introduction

*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*

### What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language

### What are some of the higher level task areas?

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [My application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the **language** and the **world**.

## Part 1: Reading in the Yelp Reviews

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [None]:
# read yelp.csv into a DataFrame
PATH = r'../data/yelp.csv'
yelp = pd.read_csv(PATH)

# create a new DataFrame that only contains the 5-star and 1-star reviews

# define X and y


# split the new DataFrame into training and testing sets


## Part 2: Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test


In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")


In [None]:
# last 50 features


In [None]:
# show vectorizer options


[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [None]:
# don't convert to lowercase


- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams


In [None]:
# last 50 features


**Predicting the star rating:**

In [None]:
# use default options for CountVectorizer

# create document-term matrices

# use Naive Bayes to predict the star rating

# calculate accuracy


In [None]:
# calculate null accuracy


In [None]:
# define a function that accepts a vectorizer and calculates the accuracy


In [None]:
# include 1-grams and 2-grams


## Part 3: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options


- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words


In [None]:
# set of stop words


## Part 4: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words and only keep 100 features


In [None]:
# all 100 features


In [None]:
# include 1-grams and 2-grams, and limit the number of features


- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times


## Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

In [None]:
# print the first review


In [None]:
# save it as a TextBlob object


In [None]:
# list the words


In [None]:
# list the sentences


In [None]:
# some string methods are available


## Part 6: Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
# initialize stemmer

# stem each word


**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
# assume every word is a noun


In [None]:
# assume every word is a verb


In [None]:
# define a function that accepts text and returns a list of lemmas


In [None]:
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)


In [None]:
# last 50 features


## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents


In [None]:
# Term Frequency


In [None]:
# Document Frequency


In [None]:
# Term Frequency-Inverse Document Frequency (simple version)


In [None]:
# TfidfVectorizer


**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

"Tokens that occur frequently in a given string should have higher contribution to similarity than those that occur few times, as should those tokens that are rare among the set of strings under consideration."

## Part 8: Using TF-IDF to Summarize a Yelp Review

Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF!

In [None]:
# create a document-term matrix using TF-IDF


In [None]:
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = yelp.text[review_id]
        #review_text = unicode(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # print 5 random words
    print('\n' + 'RANDOM WORDS:')
    random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # print the review
    print('\n' + review_text)

## Part 9: Sentiment Analysis

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)


In [None]:
# understanding the apply method (create a column that is the length of the review)


In [None]:
# define a function that accepts text and returns the polarity


In [None]:
# create a new DataFrame column for sentiment (WARNING: SLOW!)


In [None]:
# box plot of sentiment grouped by stars


In [None]:
# reviews with most positive sentiment


In [None]:
# reviews with most negative sentiment


In [None]:
# widen the column display


In [None]:
# negative sentiment in a 5-star review


In [None]:
# positive sentiment in a 1-star review


In [None]:
# reset the column display width


## Bonus: Adding Features to a Document-Term Matrix

In [None]:
# create a DataFrame that only contains the 5-star and 1-star reviews

# define X and y


# split into training and testing sets


In [None]:
# use CountVectorizer with text column only


In [None]:
# shape of other four feature columns


In [None]:
# cast other feature columns to float and convert to a sparse matrix


In [None]:
# combine sparse matrices


In [None]:
# repeat for testing set


In [None]:
# use logistic regression with text column only


In [None]:
# use logistic regression with all features


## Bonus: Fun TextBlob Features

In [None]:
# spelling correction


In [None]:
# spellcheck


In [None]:
# definitions


In [None]:
# language identification


## Conclusion

- NLP is a gigantic field
- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible

## Further study
 - Susan Li has the best tutorials out there. She frequently writes about NLP related areas and how to easily implement different techniques: https://towardsdatascience.com/@actsusanli
 
## Regular Expressions
 - RegEx is a powerful functionality that can be leveraged for NLP
 - Great tutorial on RegEx basics here: https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial