# Python Examples for Basic Natural Language Processing (NLP) 

## Introduction to NLP:
Goal of NLP: convert text data into useful (structured) formats so that we can extract useful features based on NLP pipeline

## NLP Terminology
* A *document* can be a sentence, a paragraph or an article, represented by a sequence of N words/terms, denoted as $D = (t_1, t_2, ..., t_N)$.
* A *corpus* is a collection of M text documents, denoted as $C = \{D_1,D_2,...,D_M\}$
* A *dictionary* is a bag of all words that has presented in each of the document
* *Document-Term (DT) matrix* is a summary matrix with each row corresponding to each document and columns corresponding to the terms in the dictionary.
* *Term-Frequency (tf)* measures the raw frequency of each term occurs in a document, denoted as $tf(t, d_j)$, representing the number of times the term t appear in document $d_j$
* *Stop words* are those useless words such as "the", "a", "an", "in"

For example:

Document 1: the hello hello world

Document 2: the hello

Document 3: hi the

Then the dictionary becomes [the hello world hi]


DT matrix =\begin{vmatrix}
1 & 2 & 1 & 0 \\
1 & 1 & 0 & 0\\
1 & 0 & 0 & 1
\end{vmatrix}

## Tokenizing 

The first step: turning the raw text documents into lists of words

In [23]:
from nltk.tokenize import word_tokenize

raw_text = "the hello hello world. the hello. hi the"

In [24]:
# Split the raw text into individual words
document = word_tokenize(raw_text)
print document

['the', 'hello', 'hello', 'world', '.', 'the', 'hello', '.', 'hi', 'the']


If you want to treat each sentence as a separate document - Sentence Tokenizing

In [25]:
from nltk.tokenize import sent_tokenize

# Split raw text in individual sentences
docs = sent_tokenize(raw_text)
print docs

['the hello hello world.', 'the hello.', 'hi the']


In [26]:
# Split individual sentences into words
corpus = []
for doc in docs:
    corpus.append(word_tokenize(doc))
print corpus

[['the', 'hello', 'hello', 'world', '.'], ['the', 'hello', '.'], ['hi', 'the']]


## Document-Term (DT) matrix
Document-Term (DT) matrix is a summary matrix with each row corresponding to each document and columns corresponding to the terms in the dictionary

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = sent_tokenize(raw_text)
print corpus

# Build a bag of words model
v = CountVectorizer()
print v.fit(corpus).vocabulary_
print(v.get_feature_names())
print v.fit_transform(corpus).toarray()

['the hello hello world.', 'the hello.', 'hi the']
{u'world': 3, u'the': 2, u'hi': 1, u'hello': 0}
[u'hello', u'hi', u'the', u'world']
[[2 0 1 1]
 [1 0 1 0]
 [0 1 1 0]]


## Inverse Document Frequency (idf) 
IDF is a measure of the importance of each term in a document with respect to the whole corpus. Rather than treating each term equally important, IDF assigns each of them a weight to scale up the rare terms and scale down the frequent ones. In such a way, stop words like ”the”, ”that”, ”if” and ”is” will get less weights than other more important words. IDF weights are defined as: $idf_{t,D}=log[\frac{|M|}{|d \in D: t \in D|}]+1$, where $|M|$ is the total number of documents in the corpus and $|d \in D: t \in D|$ is number of documents where the term t occurs. Note that for each corpus, idf is constant.

## TF-IDF
tf-idf is a popular weighting scheme to normalize the raw frequency. It can be computed as: $tfidf(t,d) = tf(t,d) \times idf(t,d)$, By multiplying idf to tf, tf-idf returns the relative frequency of each term in the entire collection of documents. It is also known as a stop-words-filtering technique.


### Computing  TF-IDF

In [65]:
import numpy as np
from sklearn.preprocessing import normalize

bag_of_words_matrix = v.fit_transform(corpus).toarray()
document_freq = np.sum(bag_of_words_matrix > 0, axis=0)
#print document_freq

# M is the number of documents
M = bag_of_words_matrix.shape[0]

idf = np.log(float(M) / (document_freq)) + 1 
tfidf = np.multiply(bag_of_words_matrix, idf)
#tfidf = normalize(tfidf, axis=12)
print tfidf

[[ 2.81093022  0.          1.          2.09861229]
 [ 1.40546511  0.          1.          0.        ]
 [ 0.          2.09861229  1.          0.        ]]


In [68]:
# or we can use python library to compute tfidf
# Documentation:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm=None, smooth_idf=False)
print tfidf.fit_transform(bag_of_words_matrix).toarray()

[[ 2.81093022  0.          1.          2.09861229]
 [ 1.40546511  0.          1.          0.        ]
 [ 0.          2.09861229  1.          0.        ]]


## N-grams
N-grams or Ngrams are strings consecutive words in your corpus (can be used to train a machine learning model)

In [27]:
from nltk.util import ngrams

# An example of 2-grams in our document above.
for token in ngrams(document , 2):
    print token

('the', 'hello')
('hello', 'hello')
('hello', 'world')
('world', '.')
('.', 'the')
('the', 'hello')
('hello', '.')
('.', 'hi')
('hi', 'the')


## Stop Word
Stop words are those useless words such as "the", "a", "an", "in"

In [28]:
## To check the list of stopwords in python
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print stop_words

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [29]:
# Remove stop words from our documents

document = [a for a in map(lambda x: x.lower(), document) if not a in stopwords.words('english')]
print document

['hello', 'hello', 'world', '.', 'hello', '.', 'hi']


## Stemming and Lemmatization

Stemming and Lemmatization are also called text normalization

For example: 
* play, playing, played --> play
* better --> good
* mice --> mouse
* keys --> key

NLTK in python provides several famous stemmers interfaces, such as Porter stemmer, Lancaster Stemmer, Snowball Stemmer and etc. 

In [30]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

example_doc = ['play','played','playing']
ps = PorterStemmer()
for t in example_doc:
    print(ps.stem(t))
    
print SnowballStemmer('english').stem('plays')

print WordNetLemmatizer().lemmatize('mice')

play
play
play
play
mouse


# An interesting example: Sentiment Analysis with Python NLTK Text Classification

In [None]:
# nltk.download()
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
stop_words_set = set(stopwords.words('english'))
 
def extractFeature(paragraph):
    feature_dict = dict()
    words = word_tokenize(paragraph)
    for word in words:
        word = word.lower()
    if word not in stop_words_set and word.isalpha():
        feature_dict[word] = True
    return feature_dict
 
training_data = []
for fileid in movie_reviews.fileids():
    feature_pos = extractFeature(movie_reviews.raw(fileid))
    training_data.append((feature_pos, fileid[:3]))

In [102]:
# nltk.download()
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
stop_words_set = set(stopwords.words('english'))
 
def getFeatures(text):
    feature_dict = dict()
    words = word_tokenize(text)
    for token in words:
        token = token.lower()
        if token not in stop_words_set and token.isalpha():
            feature_dict[token] = True
    return feature_dict
 
train_data = []
for fileid in movie_reviews.fileids():
    features = getFeatures(movie_reviews.raw(fileid))
    train_data.append((features, fileid[:3]))

In [103]:
train_data[0]

({u'accident': True,
  u'actors': True,
  u'actually': True,
  u'ago': True,
  u'also': True,
  u'although': True,
  u'always': True,
  u'american': True,
  u'apparently': True,
  u'apparitions': True,
  u'applaud': True,
  u'arrow': True,
  u'assuming': True,
  u'attempt': True,
  u'audience': True,
  u'away': True,
  u'back': True,
  u'bad': True,
  u'beauty': True,
  u'bentley': True,
  u'big': True,
  u'biggest': True,
  u'bit': True,
  u'blair': True,
  u'bottom': True,
  u'break': True,
  u'came': True,
  u'character': True,
  u'characters': True,
  u'chase': True,
  u'chasing': True,
  u'chopped': True,
  u'church': True,
  u'clue': True,
  u'coming': True,
  u'completely': True,
  u'concept': True,
  u'confusing': True,
  u'continues': True,
  u'cool': True,
  u'correctly': True,
  u'couples': True,
  u'craziness': True,
  u'critique': True,
  u'crow': True,
  u'dead': True,
  u'deal': True,
  u'decent': True,
  u'decided': True,
  u'despite': True,
  u'dies': True,
  u'differe

In [104]:
train_data[-1]

({u'accident': True,
  u'actions': True,
  u'address': True,
  u'airplane': True,
  u'almost': True,
  u'also': True,
  u'although': True,
  u'amazing': True,
  u'americans': True,
  u'andrew': True,
  u'announcer': True,
  u'another': True,
  u'answer': True,
  u'apart': True,
  u'apparently': True,
  u'appears': True,
  u'around': True,
  u'aroused': True,
  u'artificial': True,
  u'asked': True,
  u'attack': True,
  u'awakening': True,
  u'back': True,
  u'beautiful': True,
  u'became': True,
  u'becomes': True,
  u'behind': True,
  u'best': True,
  u'better': True,
  u'big': True,
  u'breaks': True,
  u'brings': True,
  u'brisk': True,
  u'building': True,
  u'built': True,
  u'burbank': True,
  u'california': True,
  u'came': True,
  u'cameo': True,
  u'camera': True,
  u'cameras': True,
  u'capitalized': True,
  u'captive': True,
  u'car': True,
  u'cards': True,
  u'carefully': True,
  u'carrey': True,
  u'carrying': True,
  u'center': True,
  u'certainly': True,
  u'change': Tr

In [112]:
fit = NaiveBayesClassifier.train(train_data)
accuracy = nltk.classify.util.accuracy(fit, train_data)
print(accuracy)

0.9625


In [114]:
test_set1 = 'Such a bad movie! Terrible!'
feature_test1 = getFeatures(test_set1)
fit.classify(feature_test1)

u'neg'

In [115]:
test_set2 = 'I love it!'
feature_test2 = getFeatures(test_set2)
fit.classify(feature_test2)

u'pos'

Note: We can see that based on the model prediction, the review 'Such a bad movie! Terrible!' was classified as a negative review while 'I love it!' was classified as a positive review