# Natural Language Processing (NLP) Tutorial

In this tutorial, we will talk about natural language processing (NLP) using Python. This NLP tutorial will use Python NLTK library. NLTK is a popular Python library which is used for NLP.

So what is NLP? and what are the benefits of learning NLP?

## What is NLP?

Simply and in short, natural language processing (NLP) is about developing applications and services that are able to understand human languages.

We are talking here about practical examples of natural language processing (NLP) like understanding synonyms of matching words, and writing complete grammatically correct sentences and paragraphs. This is not everything, you can think about the industrial implementations about these ideas and its benefits.

## Benefits of NLP

As all of you know, there are millions of gigabytes every day are generated by blogs, social websites, and web pages.

There are many companies gathering all of these data for understanding users and their passions and give these reports to the companies to adjust their plans.

## NLP Implementations

These are some of the successful implementation of Natural Language Processing (NLP):
- **Search engines** like Google, Yahoo, etc. 
- **Social websites feeds** like Facebook news feed.
- **Speech engines** like Apple Siri.
- **Spam filters** like Google spam filters.

## NLP Libraries

There are many open source Natural Language Processing (NLP) libraries and these are some of them:

- [Natural language toolkit (NLTK)](https://www.nltk.org/) (Python)
- [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/index.html) (Python)
- [spaCy](https://spacy.io/)  (Python)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)  (Java)
- [Apache OpenNLP](https://opennlp.apache.org/) (Java)

In this NLP Tutorial, we will use Python NLTK library.

## Install NLTK

In Jupyter notebook, you can install it by (only run once):
```python
!pip install -U NLTK
```
Download nltk data to support some NLP functionalities (only run once).
```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
```

In [None]:
# install NLTK by (only run once):
!pip install -U NLTK

# Download nltk data to support some NLP functionalities (only run once).
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

##  Text data overview

The SMS Spam Collection v.1 is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

The files contain one message per line. Each line is composed by two columns: one with label (ham or spam) and other with the raw text, separated by `TAB` or `\t`. Here are some examples:

```
ham   What you doing?how are you?
ham   dun say so early hor... U c already then say...
ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
spam   FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now!
spam   Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia?
```

In [None]:
import pandas as pd

# read training data
train_file = "https://github.com/liuhoward/teaching/raw/master/big_data/smsspam/SMSSpamCollection.train"
train_data = pd.read_csv(train_file, sep='\t', header=None, names=['label', 'text'])

print(f'num train records: {len(train_data)}')
train_data.head()

## Tokenize Text

In [None]:
# simple example to split text with space
text = "What you doing?how are you?"
tokens = [t for t in text.split(' ')]
print(tokens)

We saw how to split the text into tokens using split function, now we will see how to tokenize the text using NLTK.

In [None]:
# simple example to tokenize using NLTK
from nltk.tokenize import word_tokenize

text = "What you doing?how are you?"
# use lower case
text = text.lower()
tokens = word_tokenize(text)
print(tokens)

In [None]:
# a simple example to calculate frequency
import nltk

freq = nltk.FreqDist(tokens)
print(freq.items())

In [None]:
# tokenize training text
train_texts = train_data['text']
print(len(train_texts))
train_text_tokens = [word_tokenize(text.lower())   for text in train_texts]
# print the first example
print(train_text_tokens[0])

## Romove stopwords
There are some words like `The`, `Of`, `a`, `an`, and so on. These words are stop words.  
Generally, stop words should be removed to prevent them from affecting our results.  
NLTK is shipped with stop words lists for most languages.

In [None]:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
print(stopwords_list)

In [None]:
stopwords_set = set(stopwords_list)

# remove stopwords in training tokens
train_clean_tokens = list()
for token_list in train_text_tokens:
    new_token_list = list()
    for token in token_list:
        if token in stopwords_set:
            continue
        new_token_list.append(token)
        
    train_clean_tokens.append(new_token_list)
# print first record
print(train_clean_tokens[0])

## Word Stemming

Word stemming means removing affixes from words and return the root word. Ex: The stem of the word working => work.  
There are many algorithms for stemming, but the most used algorithm is **Porter stemming algorithm**.

In [None]:
# a simple example
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('working'))
print(stemmer.stem('increases'))

In [None]:
# stemming for training data
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

train_stem_tokens = list()
for token_list in train_clean_tokens:
    new_token_list = [stemmer.stem(token)  for token in token_list]
    train_stem_tokens.append(new_token_list)

print(train_stem_tokens[0])

## Lemetization

Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word.

In [None]:
# a simple example to compare stemming & lemmatization
print(stemmer.stem('increases'))

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases'))

In [None]:
# more examples for lemmatization
print(lemmatizer.lemmatize('playing', pos="v"))  # play is verb
print(lemmatizer.lemmatize('playing', pos="n"))  # play is noun
print(lemmatizer.lemmatize('playing', pos="a"))  # play is adjective
print(lemmatizer.lemmatize('playing', pos="r"))  # play is adverb

## Read & Preprocess text data for testing dataset

Read testing dataset into pandas, we can put stopwords removing, stemming together.

In [None]:
# read test data
test_file = "https://github.com/liuhoward/teaching/raw/master/big_data/smsspam/SMSSpamCollection.test"
test_data = pd.read_csv(test_file, sep='\t', header=None, names=['label', 'text'])

print(f'num test records: {len(test_data)}')

test_texts = test_data['text']

In [None]:
# tokenize testing text
test_text_tokens = [word_tokenize(text.lower())   for text in test_texts]

test_stem_tokens = list()
for token_list in test_text_tokens:
    new_token_list = list()
    for token in token_list:
        # ignore stopwords
        if token in stopwords_set:
            continue
        # stemming
        new_token = stemmer.stem(token)
        new_token_list.append(new_token)
        
    test_stem_tokens.append(new_token_list)

## Generate features

We should convert token sequences into numeric features for classifiers.

In [None]:
# token list to token dict
train_tokens = [nltk.FreqDist(token_list) for token_list in train_stem_tokens]
test_tokens = [nltk.FreqDist(token_list) for token_list in test_stem_tokens]

# token dict to vector
from sklearn.feature_extraction import DictVectorizer
# define vectorizer
feature_vectorizer = DictVectorizer()
# learn token indices from training tokens
feature_vectorizer.fit(train_tokens)

In [None]:
import numpy as np

# generate feature vector for training:
train_features = feature_vectorizer.transform(train_tokens)
print(f'train features shape: {np.shape(train_features)}')
# generate feature vector for testing:
test_features = feature_vectorizer.transform(test_tokens)
print(f'test features shape: {np.shape(test_features)}')

## generate labels

Convert labels (ham & spam) to numeric indices (0, 1) for classifiers.

In [None]:
# get raw labels for training, testing data

train_labels = train_data['label']

test_labels = test_data['label']


In [None]:
from sklearn.preprocessing import LabelEncoder

# define label encoder
label_encoder = LabelEncoder()

# learn encoding of labels
label_encoder.fit(train_labels)


In [None]:
# convert training labels into indices
train_target = label_encoder.transform(train_labels)
# shape of training label indices
print(np.shape(train_target))
# first 20 label indices
print(train_target[0:20])

# pay attention to unbalanced classes
values, counts = np.unique(train_target, return_counts=True)
print(f'unique values: {values}')
print(f'unique values frequency: {counts}')

# convert testing labels into indices
test_target = label_encoder.transform(test_labels)

## Classifier

We use logistic regression classifier, you can try other classifiers like Naive Bayes, SVM.

In [None]:
from sklearn.linear_model import LogisticRegression

# define classifier
# what if we remove class_weight?
clf = LogisticRegression(solver='lbfgs', class_weight='balanced')

clf.fit(X=train_features, y=train_target)

In [None]:
# get predicted label indices of testing data
test_pred = clf.predict(X=test_features)

print(np.shape(test_pred))
print(test_pred[0:20])

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

# get spam label index
spam_index = int(label_encoder.transform(['spam'])[0])
print(f'spam index is {spam_index}')

accuracy = accuracy_score(test_target, test_pred)
f1 = f1_score(test_target, test_pred, pos_label=spam_index)
precision = precision_score(test_target, test_pred, pos_label=spam_index)
recall = recall_score(test_target, test_pred, pos_label=spam_index)

print(f'accuracy: {accuracy}')
print(f'precision: {precision}')
print(f'recall: {recall}')
print(f'f1 score: {f1}')


# Q & A