Some Popular Tasks Regarding Text Processing:

* **Language Translation** - Translation of a sentence from one language to another
* **Sentiment Analysis** - To determine whether the sentiment towards any topic or product is positive, negative or neutral, based on a corpus of text
* **Spam Filtering** - To detect unsolicited and unwanted email/messages

In this notebook, we'll discuss the steps involved in text processing.

# Data Preprocessing
***
The data preprocessing steps could include:
* **Tokenization** - converting sentences to words
* Removing unnecessary punctuation and tags
* Removing stop words
* Stemming - Removing inflection via dropping unnecessary characters (usually a suffix)
* Lemmatization - Removing inflection by determining the part of speech and utilized a detailed database of the language

Stemming is the poor man's lemmatization

```text
The stemmed form of studies is: studi
The stemmed form of studying is: study

The lemmatized form of studies is: study
The lemmatized form of studying is: study
```

We can use the `nltk` library to do a lot of text preprocessing:
## Tokenization
***

In [1]:
import nltk
from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize( text )
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


## Removing Stopwords
***
We can use `nltk` to remove stop words (words containing no semantic value)

If you need to download the stopwords for `nltk`, you need to run:
```python
nltk.download("stopwords")
```

In [2]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
tokens = [token for token in tokens if token not in stop_words]
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


## Stemming
***
`nltk` also provides several stemmer interfaces like Porter stemmer, Lancaster stemmer and snowball stemmer:

In [3]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stems = []
for token in tokens:
    stems.append(stemmer.stem(token))
print(stems)

['the', 'quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


# Feature Extraction
***
In text processing, words represent discrete, categorical features. How do we encode this categorical data in a way that is ready to be used by the algorithms? One of the simples techniques is the **bag of words** featurization.

## Bag of Words (BOW)
***
We make a **list of unique words** in the corpus, called the **vocabulary**. Then each word in the vocabulary gets its own basis vector. Then a sentence, or each document in the corpus is represented by a vector that is equal to the vector sums of the basis vectors for the words appearing in that sentence. 

This leads us to the **Term Frequency-Inverse Document Frequency (TF-IDF)** technique:

## TF-IDF
***
First, let us clarify what is meant by document. A corpus is made up of many documents. So if your corpus is a set of tweets, each document is a tweet. So we can ask for the word vector of an entire tweet. The word vector of an entire tweet would be some function of the word vectors of its constituent words.

$$\textrm{Term Frequency (TF)} = \frac{\textrm{number of times token appears in single document}}{\textrm{number of tokens in single document}}$$

$$\textrm{Inverse Document Frequency (IDF)} = \log\left(\frac{\textrm{total number documents}}{\textrm{number of documents this token appears in}}\right)$$

$$\textrm{TF-IDF} = (\textrm{TF})(\textrm{IDF})$$

Here is an example of calculating the TF-IDF of a term in a document:

```text
tweet_one = "This is a beautiful beautiful day day day day day"
tweet_two = "This is a beautiful night night"
```

Then we would have that:
```text
TF("beautiful",tweet_one) = 2/10
TF("day", tweet_one) = 5/10
IDF("beautiful") = log(2/2) = 0
IDF("day") = log(2/1) = 0.3

TF_IDF("beautiful", tweet_one) = (2/10)(0)=0
TF_IDF("day", tweet_one) = (5/10)(0.30) = 0.15
```

So we see that for the first tweet, the TF-IDF method heavily penalizes the word "beautiful", but assigns greater weight to "day". "beautiful" gives no power of resolution because it appears in both documents. "day" is an important word for `tweet_one` in the context of the entire corpus (it only appears in one of the two tweets). 

`scikit-learn` provides efficient tools for computing the TF-IDF of a corpus.

One of the major disadvantages of using the bag of words featurization is that it discards all information contained in the word order of the vectors. 

To solve this problem, we use an approach called **Word Embedding**
## Word Embedding
***
A word embedding is a numerical representation of text where words that have similar semantic meaning are geometrically closer in their semantic vector space. 

### Word2Vec
***
Word2Vec takes a corpus of text and produces a vector space that has the property that vectors that are geometrically close also share common semantic contexts in the corpus. 

### Glove
***
The **Global Vectors for Word Representation** algorithm is an extension to the Word2Vec model. GloVe constructs a co-occurrence matrix on the whole text corpus. The entries of this matrix are the probabilities that a given token appears in the common of every other word in the vocabulary. This assumes the **distributional hypothesis**.

# Choosing ML Algorithms
***
Classical ML approaches such as Naive bayes or Support Vector machines for spam filtering are very popular. However, Deep Learning techniques, combined with deep-learned word embeddings are taking the NLP stage by storm.

# Example IMDB Movie Review Sentiment Analyzer
***
We will now build a sentiment analyzer over the IMDB movie review dataset. 

We will be performing binary classification (negative or positive reviews).

Our dataset can be downloaded [here](http://ai.stanford.edu/~amaas/data/sentiment/).

The dataset contains 25000 training reviews and 25000 testing reviews. 

In [43]:
import pandas as pd
import os

folder = "data/imdb_movie_reviews/"
labels = {'pos':1, 'neg':0}

data = pd.DataFrame()

for dataset in ["test","train"]:
    for polarity in ["pos","neg"]:
        preliminary_path = os.path.join(folder, dataset, polarity)
        for file in os.listdir(preliminary_path):
            print(os.path.join(preliminary_path, file))
            review = pd.read_csv( os.path.join(preliminary_path, file) ,header=None, \
                                 names=["review"], encoding='utf8')
            review["sentiment"] = labels[polarity]
            data.append(review)
print(data)

data/imdb_movie_reviews/test/pos/9013_10.txt
data/imdb_movie_reviews/test/pos/9849_8.txt
data/imdb_movie_reviews/test/pos/10012_9.txt
data/imdb_movie_reviews/test/pos/941_10.txt
data/imdb_movie_reviews/test/pos/3012_7.txt
data/imdb_movie_reviews/test/pos/119_9.txt
data/imdb_movie_reviews/test/pos/4619_10.txt
data/imdb_movie_reviews/test/pos/40_8.txt
data/imdb_movie_reviews/test/pos/8975_8.txt
data/imdb_movie_reviews/test/pos/3046_8.txt
data/imdb_movie_reviews/test/pos/5143_8.txt
data/imdb_movie_reviews/test/pos/8296_8.txt
data/imdb_movie_reviews/test/pos/268_10.txt
data/imdb_movie_reviews/test/pos/5419_10.txt
data/imdb_movie_reviews/test/pos/10986_10.txt


ParserError: Error tokenizing data. C error: EOF inside string starting at row 0