# Introduction to Natural  Language Processing

**Objective:** A general introduction to basic NLP methods to serve as a foundation for further self study and additional NLP curriculums. This notebook intends to serve as a basis for various techniques such as text processing, information retrieval, and classifying text.

- The Reddit dataset we will use today comes from Google BigQuery. You can find it [here](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2018_05?pli=1).
 - The data is public but you need to have an active account on Google Clound Platform first in order to access it.
- The original data was huge so we sampled it from the top 10 subreddit.

- We will also learn the following NLP packages in Python along the way

 - [NLTK](http://www.nltk.org/) - a very popular package for doing NLP in Python

 - [Textblob](https://textblob.readthedocs.io/en/dev/) - similar to NLTK but provides a higher level API for easy accessing.

 - [WordCloud](https://github.com/amueller/word_cloud) - how to run wordcloud in Python

## Prerequisite

- Open your **Terminal/Anaconda Prompt**, cd to the lecture code folder and run the following command:
 - `pip install -r requirements.txt`

- After installing all the required packages, run the following command:
 - `python -m textblob.download_corpora`
 
- Restart this jupyter notebook.

In [None]:
import nltk

# Uncomment the following line the first time you run the code
# nltk.download('stopwords')
# nltk.download('wordnet')

## Load Dataset

In [None]:
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/nycdsabt01/reddit_top10.csv')

- It is always a good idea to check the shape of the dataframe and column types before you run any type of operation.

In [None]:
df.shape

In [None]:
df.dtypes

- When you first readin a dataset, I would recommend using `df.sample()` rather than `df.head()` because sometimes the first couple rows are fine, however, there might be missing values or mixed types in the column so it is better if you can get a big picture of the whole dataset.

In [None]:
df.sample(10)

- `selftext` is the raw text of each Reddit post. But take a look at the column. There are missing values, `[deleted]`, `[removed]` which should not be considered as valid text.
- We need to clean the text before we can further analyze it.

In [None]:
# Fill na with empty string
df['selftext'] = df['selftext'].fillna('')
# Replace `removed` and `deleted` with empty string
tbr = ['[removed]', '[deleted]']
df['selftext'] = df['selftext'].apply(lambda x: '' if x in tbr else x)

- After cleansing the data, about 88% of our `selftext` column are just empty string.
- It makes sense to concatenate the text with its title.

In [None]:
print(sum(df['selftext'] == '') / df.shape[0])

In [None]:
df['selftext'] = df['title'] + ' ' + df['selftext']

In [None]:
df.sample(10)

## Preprocessing

- Convert all the text to lowercase - avoids having multiple copies of the same words.
- Replace url in the text with empty space.
- Replace all empty spaces with just one.
- Remove all the empty text from the dataframe.

In [None]:
import re

# Convert all the string to lower cases
df['selftext'] = df['selftext'].str.lower()
# \S+ means anything that is not an empty space
df['selftext'] = df['selftext'].apply(lambda x: re.sub('http\S*', '', x))
# \s+ means all empty space (\n, \r, \t)
df['selftext'] = df['selftext'].apply(lambda x: re.sub('\s+', ' ', x))
# We don't want empty string in our text
df = df.loc[df['selftext'] != ""]

- Let's take a look at the dataframe after preprocessing.

In [None]:
df.sample(10)

## Text Processing Steps and Methods

- Before we can make use of any text in machine assisted study of text or maching learning methods dealing with text, it's important that the text is fed as an input the model can interpret. To this end, the specific steps one must take is:
 - Filtering
 - Tokenization
 - Lemmitization
 - Stemming

## Filtering

- The first step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [None]:
import re
df['selftext'] = df['selftext'].apply(lambda x: re.sub('[^\w\s]', '', x))

- When examining a text, often there are words used within a sentence that holds no meaning for various data mining such as topic modeling or word frequency. Examples of this include "the", "is", etc. Collectively, these are known as "stopwords". 
- When mining for certain information, you should note whether your method should remove certain stopwords (for example, wordclouds). To illustrate an example, we will call upon the stopwords method from nltk. 
- Note, methods that interact with the text itself is usually found under nltk.corpus. Corpus is the linguistics term for set of structured text used for statistical study so be mindful of this specific vocabulary.
- Note one can add to this list of stopwords if one wishes to include various stopwords that are too common within the topic of text. 
- The stop words from nltk is just a Python list so you can easily append more stopwords to it. For example "computer" would be a stopword in corpus largely dealing with data science.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
print(stop)

In [None]:
df['selftext'] = df['selftext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

## Tokenization 

- Tokenization is the act of splitting one's text into a sequence of words. In this example, we will try a simplistic tokenization method below using the standard split.

In [None]:
sample_text = "This is a toy example. Illustrate this example below."
sample_tokens = sample_text.split()
print(sample_tokens)

- Did you notice something? While we have the tokens, "example" and "example." are treated as different tokens. As a NLP data scientist, you must make the choice on whether you choose to distinguish the two.

- Note, various packages in Python such as the nltk package will default tokenize "." as a seperate token instead to designate it it's own special meaning. This can be illustrated below:

In [None]:
from nltk.tokenize import word_tokenize 
word_tokenize(sample_text)

- However, textblob treats "." just as a period.

In [None]:
from textblob import TextBlob
TextBlob(sample_text).words

## Stemming and Lemmatization

- Various words in English have the same meaning. There are two main methods for handling tasks such as recognizing "strike, striking, struck" as the same words.

- Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach.

- The most common stemming algorithms are:
 - [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) (the older traditional method)
 - [Lancaster Stemmer](http://textanalysisonline.com/nltk-lancaster-stemmer) (a more aggressive modern stemmer)

- Stemming and lemmatization can both be done with self written rules using creative forms of regex but for practical example demo in this notebook, we will implement the PorterStemmer method from nltk on the example below.

In [None]:
nonprocess_text = "I am writing a Python string"

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmed_text = ' '.join([stemmer.stem(word) for word in nonprocess_text.split()])

- Note! This is more robust than the standard regex implementation as we see here "writing" is converted to "write" but "string" isn't converted to "stre".

In [None]:
print(stemmed_text)

- Unlike stemming, lemmatization will try to identify root words that are semantically similar to text based off a dictionary corpus. In essence, you can think of being able to replicate the effect manually by implementing a look-up method after parsing a text. Therefore, we usually prefer using lemmatization over stemming.

- There are various dictionaries one can use to base lemmization off of. NLTK's [wordnet](http://wordnet.princeton.edu/) is quite powerful to handle most lemmatization task. We'll examine a few implementations below.

In [None]:
from nltk import WordNetLemmatizer

lemztr = WordNetLemmatizer()

In [None]:
lemztr.lemmatize('feet')

- Note, lemmatization will return back the default if the text isn't found in the dictionary.

In [None]:
lemztr.lemmatize('abacadabradoo')

## N-grams

- N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

- Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. 

- The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

- Google hosts its n-gram corpora on [AWS S3](https://aws.amazon.com/datasets/google-books-ngrams/) for free. 
- The size of the file is about 2.2TB. You might consider using [Python API](https://github.com/dimazest/google-ngram-downloader).

## N-grams - Example

In [None]:
TextBlob(df['selftext'][10]).ngrams(2)

- You can easily implement the N-gram function using native Python - it is a common nlp interview question.

In [None]:
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))
find_ngrams(input_list, 2)

## Word Cloud

In [None]:
from wordcloud import WordCloud

In [None]:
wc = WordCloud(background_color="white", max_words=2000)
# generate word cloud
wc.generate(' '.join(df['selftext']))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

- This wordcloud is generated using all the text data. However, it makes more sense to have a separate wordcloud for each individual subreddit.
- If you find any frequent word that doesn't contain useful information, you should consider adding it to your stopword list.
- You can find more examples on the [documentation](http://amueller.github.io/word_cloud/auto_examples/index.html) and [blog post](http://minimaxir.com/2016/05/wordclouds/).

In [None]:
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure(figsize=(4, 3))
plt.axis("off")
plt.show()

## Sentiment Analysis

- Sentiment analysis refers to the use of state-of-the-art natural language processing, text analysis, and computational linguistics to identify affective states and subjective information.

- Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document.

- Today we will just call the [sentiment analysis API](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis) from TextBlob and take it like a black box as we haven't talked about machine learning yet.

- We want to apply the function to the text column of the dataframe and generate two new columns called polarity and subjectivity. The process is taking a long time so we just apply it to a sample of dataset.

In [None]:
sample_size = 10000

def sentiment_func(x):
    sentiment = TextBlob(x['selftext'])
    x['polarity'] = sentiment.polarity
    x['subjectivity'] = sentiment.subjectivity
    return x

sample = df.sample(sample_size).apply(sentiment_func, axis=1)

In [None]:
sample.head(10)

- Then you might be interested in what the relationship between polarity and number of thumb ups.

In [None]:
sample.plot.scatter('ups', 'polarity')

# Recommended Resources:

Other advanced libraries in Python that we will cover in the future lectures:

[Spacy](https://spacy.io/)

[Gensim](https://radimrehurek.com/gensim/)