# JumpStart: Natural  Language Processing

**Objective:** A general introduction to basic NLP methods to serve as a foundation for further self study and additional NLP curriculums. This notebook intends to serve as a basis for various techniques such as text processing, information retrieval, and classifying text.

- The Reddit dataset we will use today comes from Google BigQuery. You can find it [here](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2018_05?pli=1).
 - The data is public but you need to have an active account on Google Clound Platform first in order to access it.
- The original data was huge so we sampled it from the top 10 subreddit.

- We will also learn the following NLP packages in Python along the way

 - [NLTK](http://www.nltk.org/) - a very popular package for doing NLP in Python

 - [Textblob](https://textblob.readthedocs.io/en/dev/) - similar to NLTK but provides a higher level API for easy accessing.

 - [WordCloud](https://github.com/amueller/word_cloud) - how to run wordcloud in Python

## Prerequisite

- Open your **Terminal/Anaconda Prompt**, cd to the lecture code folder and run the following command:
 - `pip install -r requirements.txt`

- After installing all the required packages, run the following command:
 - `python -m textblob.download_corpora`
 
- Restart this jupyter notebook.

In [2]:
import nltk

# Uncomment the following line the first time you run the code
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ktread/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /Users/ktread/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

## Load Dataset

In [3]:
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/nycdsabt01/reddit_top10.csv')

- It is always a good idea to check the shape of the dataframe and column types before you run any type of operation.

In [4]:
df.shape

(737339, 11)

In [5]:
df.dtypes

created_utc      int64
subreddit       object
author          object
domain          object
url             object
num_comments     int64
score            int64
ups              int64
downs            int64
title           object
selftext        object
dtype: object

In [7]:
df.describe()

Unnamed: 0,created_utc,num_comments,score,ups,downs
count,737339.0,737339.0,737339.0,737339.0,737339.0
mean,1526486000.0,6.935,52.798143,52.798143,0.0
std,734279.5,151.224431,721.943659,721.943659,0.0
min,1525306000.0,0.0,0.0,0.0,0.0
25%,1525723000.0,0.0,1.0,1.0,0.0
50%,1526487000.0,1.0,1.0,1.0,0.0
75%,1527045000.0,3.0,4.0,4.0,0.0
max,1527725000.0,46164.0,100389.0,100389.0,0.0


- When you first readin a dataset, I would recommend using `df.sample()` rather than `df.head()` because sometimes the first couple rows are fine, however, there might be missing values or mixed types in the column so it is better if you can get a big picture of the whole dataset.

In [8]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
403317,1525489748,Ice_Poseidon,lilpump6969,self.Ice_Poseidon,https://www.reddit.com/r/Ice_Poseidon/comments...,2,13,13,0,CAN WE JUST LEAVE ALREADY IM TIRED OF WATCHING...,
612083,1527716193,FortNiteBR,X-IS-DEATH1,self.FortNiteBR,https://www.reddit.com/r/FortNiteBR/comments/8...,0,0,0,0,I managed to defy gravity with the shopping cart,I was just going up a hill and I started drivi...
535475,1525667785,FortNiteBR,BroncoBoy48,self.FortNiteBR,https://www.reddit.com/r/FortNiteBR/comments/8...,0,1,1,0,[Suggestion] Why don’t they add refunding?,[removed]
177433,1526239390,ACCIDENTAL_HAIKU_BOT,ACCIDENTAL_HAIKU_BOT,self.ACCIDENTAL_HAIKU_BOT,https://www.reddit.com/r/ACCIDENTAL_HAIKU_BOT/...,0,1,1,0,/u/DarkFERDY's accidental haiku in /r/BABYMETAL,How do I get to \n the thread I kinda s...
314754,1526459067,AskReddit,SANDRA254,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8j...,1,1,1,0,So what do you guys do when it comes to findin...,[removed]
210107,1527440634,RocketLeagueExchange,WhatIsHam,self.RocketLeagueExchange,https://www.reddit.com/r/RocketLeagueExchange/...,0,1,1,0,[PC] [H] Striker White Lonewolf [W] Pricecheck...,add me or comment :)
89940,1527099036,newsbotbot,-en-,mobile.twitter.com,https://mobile.twitter.com/FT/status/999351367...,0,1,1,0,@FT: Fed signals concern over trade tension ht...,
694450,1526870824,Showerthoughts,[deleted],self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/commen...,1,1,1,0,So we’re all just still cool with referring to...,[removed]
25073,1526721329,AskReddit,[deleted],self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8k...,0,1,1,0,"Instructors of Reddit, what is the worst case ...",[removed]
605025,1527012582,Ice_Poseidon,[deleted],self.Ice_Poseidon,https://www.reddit.com/r/Ice_Poseidon/comments...,3,1,1,0,Mexican Andy can’t have Brandon on stream if h...,[deleted]


- `selftext` is the raw text of each Reddit post. But take a look at the column. There are missing values, `[deleted]`, `[removed]` which should not be considered as valid text.
- We need to clean the text before we can further analyze it.

In [10]:
# Fill na with empty string
df['selftext'] = df['selftext'].fillna('')
# Replace `removed` and `deleted` with empty string
tbr = ['[removed]', '[deleted]']
df['selftext'] = df['selftext'].apply(lambda x: '' if x in tbr else x)

- After cleansing the data, about 88% of our `selftext` column are just empty string.
- It makes sense to concatenate the text with its title.

In [11]:
print(sum(df['selftext'] == '') / df.shape[0])

0.8806152936437649


In [12]:
df['selftext'] = df['title'] + ' ' + df['selftext']

In [14]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
145107,1526083203,Showerthoughts,HumanToes,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/commen...,0,10,10,0,Your parents probably remember remember your c...,Your parents probably remember remember your c...
182011,1526239388,AskReddit,CoolGuess,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8j...,4,0,0,0,"If Facebook launches its own crypto currency, ...","If Facebook launches its own crypto currency, ..."
613306,1527646507,ACCIDENTAL_HAIKU_BOT,ACCIDENTAL_HAIKU_BOT,self.ACCIDENTAL_HAIKU_BOT,https://www.reddit.com/r/ACCIDENTAL_HAIKU_BOT/...,0,1,1,0,/u/iobraska's accidental haiku in /r/Catholicism,/u/iobraska's accidental haiku in /r/Catholici...
628921,1527694974,The_Donald,KrakNup,i.magaimg.net,http://i.magaimg.net/img/3f03.png,3,129,129,0,She has family connections to Bill Ayers (Weat...,She has family connections to Bill Ayers (Weat...
537535,1525916069,FortNiteBR,BazookaBrawkler,self.FortNiteBR,https://www.reddit.com/r/FortNiteBR/comments/8...,7,1,1,0,Thanos fortnite,Thanos fortnite the new gamemode. fun but when...
438965,1526785959,RocketLeagueExchange,SkyReveal,self.RocketLeagueExchange,https://www.reddit.com/r/RocketLeagueExchange/...,4,4,4,0,☁️[Xbox][H] Black Veloce &amp; Party Time [W] ...,☁️[Xbox][H] Black Veloce &amp; Party Time [W] ...
411773,1525529734,AutoNewspaper,AutoNewspaperAdmin,reuters.com,https://www.reuters.com/article/us-australia-k...,0,1,1,0,[Offbeat] - Carrot-addicted kangaroos hopping ...,[Offbeat] - Carrot-addicted kangaroos hopping ...
447237,1526827070,AskReddit,[deleted],self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8k...,2,2,2,0,What's the best advice you've ever received fo...,What's the best advice you've ever received fo...
142574,1526114645,AskReddit,garrusnogarrus,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8i...,14,2,2,0,"Reddit, what tabs have you had open for a long...","Reddit, what tabs have you had open for a long..."
430534,1526782851,AskReddit,vulcan5301,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8k...,23,2,2,0,"What cheap and inexpensive gifts can a broke, ...","What cheap and inexpensive gifts can a broke, ..."


## Preprocessing

- Convert all the text to lowercase - avoids having multiple copies of the same words.
- Replace url in the text with empty space.
- Replace all empty spaces with just one.

In [20]:
import re

# Convert all the string to lower cases
df['selftext'] = df['selftext'].str.lower()
# \S+ means anything that is not an empty space
df['selftext'] = df['selftext'].apply(lambda x: re.sub('http\S*', '', x))
# \s+ means all empty space (\n, \r, \t)
df['selftext'] = df['selftext'].apply(lambda x: re.sub('\s+', ' ', x))

- Let's take a look at the dataframe after preprocessing.

In [21]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
604207,1526955785,The_Donald,[deleted],foxnews.com,http://www.foxnews.com/politics/2018/05/21/hou...,3,11,11,0,House Republicans to call for second special c...,house republicans call second special counsel ...
667500,1526914549,AskReddit,PanicAtTheMetro,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8l...,16,6,6,0,What are you excited for in the future?,what excited future?
545981,1525949838,AskReddit,NRDL,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8i...,21,3,3,0,What's legitimately overrated in your experience?,what's legitimately overrated experience?
481297,1527364101,newsbotbot,-en-,mobile.twitter.com,https://mobile.twitter.com/BW/status/100046306...,0,1,1,0,@BW: Europe is ready to move on from Brexit ht...,@bw: europe ready move brexit
230012,1527439922,AskReddit,[deleted],self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8m...,2,0,0,0,Would you give your husband anal or anal oral ...,would give husband anal anal oral asked it? wh...
158158,1526088926,newsbotbot,-en-,twitter.com,https://twitter.com/APWestRegion/status/995108...,0,1,1,0,@AP: RT @APWestRegion: Parts of Hawaii's Big I...,@ap: rt @apwestregion: parts hawaii's big isla...
297924,1526469265,AutoNewspaper,AutoNewspaperAdmin,miamiherald.com,http://www.miamiherald.com/news/business/artic...,0,1,1,0,[World] - Bank of England official apologizes ...,[world] - bank england official apologizes sex...
517748,1525736710,Ice_Poseidon,maddogumadcunt,i.redd.it,https://i.redd.it/h29c3fx1tiw01.jpg,3,7,7,0,&lt;---- This many people think Scam Pepper sh...,&lt;---- this many people think scam pepper ba...
411217,1525516941,AutoNewspaper,AutoNewspaperAdmin,nytimes.com,https://www.nytimes.com/2018/05/05/style/steph...,0,1,1,0,[Lifestyle] - The Man Who Bought New York | NY...,[lifestyle] - the man who bought new york | ny...
299559,1526485582,AutoNewspaper,AutoNewspaperAdmin,rt.com,https://www.rt.com/usa/426907-trump-tower-meet...,0,1,1,0,[US] - Trump Tower meeting: Senate panel relea...,[us] - trump tower meeting: senate panel relea...


## Text Processing Steps and Methods

- Before we start using machine learning methods on our text, there are some steps that we first want to perform so that our text is in a format that our model can interpret.
- These steps include:
 - Filtering
 - Tokenization
 - Stemming
 - Lemmitization

## Filtering

- The first step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [22]:
df['selftext'] = df['selftext'].apply(lambda x: re.sub('[^\w\s]', '', x))

- When examining a text, often there are words used within a sentence that holds no meaning for various data mining operations such as topic modeling or word frequency. 
    - Examples of this include "the", "is", etc. Collectively, these are known as "stopwords". 
- When mining for certain information, you should note whether your method should remove certain stopwords (for example, wordclouds). To illustrate an example, we will call upon the stopwords method from nltk. 
- Note, methods that interact with the text itself is usually found under nltk.corpus. Corpus is the linguistics term for set of structured text used for statistical study so be mindful of this specific vocabulary.
- The stop words from nltk is just a Python list so you can easily append more stopwords to it. For example "computer" would be a stopword in corpus largely dealing with data science.

In [23]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [24]:
df['selftext'] = df['selftext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
print(df[['selftext']])

                                                 selftext
0                                                   meirl
1            me_irl royal wedding going im browsing memes
2                                                     irl
3                                  irl haha lol post text
4       whats go activity airport youre plane delayed ...
5       anyone else stoked tonights battle bots whos f...
6                                     reddit country suck
7       redditors mental disabilities motivation go li...
8       whats scariest paranormal experience ever scep...
9       show called mildly interesting people would it...
10      african americans reddit irritating funny stup...
11      something common knowledge took embarrassingly...
12      jobs version classic haha item scan guess must...
13                                    think royals better
14                         people reddit cynical bastards
15      daily cosmetic sales 19 may weekly cosmetics a...
16      idea c

## Tokenization 

- Tokenization is the act of splitting text into a sequence of words. In this example, we will try a simplistic tokenization method below using the standard split.

In [27]:
sample_text = "This is a toy example. Illustrate this example below."
sample_tokens = sample_text.split()
print(sample_tokens)

['This', 'is', 'a', 'toy', 'example.', 'Illustrate', 'this', 'example', 'below.']


- Did you notice something? While we have the tokens, "example" and "example." are treated as different tokens. As a NLP data scientist, you must make the choice on whether you choose to distinguish the two.

- Note, various packages in Python such as the nltk package will default tokenize "." as a seperate token instead to designate it it's own special meaning. This can be illustrated below:

In [29]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize 
word_tokenize(sample_text)

[nltk_data] Downloading package punkt to /Users/ktread/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This',
 'is',
 'a',
 'toy',
 'example',
 '.',
 'Illustrate',
 'this',
 'example',
 'below',
 '.']

- However, textblob treats "." just as a period.

In [35]:
from textblob import TextBlob
TextBlob(sample_text).words

WordList(['This', 'is', 'a', 'toy', 'example', 'Illustrate', 'this', 'example', 'below'])

## Stemming and Lemmatization

- Various words in English have the same meaning. There are two main methods for handling tasks such as recognizing "strike, striking, struck" as the same words.

- Stemming refers to the removal of suffixes, like “ing”, “ly”, “s”, etc. by a simple rule-based approach.

- The most common stemming algorithms are:
 - [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) (the older traditional method)
 - [Lancaster Stemmer](http://textanalysisonline.com/nltk-lancaster-stemmer) (a more aggressive modern stemmer)

- Stemming and lemmatization can both be done with self written rules using creative forms of regex but for practical example demo in this notebook, we will implement the PorterStemmer method from nltk on the example below.

In [38]:
nonprocess_text = "I am writing a Python string"

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [39]:
stemmed_text = ' '.join([stemmer.stem(word) for word in nonprocess_text.split()])
print(stemmed_text)

I am write a python string


- Note: This is more robust than the standard regex implementation as we see here "writing" is converted to "write" but "string" isn't converted to "stre".

- Unlike stemming, lemmatization will try to identify root words that are semantically similar to text based off a dictionary corpus. In essence, you can think of being able to replicate the effect manually by implementing a look-up method after parsing a text. Therefore, we usually prefer using lemmatization over stemming.

- There are various dictionaries one can use to base lemmization off of. NLTK's [wordnet](http://wordnet.princeton.edu/) is quite powerful to handle most lemmatization task. We'll examine a few implementations below.

In [None]:
from nltk import WordNetLemmatizer

lemztr = WordNetLemmatizer()

In [None]:
lemztr.lemmatize('feet')

- Note, lemmatization will return back the string if the text isn't found in the dictionary.

In [None]:
lemztr.lemmatize('abacadabradoo')

## N-grams

- N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

- Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. 

- The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

- Google hosts its n-gram corpora on [AWS S3](https://aws.amazon.com/datasets/google-books-ngrams/) for free. 
- The size of the file is about 2.2TB. You might consider using [Python API](https://github.com/dimazest/google-ngram-downloader).

## N-grams - Example

In [None]:
TextBlob(df['selftext'][5]).ngrams(2)

- You can easily implement the N-gram function using native Python - it is a common nlp interview question.

In [None]:
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))
find_ngrams(input_list, 3)

## Word Cloud

In [None]:
from wordcloud import WordCloud

In [None]:
wc = WordCloud(background_color="white", max_words=2000, width=800, height=400)
# generate word cloud
wc.generate(' '.join(df['selftext']))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

- This wordcloud is generated using all the text data. However, it makes more sense to have a separate wordcloud for each individual subreddit.
- If you find any frequent word that doesn't contain useful information, you should consider adding it to your stopword list.
- You can find more examples on the [documentation](http://amueller.github.io/word_cloud/auto_examples/index.html) and [blog post](http://minimaxir.com/2016/05/wordclouds/).

In [None]:
# show
plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

## Sentiment Analysis

- Sentiment analysis refers to the use of natural language processing, text analysis, and computational linguistics to identify emotional states and subjective information.

- Using sentiment analysis, we can gain information about the attitude of the speaker or writer of text with respect to the topic. 

- Today we will just call the [sentiment analysis API](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis) from TextBlob and take it like a black box as we haven't talked about machine learning yet.

- We want to apply the function to the text column of the dataframe and generate two new columns called polarity and subjectivity. The process will take a long time so we will apply it to a sample of dataset.
    - Polarity refers to the emotions expressed in the text.
    - Subjectivity is a measure of how subjective vs objective the text is.

- Let's use sentiment analysis to analyze the relationship between polarity and number of thumb ups.

In [None]:
# filter out all posts that have less than 100 upvotes
sa_df = df.loc[df.ups > 100]

In [None]:
sample_size = 10000

def sentiment_func(x):
    sentiment = TextBlob(x['selftext'])
    x['polarity'] = sentiment.polarity
    x['subjectivity'] = sentiment.subjectivity
    return x

sample = sa_df.sample(sample_size).apply(sentiment_func, axis=1)

In [None]:
sample.plot.scatter('ups', 'polarity')

# Recommended Resources:

Other advanced libraries in Python that we will cover in the future lectures:

[Spacy](https://spacy.io/)

[Gensim](https://radimrehurek.com/gensim/)