# JumpStart: Natural  Language Processing

**Objective:** A general introduction to basic NLP methods to serve as a foundation for further self study and additional NLP curriculums. This notebook intends to serve as a basis for various techniques such as text processing, information retrieval, and classifying text.

- The Reddit dataset we will use today comes from Google BigQuery. You can find it [here](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2018_05?pli=1).
 - The data is public but you need to have an active account on Google Clound Platform first in order to access it.
- The original data was huge so we sampled it from the top 10 subreddit.

- We will also learn the following NLP packages in Python along the way

 - [NLTK](http://www.nltk.org/) - a very popular package for doing NLP in Python

 - [Textblob](https://textblob.readthedocs.io/en/dev/) - similar to NLTK but provides a higher level API for easy accessing.

 - [WordCloud](https://github.com/amueller/word_cloud) - how to run wordcloud in Python

## Prerequisite

- Open your **Terminal/Anaconda Prompt**, cd to the lecture code folder and run the following command:
 - `pip install -r requirements.txt`

- After installing all the required packages, run the following command:
 - `python -m textblob.download_corpora`
 
- Restart this jupyter notebook.

In [1]:
import nltk

# Uncomment the following line the first time you run the code
#nltk.download('stopwords')
#nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/neerajsomani/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/neerajsomani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load Dataset

In [2]:
import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/nycdsabt01/reddit_top10.csv')

- It is always a good idea to check the shape of the dataframe and column types before you run any type of operation.

In [3]:
df.shape

(737339, 11)

In [4]:
df.dtypes

created_utc      int64
subreddit       object
author          object
domain          object
url             object
num_comments     int64
score            int64
ups              int64
downs            int64
title           object
selftext        object
dtype: object

- When you first readin a dataset, I would recommend using `df.sample()` rather than `df.head()` because sometimes the first couple rows are fine, however, there might be missing values or mixed types in the column so it is better if you can get a big picture of the whole dataset.

In [5]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
718669,1525321842,AutoNewspaper,AutoNewspaperAdmin,miamiherald.com,http://www.miamiherald.com/news/business/artic...,0,1,1,0,[Business] - NY state starts 3rd round of dron...,
298185,1526472466,AutoNewspaper,AutoNewspaperAdmin,theguardian.com,https://www.theguardian.com/environment/2018/m...,0,2,2,0,[Environment] - One man's race to capture the ...,
133116,1525471275,RocketLeagueExchange,[deleted],self.RocketLeagueExchange,https://www.reddit.com/r/RocketLeagueExchange/...,4,1,1,0,[Xbox][H]nc import[W]offers,[deleted]
710626,1525349528,The_Donald,AdolphEinstien,zerohedge.com,https://www.zerohedge.com/news/2018-05-02/hill...,13,153,153,0,Hillary Clinton Now Blaming Socialist Democrat...,
468816,1527306636,Ice_Poseidon,MagnoliaKing,self.Ice_Poseidon,https://www.reddit.com/r/Ice_Poseidon/comments...,1,5,5,0,Asian Andy trying to appeal to his 10 yr old f...,
93532,1527104168,Showerthoughts,[deleted],self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/commen...,3,0,0,0,Raccoons are like a cross between a dog and a ...,[deleted]
429701,1526852189,ACCIDENTAL_HAIKU_BOT,ACCIDENTAL_HAIKU_BOT,self.ACCIDENTAL_HAIKU_BOT,https://www.reddit.com/r/ACCIDENTAL_HAIKU_BOT/...,0,1,1,0,/u/JunglePygmy's accidental haiku in /r/advent...,Can we PLEASE get a \n gigantic gorgeou...
219413,1527451868,newsbotbot,-en-,mobile.twitter.com,https://mobile.twitter.com/CBSNews/status/1000...,0,1,1,0,"@CBSNews: ""I'm just overwhelmed with gratitude...",
329256,1527520452,FortNiteBR,iOnlyReadPussy,v.redd.it,https://v.redd.it/n4d9p1ot4m011,26,20,20,0,"Yall underestimating us, mobile players too much",
556179,1525958693,Ice_Poseidon,PrlvateClient,gyazo.com,https://gyazo.com/f34e28e93a66b351b27b8aea04c4...,0,1,1,0,TTD,


- `selftext` is the raw text of each Reddit post. But take a look at the column. There are missing values, `[deleted]`, `[removed]` which should not be considered as valid text.
- We need to clean the text before we can further analyze it.

In [6]:
# Fill na with empty string
df['selftext'] = df['selftext'].fillna('')
# Replace `removed` and `deleted` with empty string
tbr = ['[removed]', '[deleted]']
df['selftext'] = df['selftext'].apply(lambda x: '' if x in tbr else x)

- After cleansing the data, about 88% of our `selftext` column are just empty string.
- It makes sense to concatenate the text with its title.

In [7]:
print(sum(df['selftext'] == '') / df.shape[0])

0.8806152936437649


In [8]:
df['selftext'] = df['title'] + ' ' + df['selftext']

In [9]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
666144,1526876429,AskReddit,ThinkerPlus,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8k...,14,1,1,0,How do you turn a Winnebago into a Sex Winnebago?,How do you turn a Winnebago into a Sex Winneba...
221554,1527432621,Ice_Poseidon,BeardedDelight,self.Ice_Poseidon,https://www.reddit.com/r/Ice_Poseidon/comments...,1,2,2,0,Look at this 8AM content for the EU fags God b...,Look at this 8AM content for the EU fags God b...
70758,1527102344,AskReddit,Desmoire,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8l...,4,2,2,0,What is the pettiest thing you have done (or h...,What is the pettiest thing you have done (or h...
99974,1527119675,AskReddit,DEWFOUR,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8l...,1,1,1,0,"What food, unique to your country, should food...","What food, unique to your country, should food..."
214828,1527452366,AskReddit,zaphodsheads,self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8m...,0,1,1,0,"Fakers of reddit, what happened when the docto...","Fakers of reddit, what happened when the docto..."
266370,1526407089,AutoNewspaper,AutoNewspaperAdmin,csmonitor.com,https://www.csmonitor.com/USA/Politics/2018/05...,0,1,1,0,[National] - Legislators work to shift culture...,[National] - Legislators work to shift culture...
194395,1526248395,AutoNewspaper,AutoNewspaperAdmin,tampabay.com,http://www.tampabay.com/news/business/Trump-pl...,0,1,1,0,[Local] - Trump pledges to help Chinese phone ...,[Local] - Trump pledges to help Chinese phone ...
13317,1526728874,Ice_Poseidon,zorroisreal,imgur.com,https://imgur.com/vXTtGrB,3,176,176,0,Petition for Brother Harry to be on the Next R...,Petition for Brother Harry to be on the Next R...
542303,1525912204,RocketLeagueExchange,pcpresident2016,self.RocketLeagueExchange,https://www.reddit.com/r/RocketLeagueExchange/...,5,0,0,0,[xbox] [H] Triumph crates [W] unpainted exotic...,[xbox] [H] Triumph crates [W] unpainted exotic...
393548,1525542530,me_irl,MeWhoBelievesInYou,i.redd.it,https://i.redd.it/0ak0rt5or2w01.jpg,0,6,6,0,me_irl,me_irl


## Preprocessing

- Convert all the text to lowercase - avoids having multiple copies of the same words.
- Replace url in the text with empty space.
- Replace all empty spaces with just one.

In [10]:
import re

# Convert all the string to lower cases
df['selftext'] = df['selftext'].str.lower()
# \S+ means anything that is not an empty space
df['selftext'] = df['selftext'].apply(lambda x: re.sub('http\S*', '', x))
# \s+ means all empty space (\n, \r, \t)
df['selftext'] = df['selftext'].apply(lambda x: re.sub('\s+', ' ', x))

- Let's take a look at the dataframe after preprocessing.

In [11]:
df.sample(10)

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext
628114,1527647133,The_Donald,macredsmile,youtube.com,https://www.youtube.com/watch?v=mD6Iqxegug4,5,212,212,0,VICTORY: Tommy Robinson Reporting Restrictions...,victory: tommy robinson reporting restrictions...
724680,1525384876,AutoNewspaper,AutoNewspaperAdmin,ktbs.com,https://www.ktbs.com/news/arklatex-indepth/ark...,0,1,1,0,[National] - ArkLaTex real estate sales buckin...,[national] - arklatex real estate sales buckin...
476371,1527306533,AutoNewspaper,AutoNewspaperAdmin,nbcnews.com,https://www.nbcnews.com/news/us-news/jury-reco...,0,1,1,0,[Business] - Jury recommends $25M in Johnson &...,[business] - jury recommends $25m in johnson &...
713934,1525372693,newsbotbot,-en-,twitter.com,https://twitter.com/AFP/status/992109284447010817,0,1,1,0,@AFP: #WorldPressFreedomDay Journalists march ...,@afp: #worldpressfreedomday journalists march ...
196870,1526192219,FortNiteBR,Insonarc,v.redd.it,https://v.redd.it/r545qoq9fkx01,45,945,945,0,"""No Skins"" are the nicest skins","""no skins"" are the nicest skins"
551615,1525981136,The_Donald,Ta_Ta_Toothie,twitter.com,https://twitter.com/funder/status/994637881753...,1,1,1,0,BREAKING: Nunes and staff now under investigat...,breaking: nunes and staff now under investigat...
99823,1527115407,AskReddit,[deleted],self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8l...,1,1,1,0,"Kids these days, what new music should I be li...","kids these days, what new music should i be li..."
648944,1527717254,AskReddit,[deleted],self.AskReddit,https://www.reddit.com/r/AskReddit/comments/8n...,6,1,1,0,If you were forced to cut off one part of your...,if you were forced to cut off one part of your...
17575,1526759295,AutoNewspaper,AutoNewspaperAdmin,france24.com,http://www.france24.com/en/20180519-halep-down...,0,1,1,0,[World] - Halep downs Sharapova to set up Rome...,[world] - halep downs sharapova to set up rome...
651875,1527656072,Ice_Poseidon,[deleted],self.Ice_Poseidon,https://www.reddit.com/r/Ice_Poseidon/comments...,0,1,1,0,Walking up to a drug deal LOL,walking up to a drug deal lol


## Text Processing Steps and Methods

- Before we start using machine learning methods on our text, there are some steps that we first want to perform so that our text is in a format that our model can interpret.
- These steps include:
 - Filtering
 - Tokenization
 - Stemming
 - Lemmitization

## Filtering

- The first step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [12]:
df['selftext'] = df['selftext'].apply(lambda x: re.sub('[^\w\s]', '', x))

- When examining a text, often there are words used within a sentence that holds no meaning for various data mining operations such as topic modeling or word frequency. 
    - Examples of this include "the", "is", etc. Collectively, these are known as "stopwords". 
- When mining for certain information, you should note whether your method should remove certain stopwords (for example, wordclouds). To illustrate an example, we will call upon the stopwords method from nltk. 
- Note, methods that interact with the text itself is usually found under nltk.corpus. Corpus is the linguistics term for set of structured text used for statistical study so be mindful of this specific vocabulary.
- The stop words from nltk is just a Python list so you can easily append more stopwords to it. For example "computer" would be a stopword in corpus largely dealing with data science.

In [13]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [14]:
df['selftext'] = df['selftext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

## Tokenization 

- Tokenization is the act of splitting text into a sequence of words. In this example, we will try a simplistic tokenization method below using the standard split.

In [15]:
sample_text = "This is a toy example. Illustrate this example below."
sample_tokens = sample_text.split()
print(sample_tokens)

['This', 'is', 'a', 'toy', 'example.', 'Illustrate', 'this', 'example', 'below.']


- Did you notice something? While we have the tokens, "example" and "example." are treated as different tokens. As a NLP data scientist, you must make the choice on whether you choose to distinguish the two.

- Note, various packages in Python such as the nltk package will default tokenize "." as a seperate token instead to designate it it's own special meaning. This can be illustrated below:

In [16]:
from nltk.tokenize import word_tokenize 
word_tokenize(sample_text)

['This',
 'is',
 'a',
 'toy',
 'example',
 '.',
 'Illustrate',
 'this',
 'example',
 'below',
 '.']

- However, textblob treats "." just as a period.

In [17]:
from textblob import TextBlob
TextBlob(sample_text).words

WordList(['This', 'is', 'a', 'toy', 'example', 'Illustrate', 'this', 'example', 'below'])

## Stemming and Lemmatization

- Various words in English have the same meaning. There are two main methods for handling tasks such as recognizing "strike, striking, struck" as the same words.

- Stemming refers to the removal of suffixes, like “ing”, “ly”, “s”, etc. by a simple rule-based approach.

- The most common stemming algorithms are:
 - [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) (the older traditional method)
 - [Lancaster Stemmer](http://textanalysisonline.com/nltk-lancaster-stemmer) (a more aggressive modern stemmer)

- Stemming and lemmatization can both be done with self written rules using creative forms of regex but for practical example demo in this notebook, we will implement the PorterStemmer method from nltk on the example below.

In [None]:
nonprocess_text = "I am writing a Python string"

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmed_text = ' '.join([stemmer.stem(word) for word in nonprocess_text.split()])
print(stemmed_text)

- Note: This is more robust than the standard regex implementation as we see here "writing" is converted to "write" but "string" isn't converted to "stre".

- Unlike stemming, lemmatization will try to identify root words that are semantically similar to text based off a dictionary corpus. In essence, you can think of being able to replicate the effect manually by implementing a look-up method after parsing a text. Therefore, we usually prefer using lemmatization over stemming.

- There are various dictionaries one can use to base lemmization off of. NLTK's [wordnet](http://wordnet.princeton.edu/) is quite powerful to handle most lemmatization task. We'll examine a few implementations below.

In [None]:
from nltk import WordNetLemmatizer

lemztr = WordNetLemmatizer()

In [None]:
lemztr.lemmatize('feet')

- Note, lemmatization will return back the string if the text isn't found in the dictionary.

In [None]:
lemztr.lemmatize('abacadabradoo')

## N-grams

- N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

- Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. 

- The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

- Google hosts its n-gram corpora on [AWS S3](https://aws.amazon.com/datasets/google-books-ngrams/) for free. 
- The size of the file is about 2.2TB. You might consider using [Python API](https://github.com/dimazest/google-ngram-downloader).

## N-grams - Example

In [None]:
TextBlob(df['selftext'][5]).ngrams(2)

- You can easily implement the N-gram function using native Python - it is a common nlp interview question.

In [None]:
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))
find_ngrams(input_list, 3)

## Word Cloud

In [None]:
from wordcloud import WordCloud

In [None]:
wc = WordCloud(background_color="white", max_words=2000, width=800, height=400)
# generate word cloud
wc.generate(' '.join(df['selftext']))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

- This wordcloud is generated using all the text data. However, it makes more sense to have a separate wordcloud for each individual subreddit.
- If you find any frequent word that doesn't contain useful information, you should consider adding it to your stopword list.
- You can find more examples on the [documentation](http://amueller.github.io/word_cloud/auto_examples/index.html) and [blog post](http://minimaxir.com/2016/05/wordclouds/).

In [None]:
# show
plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

## Sentiment Analysis

- Sentiment analysis refers to the use of natural language processing, text analysis, and computational linguistics to identify emotional states and subjective information.

- Using sentiment analysis, we can gain information about the attitude of the speaker or writer of text with respect to the topic. 

- Today we will just call the [sentiment analysis API](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis) from TextBlob and take it like a black box as we haven't talked about machine learning yet.

- We want to apply the function to the text column of the dataframe and generate two new columns called polarity and subjectivity. The process will take a long time so we will apply it to a sample of dataset.
    - Polarity refers to the emotions expressed in the text.
    - Subjectivity is a measure of how subjective vs objective the text is.

- Let's use sentiment analysis to analyze the relationship between polarity and number of thumb ups.

In [None]:
# filter out all posts that have less than 100 upvotes
sa_df = df.loc[df.ups > 100]

In [None]:
sample_size = 10000

def sentiment_func(x):
    sentiment = TextBlob(x['selftext'])
    x['polarity'] = sentiment.polarity
    x['subjectivity'] = sentiment.subjectivity
    return x

sample = sa_df.sample(sample_size).apply(sentiment_func, axis=1)

In [None]:
sample.plot.scatter('ups', 'polarity')

# Recommended Resources:

Other advanced libraries in Python that we will cover in the future lectures:

[Spacy](https://spacy.io/)

[Gensim](https://radimrehurek.com/gensim/)