# Data Programming in Python | BAIS:6040
# Text Processing with NLTK and TextBlob

Instructor: Jeff Hendricks 

Topics to be covered:
- Text processing with NLTK and text mining with TextBlob (+ exercises)
- Adding new columns (+ exercises)
- Popular keywords
- Searching for records containing a certain string

References: 
- NLTK official website (http://www.nltk.org/)
- Pandas official website (http://pandas.pydata.org/) 
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- TextBlob official website (https://textblob.readthedocs.io/en/dev/)

## Prerequisites

In [None]:
#! pip install nltk==3.5
# !pip install textblob
# ! python -m nltk.downloader stopwords
# ! python -m textblob.download_corpora

## Importing modules

In [None]:
from collections import Counter              # word counting
import nltk                                  # text processing
import pandas as pd                          # handling Pandas dataframes
from textblob import TextBlob                # sentiment analysis and language detection and translation 

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('brown')
nltk.download('stopwords')
pd.set_option('display.max_colwidth', 200)   # set the maximum column width to 200

## Text processing with NLTK

In [None]:
text = "I am pleased to announce the launch of http://greatoutdoors.com . This new platform will allow all who love hiking to find enjoyable trails. This has been a priority of mine and I’m pleased to share that it is up and running! #KeepAmericaHiking"
text

In [None]:
sents = nltk.sent_tokenize(text)       # Tokenize text into sentences.

In [None]:
sents

In [None]:
tokens = nltk.word_tokenize(text)        # Tokenize text into words.

In [None]:
tokens

In [None]:
tagged_tokens = nltk.pos_tag(tokens)          # Tag each token with their part-of-speech tag.

Note that the argument of <b>pos_tag</b> function is a list of tokens, not raw text. 

In [None]:
tagged_tokens

Penn Part of Speech Tags: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

In [None]:
stemmer = nltk.stem.SnowballStemmer("english")

In [None]:
stemmer.stem("car")

In [None]:
stemmer.stem("cars")

In [None]:
stemmer.stem("say")

In [None]:
stemmer.stem("saying")

In [None]:
stemmer.stem("said")

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
lemmatizer.lemmatize("cars", pos="n")

In [None]:
lemmatizer.lemmatize("saying", pos="v")

In [None]:
lemmatizer.lemmatize("said", pos="v")

In [None]:
lemmatizer.lemmatize("am", pos="v")

In [None]:
lemmatizer.lemmatize("are", pos="v")

In [None]:
lemmatizer.lemmatize("is", pos="v")

## Text mining with TextBlob

In [None]:
text = "It's just awesome!"

In [None]:
tb = TextBlob(text)

In [None]:
tb.sentiment                # polarity between -1 (negative) and 1 (positive) 
                            # subjectivity between 0 (objective) and 1 (subjective)

In [None]:
tb.sentiment.polarity

In [None]:
tb.sentiment.subjectivity

In [None]:
text = "It's Friday."

In [None]:
tb = TextBlob(text)
tb.sentiment

In [None]:
text = "Machine Learning is a branch of Artificial Intelligence in computer science."

In [None]:
tb = TextBlob(text)

In [None]:
tb.noun_phrases

In [None]:
tb.detect_language()

In [None]:
text_es = tb.translate(to="es")
text_es

Language translation and detection of TextBlob is powered by the Google Translate API.

In [None]:
text_de = tb.translate(to="de")
text_de

In [None]:
tb = TextBlob(text_es.string)
tb.detect_language()

In [None]:
tb = TextBlob(text_de.string)
tb.detect_language()

## Exercises for text processing and mining (9 questions)

In [None]:
text = "When our Country had no debt and built everything from Highways to the Military with CASH, we had a big system of Tariffs. Now we allow other countries to steal our wealth, treasure, and jobs - But no more! The USA is doing great, with unlimited upside into the future!"
text

1\. Using NLTK, get a list <i>sents</i> of sentences in <i>text</i>.

In [None]:
# Your answer here


2\. Using NLTK, get a list <i>tokens</i> of words in <i>text</i>.

In [None]:
# Your answer here


3\. Using NLTK, get a list <i>tagged_tokens</i> of tagged tokens in <i>text</i>.

In [None]:
# Your answer here


4\. Get a new list of the nouns in <i>tagged_tokens</i>, which are tagged with 'NN'.

In [None]:
# Your answer here


5\. Get a new list of the verbs in <i>tagged_tokens</i>, which are tagged with 'VB', 'VBD', 'VBG', 'VBN', 'VBP', or 'VBZ'.

In [None]:
# Your answer here


6\. Using TextBlob, get a new list of the polarities of the sentences in <i>text</i>.

In [None]:
# Your answer here


7\. Using TextBlob, get a new list of the subjectivities of the sentences in <i>text</i>.

In [None]:
# Your answer here


8\. Using TextBlob, get a new list of the Chinese sentences translated from the sentences in <i>text</i>.

In [None]:
# Your answer here


9\. Using TextBlob, get a new list of the detected languages of the sentences in <i>text</i>.

In [None]:
# Your answer here


## Loading a CSV file into a Pandas dataframe

In [None]:
df = pd.read_csv("data/imdb_data.csv", sep=",", header=0)
df.columns=['text']

In [None]:
df.head()

## Deriving a new column from an existing column

The raw dataset itself does not always come with all the information you may need. In many cases, you will have to derive new columns from a set of existing columns. 

### Adding a new column for text length (text ➔ text_length) 

In [None]:
df.text[:5]

In [None]:
df["text_length"] = df.text.apply(lambda x: len(x))

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

When using the <b>apply</b> method, pay attention to the existing column to be used (<i>text</i>) and the new column to be added (<i>text_length</i>).

In [None]:
df.columns

In [None]:
df[["text", "text_length"]].head()

### Adding a new column for tokens (text ➔ tokens)

In [None]:
df["tokens"] = df.text.apply(lambda x: nltk.word_tokenize(x))

In [None]:
df.columns

In [None]:
df[["text", "tokens"]].head()

### Adding a new column for tagged tokens (tokens ➔ tagged_tokens)

In [None]:
df["tagged_tokens"] = df.tokens.apply(lambda x: nltk.pos_tag(x))

In [None]:
df.columns

In [None]:
df[["text", "tokens", "tagged_tokens"]].head()

### Adding new columns for sentiment (text ➔ polarity, subjectivity)

- Polarity between -1 (negative) and 1 (positive)
- Subjectivity between 0 (objective) and 1 (subjective)

In [None]:
df["polarity"] = df.text.apply(lambda x: TextBlob(x).sentiment.polarity)
df["subjectivity"] = df.text.apply(lambda x: TextBlob(x).sentiment.subjectivity)

In [None]:
df.columns

In [None]:
df[["text", "polarity", "subjectivity"]].head()

In [None]:
df[(df.polarity > 0.4) | (df.polarity < -0.4)][["text", "polarity"]].head()

## Exercises for adding new columns (2 questions)

Let's continue to use the dataframe <i>df</i> from the movie review dataset.

In [None]:
df.head()

1\. Add a new column named <i>text_short</i> that contains only the first ten characters of the column <i>text</i>.

In [None]:
# Your answer here


2\. Add a new column <i>lang</i> that contains the language of the column <i>text</i>.

In [None]:
# Your answer here


## Popular keywords

In [None]:
df.tagged_tokens[:5]

In [None]:
###################################################################################
# The 'counter' object will have all the word count information. 
###################################################################################

counter = Counter()

for pos in df.tagged_tokens:
    
    ###################################################################
    # Use a set to remove duplicate words in a review.
    # This enables you to count every token in a review as just 1 
    # even though they appear multiple times in the same review. 
    ###################################################################
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        word_set.add(word)
            
    counter.update(word_set)

collections.Counter: https://docs.python.org/3/library/collections.html#collections.Counter

In [None]:
counter.most_common(30)           # Show the top-n most popular words in 'counter'.

collections.Counter.most_common: https://docs.python.org/3/library/collections.html#collections.Counter.most_common

In [None]:
global_stopwords = nltk.corpus.stopwords.words("english") # stopwords based on the English stopwords provided by NLTK
global_stopwords[:30]

Stopwords actually have no meaning in terms of keyword analysis.

In [None]:
counter = Counter()

for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ###################################
        # Check if the word is a stopword.
        ###################################
        if word in global_stopwords:
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

In [None]:
local_stopwords = [":", ";", ",", ".", "...", "!", "-", "#", "(", ")", "@", "&", "%", "'", "’", "“", "”", 
                   "amp", "https", "rt", "t…"]

counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ##########################################################
        # Check if the word is either a global or a local stopword.
        ##########################################################
        if word in (global_stopwords + local_stopwords):
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

### Popular adjectives 

In [None]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue
            
        ##########################################################
        # Check if the tag matches.
        ##########################################################
        if tag.startswith("JJ"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

### Popular verbs

In [None]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue
            
        ##########################################################
        # Check the tag
        ##########################################################
        if tag.startswith("VB"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

## Searching for records containing a certain string

In [None]:
target_string = "movie"

In [None]:
mask = df.text.str.contains(target_string, case=False, regex=False)
mask

pandas.Series.str.contains: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [None]:
df[mask][["text"]].head()

In [None]:
df[~mask][["text"]].head()