# Data Programming in Python | BAIS:6040
# Module 7. Text Processing with NLTK and TextBlob

Written by Kang-Pyo Lee 

Topics to be covered:
- Text processing with NLTK and text mining with TextBlob (+ exercises)
- Adding new columns (+ exercises)
- Searching for rows containing a search term

## Installing Packages

In [None]:
# ! pip install --user --upgrade nltk textblob

## Text Processing with NLTK

In [None]:
text = "Some people did not think it was possible to produce a #COVID19 vaccine so quickly, but it was. Now some people say that vaccinating the world is not possible. They’re wrong. As Nelson Mandela, Madiba, said; it always seems impossible until it’s done."
text

Suppose you would like to identify sentences in a string. 

In [None]:
text.split(".")

Two problems with simply splitting a sentence with a period: 
- A sentence does not always end with a period. For example, it can also end with a question mark or an exclamation mark.
- A period does not always mean completeness of a sentence, for emample, Ms., Dr., 3.14, etc. 

Now suppose you would also like to identify words in a string. 

In [None]:
text.split(" ") 

Splitting text with a space works generally fine except that it cannot handle punctuation characters such as comma, period, and 's. 

### Sentence Tokenization

In [None]:
import nltk

# nltk.download('all')

Natural Language Toolkit (NLTK): https://www.nltk.org/

In [None]:
sents = nltk.sent_tokenize(text)       # Tokenize text into sentences
sents

In [None]:
len(sents)

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCES"
nltk.sent_tokenize(your_text)

### Word Tokenization

In [None]:
words = nltk.word_tokenize(text)        # Tokenize text into words
words

Note that each punctuation character is also treated as a token. 

In [None]:
len(words)

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
nltk.word_tokenize(your_text)

### Part-of-Speech (PoS) Tagging

Part-of-speech tagging is the process of marking up a word in a text as corresponding to a particular part-of-speech such as noun, verb, adjective, adverb, etc.

Part-of-speech tagging on Wikipedia: https://en.wikipedia.org/wiki/Part-of-speech_tagging

In [None]:
words

In [None]:
tagged_words = nltk.pos_tag(words)
tagged_words

Note that the argument of <b>pos_tag</b> function is a list of words, not raw text. 

Penn Part of Speech Tags: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

In [None]:
[word for word, tag in tagged_words if tag in ["NN", "NNS"]]      # Noun, singular or mass; Noun, plural

In [None]:
[word for word, tag in tagged_words if tag in ["NNP", "NNPS"]]    # Proper noun, singular; Proper noun, plural

In [None]:
[word for word, tag in tagged_words if tag.startswith("NN")]

In [None]:
[word for word, tag in tagged_words if tag.startswith("JJ")]

In [None]:
[word for word, tag in tagged_words if tag.startswith("VB")]

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
nltk.pos_tag(nltk.word_tokenize(your_text))

### Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. A computer program that stems word is called a stemming program, stemming algorithm, or just stemmer.

Stemming on Wikipedia: https://en.wikipedia.org/wiki/Stemming

In [None]:
stemmer = nltk.stem.SnowballStemmer("english")

In [None]:
stemmer.stem("car")

In [None]:
stemmer.stem("cars")

In [None]:
stemmer.stem("say")

In [None]:
stemmer.stem("saying")

In [None]:
stemmer.stem("said")

In [None]:
words

In [None]:
[(word, stemmer.stem(word)) for word in words]

In [None]:
your_word = "TRY_YOUR_OWN_WORD"
stemmer.stem(your_word)

<hr>

## Text Mining with TextBlob

TextBlob: https://textblob.readthedocs.io/

In [None]:
from textblob import TextBlob

### Sentiment Analysis

In [None]:
from IPython.display import Image
Image("classdata/images/sentiment_analysis.png")

In [None]:
text = "It's just awesome!"

In [None]:
tb = TextBlob(text)

In [None]:
tb.sentiment

In [None]:
tb.sentiment.polarity

In [None]:
tb.sentiment.subjectivity

In [None]:
text = "It's Friday."

In [None]:
tb = TextBlob(text)
tb.sentiment

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
tb = TextBlob(your_text)
tb.sentiment

### Noun Phrase Extraction

In [None]:
text = "Machine Learning is a branch of Artificial Intelligence in computer science."

In [None]:
tb = TextBlob(text)

In [None]:
tb.noun_phrases

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
tb = TextBlob(your_text)
tb.noun_phrases

### Language Detection

In [None]:
text

In [None]:
tb = TextBlob(text)
tb.detect_language()

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
tb = TextBlob(your_text)
tb.detect_language()

### Language Translation

In [None]:
tb = TextBlob(text)
tb.translate(to="es")

Language detection and translation of TextBlob is powered by the Google Translate API (https://developers.google.com/translate/), which means there is a limit on the number of API calls. 

In [None]:
tb = TextBlob(text)
tb.translate(to="zh-CN")

In [None]:
your_text = "TRY_YOUR_OWN_SENTENCE"
tb = TextBlob(your_text)
tb.translate(to="LANGUAGE_CODE")

## Exercises for Text Processing and Mining

<hr>

## Adding New Columns to a Dataframe

The raw dataset itself does not always come with all the information you may need. In many cases, you will have to derive new columns from existing columns. 

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)   # set the maximum column width to 150

df = pd.read_csv("classdata/timeline_cnnbrk.csv", sep="\t")
df

CNN Breaking News on Twitter: https://twitter.com/cnnbrk

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.tail()

### Adding a New Column Representing the Length of the Text (text ➔ text_length) 

In [None]:
Image("classdata/images/dataframe.png")

In [None]:
df["text_length"] = df.text.apply(lambda x: len(x))

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

When using the <b>apply</b> method, pay attention to the existing column to be used (e.g., `text`) and the new column to be added (e.g., `text_length`).

In [None]:
df[["text", "text_length"]]

In [None]:
df.columns

### Adding a New Column Representing Tokens of the Text (text ➔ tokens)

In [None]:
df["tokens"] = df.text.apply(lambda x: nltk.word_tokenize(x))

In [None]:
df[["text", "tokens"]]

In [None]:
df.columns

### Adding a New Column Representing the Tagged Tokens (tokens ➔ tagged_tokens)

In [None]:
df["tagged_tokens"] = df.tokens.apply(lambda x: nltk.pos_tag(x))

In [None]:
df[["text", "tokens", "tagged_tokens"]]

In [None]:
df.columns

### Adding New Columns Representing Tweet Sentiment (text ➔ polarity, subjectivity)

In [None]:
df["polarity"] = df.text.apply(lambda x: TextBlob(x).sentiment.polarity)
df["subjectivity"] = df.text.apply(lambda x: TextBlob(x).sentiment.subjectivity)

In [None]:
df[["text", "polarity", "subjectivity"]]

In [None]:
df.columns

In [None]:
df[df.polarity > 0.7][["text", "polarity"]]

In [None]:
df[(df.polarity > 0.7) | (df.polarity < -0.7)][["text", "polarity"]]

## Searching for Rows Containing a Search Term

In [None]:
search_term = "covid"

In [None]:
mask = df.text.str.contains(search_term, case=False)
mask

pandas.Series.str.contains: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [None]:
df[mask][["created_at", "text"]]

In [None]:
df[~mask][["created_at", "text"]]

## Exercises for Adding New Columns