# Text Analytics | BAIS:6100
# Module 3: Basic Natural Language Processing 


Instructor: Kang-Pyo Lee 

Topics to be covered:

- Basic NLP Techniques Using NLTK
    - Sentence Tokenization
    - Word Tokenization
    - Part-of-Speech (PoS) Tagging
    - Stemming
    - Lemmatization
    - N-grams

- NLP Techniques Using TextBlob
    - Sentiment Analysis
    - Noun Phrase Extraction
    - Language Detection/Translation

- Gender Prediction

## Install packages

In [None]:
# ! pip install --user --upgrade gender-guesser nltk textblob

## Basic NLP Techniques Using NLTK

In [None]:
text = "Some people did not think it was possible to produce a #COVID19 vaccine so quickly, but it was. Now some people say that vaccinating the world is not possible. They’re wrong. As Nelson Mandela, Madiba, said; it always seems impossible until it’s done."
text

Suppose you would like to identify sentences in a string. 

In [None]:
text.split(".")

Two problems with simply splitting a sentence with a period: 
- A sentence does not always end with a period. For example, it can also end with a question mark or an exclamation mark.
- A period does not always mean completeness of a sentence, for emample, Ms., Dr., 3.14, etc. 

Now suppose you would also like to identify words in a string. 

In [None]:
text.split(" ")     # 'was.', 'possible.', 'wrong.', 'said;', 'done.'

Splitting text with a space works generally fine except that it cannot handle punctuation characters such as comma, period, and 's. 

### Sentence Tokenization

In [None]:
import nltk

# nltk.download('all')

Natural Language Toolkit (NLTK): https://www.nltk.org/

In [None]:
sents = nltk.sent_tokenize(text)       # Tokenize text into sentences

In [None]:
sents

In [None]:
len(sents)

### Word Tokenization

In [None]:
words = nltk.word_tokenize(text)        # Tokenize text into words

In [None]:
words

Note that each punctuation character is also treated as a token. 

In [None]:
len(words)

### Part-of-Speech (PoS) Tagging

Part-of-speech tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Part-of-speech tagging on Wikipedia: https://en.wikipedia.org/wiki/Part-of-speech_tagging

In [None]:
tagged_words = nltk.pos_tag(words)          # Tag each word with their part-of-speech tag

Note that the argument of <b>pos_tag</b> function is a list of words, not raw text. 

In [None]:
tagged_words

Penn Part of Speech Tags: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

In [None]:
[word for word, tag in tagged_words if tag in ["NN", "NNS"]]   # Noun, singular or mass; Noun, plural

In [None]:
[word for word, tag in tagged_words if tag in ["NNP", "NNPS"]]  # Proper noun, singular; Proper noun, plural

In [None]:
[word for word, tag in tagged_words if tag.startswith("NN")]

In [None]:
[word for word, tag in tagged_words if tag.startswith("JJ")]

In [None]:
[word for word, tag in tagged_words if tag.startswith("VB")]

### Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.

Stemming on Wikipedia: https://en.wikipedia.org/wiki/Stemming

In [None]:
stemmer = nltk.stem.SnowballStemmer("english")

In [None]:
stemmer.stem("car")

In [None]:
stemmer.stem("cars")

In [None]:
stemmer.stem("say")

In [None]:
stemmer.stem("saying")

In [None]:
stemmer.stem("said")

In [None]:
[(word, stemmer.stem(word)) for word in words]

### Lemmatization

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.

Lemmatization on Wikipedia: https://en.wikipedia.org/wiki/Lemmatisation

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

WordNet (https://wordnet.princeton.edu/) is a large lexical database of English developed by the Princeton University. 

In [None]:
lemmatizer.lemmatize("cars", pos="n")

The `lemmatize` method takes a part-of-speech parameter called `pos`. If not supplied, the default is '*n*', which means noun.

In [None]:
lemmatizer.lemmatize("saying", pos="v")

In [None]:
lemmatizer.lemmatize("said", pos="v")

In [None]:
lemmatizer.lemmatize("am", pos="v")

In [None]:
lemmatizer.lemmatize("are", pos="v")

In [None]:
lemmatizer.lemmatize("is", pos="v")

In [None]:
lemmatizer.lemmatize("better", pos="a")

In [None]:
lemmatizer.lemmatize("best", pos="a")

### N-grams

An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. 

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram".

N-gram on Wikipedia: https://en.wikipedia.org/wiki/N-gram

In [None]:
from IPython.display import Image
Image("classdata/images/ngram.png")

N-gram models are known to be effective in capturing phrases of multiple words.

In [None]:
from nltk.util import ngrams

In [None]:
ngrams(words, 1)

In [None]:
list(ngrams(words, 1))

Unigrams are practically the same as word tokens. 

In [None]:
list(ngrams(words, 2))

In [None]:
[" ".join(gram) for gram in ngrams(words, 2)]

In [None]:
[" ".join(gram) for gram in ngrams(words, 3)]

Note that while n-grams are able to capture meaningful phrases in text, they may generate much noise. 

## NLP Techniques Using TextBlob

TextBlob: https://textblob.readthedocs.io/

In [None]:
from textblob import TextBlob

### Sentiment Analysis

In [None]:
Image("classdata/images/sentiment_analysis.png")

In [None]:
text = "It's just awesome!"

In [None]:
tb = TextBlob(text)

In [None]:
tb.sentiment

In [None]:
tb.sentiment.polarity

In [None]:
tb.sentiment.subjectivity

In [None]:
text = "It's Friday."

In [None]:
tb = TextBlob(text)
tb.sentiment

In [None]:
text = "TRY_YOUR_OWN_SENTENCE"
tb = TextBlob(text)
tb.sentiment

### Noun Phrase Extraction

In [None]:
text = "Machine Learning is a branch of Artificial Intelligence in computer science."

In [None]:
tb = TextBlob(text)

In [None]:
tb.noun_phrases

### Language Detection

In [None]:
tb.detect_language()

### Language Translation

In [None]:
tb.translate(to="es")

Language detection and translation of TextBlob is powered by the Google Translate API (https://developers.google.com/translate/), which means there is a limit on the number of API calls. 

In [None]:
tb.translate(to="zh-CN")

### Gender Prediction

In [None]:
import gender_guesser.detector as gender

gender-guesser: https://pypi.python.org/pypi/gender-guesser

In [None]:
d = gender.Detector(case_sensitive=False)

In [None]:
d.get_gender("Bob"), d.get_gender("Sally"), d.get_gender("Pauley"), d.get_gender("Jamie"), d.get_gender("Iowa")

The gender-guesser returns six types of gender: 
- female
- male
- mostly_female
- mostly_male
- unknown (name not found)
- andy (androgynous)

In [None]:
d.get_gender("YOUR_FIRST_NAME")

In [None]:
def predict_gender(detector, name):
    if len(name.split()) == 0:
        return "unknown"
    
    first_name = name.split()[0]               # Take the first name.
    
    ######################################################################
    # Apply some heuristics (any other good ideas?)
    ######################################################################
    if first_name.startswith("Ms") | first_name.startswith("Mrs") | first_name.startswith("Miss"):
        return "female"
    if first_name.startswith("Mr"):
        return "male"
    
    user_gender = detector.get_gender(first_name)
    
    ######################################################################
    # For simplicity, treat most_female as female and most_male as male.
    ######################################################################
    if user_gender == "mostly_female":
        return "female"
    elif user_gender == "mostly_male":
        return "male"
    
    return user_gender

In [None]:
predict_gender(d, "Bill Gates")

In [None]:
predict_gender(d, "Melinda Gates")

## Exercises - Basic NLP Techniques Using NLTK and TextBlob