## MIS780 - Advanced Artificial Intelligence for Business

## Week 2 - Part 1: Natual Language Processing

In this session, you will get familar with Python natural language processing took kit (NLTK) and perform various text processing operations.

## Table of Content
   
1. [Text Processing](#cell_Processing)
    - [Tokenization](#cell_Tokenization)
    - [Stemming](#cell_Stemmers)
    - [From Lists to String](#cell_Lists)
    - [Sentence Segmentation](#cell_Sentence)
    
    
2. [Tagging Words](#cell_Tagging)
    - [Using a Tagger](#cell_Tagger)
    - [Representing Tagged Tokens](#cell_Representing)

<a id = "cell_Processing"></a>
### **1. Text Processing**

In [24]:
#load the NLTK toolbox before we start
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<a id = "cell_Tokenization"></a>
###  Tokenization

First, we need to define the data we will use in this section:

In [25]:
raw = '''Online activities such as articles, website text, blog posts, \
social media posts are generating unstructured textual data. \
Corporate and business need to analyze textual data to understand \
customer activities, opinion, and feedback to successfully derive their business.'''
print(raw)

Online activities such as articles, website text, blog posts, social media posts are generating unstructured textual data. Corporate and business need to analyze textual data to understand customer activities, opinion, and feedback to successfully derive their business.


In [None]:
# check the datatype of raw
type(raw)

str

Split text into tokens:

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(raw)
print(tokens)

['Online', 'activities', 'such', 'as', 'articles', ',', 'website', 'text', ',', 'blog', 'posts', ',', 'social', 'media', 'posts', 'are', 'generating', 'unstructured', 'textual', 'data', '.', 'Corporate', 'and', 'business', 'need', 'to', 'analyze', 'textual', 'data', 'to', 'understand', 'customer', 'activities', ',', 'opinion', ',', 'and', 'feedback', 'to', 'successfully', 'derive', 'their', 'business', '.']


<a id = "cell_Stemmers"></a>
### Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. We use `PorterStemmer()` function to perform stemming:

In [None]:
from nltk import PorterStemmer

porter = PorterStemmer()
[porter.stem(t) for t in tokens]

['onlin',
 'activ',
 'such',
 'as',
 'articl',
 ',',
 'websit',
 'text',
 ',',
 'blog',
 'post',
 ',',
 'social',
 'media',
 'post',
 'are',
 'gener',
 'unstructur',
 'textual',
 'data',
 '.',
 'corpor',
 'and',
 'busi',
 'need',
 'to',
 'analyz',
 'textual',
 'data',
 'to',
 'understand',
 'custom',
 'activ',
 ',',
 'opinion',
 ',',
 'and',
 'feedback',
 'to',
 'success',
 'deriv',
 'their',
 'busi',
 '.']

<a id = "cell_Sentence"></a>
### Sentence Segmentation

Raw text can be splitted into sentences using `sent_tokenize()` function.

In [None]:
from nltk.tokenize import sent_tokenize

sents =  sent_tokenize(raw)
[s for s in sents]

['Online activities such as articles, website text, blog posts, social media posts are generating unstructured textual data.',
 'Corporate and business need to analyze textual data to understand customer activities, opinion, and feedback to successfully derive their business.']

<a id = "cell_Stopwords"></a>
### Stopwords Removal

Stopwords considered as noise in the text. Text may contain stop words such as *is*, *am*, *are*, *this*, *a*, *an*, *the*, *etc.*

In NLTK for removing stopwords, you need to create a list of stopwords and filter out your list of tokens from these words.

In [None]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'further', 'should', "wouldn't", 'is', 'there', 'for', 'we', "couldn't", 'isn', 'once', 'again', 'didn', 'yours', 'being', 'he', "isn't", 'mightn', 'weren', 'who', "he'd", 'own', 'with', 'a', 'into', 'now', 'it', 'can', "shan't", 'up', 'yourselves', 'here', 'that', 'y', 'were', 'shan', 'the', 'under', 'hadn', 'why', 'over', 'of', 'some', 'these', 'won', "we're", 'not', 'hasn', "it'd", 'other', 'them', 'too', 'your', 'i', 'most', "doesn't", 'me', "shouldn't", "we've", "needn't", 'wasn', 'no', 'will', 'ours', 'be', 'my', 'ourselves', 'aren', 'each', "haven't", 'at', 'our', "she'll", 'an', "you've", 'don', 'they', 'doing', 'herself', "i'm", 'ma', 'mustn', 'where', 'only', 'few', 'haven', 'below', 'same', "it'll", "they've", "hadn't", 'how', 'd', 'this', 'all', 'did', "you're", 'but', "they're", 'his', 'very', 'its', "he's", 'those', 'am', "mustn't", 'have', 'nor', 'has', 'before', 'had', 'shouldn', 'been', 'through', 'm', "she's", 'or', 'from', 'doesn', 'was', "you'll", "weren't", 'havin

In [None]:
filtered_sent=[]
for w in tokens:
    if w not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_sent)

Tokenized Sentence: ['Online', 'activities', 'such', 'as', 'articles', ',', 'website', 'text', ',', 'blog', 'posts', ',', 'social', 'media', 'posts', 'are', 'generating', 'unstructured', 'textual', 'data', '.', 'Corporate', 'and', 'business', 'need', 'to', 'analyze', 'textual', 'data', 'to', 'understand', 'customer', 'activities', ',', 'opinion', ',', 'and', 'feedback', 'to', 'successfully', 'derive', 'their', 'business', '.']
Filterd Sentence: ['Online', 'activities', 'articles', ',', 'website', 'text', ',', 'blog', 'posts', ',', 'social', 'media', 'posts', 'generating', 'unstructured', 'textual', 'data', '.', 'Corporate', 'business', 'need', 'analyze', 'textual', 'data', 'understand', 'customer', 'activities', ',', 'opinion', ',', 'feedback', 'successfully', 'derive', 'business', '.']


<a id = "cell_Lists"></a>
### From Lists to String

Convert list of tokens back into string. Use `' '.join(tokens)` to take all the items in `tokens` and concatenate them as one bigstring, using `' '` as a spacer between the items.

In [None]:
print(tokens)
' '.join(tokens)

['Online', 'activities', 'such', 'as', 'articles', ',', 'website', 'text', ',', 'blog', 'posts', ',', 'social', 'media', 'posts', 'are', 'generating', 'unstructured', 'textual', 'data', '.', 'Corporate', 'and', 'business', 'need', 'to', 'analyze', 'textual', 'data', 'to', 'understand', 'customer', 'activities', ',', 'opinion', ',', 'and', 'feedback', 'to', 'successfully', 'derive', 'their', 'business', '.']


'Online activities such as articles , website text , blog posts , social media posts are generating unstructured textual data . Corporate and business need to analyze textual data to understand customer activities , opinion , and feedback to successfully derive their business .'

<a id = "cell_Tagging"></a>
### **2. Tagging Words**

The process of classifying words into their **parts-of-speech** and labeling them accordingly is known as **part-of-speech tagging**, **POS tagging**, or simply **tagging**. Parts-of-speech are also known as **lexical categories**. The collection of tags used for a particular task is known as a **tagset**.

<a id = "cell_Tagger"></a>
### Using a Tagger

POS tagger, processes a sequence of words, and attaches a part of speech tag to each word

In [26]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Here we see that `and` is **CC**, a coordinating conjunction; `now` and `completely` are **RB**, or adverbs; `for` is **IN**, a preposition; `something` is **NN**, a noun; and `different` is **JJ**, an adjective.

Let’s look at another example, this time including some homonyms

In [None]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

<a id = "cell_Representing"></a>
### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token.

In [None]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

In [None]:
tagged_token[0]

'fly'

In [None]:
tagged_token[1]

'NN'

We can construct a list of tagged tokens directly from a string:

In [None]:
sent = '''The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN. \
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC \
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS \
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB \
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT \
interest/NN of/IN both/ABX governments/NNS ''/'' ./.'''
sent.split()

['The/AT',
 'grand/JJ',
 'jury/NN',
 'commented/VBD',
 'on/IN',
 'a/AT',
 'number/NN',
 'of/IN.',
 'other/AP',
 'topics/NNS',
 ',/,',
 'AMONG/IN',
 'them/PPO',
 'the/AT',
 'Atlanta/NP',
 'and/CC',
 'Fulton/NP-tl',
 'County/NN-tl',
 'purchasing/VBG',
 'departments/NNS',
 'which/WDT',
 'it/PPS',
 'said/VBD',
 '``/``',
 'ARE/BER',
 'well/QL',
 'operated/VBN',
 'and/CC',
 'follow/VB',
 'generally/RB',
 'accepted/VBN',
 'practices/NNS',
 'which/WDT',
 'inure/VB',
 'to/IN',
 'the/AT',
 'best/JJT',
 'interest/NN',
 'of/IN',
 'both/ABX',
 'governments/NNS',
 "''/''",
 './.']

In [None]:
[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN.'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

### References:

- Bird, S., Klein, E., & Loper, E. (2009). Natual Language Processing with Python. O’Reilly Media, Sebastopol, CA 95472. https://www.oreilly.com/library/view/natural-language-processing/9780596803346/