# Natural Language Processing

oleh: 
<br> Ade Satya Wahana
<br> I Gede Yudi Paramartha

## The Nature of Languange Processing

**Computers cannot directly understand text like humans.** 
For example, humans can automatically break down sentences into units of meaning, but computers cannot. Therefore, **text data must be processed before** various text analysis algorithms can use it.

Text data usually appears in an **unstructured form**. Text data exists **in the wild** and has not been converted into a structured format, like a spreadsheet. Therefore, it has to be manipulated and converted into a proper structured and numerical format consumable by text analysis algorithms, which is referred to as **text pre-processing.**

## Source of Text

The majority of text data that appears in everyday sources such as: 
1. Books
2. Newspapers
3. Magazines
4. Emails
5. Blogs
6. Tweets

## Accessing Various Text Resources

**Existing data repositories**
<br> Most of which contains corpora that have been either pre-processed into a specific format that can be directly digested by the downstream text analysis algorithms or manually annotated.
1. __[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)__ contains 30 corpora that can be used in text mining tasks, such as regression, clustering, and classification.
2. __[Linguistic Data Consortium](https://www.ldc.upenn.edu/)__ contains corpora mainly used in various natural language processing tasks, such as parsing, acoustic analysis, phonological analysis and etc. One disadvantage of using LDC is that its corpora are not free. Users have to buy a license in order to use those corpora.

__[NLTK](http://www.nltk.org/howto/corpus.html)__
<br> A language toolkit that also includes a diverse set of corpora and lexical resources.

1. Plain text corpora. __[The Gutenberg Corpus](https://github.com/aparrish/gutenberg-poetry-corpus)__  contains thousands of books.
2. Tagged Corpora. __[The Brown Corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html)__ is annotated with part-of-speech tags. Each word is now paired with its part-of-speech tag. You can retrieve words as (word, tag) tuples, rather than just bare word strings.
3. Chunked Corpora. __[The CoNLL corpora](https://www.kaggle.com/nltkdata/conll-corpora)__ includes phrasal chunks (CoNLL 2000), named entity chunks (CoNLL 2002).
4. Parsed Corpora. __[The Treebank corpora](https://www.aclweb.org/anthology/J93-2004.pdf)__ provide a syntactic parse for each sentence, like the Penn Treebank based on Wall Street Journal samples.
5. Word List and Lexicons. __[WordNet](https://wordnet.princeton.edu/)__ is a large lexical database of English, where nouns, verbs, adjectives and adverbs are organized into interlinked synsets (i.e., sets of synonyms).
6. Categorized Corpora. __[The Reuters corpus](https://trec.nist.gov/data/reuters/reuters.html)__ is a corpus of Reuters News stories for used in developing text analysis algorithms.

**Web**
<br> The largest source for getting text data is the Web. Text can be extracted from webpages or be retrieved via various APIs.
1. Wikipedia article.
2. Tweets that allows people to communicate with short, 140-characters messages. It is fortunate that Twitter provides quite well documented API that we can use to retrieve tweets of our interest.
3. The other text data can be scraped from the Internet, like webpages. Here is a __[Youtube](https://www.youtube.com/watch?v=3xQTJi2tqgk)__ video on scraping websites with Python.

## How to install NLTK

**Using Pip**
![NLTK_pip.png](attachment:NLTK_pip.png)

**Using Anaconda**

![NLTK_conda.png](attachment:NLTK_conda.png)

## Basic String

Textual data in Python is handled with str objects, or strings. 

Strings are immutable sequences of Unicode code points. String literals are written in a variety of ways:
![str.png](attachment:str.png)

In [1]:
#membuat variable baru bernama Name

Name = 'Ade Satya Wahana'

In [2]:
#mengecek type dari variable Name

type(Name)

str

> **Case study**

> **Split**. Splits a string into a list.

In [3]:
#membagi kalimat pada variable Name menggunakan split()

Name.split()

['Ade', 'Satya', 'Wahana']

In [4]:
#membuat variable baru bernama Date_of_birth

Date_of_birth = 'Jakarta, 6 Januari 2020'

In [5]:
#membagi kalimat pada variable Date_of_birth menggunakan split()
#perhatikan hasil dari split()

Date_of_birth.split()

['Jakarta,', '6', 'Januari', '2020']

In [6]:
#membagi kalimat pada variable Date_of_birth menggunakan split()
#mengatur parameter dalam split() dengan membagi kalimatnya berdasarkan tanda koma (,)
#perhatikan hasil 
#bandingkan hasil dari split() sebelumnya


Date_of_birth.split(',')

['Jakarta', ' 6 Januari 2020']

In [8]:
#mengatur kalimat pada variable Date_of_birth menggunakan split()
#mengatur parameter dalam split() dengan membatasi jumlah split yang boleh dilakukan, maxsplit = ...
#perhatikan hasil

Date_of_birth.split(maxsplit=2)

['Jakarta,', '6', 'Januari 2020']

> **Lowercase, uppercase, and etc**.

In [9]:
#membuat keseluruhan kata dalam variable Name menjadi lowercase atau huruf kecil semua

Name.lower()

'ade satya wahana'

In [10]:
#membuat keseluruhan kata dalam variable Name menjadi uppercase atau huruf besar semua

Name.upper()

'ADE SATYA WAHANA'

In [11]:
#membuat huruf besar pada awal kalimat dalam variable Name

Name.capitalize()

'Ade satya wahana'

In [12]:
#membuat huruf besar pada awal setiap kata dalam variable Name

Name.title()

'Ade Satya Wahana'

>> **quiz**. Split the sentence into capitalize each word.

In [13]:
#membuat variable baru bernama sentence

sentence = 'saYa Ade Satya wahana, pegawai Itjen KEMENKEU'

In [16]:
#mengkombinasikan title() dan split()

sentence.upper().split()

['SAYA', 'ADE', 'SATYA', 'WAHANA,', 'PEGAWAI', 'ITJEN', 'KEMENKEU']

> **Count**. Count and prints total words available in a string, but using predefined function named split().

In [17]:
#menggunakan split()

len(sentence.split())

7

In [18]:
#panggil modul regex
#pastikan sudah terinstall di environtment

import re

len(re.findall(r'\w+', sentence))

#regex r'\w+''

7

> **List**. Equals sign, and then some quoted words, separated with commas, and surrounded with brackets.

In [19]:
#membuat variable baru bernama list1

list1 = ['Call', 'me', 'Ishmael', '.']

In [20]:
len(list1)

4

In [21]:
#membuat variable baru bernama list2

list2 =['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']

In [22]:
len(list2)

11

>> **Concatenation**. it combines the lists together into a single list. We can concatenate sentences to build up a text.

In [23]:
#menggabungkan list1 dengan list2

new_list = list1 + list2

In [24]:
new_list

['Call',
 'me',
 'Ishmael',
 '.',
 'The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.']

In [25]:
len(new_list)

15

>> **Appending**. What if we want to add a single item to a list? This is known as appending. When we
append() to a list, the list itself is updated as a result of the operation.

In [26]:
#menambahkan string Next ke variable list1

list1.append('Next')

In [27]:
list1

['Call', 'me', 'Ishmael', '.', 'Next']

>> **Extend**.  Iterates over its argument and adding each element to the list and extending the list. The length of the list increases by number of elements in it’s argument.

In [28]:
#menambahkan list string 'Friday' ke variable list1
#perhatikan perbedaan append dan extend

list1.extend(['Friday'])

In [29]:
list1

['Call', 'me', 'Ishmael', '.', 'Next', 'Friday']

>> **Indexing**. We can identify the elements of a Python list by their order of occurrence in the list. The number that represents this position is the item’s index.

In [30]:
#indexing python dimulai dari 0, bukan 1
#mencari index [2] di variable list1

list1[2]

'Ishmael'

In [32]:
#indexing juga bisa menggunakan 'jarak'
#mencari index dari [0] ke [2]
#perhatikan index [2] merupakan batas akhirnya
#artinya [0] ke [2], kata pertama dan kata kedua

list1[0:3]

['Call', 'me', 'Ishmael']

>> **Sorted**. Sorts the elements of a given iterable in a specific order (either ascending or descending) and returns the sorted iterable as a list.

In [33]:
sorted(list1)

['.', 'Call', 'Friday', 'Ishmael', 'Next', 'me']

> __[Regular Expression](https://regex101.com/)__ (__[Regex](https://regexr.com/ )__). A sequence of characters that define a search pattern.

Basic tokens:

![regex.png](attachment:regex.png)

In [38]:
#membuat variable baru bernama Phone_number

Phone_number = 'i Gede Yudi Paramartha, 091-999-777'

In [39]:
#membuat pola untuk menangkap nama saja

pattern_name = re.compile(r'[a-zA-Z]+')

In [40]:
matches1 = re.compile(pattern_name)
matches1.findall(Phone_number)

['i', 'Gede', 'Yudi', 'Paramartha']

In [41]:
#membuat pola untuk menangkap phone number

pattern_phone_number = re.compile(r'\d+')

In [42]:
matches2 = re.compile(pattern_phone_number)
matches2.findall(Phone_number)

['091', '999', '777']

## Basic Steps of  Pre-Processing Text

<br>

The possible steps of text pre-processing are nearly the same for all text analysis tasks, though which pre-processing steps are chosen depends on the specific task. The basic steps are as follows:
1. Text Cleaning
2. Sentence Segmentation
3. Tokenization
4. POS Tag
5. Case Normalization
6. Removing Stop words
7. Stemming and Lemmatization

**1. Text Data Cleaning**. We need to learn how to work with unstructured data to be able to extract relevant information from it and make it useful.

- Clear out HTML characters. A Lot of HTML entities like &apos; ,&amp; ,&lt; etc can be found in most of the data available on the web. We need to get rid of these from our data.
- Removing URLs, Hashtags and Styles. We can have hyperlinks, hashtags or styles like retweet text for twitter dataset etc. These provide no relevant information and can be removed. In hashtags, only the hash sign ‘#’ will be removed.
- Contraction replacement. The text data might contain apostrophe’s used for contractions. Example- “didn’t” for “did not” etc. This can change the sense of the word or sentence. Hence we need to replace these apostrophes with the standard lexicons.

**There is no specific method how to perform it**.

> **Case study**

> **Clear out HTML characters** using HTMLParser library and BeautifulSoup library.

In [43]:
#membuat variable baru bernama tweet
tweet = "1.2.3.4. I enjoyd the event which took place yesteday &amp; I luvd it ! The link to the show is http://t.co/4ftYom0i It is awsome you'll luv it #HadFun #Enjoyed BFN GN"

In [44]:
#pastikan library html sudah terinstall
#memanggil library html
import html

#mengimplementasikan library html untuk menghilangkan karakter HTML
clean_tweet_html = html.unescape(tweet)

In [45]:
clean_tweet_html

"1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is http://t.co/4ftYom0i It is awsome you'll luv it #HadFun #Enjoyed BFN GN"

In [46]:
#pastikan library BeautifulSoup sudah terinstall
#memanggil library html
from bs4 import BeautifulSoup

#mengimplementasikan library BeautifulSoup untuk menghilangkan karakter HTML
clean_tweet_bs4 = BeautifulSoup(tweet, "html.parser").text

In [47]:
clean_tweet_bs4

"1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is http://t.co/4ftYom0i It is awsome you'll luv it #HadFun #Enjoyed BFN GN"

In [None]:
#perhatikan hasil dari keduanya, apakah ada yang berbeda?

> **Removing URLs, Hashtags and Styles** using Regular Expression (Regex).

In [48]:
tweet

"1.2.3.4. I enjoyd the event which took place yesteday &amp; I luvd it ! The link to the show is http://t.co/4ftYom0i It is awsome you'll luv it #HadFun #Enjoyed BFN GN"

In [49]:
#pastikan sudah terdapat modul regex
#panggil modul regex
import re  

#bersihkan http atau https
clean_tweet_url = re.sub(r'(http|https)\S+', '', clean_tweet_bs4)

In [50]:
clean_tweet_url

"1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is  It is awsome you'll luv it #HadFun #Enjoyed BFN GN"

In [51]:
clean_tweet_hastag = re.sub(r'#', '', clean_tweet_url)

In [52]:
clean_tweet_hastag

"1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is  It is awsome you'll luv it HadFun Enjoyed BFN GN"

> **Contraction replacement** using dictionary.

In [None]:
#konsepnya adalah kita membuat dictionary berisikan kata yang diganti serta kata penggantinya
#kamus ini bisa dalam bentuk macam-macam
#bisa dalam bentuk dictionary sederhana maupun bentuk excel atau csv dan lain sebagainya.
#python memungkinkan kita untuk membaca format yang berbeda tersebut.

In [53]:
clean_tweet_hastag

"1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is  It is awsome you'll luv it HadFun Enjoyed BFN GN"

In [56]:
#membuat dictionary sederhana bernama tweet_dict
tweet_dict = {"n't":" not",
              "'m":" am",
              "'ll":" will", 
              "'d":" would",
              "'ve":" have",
              "'re":" are"}

#mengimplementasikan dictionary tersebut
for key, value in tweet_dict.items():
    if key in clean_tweet_hastag:
        clean_tweet_dict = clean_tweet_hastag.replace(key, value)

In [57]:
clean_tweet_dict

'1.2.3.4. I enjoyd the event which took place yesteday & I luvd it ! The link to the show is  It is awsome you will luv it HadFun Enjoyed BFN GN'

**2. Sentence Segmentation**

Sentence segmentation is also known as sentence boundary disambiguation or sentence boundary detection.

The following is the Wikipedia definition of sentence boundary disambiguation:
> Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

The accuracy of the SBD system will directly affect the performance of these applications.

> **Punkt Sentence Tokenizer**. The NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) was designed to split text into sentences "by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences." It contains a pre-trained sentence tokenizer for English.

> **Case Study**

In [58]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [59]:
print(sent_detector)

<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x000002AA475B1D88>


In [60]:
text1 = '''And so it turned out; Mr. Hosea Hussey being from home, but leaving 
Mrs. Hussey entirely competent to attend to all his affairs. Upon making known our desires 
for a supper and a bed, Mrs. Hussey, postponing further scolding for the present, ushered us 
into a little room, and seating us at a table spread with the relics of a recently concluded repast, 
turned round to us and said—"Clam or Cod?"'''

print('\n-----\n'.join(sent_detector.tokenize(text1.strip())))

And so it turned out; Mr. Hosea Hussey being from home, but leaving 
Mrs. Hussey entirely competent to attend to all his affairs.
-----
Upon making known our desires 
for a supper and a bed, Mrs. Hussey, postponing further scolding for the present, ushered us 
into a little room, and seating us at a table spread with the relics of a recently concluded repast, 
turned round to us and said—"Clam or Cod?"


In [61]:
text2 = '''A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"'''
print('\n-----\n'.join(sent_detector.tokenize(text2.strip())))

A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?"
-----
says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs.
-----
Hussey?"


**3. Tokenization**

The process of breaking a stream of text into tokens is often referred to as **tokenization**.
For example, a tokenizer turns a string such as 
```
    A data wrangler is the person performing the wrangling tasks.
```
into a sequence of tokens such as
```
    "A" "data" "wrangler" "is" "the" "person" "performing" "the" "wrangling" "tasks"
```

There is no single right way to do tokenization. It completely **depends on the corpus and the text analysis task you are going to perform**.

> **Case study**

In [62]:
#bikin variable dengan nama raw
#isi dari varible berbentuk string
#nama variable bisa bebas, tidak harus 'raw'
raw = """The GSO finace group in  U.S.A. provided Cole with about US$40,000,555.4 in funding, 
which accounts for 35.3% of Cole's revenue (i.e., AUD113.3m), 
as the ASX-listed firm battles for its survival. 
Mr. Johnson said GSO's recapitalisation meant "the current shares are worthless"."""

> **Split method**. The simplest approach might be using Python's string function split(). This function returns a list of tokens in the string.

In [63]:
#metode split
#membuat variable baru bernama split_tokens
split_tokens = raw.split()
print(split_tokens)

['The', 'GSO', 'finace', 'group', 'in', 'U.S.A.', 'provided', 'Cole', 'with', 'about', 'US$40,000,555.4', 'in', 'funding,', 'which', 'accounts', 'for', '35.3%', 'of', "Cole's", 'revenue', '(i.e.,', 'AUD113.3m),', 'as', 'the', 'ASX-listed', 'firm', 'battles', 'for', 'its', 'survival.', 'Mr.', 'Johnson', 'said', "GSO's", 'recapitalisation', 'meant', '"the', 'current', 'shares', 'are', 'worthless".']


In [64]:
#perhatikan bahwa variable tokens, hasil splitnya berbentuk list
#jika ingin print satu per satu dari anggota di list, bisa menggunakan function
for word in split_tokens:
    print(word)

The
GSO
finace
group
in
U.S.A.
provided
Cole
with
about
US$40,000,555.4
in
funding,
which
accounts
for
35.3%
of
Cole's
revenue
(i.e.,
AUD113.3m),
as
the
ASX-listed
firm
battles
for
its
survival.
Mr.
Johnson
said
GSO's
recapitalisation
meant
"the
current
shares
are
worthless".


> **Regex method**. We can combine split method with Regex or using RegexpTokenizer from NLTK

In [65]:
#kombinasi metode split dengan regex
#pastikan sudah terdapat modul regex
#panggil modul regex
import re

#membuat variable baru bernama regex_tokens
regex_tokens = re.split(r"\s+", raw)

#'r"\s+"' merupakan regex yang dimaksudkan untuk memisahkan kata dari variable raw, dengan setiap kata dipisahkan oleh whitespace
#regex fleksibel dalam menentukan maksud penggunaannya
#sangat tergantung dari kebutuhan dan pemahaman kita akan masalah yang dihadapi

In [66]:
regex_tokens

['The',
 'GSO',
 'finace',
 'group',
 'in',
 'U.S.A.',
 'provided',
 'Cole',
 'with',
 'about',
 'US$40,000,555.4',
 'in',
 'funding,',
 'which',
 'accounts',
 'for',
 '35.3%',
 'of',
 "Cole's",
 'revenue',
 '(i.e.,',
 'AUD113.3m),',
 'as',
 'the',
 'ASX-listed',
 'firm',
 'battles',
 'for',
 'its',
 'survival.',
 'Mr.',
 'Johnson',
 'said',
 "GSO's",
 'recapitalisation',
 'meant',
 '"the',
 'current',
 'shares',
 'are',
 'worthless".']

In [67]:
#regex menggunakan RegexpTokenizer
#panggil modul RegexpTokenizer

from nltk.tokenize import RegexpTokenizer

In [68]:
#untuk mengecek parameter didalam RegexpTokenizer(), kita bisa menggunakan shift + tab
#kriteria dari RegexpTokenizer
#sama dengan metode regex sebelumnya, menggunakan whitespace sebagai kriteria
tokenizer = RegexpTokenizer(r"\s+", gaps=True)

#membuat variable baru bernama regexptokenizer
regexptokenizer_tokens = tokenizer.tokenize(raw)

In [69]:
regexptokenizer_tokens

['The',
 'GSO',
 'finace',
 'group',
 'in',
 'U.S.A.',
 'provided',
 'Cole',
 'with',
 'about',
 'US$40,000,555.4',
 'in',
 'funding,',
 'which',
 'accounts',
 'for',
 '35.3%',
 'of',
 "Cole's",
 'revenue',
 '(i.e.,',
 'AUD113.3m),',
 'as',
 'the',
 'ASX-listed',
 'firm',
 'battles',
 'for',
 'its',
 'survival.',
 'Mr.',
 'Johnson',
 'said',
 "GSO's",
 'recapitalisation',
 'meant',
 '"the',
 'current',
 'shares',
 'are',
 'worthless".']

In [None]:
#kita bisa eksplore modul lain di nltk.tokenizer
#link
#https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize

> **Function method**. The iterable will be split into a number of chunks, each of which will be submitted to a process in the process pool. Each process will apply a callable function to each element in the chunk it has received.

In [70]:
import nltk

In [71]:
def tokenizeRawData(data):
    """
        Fungsi untuk tokenize data input.
        Data input merupakan data, di contoh kasus ini adalah variable raw
    """
    # membuat function
    function_tokens = nltk.tokenize.word_tokenize(data) 
    return (function_tokens)

In [72]:
#implementasi function dengan input variable raw
tokenizeRawData(raw)

['The',
 'GSO',
 'finace',
 'group',
 'in',
 'U.S.A.',
 'provided',
 'Cole',
 'with',
 'about',
 'US',
 '$',
 '40,000,555.4',
 'in',
 'funding',
 ',',
 'which',
 'accounts',
 'for',
 '35.3',
 '%',
 'of',
 'Cole',
 "'s",
 'revenue',
 '(',
 'i.e.',
 ',',
 'AUD113.3m',
 ')',
 ',',
 'as',
 'the',
 'ASX-listed',
 'firm',
 'battles',
 'for',
 'its',
 'survival',
 '.',
 'Mr.',
 'Johnson',
 'said',
 'GSO',
 "'s",
 'recapitalisation',
 'meant',
 '``',
 'the',
 'current',
 'shares',
 'are',
 'worthless',
 "''",
 '.']

> **quiz for tokenization**. Word count using word_tokenize. 

In [73]:
#menggunakan FreqDist dari nltk.probability
#digabungkan dengan word_tokenize dari nltk.tokenize

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist


fdist = FreqDist()
for word in word_tokenize(raw):
    fdist[word] += 1

In [75]:
fdist

FreqDist({',': 3, 'GSO': 2, 'in': 2, 'Cole': 2, 'for': 2, "'s": 2, 'the': 2, '.': 2, 'The': 1, 'finace': 1, ...})

> **N-grams**. Besides unigrams that we have been working on so far,
N-grams of texts are also extensively used in various text analysis tasks.
They are basically contiguous sequences of `n` words from a given sequence of text.
When computing the n-grams you typically move a fixed size window of size n
words forward.

For example, for the sentence.
```
'Laughter is like a windshield wiper'.
```
if N = 2 (known as bigrams), the n-grams would be:
```
'Laughter is', 'is like', 'like a, 'a windshield', 'windshield wiper'.
```

In [76]:
#panggil modul ngrams dari nltk.util
from nltk.util import ngrams

#membuat fungsi untuk ekstrak ngram
def extract_ngrams(data, num):
    """
        Fungsi untuk mengimplementasikan n-gram dari variable
        Data input merupakan data, di contoh kasus ini adalah variable raw. Sedangkan num adalah konteks n dari n-gram.
    """
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]

In [81]:
#contoh dari ngrams
#n bisa diganti 1,2,3 dst.
#penting untuk membaca penelitian2 terkait hasil implementasi n tersebut.

#bigram, n=2
extract_ngrams(raw, 2)

['The GSO',
 'GSO finace',
 'finace group',
 'group in',
 'in U.S.A.',
 'U.S.A. provided',
 'provided Cole',
 'Cole with',
 'with about',
 'about US',
 'US $',
 '$ 40,000,555.4',
 '40,000,555.4 in',
 'in funding',
 'funding ,',
 ', which',
 'which accounts',
 'accounts for',
 'for 35.3',
 '35.3 %',
 '% of',
 'of Cole',
 "Cole 's",
 "'s revenue",
 'revenue (',
 '( i.e.',
 'i.e. ,',
 ', AUD113.3m',
 'AUD113.3m )',
 ') ,',
 ', as',
 'as the',
 'the ASX-listed',
 'ASX-listed firm',
 'firm battles',
 'battles for',
 'for its',
 'its survival',
 'survival .',
 '. Mr.',
 'Mr. Johnson',
 'Johnson said',
 'said GSO',
 "GSO 's",
 "'s recapitalisation",
 'recapitalisation meant',
 'meant ``',
 '`` the',
 'the current',
 'current shares',
 'shares are',
 'are worthless',
 "worthless ''",
 "'' ."]

**4. POS Tag** 

The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags
used for a particular task is known as a tagset.

> **Case study**

>**Pos Tag using nltk**

In [82]:
#menggunakan model pos_tag dari nltk
#membuat variable baru bernama tagged_word
tagged_word = nltk.tag.pos_tag(tokenizeRawData(raw))

In [83]:
#print variable tagged_word
tagged_word

[('The', 'DT'),
 ('GSO', 'NNP'),
 ('finace', 'NN'),
 ('group', 'NN'),
 ('in', 'IN'),
 ('U.S.A.', 'NNP'),
 ('provided', 'VBD'),
 ('Cole', 'NNP'),
 ('with', 'IN'),
 ('about', 'IN'),
 ('US', 'NNP'),
 ('$', '$'),
 ('40,000,555.4', 'CD'),
 ('in', 'IN'),
 ('funding', 'NN'),
 (',', ','),
 ('which', 'WDT'),
 ('accounts', 'VBZ'),
 ('for', 'IN'),
 ('35.3', 'CD'),
 ('%', 'NN'),
 ('of', 'IN'),
 ('Cole', 'NNP'),
 ("'s", 'POS'),
 ('revenue', 'NN'),
 ('(', '('),
 ('i.e.', 'JJ'),
 (',', ','),
 ('AUD113.3m', 'NNP'),
 (')', ')'),
 (',', ','),
 ('as', 'IN'),
 ('the', 'DT'),
 ('ASX-listed', 'JJ'),
 ('firm', 'NN'),
 ('battles', 'VBZ'),
 ('for', 'IN'),
 ('its', 'PRP$'),
 ('survival', 'NN'),
 ('.', '.'),
 ('Mr.', 'NNP'),
 ('Johnson', 'NNP'),
 ('said', 'VBD'),
 ('GSO', 'NNP'),
 ("'s", 'POS'),
 ('recapitalisation', 'NN'),
 ('meant', 'VBD'),
 ('``', '``'),
 ('the', 'DT'),
 ('current', 'JJ'),
 ('shares', 'NNS'),
 ('are', 'VBP'),
 ('worthless', 'JJ'),
 ("''", "''"),
 ('.', '.')]

In [84]:
#melakukan filter atas word dengan tag 'NN' saja
#membuat variable baru bernama filter_tag
filter_tag = [word for word, tag in tagged_word if tag.startswith('NN')]

In [85]:
#print variable filter_tag
filter_tag

['GSO',
 'finace',
 'group',
 'U.S.A.',
 'Cole',
 'US',
 'funding',
 '%',
 'Cole',
 'revenue',
 'AUD113.3m',
 'firm',
 'survival',
 'Mr.',
 'Johnson',
 'GSO',
 'recapitalisation',
 'shares']

In [None]:
#untuk memahami arti dari tagging
#bisa cek link
#https://stackoverflow.com/questions/29332851/what-does-nn-vbd-in-dt-nns-rb-means-in-nltk
#https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

> **Abbreviation**. To deal with abbreviations, one approach is to maintain a look-up list of known abbreviations during tokenization. Another approach aims for smart tokenization. Here we will show you how to use regular expressions to cover most but not all abbreviations.

In [86]:
#mencari bentuk singkatan dengan pola, semua huruf besar maupun kecil yang dipisahkan oleh tanda titik.

tokenizer = RegexpTokenizer(r"(?:[a-zA-Z]\.)+")
tokenizer.tokenize(raw)

['U.S.A.', 'i.e.', 'l.', 'r.']

In [87]:
#mencari bentuk singkatan dengan pola, kata yang diakhiri dengan tanda titik.

tokenizer = RegexpTokenizer(r"[a-zA-z]{2,}\.")
tokenizer.tokenize(raw)

['survival.', 'Mr.']

**5. Case Normalization**

After word tokenization, you may find that words can contain either upper- or lowercase letters.

For example, you might have "data" and "Data" appearing in the same text.
Should one treat them as two different words or as the same word?

Most English texts are written in mixed case. 
In other words, a text can contain both upper- and lowercase letters.
Capitalization helps readers differentiate, for example, between nouns and proper nouns.
In many circumstances, however, an uppercase word should be treated no differently than in lower case appearing in a document, and even in a corpus.

Therefore, a common strategy is **to reduce all letters in a word to lower case.**
It is very simple to do so.

> **Case study**

> **islower()**. This method to checks the given string if it has lowercase characters in it. If yes, then it returns true; else, if there are no lowercase characters, it returns false.

In [88]:
raw.islower()
#artinya di dalam variable raw terdapat kata dalam bentuk tidak lowercase

False

> **lower()**. Converting to lowercase, creates another string other than the original string and returns that string.

In [89]:
raw.lower()

'the gso finace group in  u.s.a. provided cole with about us$40,000,555.4 in funding, \nwhich accounts for 35.3% of cole\'s revenue (i.e., aud113.3m), \nas the asx-listed firm battles for its survival. \nmr. johnson said gso\'s recapitalisation meant "the current shares are worthless".'

> **lower() combined with tokenization**. To make a new variable that contains lowercase words.

In [90]:
#membuat variable baru bernama lowercase_tokens
#token.lower() artinya, setiap iterasi akan masing-masing kata (token) di variable regexptokenizer_tokens (variable sebelumnya) untuk diimplementasikan lower()
lowercase_tokens = [token.lower() for token in regexptokenizer_tokens]

In [91]:
lowercase_tokens

['the',
 'gso',
 'finace',
 'group',
 'in',
 'u.s.a.',
 'provided',
 'cole',
 'with',
 'about',
 'us$40,000,555.4',
 'in',
 'funding,',
 'which',
 'accounts',
 'for',
 '35.3%',
 'of',
 "cole's",
 'revenue',
 '(i.e.,',
 'aud113.3m),',
 'as',
 'the',
 'asx-listed',
 'firm',
 'battles',
 'for',
 'its',
 'survival.',
 'mr.',
 'johnson',
 'said',
 "gso's",
 'recapitalisation',
 'meant',
 '"the',
 'current',
 'shares',
 'are',
 'worthless".']

**Removing Stopwords**

[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are words that are extremely common and carry little lexical content. For many NLP and text mining tasks, it is useful to remove stopwords in order to save storage space 
and speed up processing, and the process of removing these words is usually called “stopping.”

> **Case study**

> **Stopwords for English words**. We can implement stopwords module from nltk.corpus. We also can extend that module by adding some extra words.

In [92]:
#nltk memiliki modul dengan corpus berupa kata-kata dalam bahasa inggris yang biasa untuk di stop.
#biasanya merupakan kata subyek, kata penghubung dan sejenisnya.
#modulnya bernama 'stopwords'

from nltk.corpus import stopwords

#membuat variable baru bernama english_stopwords
english_stopwords = stopwords.words('english')

In [93]:
english_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [94]:
#modul stopwords support beragam jenis bahasa
#tidak hanya bahasa Inggris
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [95]:
indonesian_stopwords = stopwords.words('indonesian')

In [98]:
indonesian_stopwords

['ada',
 'adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut'

> **Stopwords for Indonesian words**. We can implement StopWordRemoverFactory module from Sastrawi.StopWordRemover.StopWordRemoverFactory.

In [99]:
#terdapat juga modul stopwords dalam bahasa indonesia bernama 'StopWordRemoverFactory'
#pastikan untuk install terlebih dahulu modulnya

#install modul 
#https://pypi.org/project/Sastrawi/
#pip install sastrawi
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

#menjalankan modul
factory = StopWordRemoverFactory()

#membuat variable baru bernama indonesian_stopwords_sastrawi
indonesian_stopwords_sastrawi = factory.get_stop_words()

In [100]:
indonesian_stopwords_sastrawi

['yang',
 'untuk',
 'pada',
 'ke',
 'para',
 'namun',
 'menurut',
 'antara',
 'dia',
 'dua',
 'ia',
 'seperti',
 'jika',
 'jika',
 'sehingga',
 'kembali',
 'dan',
 'tidak',
 'ini',
 'karena',
 'kepada',
 'oleh',
 'saat',
 'harus',
 'sementara',
 'setelah',
 'belum',
 'kami',
 'sekitar',
 'bagi',
 'serta',
 'di',
 'dari',
 'telah',
 'sebagai',
 'masih',
 'hal',
 'ketika',
 'adalah',
 'itu',
 'dalam',
 'bisa',
 'bahwa',
 'atau',
 'hanya',
 'kita',
 'dengan',
 'akan',
 'juga',
 'ada',
 'mereka',
 'sudah',
 'saya',
 'terhadap',
 'secara',
 'agar',
 'lain',
 'anda',
 'begitu',
 'mengapa',
 'kenapa',
 'yaitu',
 'yakni',
 'daripada',
 'itulah',
 'lagi',
 'maka',
 'tentang',
 'demi',
 'dimana',
 'kemana',
 'pula',
 'sambil',
 'sebelum',
 'sesudah',
 'supaya',
 'guna',
 'kah',
 'pun',
 'sampai',
 'sedangkan',
 'selagi',
 'sementara',
 'tetapi',
 'apakah',
 'kecuali',
 'sebab',
 'selain',
 'seolah',
 'seraya',
 'seterusnya',
 'tanpa',
 'agak',
 'boleh',
 'dapat',
 'dsb',
 'dst',
 'dll',
 'dahulu

In [101]:
len(indonesian_stopwords_sastrawi)

126

> **Combined modules**. We can extend the word that need to stop by combining module or input some word manually.

In [102]:
list_extend = ['namun', 'jika', 'sedang', 'sedang']

In [103]:
#implementasi extend()
#kombinasikan variable english_stopwords dengan variable list_extend

english_stopwords.extend(list_extend)

In [104]:
len(english_stopwords)

#perhatikan panjang dari variable yang dihasilkan oleh extend()

183

In [105]:
new_stopwords = list(set(english_stopwords))

In [106]:
len(new_stopwords)

#perhatikan panjang dari variable new_stopwords, setelah merubah extend() menjadi set() terlebih dahulu

182

> **Implementation of stopwords**.

In [107]:
#refresh sejenak
#kita punya variable raw yang sudah dijadikan token bernama lowercase_tokens
#stopwords sudah dalam bentuk lowercase
#matching

#bikin variable kosong dalam bentuk list
#untuk menampung kata-kata setelah di filter dalam stopwords
filtered_word = []

#looping
for word in lowercase_tokens: 
    if word not in english_stopwords: 
        filtered_word.append(word)

In [108]:
#gambaran sebelum dan setelah implementasi stopwords
len(lowercase_tokens)

41

In [109]:
len(filtered_word)

28

In [110]:
filtered_word

['gso',
 'finace',
 'group',
 'u.s.a.',
 'provided',
 'cole',
 'us$40,000,555.4',
 'funding,',
 'accounts',
 '35.3%',
 "cole's",
 'revenue',
 '(i.e.,',
 'aud113.3m),',
 'asx-listed',
 'firm',
 'battles',
 'survival.',
 'mr.',
 'johnson',
 'said',
 "gso's",
 'recapitalisation',
 'meant',
 '"the',
 'current',
 'shares',
 'worthless".']

**Stemming and Lemmatization**

Another question in text pre-processing is whether we want to keep word forms like "educate", "educated", "educating", 
and "educates" separate or to collapse them. Grouping such forms together and working in terms of their base form is 
usually known as stemming or lemmatization.

Typically the stemming process includes the identification and removal of prefixes, suffixes, and pluralisation, 
and leaves you with a stem.

Lemmatization is a more advanced form of stemming that makes use of, for example, the context surrounding the words, 
an existing vocabulary, morphological analysis of words and other grammatical information (e.g., part-of-speech tags) 
to determine the basic or dictionary form of a word, which is known as the lemma.
See Wikipedia entries for [stemming](https://en.wikipedia.org/wiki/Stemming) 
and [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation).

Stemming and lemmatization are the basic text pre-processing methods for texts in languages like English, French, 
German, etc. 
In English, nouns are inflected in the plural, verbs are inflected in the various tenses, and adjectives are 
inflected in the comparative/superlative. 

For example,
* watch &#8594; watches
* party &#8594; parties
* carry &#8594; carrying
* love &#8594; loving
* stop &#8594; stopped
* wet &#8594; wetter
* fat &#8594; fattest
* die &#8594; dying
* meet &#8594; meeting

In morphology, the derivation process creates a new word out of an existing one often by adding either 
a prefix or a suffix. It brings considerable sematic changes to the word, often word class is changed.

For example,
* dark &#8594; darkness
* agree &#8594; agreement
* friend &#8594; friendship
* derivation &#8594; derivational

NLTK provides several famous stemmers interfaces, such as

* Porter Stemmer, which is based on 
[The Porter Stemming Algorithm](http://tartarus.org/martin/PorterStemmer/)
* Lancaster Stemmer, which is based on 
[The Lancaster Stemming Algorithm](http://delivery.acm.org/10.1145/110000/101310/p56-paice.pdf?ip=130.194.73.168&id=101310&acc=ACTIVE%20SERVICE&key=65D80644F295BC0D%2E54DA4E88E6052E5D%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=586402953&CFTOKEN=41173049&__acm__=1456460730_26a9cd5f8f70e5d3e101f527c10e1a82),
* Snowball Stemmer, which is based on [the Snowball Stemming Algorithm](http://snowball.tartarus.org/)

> **Case study**

> **Porter Stemmer**

In [111]:
#panggil modul PorterStemmer dari nltk.stem
from nltk.stem import PorterStemmer

#implementasi dari modul PorterStemmer
porter_stemmer = PorterStemmer()
['{0} -> {1}'.format(word, porter_stemmer.stem(word)) for word in lowercase_tokens]

['the -> the',
 'gso -> gso',
 'finace -> finac',
 'group -> group',
 'in -> in',
 'u.s.a. -> u.s.a.',
 'provided -> provid',
 'cole -> cole',
 'with -> with',
 'about -> about',
 'us$40,000,555.4 -> us$40,000,555.4',
 'in -> in',
 'funding, -> funding,',
 'which -> which',
 'accounts -> account',
 'for -> for',
 '35.3% -> 35.3%',
 'of -> of',
 "cole's -> cole'",
 'revenue -> revenu',
 '(i.e., -> (i.e.,',
 'aud113.3m), -> aud113.3m),',
 'as -> as',
 'the -> the',
 'asx-listed -> asx-list',
 'firm -> firm',
 'battles -> battl',
 'for -> for',
 'its -> it',
 'survival. -> survival.',
 'mr. -> mr.',
 'johnson -> johnson',
 'said -> said',
 "gso's -> gso'",
 'recapitalisation -> recapitalis',
 'meant -> meant',
 '"the -> "the',
 'current -> current',
 'shares -> share',
 'are -> are',
 'worthless". -> worthless".']

> **Lancaster Stemmer**

In [112]:
#panggil modul LancasterStemmer dari nltk.stem
from nltk.stem import LancasterStemmer


#implementasi dari modul LancasterStemmer
lacancaster_stemmer = LancasterStemmer()
['{0} -> {1}'.format(word, lacancaster_stemmer.stem(word)) for word in lowercase_tokens]

['the -> the',
 'gso -> gso',
 'finace -> finac',
 'group -> group',
 'in -> in',
 'u.s.a. -> u.s.a.',
 'provided -> provid',
 'cole -> col',
 'with -> with',
 'about -> about',
 'us$40,000,555.4 -> us$40,000,555.4',
 'in -> in',
 'funding, -> funding,',
 'which -> which',
 'accounts -> account',
 'for -> for',
 '35.3% -> 35.3%',
 'of -> of',
 "cole's -> cole's",
 'revenue -> revenu',
 '(i.e., -> (i.e.,',
 'aud113.3m), -> aud113.3m),',
 'as -> as',
 'the -> the',
 'asx-listed -> asx-listed',
 'firm -> firm',
 'battles -> battl',
 'for -> for',
 'its -> it',
 'survival. -> survival.',
 'mr. -> mr.',
 'johnson -> johnson',
 'said -> said',
 "gso's -> gso's",
 'recapitalisation -> recapit',
 'meant -> meant',
 '"the -> "the',
 'current -> cur',
 'shares -> shar',
 'are -> ar',
 'worthless". -> worthless".']

In [None]:
#bisa dicoba beberapa metode stem lainnya
#cek link
#http://www.nltk.org/api/nltk.stem.html

> **Sastrawi Stemmer**

In [113]:
article = "Toyota dan Daihatsu resmi memperkenalkan Toyota Raize dan Daihatsu Rocky. Keduanya mengumumkan bentuk kolaborasi baru dengan meluncurkan Raize dan Rocky. President of Daihatsu Motor Company, Soichiro Okudaira, mengatakan Daihatsu Rocky dan Toyota Raize akan diproduksi di pabrik PT Astra Daihatsu Motor. Mobil ini akan tersedia di showroom mulai 30 April mendatang."

In [114]:
#pastikan sudah library Sastrawi sudah terinstall
#panggil libray StemmerFactory
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

#mengimplementasikan library Stemmer Factory
factory = StemmerFactory()
sastrawi_stemmer = factory.create_stemmer()

#print hasil
['{0} -> {1}'.format(word, sastrawi_stemmer.stem(word)) for word in article.split()]

['Toyota -> toyota',
 'dan -> dan',
 'Daihatsu -> daihatsu',
 'resmi -> resmi',
 'memperkenalkan -> kenal',
 'Toyota -> toyota',
 'Raize -> raize',
 'dan -> dan',
 'Daihatsu -> daihatsu',
 'Rocky. -> rocky',
 'Keduanya -> dua',
 'mengumumkan -> umum',
 'bentuk -> bentuk',
 'kolaborasi -> kolaborasi',
 'baru -> baru',
 'dengan -> dengan',
 'meluncurkan -> luncur',
 'Raize -> raize',
 'dan -> dan',
 'Rocky. -> rocky',
 'President -> president',
 'of -> of',
 'Daihatsu -> daihatsu',
 'Motor -> motor',
 'Company, -> company',
 'Soichiro -> soichiro',
 'Okudaira, -> okudaira',
 'mengatakan -> kata',
 'Daihatsu -> daihatsu',
 'Rocky -> rocky',
 'dan -> dan',
 'Toyota -> toyota',
 'Raize -> raize',
 'akan -> akan',
 'diproduksi -> produksi',
 'di -> di',
 'pabrik -> pabrik',
 'PT -> pt',
 'Astra -> astra',
 'Daihatsu -> daihatsu',
 'Motor. -> motor',
 'Mobil -> mobil',
 'ini -> ini',
 'akan -> akan',
 'tersedia -> sedia',
 'di -> di',
 'showroom -> showroom',
 'mulai -> mulai',
 '30 -> 30',
 