<a href="https://colab.research.google.com/github/dhitology/tm-python/blob/master/001_Text_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Muhammad Apriandito - Technaut Education*


---



---



# **Text Preprocessing**

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

## English Text Pre-Processing

Here we preprocess the simple word below.

In [None]:
# Input English Text
text_en = 'The death toll from the coronavirus has reached 28 in South Korea with 600 newly confirmed cases, raising the national Itally to 4,812 cases, the South Korean Centers for Disease Control and Prevention (KCDC) said in a news release Tuesday.'
text_en

### **Remove Symbol and Character**

Remove symbol and character from the text

In [None]:
# Import Library
import string 

# Remove Symbol and Character
text_en_nosymbol = text_en.translate(str.maketrans('','',string.punctuation)).lower()
text_en_nosymbol

###  **Tokenization**

Tokenization is the process of breaking a document down into words, punctuation marks, numeric digits, etc. We can do sentence tokenization and word tokenization.

#### **Sentence Tokenization**

In [None]:
# Import Module 
from nltk.tokenize import sent_tokenize

# Tokenize Sentence
text_en_tokenizeds =sent_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizeds

#### **Word Tokenization**

In [None]:
# Word Tokenization

# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_en_tokenizedw = word_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizedw

### **Remove Stopword**

Though "stop words" usually refers to the most common words in a language. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That".

In [None]:
# Download English Stopwords from NLTK
from nltk.corpus import stopwords

# Show Stopwords
stopwords_en = nltk.corpus.stopwords.words('english')
stopwords_en

In [None]:
# Removing Stopwords
text_en_filtered =[]
for w in text_en_tokenizedw:
    if w not in stopwords_en:
        text_en_filtered.append(w)

# Show Tokenized vs Filtered
print('Tokenized:',text_en_tokenizedw)
print('Filtered:',text_en_filtered)

### **Text Normalization**

Text normalization considers another type of noise in the text. For example connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. 

#### **Stemming**

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root formâ€”generally a written word form. Example : consulting -> consult, parties -> parti

In [None]:
# Import Modules
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

# Set Stemming Function
stemmer = PorterStemmer()
text_en_stemmed =[]
for i in text_en_filtered:
    text_en_stemmed.append(stemmer.stem(i))

# Show
print('Filtered:',text_en_filtered)
print('Stemmed:',text_en_stemmed)

#### **Lemmzatization**

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. Example : best -> good, parti -> party

*Lets try to compare stemming vs lemmatization!*

In [None]:
# Import Modules
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# Try to the Words
word = 'flying'
print('Stemmed:',stemmer.stem(word))
print('Lemmatized:',lemmatizer.lemmatize(word,'v'))

*Lemmatizing the text*

In [None]:
# Import Modules
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Set Lemmatization Function
lemmatizer = nltk.WordNetLemmatizer()
text_en_lemmatized = [lemmatizer.lemmatize(i, pos='v') for i in text_en_filtered]

# Show
print('Filtered:',text_en_filtered)
print('Stemmed:',text_en_stemmed)
print('Lemmatized', text_en_lemmatized)

## **Indonesian Text Preprocessing**

In [None]:
# Input Text (Indonesia)
text_id = 'Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya kepada pemerintah pada tanggal 9 - maret - 2020.'
text_id

### **Remove Symbol and Character**

In [None]:
# Import Library
import string 

# Remove Symbol
text_id_nosymbol = text_id.translate(str.maketrans('','',string.punctuation)).lower()
text_id_nosymbol

### **Word Tokenization**

In [None]:
# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_id_tokenized = word_tokenize(text_id_nosymbol)

# Show
text_id_tokenized

### **Remove Stopword**

In [None]:
# Download Indonesian Stopword
stopwords_id = nltk.corpus.stopwords.words('indonesian')

# Show Stopwords
stopwords_id

In [None]:
# Removing Stopword
text_id_filtered = []
for w in text_id_tokenized:
    if w not in stopwords_id:
        text_id_filtered.append(w)

# Show
print('Tokenized:',text_id_tokenized)
print('Filtered:',text_id_filtered)

### **Text Normalization**

In [None]:
# Install Library for Text Normalization
! pip install sastrawi

In [None]:
# Import Library
import re

#### **Stemming**

In [None]:
# Import Module
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize.treebank import TreebankWordDetokenizer

# Create Stemmer Function
stemmer_id = StemmerFactory().create_stemmer()
text_id_stemmed = [stemmer_id.stem(i) for i in text_id_filtered]

# Show
print('Filtered:', text_id_filtered)
print('Stemmed:', text_id_stemmed)