# Reviews Data Processing

written by: Muhammad Angga Muttaqien | muha.muttaqien@gmail.com

## Data Preparation

In [1]:
import os
import re, string, unicodedata
import nltk
import Sastrawi
import contractions
import inflect
import itertools
import pandas as pd
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

# for processing indonesian text
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

In [2]:
# my own stemmer
from stemmer import IndonesianStemmer
from stemmer import EnglishStemmer

#### XML Processing

In [3]:
import xml.etree.ElementTree as et

In [4]:
tree = et.parse('./datasets/training_set.xml')
root = tree.getroot()

In [5]:
reviews_corpus = []

# grab all XML contents
for review in root.findall('review'):
    rid = review.get('rid')
    text = review.find('text').text 
    
    for aspects in review.findall('aspects'):
        id = aspects.get('id')
        for aspect in aspects.findall('aspect'):
            category = aspect.get('category')
            polarity = aspect.get('polarity')
            
            # text_corpus = id + " | " + category + " - " + polarity + " | " + text
            # reviews_corpus.append(text_corpus)
            
    reviews_corpus.append(text)

In [6]:
reviews_corpus = reviews_corpus[0:50]

In [7]:
def display_reviews(reviews_corpus):
    for id, review in enumerate(reviews_corpus):
        print("{}) {}\n".format(id+1, review))
        
display_reviews(reviews_corpus)

1) I love the concept. I feel like in swiss traditional market. The place is amazing. The food is awesome. But, in my opinion, they need to make a change/rotation in menu or even new menu. I choose this place for lunch frequently. Sometimes I feel bored with the menu.  Overall, thanks Marche for the delicious food, also the nice place.

2) Sengaja macet2an kesini cuman buat nyobain nasi goreng cakalang yang orang2 bilang enak. Dan emang beneran enak sih nasi gorengnya wkkw suasana nya juga enak buat makan ramai2 gitu.

3) Suka sama bebek ini karna dulu d ajak tmn makan di sini, ehh malah jd ketagihan sama dagingnya yg empuk dan sambel mentah nya yg dasyatttt    Dulu tempatnya masih tenda, sekarang udh ada kiosnya, kursinya lumayan banyak ada toilet nya juga..    Kalo makan bebek ini selalu order dua bebek, nasi uduk, sate rempela, sambel mentah ekstra pedas dan es teh manis, sambel mentah nya bisa request pedasnya..

4) Very good and very delish!!! Gokils deh enaknya... Highly Recommen

#### Splitting english and indonesian training data

In [8]:
english_vocab = set(w.lower() for w in nltk.corpus.words.words())

en_reviews_corpus = []
id_reviews_corpus = []
for review in reviews_corpus:
    tokens = word_tokenize(review)
    added_vocab = ['tau', 'gue', 'saya', 'baru', 'gila', 'ga', 'paling', 'yang'] # manually add indo vocabularies
    
    if(tokens[0].lower() in english_vocab and (tokens[0].lower() not in added_vocab)):
        # print(tokens[0].lower())
        en_reviews_corpus.append(" ".join(tokens))
    else:
        id_reviews_corpus.append(" ".join(tokens))
        
print("Total training data: ", len(reviews_corpus))
print("English reviews: ", len(en_reviews_corpus))
print("Indonesian reviews:", len(id_reviews_corpus))

Total training data:  50
English reviews:  22
Indonesian reviews: 28


In [9]:
en_processed_reviews = []
id_processed_reviews = []

## Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

1. Noise Removal
2. Lexicon Normalization
3. Object Standardization

#### 1. Noise removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise. For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.

In [10]:
# en stopwords
en_stopwords = stopwords.words('english')

# id stopwords
factory = StopWordRemoverFactory()
id_stopwords_remover = factory.create_stop_word_remover()

##### English

In [11]:
for id, review in enumerate(en_reviews_corpus):
    tokens = word_tokenize(review)
    
    review_list = [i.lower() for i in tokens if i not in en_stopwords and len(i) > 2 and i in english_vocab]
    review_arr = " ".join(review_list)
    en_processed_reviews.append(review)

##### Indo

In [12]:
for id, review in enumerate(id_reviews_corpus):
    id_processed_reviews.append(id_stopwords_remover.remove(review))

#### 2. Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word. For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon normalization practices are :

1. Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
2. Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

#### English

In [13]:
joint_review = []
lem = WordNetLemmatizer()
stem = LancasterStemmer()

for id, review in enumerate(en_processed_reviews):
    tokens = word_tokenize(review)
    
    joint_token = []
    for token in tokens:
        token = lem.lemmatize(token, "v")
        # token = stem.stem(token)
        joint_token.append(token)
        joint_token_str = " ".join(joint_token)
        
    joint_review.append(joint_token_str)

en_processed_reviews = joint_review

#### Indo

In [14]:
joint_review = []
factory = StemmerFactory()
stemmer = factory.create_stemmer()

for id, review in enumerate(id_processed_reviews):
    print("Input: %s\n"%review)
    joint_token_str = stemmer.stem(review)
    print("Output: %s\n"%joint_token_str)

    joint_review.append(joint_token_str)

id_processed_reviews = joint_review

Input: Sengaja macet2an kesini cuman buat nyobain nasi goreng cakalang orang2 bilang enak . Dan emang beneran enak sih nasi gorengnya wkkw suasana nya enak buat makan ramai2 gitu .

Output: sengaja macet2an kesini cuman buat nyobain nasi goreng cakalang orang2 bilang enak dan emang beneran enak sih nasi goreng wkkw suasana nya enak buat makan ramai2 gitu

Input: Suka sama bebek karna dulu d ajak tmn makan sini , ehh malah jd ketagihan sama dagingnya yg empuk sambel mentah nya yg dasyatttt Dulu tempatnya tenda , sekarang udh kiosnya , kursinya lumayan banyak toilet nya juga.. Kalo makan bebek selalu order bebek , nasi uduk , sate rempela , sambel mentah ekstra pedas es teh manis , sambel mentah nya request pedasnya..

Output: suka sama bebek karna dulu d ajak tmn makan sini ehh malah jd tagih sama daging yg empuk sambel mentah nya yg dasyatttt dulu tempat tenda sekarang udh kios kursi lumayan banyak toilet nya juga kalo makan bebek selalu order bebek nasi uduk sate rempela sambel mentah

Output: datang sini cuma makan dimsum sudah keburu makan tempat entah hari rasa dimsumnya kurang oke atau mungkin benar perut sudah kenyang cara seluruh sih boleh lah dimsum sini jadi sasar brunch ngemil iseng tapi seperti saing lebih berat suka masakan sini restoran benar luar biasa sejak zaman tempo dulu sekarang restoran tetap terus jaya ciri khas makan lezat kualitas bagus memang harga makan sini mahal pas masakan nya pas harga nya suka 3

Input: Shaokaonya enak , bumbunya pas dibakarnya pas ga terlalu kering.. Harga ga terlalu mahal , plus sering ad diskon.. very recommended..

Output: shaokaonya enak bumbu pas bakar pas ga terlalu kering harga ga terlalu mahal plus sering ad diskon very recommended

Input: Martabak Pecenongan 65A never disappoints me . Adonan tebal maupun yg tipis enak '' '' ! Topingnya dikasih melimpah . Harganya pas ! Pokoknya the best martabak maker deh . Tempatnya bersih . Penyajiannya oke aka higenis . Mas mas nya ramah . Plus menampung kritik saran instanya

Output: pilih tepat klo mau makan steak enak harga pas kantong yep joni steak tau gw sh 2 cabang jkt satu pasar baru satu lg gajah mada pertama x makan sini ajak temen mantab luar biasa dah layan cepat rasa maknyus harga pas kantong apalagi ukur steak tenderloin banyak itu yah mungkin banyak steak2 laennya yg lebih enak tp klo mau milih yg enak value for money coba joni steak favorite menu makan sini salmon steaknya rasa ga amis2nya gt fresh sih tambah sambal bangkok selada yg sangat yummy jaman pertama x makan sini saji pake piring hotplate klo sekarang2 disajiin piring biasa yaahh tp tetep oke

Input: Pulang kantor daerah menteng/cikini trus bingung mau makan . Liat Zomato katanya disini salah satu recommended . So here I was . Tempatnya lumayan lucu indoor outdoor ( smoking area ) . They have all-day breakfast menu plus many selection of pastries ( quiche , ) . Pesen aglio olio with shrimp mushroom truffle pizza . WOW surprisingly enak enak lho ! ! Pizzanya ! Truffle-nya pas over po

#### 3. Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

1) Handling Appostrophes

To avoid any word sense disambiguation in text, it is recommended to maintain proper structure in it and to abide by the rules of context free grammar. When apostrophes are used, chances of disambiguation increases.
For example “it’s is a contraction for it is or it has”. All the apostrophes should be converted into standard lexicons.

##### English

In [15]:
appostrophes_dict = {"'s": "is", "'re": "are", "'m": "am", "'ve": "ve", "'d": "would", "'ll": "will", "'t": "ot"}

In [16]:
joint_review = []

for id, review in enumerate(en_processed_reviews):
    tokens = word_tokenize(review)
    
    joint_token = [appostrophes_dict[token] if token in appostrophes_dict else token for token in tokens]
    joint_token_str = " ".join(joint_token)
    
    joint_review.append(joint_token_str)
    
en_processed_reviews = joint_review

##### Indo

Nothing. There is no appostrophes behaviour in indonesian language

2) Removal of Punctuations

All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.

##### English

In [17]:
joint_review = []
punctuations_dict = '''!()-[]{};:'"\,<>./@#$%^&*_~''' # not includes !(),.?

for id, review in enumerate(en_processed_reviews):

    processed_review = ""
    for token in review:
        if token not in punctuations_dict:
            processed_review = processed_review + token

    joint_review.append(processed_review)

en_processed_reviews = joint_review

##### Indo

In [18]:
joint_review = []
punctuations_dict = '''!()-[]{};:'"\,<>./@#$%^&*_~''' # not includes !(),.?

for id, review in enumerate(id_processed_reviews):

    processed_review = ""
    for token in review:
        if token not in punctuations_dict:
            processed_review = processed_review + token

    joint_review.append(processed_review)

_processed_reviews = joint_review

3) Removal of whitespace noise

There is a need to remove unneeded whitespace in a sentences like "because it is right  .", "Besides  , there is..."

##### English

In [19]:
joint_review = []

for id, review in enumerate(en_processed_reviews):
    processed_review = review.lower()
    processed_review = processed_review.replace(" .", ".")
    processed_review = processed_review.replace(" ,", ".")
    
    joint_review.append(processed_review)

en_processed_reviews = joint_review

##### Indo

In [20]:
joint_review = []

for id, review in enumerate(id_processed_reviews):
    processed_review = review.replace(" .", ".")
    processed_review = processed_review.replace(" ,", ".")
    
    joint_review.append(processed_review)

id_processed_reviews = joint_review

4) Standardizing words

Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.

##### English

In [21]:
joint_review = []

for id, review in enumerate(en_processed_reviews):
    review = ''.join(''.join(s)[:2] for _, s in itertools.groupby(review))
    tokens = word_tokenize(review)
    
    joint_token = []
    for token in tokens:
        if token not in english_vocab:
            token = ''.join(''.join(s)[:1] for _, s in itertools.groupby(token))
    
        joint_token.append(token)
    
    joint_token_str = " ".join(joint_token)   
    joint_review.append(joint_token_str)

en_processed_reviews = joint_review

##### Indo

In [22]:
joint_review = []

for id, review in enumerate(id_processed_reviews):
    review = ''.join(''.join(s)[:2] for _, s in itertools.groupby(review))
    tokens = word_tokenize(review)
    
    joint_token = []
    for token in tokens:
        if token not in english_vocab:
            token = ''.join(''.join(s)[:1] for _, s in itertools.groupby(token))
    
        joint_token.append(token)
    
    joint_token_str = " ".join(joint_token)   
    joint_review.append(joint_token_str)

id_processed_reviews = joint_review

#### Display both training data

In [23]:
# uncomment this to display the last processed text for english
display_reviews(en_processed_reviews)

1) i love the concept i feel like in swiss traditional market the place be amaze the food be awesome but in my opinion they need to make a changerotation in menu or even new menu i choose this place for lunch frequently sometimes i feel bore with the menu overall thank marche for the delicious food also the nice place

2) very good and very delish gokils deh enaknya highly recomended gyutan semur iga sapi ayam pangang very good and all the deserts also very good

3) best place to date someone good ambiance nice interior decent price use to be best hamburger but my favorite be their alfredo or carbonara prince house seharusnya saya rate 50 its just their affordable waffle no longer there the price be not worth anymore

4) cheese cake nya juara lembut tempatnya enak cozy parkiranya pun lumayan luas

5) been here twice and waktu itu gue ke sini pas udah rada malem as you know wargih ini selalu rame gapernah sepi

6) great concept relatable with its name marche the food varieties arent tha

In [24]:
# uncomment this to display the last processed text for indonesian
display_reviews(id_processed_reviews)

1) sengaja macet2an kesini cuman buat nyobain nasi goreng cakalang orang2 bilang enak dan emang beneran enak sih nasi goreng wkw suasana nya enak buat makan ramai2 gitu

2) suka sama bebek karna dulu d ajak tmn makan sini eh malah jd tagih sama daging yg empuk sambel mentah nya yg dasyat dulu tempat tenda sekarang udh kios kursi lumayan banyak toilet nya juga kalo makan bebek selalu order bebek nasi uduk sate rempela sambel mentah ekstra pedas es teh manis sambel mentah nya request pedas

3) tempat dessert kelapa fresh banget kalo kesini paling suka beli coco pouchnya ngobrol sama temen harga nya sahabat banget lumayan banyak isi coco pouch nya paling suka coco pouch rasa honeydew asa seger banget bener energy potion

4) tryin menya sakura ramen for the first time with bunch of my friends first we re bit doubting but the sign in front of resto quite big sayin that japan no 1 ramen so we decided to give it a try most of us tryin the tonkotsu ramen the spicy one the broth was quite thick

## Text to Features

**Text to Features (Feature Engineering on text data)**

1) Syntactical Parsing
- Dependency Grammar
- Part of Speech Tagging

2) Entity Parsing
- Phrase Detection
- Named Entity Recognition
- Topic Modelling
- N-Grams

3) Statistical features
- TF – IDF
- Frequency / Density Features
- Readability Features

4) Word Embeddings

## RNN - Text Classification