# NLP (Text) Assignment

Aspect Based Sentiment Analysis on Trip(?) Review

by:
- 13515035 - Oktavianus Handika
- 13515075 - Adrian Mulyana Nugraha

In [4]:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import time
import pandas as pd

#Read the train dataset from csv file
train = pd.read_csv("train.csv", usecols=['Description'])
train

Unnamed: 0,Description
0,The room was kind of clean but had a VERY stro...
1,I stayed at the Crown Plaza April -- - April -...
2,I booked this hotel through Hotwire at the low...
3,Stayed here with husband and sons on the way t...
4,My girlfriends and I stayed here to celebrate ...
5,We had - rooms. One was very nice and clearly ...
6,My husband and I have stayed in this hotel a f...
7,My wife & I stayed in this glorious city a whi...
8,My boyfriend and I stayed at the Fairmont on a...
9,"Wonderful staff, great location, but it was de..."


# System Architecture (Modules)

This classification uses scikit-learn library to run the system from preprocessing, feature extraction, and finally the classification itself. 

## Preprocessing

Preprocessing in this system uses tokenization only to preprocess the train dataset before going to the feature extraction. The library which is used for tokenization is CountVectorizer. This library will convert the documents to a matrix of token counts from the train dataset.
To see all the vocabulary in document which was tokenized, we call the function *vocabulary_* of fitted token vector.

## Feature Extraction

After preprocessing the dataset into a matrix of token counts, we still have to do feature extraction to eliminate some token that are not very meaningful. We use TF-IDF *(Term Frequency - Inverse Document Frequency)*. This library will summarize how often a given word appears within a document and downscales words that appear a lot across documents.

## Dataset Training

After Preprocessing and Feature Extraction from previous steps, dataset will be trained with several learning algorithm with library to test whether a text is a 'Spam' or not spam ('Ham') according to the training dataset, text processing, and its machine learning algorithm that is used.
Given some SMS text that will be tested with those classification algorithm.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

classifier_label = ['Ham','Spam']
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train['Teks'].values)
tf_transformer = TfidfTransformer(use_idf=False).fit(train_counts)
train_tf = tf_transformer.transform(train_counts)

print(train_tf.shape)
print(count_vect.vocabulary_)

(1143, 4951)
{'promo': 3748, 'beli': 1058, 'paket': 3465, 'flash': 1856, 'mulai': 3193, '1gb': 318, 'di': 1540, 'my': 3211, 'telkomsel': 4472, 'app': 885, 'dpt': 1737, 'extra': 1827, 'kuota': 2714, '2gb': 431, '4g': 549, 'lte': 2839, 'dan': 1465, 'nelpon': 3253, 'hingga': 2132, '100mnt': 232, '1hr': 319, 'buruan': 1302, 'cek': 1347, 'tsel': 4689, 'me': 2969, 'mytsel1': 3215, 'gb': 1942, '30': 439, 'hari': 2084, 'hanya': 2073, 'rp': 3921, '35': 474, 'ribu': 3898, 'spesial': 4319, 'buat': 1272, 'anda': 844, 'yang': 4920, 'terpilih': 4532, 'aktifkan': 801, 'sekarang': 4102, 'juga': 2399, '550': 579, '905': 692, 'sd': 4053, 'nov': 3347, '2015': 338, '2016': 339, '07': 53, '08': 55, '11': 244, '47': 536, 'plg': 3669, 'yth': 4939, 'sisa': 4237, '478kb': 541, 'download': 1733, 'mytelkomsel': 3214, 'apps': 889, 'http': 2161, 'utk': 4794, 'atau': 930, 'hub': 2163, '363': 482, '29': 403, '7160kb': 624, '5gb': 590, '55': 578, '907': 693, 'skb': 4247, 'lagi': 2733, 'ekstra': 1792, 'pulsa': 3778, '

In [15]:
#Testing the learning method with some input
test = ['Selamat! Anda mendapatkan uang sebesar 100 juta rupiah. Untuk informasi lebih lanjut, ' + 
        'silakan hubungi nomor berikut +628654321234', #spam
        'Ma, boleh transfer pulsa dulu ke nomor ini? Aku belum bisa isi ulang masih di kampus dulu sekarang', #ham
        'aaa', #ham
        'Transfer saldonya ke rekening ini ya 542 098 7543', #spam
        'Tolong kirim fotocopy KTP dan KK ke email berikut', #ham
        'Registrasi kartumu segera sebelum 1 Oktober 2019', #ham
        'Halo, ada yang bisa dibantu?', #ham
       ]
test_count = count_vect.transform(test)
test_tfidf = tf_transformer.transform(test_count)

In [16]:
from sklearn.naive_bayes import MultinomialNB
classifier_NB = MultinomialNB().fit(train_tf,train.label)

test_predict = classifier_NB.predict(test_tfidf)

for doc, category in zip(test, test_predict):
    print('%r => %s' % (doc, classifier_label[category]))

'Selamat! Anda mendapatkan uang sebesar 100 juta rupiah. Untuk informasi lebih lanjut, silakan hubungi nomor berikut +628654321234' => Spam
'Ma, boleh transfer pulsa dulu ke nomor ini? Aku belum bisa isi ulang masih di kampus dulu sekarang' => Ham
'aaa' => Ham
'Transfer saldonya ke rekening ini ya 542 098 7543' => Ham
'Tolong kirim fotocopy KTP dan KK ke email berikut' => Ham
'Registrasi kartumu segera sebelum 1 Oktober 2019' => Ham
'Halo, ada yang bisa dibantu?' => Ham


In [17]:
from sklearn.ensemble import RandomForestClassifier
classifier_RF = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0).fit(train_tf,train.label)

test_predict = classifier_RF.predict(test_tfidf)

for doc, category in zip(test, test_predict):
    print('%r => %s' % (doc, classifier_label[category]))

'Selamat! Anda mendapatkan uang sebesar 100 juta rupiah. Untuk informasi lebih lanjut, silakan hubungi nomor berikut +628654321234' => Ham
'Ma, boleh transfer pulsa dulu ke nomor ini? Aku belum bisa isi ulang masih di kampus dulu sekarang' => Ham
'aaa' => Ham
'Transfer saldonya ke rekening ini ya 542 098 7543' => Ham
'Tolong kirim fotocopy KTP dan KK ke email berikut' => Ham
'Registrasi kartumu segera sebelum 1 Oktober 2019' => Ham
'Halo, ada yang bisa dibantu?' => Ham


In [18]:
from sklearn.svm import LinearSVC
classifier_SVC = LinearSVC().fit(train_tf,train.label)

test_predict = classifier_SVC.predict(test_tfidf)

for doc, category in zip(test, test_predict):
    print('%r => %s' % (doc, classifier_label[category]))

'Selamat! Anda mendapatkan uang sebesar 100 juta rupiah. Untuk informasi lebih lanjut, silakan hubungi nomor berikut +628654321234' => Spam
'Ma, boleh transfer pulsa dulu ke nomor ini? Aku belum bisa isi ulang masih di kampus dulu sekarang' => Spam
'aaa' => Ham
'Transfer saldonya ke rekening ini ya 542 098 7543' => Ham
'Tolong kirim fotocopy KTP dan KK ke email berikut' => Spam
'Registrasi kartumu segera sebelum 1 Oktober 2019' => Ham
'Halo, ada yang bisa dibantu?' => Ham


# Analysis

There are 3 algorithms used in this program:
  - Multinomial Naive Bayes Classifier
  - Random Forest Classifier
  - Linear SVM (Support Vector Classifier)
  
Comparing the 3 algorithms, Multinomial Naive Bayes achieves 85.7% accuracy (6/7 correct), Random Forest with 71.4% accuracy (5/7 correct), and Linear SVM with 57.1% accuracy (4/7 correct).

This is caused by the nature of each algorithm. Multinomial Naive Bayes matches the words from the test case to the word appearances from the tokenized word bank, when the words match the categories, the chances of the test case entering the matching category increases.
Random Forest Classifier uses multiple decision trees trained at the subsets of the data, with a random replacement in the data sets in every iteration.
Linear SVM divide 2 classifier (spam and ham) with linear equation line in a vector space, but some data in a classification probably isn't in the area of its cluster.