# Load Data

## Connect to Google Drive

First thing first, connect this Google Colab project to Google Drive.

Run the code below to connect them.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Load the Data from *.csv* Source

Here we use **Hotel Reviews** data from Kaggle as our dataset. You can download it [here](https://https://www.kaggle.com/anu0012/hotel-review). After that, upload the .csv file to any directory in your personal Google Drive.
</br>

The hotel review data from the link provided above consists of *train.csv* and *test.csv*. But, in this project we will be using the **latter** only
<br/>
<br/>

To acces the data in the .csv file, copy the path of that .csv file and store it in a variable called `data` using the code provided below.


In [None]:
import pandas as pd

# Raihan's Google Drive directory
# data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datasets/hotelReview_train.csv')

# Ibnu's Google Drive directory
# data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Hotel Review/train.csv')

# Local directory
data = pd.read_csv('hotel_review_train.csv')

## Displaying the Data

After the data has been sucessfully read, we can display different aspects of the data programmatically.

Below is a snippet code to output the numbers of (row, column)

In [None]:
data.shape

(38932, 5)

Below is a snippet code to output a `n` random of row(s)

In [None]:
data.sample(5)

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
1642,id11968,My husband and I plus two of our friends (anot...,Firefox,Mobile,not happy
22265,id32591,"After reading reviews, I thought this was goin...",Mozilla Firefox,Mobile,happy
1932,id12258,"This is a reliable, consistent hotel. Not spec...",Google Chrome,Desktop,happy
27233,id37559,This hotel is located right next to a trolley ...,Google Chrome,Desktop,not happy
18930,id29256,This hotel is the perfect place to stay while ...,IE,Tablet,happy


Below is a snippet code to output the data descriptively

In [None]:
data.describe()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
count,38932,38932,38932,38932,38932
unique,38932,38932,11,3,2
top,id24260,My husband and I love to stay at the JW Marrio...,Firefox,Desktop,happy
freq,1,1,7367,15026,26521


Below is a snippet code to output the count of target value

In [None]:
data['Is_Response'].value_counts()

happy        26521
not happy    12411
Name: Is_Response, dtype: int64

In this project, we'll only use the column of `Description` and `Is_Response` only. 

We'll also store all of the `Description` data to a variable named `attribute` and the `Is_Response` as `target`.

# Preprocessing


## Column Handling

First we will get rid of unused columns which are irrelevant for this project's Sentiment Analysis. Those columns are `User_ID`, `Browser_Used`, and `Device_Used`.

In [None]:
data.drop(columns = ['User_ID', 'Browser_Used', 'Device_Used'], inplace = True)

KeyError: ignored

Next we will change the `Is_Response` column values from "happy" and "not happy" to "positive" and "negative"

In [None]:
data['Is_Response'] = data['Is_Response'].map({'happy' : 'positive', 'not happy' : 'negative'})

data.sample(3)

Unnamed: 0,Description,Is_Response
32912,"Positives: Great Building, Artwork, uniforms, ...",negative
17626,This place is ok. Service is what you'd expect...,negative
16888,My husband booked for us to go to NY for my --...,positive


## Text Cleaning

We will clean the text by removing any punctuations. In addition, this steps also removes any twitter username (@username...) and websites link (http... and www...). The processes above are done using Regular Expression method to search for matching texts.


In [None]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
twitter_handle = r'@[A-Za-z0-9_]+'                         # remove twitter handle (@username)
url_handle = r'http[^ ]+'                                  # remove website URLs that start with 'https?://'
combined_handle = r'|'.join((twitter_handle, url_handle))  # join
www_handle = r'www.[^ ]+'                                  # remove website URLs that start with 'www.'
punctuation_handle = r'\W+'


We will also get rid of "stopwords". Stopwords are the most common words in a language that adds no semantic meaning to a sentece; there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. 

<img height=300 src=https://onlinemediamasters.com/wp-content/uploads/2015/11/Stop-Words.jpg >
</img>

This stopwords we use will be in English and can be downloaded [here](http://xpo6.com/download-stop-word-list/). 

Download "Text file of stop words for download" Then add at the first line "stopword". The idea is to trick the `read_csv` function to read a column with a header named "stopword".

Then upload it somewhere on your Google Drive. 

In [None]:
# Ibnu's Google Drive directory
# stopwords = set(pd.read_csv('/content/drive/My Drive/Colab Notebooks/Hotel Review/stop-word-list.txt', sep='\n', header=0).stopword)

# Raihan's Google Drive directory
# stopwords = set(pd.read_csv('/content/drive/My Drive/Colab Notebooks/Stopword/stopword_en.txt', sep='\n', header=0).stopword)

# Local direcotry
stopwords = set(pd.read_csv('stopword_en.txt', sep='\n', header=0).stopword)

Define a function called `process_text` to process the text using the methods listed above. 

In [None]:
def process_text(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()

    try:
        text = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        text = souped

    cleaned_text = re.sub(punctuation_handle, " ",(re.sub(www_handle, '', re.sub(combined_handle, '', text)).lower()))
    cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stopwords])

    return (" ".join([word for word in tokenizer.tokenize(cleaned_text) if len(word) > 1])).strip()

Below is an input-based example to test the above text cleaning method. Try it~

In [None]:
example_text = "hahaha if above a ----'-' www.adasd apakah SAYA ingin pergi pada tanggal 15 bulan februari besok ? tidak karena hari kemarin @twitter suka main https://www.twitter.com"

process_text(example_text)

'hahaha apakah saya ingin pergi pada tanggal 15 bulan februari besok tidak karena hari kemarin suka main'

Then we will create a new column in our data named `clean_text` to store the cleaned text. 

We will process every row in variable `attribute`, which is the raw text from the .csv data. Then concate the new attribute `clean_text` to the original data file.

In [None]:
cleaned_text = []

for text in data.Description:
    cleaned_text.append(process_text(text))

clean_text = pd.DataFrame({'clean_text' : cleaned_text})
data = pd.concat([data, clean_text], axis = 1)

data.sample(5)

Unnamed: 0,Description,Is_Response,clean_text
2979,"I visited Washington, DC for one night-one day...",negative,visited washington dc night day march chose ho...
32090,The Peninsula Hotel remains my favorite city h...,positive,peninsula hotel remains favorite city hotel ve...
9157,We had a delightful stay at this hotel. The lo...,positive,delightful stay hotel location walking distanc...
31880,"I came to the Sax last month for a conference,...",positive,came sax month conference did fine job room sp...
26662,I had the most wonderful stay at The Parc --. ...,positive,wonderful stay parc recommended sister amazing...


## Splitting Train Data

Here we are going set the variable `attribute` to hold the movie review texts, and variable `target` to hold the conclusion [ positive ; negative ] of the moview review

In [None]:
from sklearn.model_selection import train_test_split

attribute = data.clean_text
target = data.Is_Response

We will split entire data set into four variables; `attribute_train`, `attribute_test`, `target_train`, `target_test`, with the ratio of 9:1 ( train : test ). 

The ratio is then converted to `0.1` as a parameter to tell the test data size is gonna be 10% data of the train data

After that, we display the four variables to see how much data is distributed amongst the variables.

In [None]:
attribute_train, attribute_test, target_train, target_test = train_test_split(attribute, target, test_size = 0.1, random_state = 225)

print('attribute_train :', len(attribute_train))
print('attribute_test  :', len(attribute_test))
print('target_train :', len(target_train))
print('target_test  :', len(target_test))

attribute_train : 35038
attribute_test  : 3894
target_train : 35038
target_test  : 3894


# Training

## Defining the Model

We will train the model of this project by Vectorizing using **TF-IDF** and the Classifier using **Logistic Regression** 

We choose so because it is ...  *(insert reason here)*

Other options for Vectorizers are `CountVectorizer` and `HashingVectorizer`. And as for Classifiers, there are : 

1.   sklearn.ensemble `RandomForestClassifier`,
2.   sklearn.naive_bayes `BernoulliNB`,
3.   sklearn.svm `SVC`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tvec = TfidfVectorizer()
clf2 = LogisticRegression()

## Create Model Pipeline

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. Here, the parameters are our Vectorizer and Classifier.

In [None]:
from sklearn.pipeline import Pipeline

model = Pipeline([('vectorizer',tvec)
                 ,('classifier',clf2)])

model.fit(attribute_train, target_train)



Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inter

Below is a phrase to be used as an example to test the model above, which outputs the verdict of what the predicted sentiment is. Try it~

[Here](https://www.tripadvisor.com/Hotel_Review-g152515-d503041-Reviews-Hotel_Riu_Palace_Cabo_San_Lucas-Cabo_San_Lucas_Los_Cabos_Baja_California.html#REVIEWS) is another example of multiple hotel reviews from Trip Advisor that you can copy paste to the variable `example_text`.

In [None]:
example_text = ["I'm very happy now"]
example_result = model.predict(example_text)

print(example_result)

['positive']


# Testing

## Test with attribute_test

We will perform a testing with `attribute_test` and then compare the actual result from `response_test`. 

After that, display the *confusion_matrix*, which is also known as an error matrix, a specific table layout that allows visualization of the performance of an algorithm

<img height="200" src="https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/60900/versions/13/screenshot.png" alt="Confusion Matrix" />

In [None]:
from sklearn.metrics import confusion_matrix

verdict = model.predict(attribute_test)

confusion_matrix(verdict, target_test)

array([[ 989,  147],
       [ 334, 2424]])

Display the accuracy we got by comparing the test result of `verdict` and actual result of `target_test`

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(verdict, target_test))
print("Precision : ", precision_score(verdict, target_test, average = 'weighted'))
print("Recall : ", recall_score(verdict, target_test, average = 'weighted'))

Accuracy :  0.8764766307139189
0.8858545002516267
0.8764766307139189
