# Load Data

## Connect to Google Drive

First thing first, connect this Google Colab project to Google Drive.

Run the code below to connect them.

In [25]:
import pandas as pd

data = pd.read_csv('train.csv')

## Displaying the Data

After the data has been sucessfully read, we can display different aspects of the data programmatically.

Below is a snippet code to output the numbers of (row, column)

In [26]:
data.shape

(38932, 5)

Below is a snippet code to output a `n` random of row(s)

In [27]:
data.sample(5)

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
26002,id36328,Interesting Hotel ... The location and views o...,Edge,Desktop,not happy
38708,id49034,great location \nbest service\nthe suite are v...,Google Chrome,Mobile,happy
25695,id36021,Lucky to have stayed there during our visit to...,Mozilla Firefox,Mobile,happy
38818,id49144,After reading a few of the reviews of this hot...,InternetExplorer,Mobile,happy
29811,id40137,We love boutique hotels until we stayed at the...,Firefox,Desktop,not happy


Below is a snippet code to output the data descriptively

In [28]:
data.describe()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
count,38932,38932,38932,38932,38932
unique,38932,38932,11,3,2
top,id24042,After a disastourous check in I.E.given the wr...,Firefox,Desktop,happy
freq,1,1,7367,15026,26521


Below is a snippet code to output the count of target value

In [29]:
data['Is_Response'].value_counts()

happy        26521
not happy    12411
Name: Is_Response, dtype: int64

In this project, we'll only use the column of `Description` and `Is_Response` only. 

We'll also store all of the `Description` data to a variable named `attribute` and the `Is_Response` as `target`.

# Preprocessing


## Column Handling

First we will get rid of unused columns which are irrelevant for this project's Sentiment Analysis. Those columns are `User_ID`, `Browser_Used`, and `Device_Used`.

In [30]:
data.drop(columns = ['User_ID', 'Browser_Used', 'Device_Used'], inplace = True)

Next we will change the `Is_Response` column values from "happy" and "not happy" to "positive" and "negative"

In [31]:
data['Is_Response'] = data['Is_Response'].map({'happy' : 'positive', 'not happy' : 'negative'})

data.sample(3)

Unnamed: 0,Description,Is_Response
29121,"Very nice and excellent located hotel, really ...",positive
13065,"My kids, my dog and myself stayed here while m...",negative
35033,Very comfortable clean rooms at a very reasona...,positive


## Text Cleaning

We will clean the text by removing any punctuations. In addition, this steps also removes any twitter username (@username...) and websites link (http... and www...). The processes above are done using Regular Expression method to search for matching texts.


In [32]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
twitter_handle = r'@[A-Za-z0-9_]+'                         # remove twitter handle (@username)
url_handle = r'http[^ ]+'                                  # remove website URLs that start with 'https?://'
combined_handle = r'|'.join((twitter_handle, url_handle))  # join
www_handle = r'www.[^ ]+'                                  # remove website URLs that start with 'www.'
punctuation_handle = r'\W+'

In [34]:

stopwords = set(pd.read_csv('stopword_en.txt', sep='\n', header=0).stopword)

Define a function called `process_text` to process the text using the methods listed above. 

In [35]:
def process_text(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()

    try:
        text = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        text = souped

    cleaned_text = re.sub(punctuation_handle, " ",(re.sub(www_handle, '', re.sub(combined_handle, '', text)).lower()))
    cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stopwords])

    return (" ".join([word for word in tokenizer.tokenize(cleaned_text) if len(word) > 1])).strip()

Below is an input-based example to test the above text cleaning method. Try it~

In [36]:
example_text = "hahaha if above a ----'-' www.adasd apakah SAYA ingin pergi pada tanggal 15 bulan februari besok ? tidak karena hari kemarin @twitter suka main https://www.twitter.com"

process_text(example_text)

'hahaha apakah saya ingin pergi pada tanggal 15 bulan februari besok tidak karena hari kemarin suka main'

Then we will create a new column in our data named `clean_text` to store the cleaned text. 

We will process every row in variable `attribute`, which is the raw text from the .csv data. Then concate the new attribute `clean_text` to the original data file.

In [37]:
cleaned_text = []

for text in data.Description:
    cleaned_text.append(process_text(text))

clean_text = pd.DataFrame({'clean_text' : cleaned_text})
data = pd.concat([data, clean_text], axis = 1)

data.sample(5)

Unnamed: 0,Description,Is_Response,clean_text
28690,This was a great hotel....Marriott does not di...,positive,great hotel marriott does disappoint adults sm...
9631,We stayed at this hotel in April ---- and had ...,negative,stayed hotel april booked hotwire website chos...
3220,We stayed for a few nights on business. We lov...,positive,stayed nights business loved don compare ve in...
32532,Surprised with the quality of the room given t...,positive,surprised quality room given rates compared ho...
4970,We stayed at the Raffaello on our way to O'Har...,positive,stayed raffaello way hare party adult children...


## Splitting Train Data

Here we are going set the variable `attribute` to hold the movie review texts, and variable `target` to hold the conclusion [ positive ; negative ] of the moview review

In [38]:
from sklearn.model_selection import train_test_split

attribute = data.clean_text
target = data.Is_Response

We will split entire data set into four variables; `attribute_train`, `attribute_test`, `target_train`, `target_test`, with the ratio of 9:1 ( train : test ). 

The ratio is then converted to `0.1` as a parameter to tell the test data size is gonna be 10% data of the train data

After that, we display the four variables to see how much data is distributed amongst the variables.

In [39]:
attribute_train, attribute_test, target_train, target_test = train_test_split(attribute, target, test_size = 0.1, random_state = 225)

print('attribute_train :', len(attribute_train))
print('attribute_test  :', len(attribute_test))
print('target_train :', len(target_train))
print('target_test  :', len(target_test))

attribute_train : 35038
attribute_test  : 3894
target_train : 35038
target_test  : 3894


# Training

## Defining the Model

We will train the model of this project by Vectorizing using **TF-IDF** and the Classifier using **Logistic Regression** 

We choose so because it is ...  *(insert reason here)*

Other options for Vectorizers are `CountVectorizer` and `HashingVectorizer`. And as for Classifiers, there are : 

1.   sklearn.ensemble `RandomForestClassifier`,
2.   sklearn.naive_bayes `BernoulliNB`,
3.   sklearn.svm `SVC`

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tvec = TfidfVectorizer()
clf2 = LogisticRegression()

## Create Model Pipeline

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. Here, the parameters are our Vectorizer and Classifier.

In [41]:
from sklearn.pipeline import Pipeline

model = Pipeline([('vectorizer',tvec)
                 ,('classifier',clf2)])

model.fit(attribute_train, target_train)



Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inter

In [51]:
import speech_recognition as sr 
import os 
from pydub import AudioSegment
from pydub.silence import split_on_silence

r = sr.Recognizer()
with sr.Microphone() as source:
    # read the audio data from the default microphone
    audio_data = r.record(source, duration=5)
    print("Recognizing...")
    # convert speech to text
    text = r.recognize_google(audio_data)
    
print(text)

Recognizing...
service good service


In [53]:
example_text = [text]
example_result = model.predict(example_text)
print(example_result)

['positive']


In [63]:
import numpy as np

In [59]:
ts = example_result.tostring()

In [64]:

ts = np.array_str(example_result) 
print ("The string representation of input array : ", ts) 
print(type(ts)) 


The string representation of input array :  ['positive']
<class 'str'>


In [65]:
from gtts import gTTS 
import os 


# Language in which you want to convert 
language = 'en'


myobj = gTTS(text=ts, lang=language, slow=False) 


myobj.save("demo.mp3") 

# Playing the converted file 
os.system("mpg321 demo.mp3") 


1

# Testing

In [54]:
from sklearn.metrics import confusion_matrix

verdict = model.predict(attribute_test)

confusion_matrix(verdict, target_test)

array([[ 989,  147],
       [ 334, 2424]], dtype=int64)

Display the accuracy we got by comparing the test result of `verdict` and actual result of `target_test`

In [55]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(verdict, target_test))
print("Precision : ", precision_score(verdict, target_test, average = 'weighted'))
print("Recall : ", recall_score(verdict, target_test, average = 'weighted'))

Accuracy :  0.8764766307139189
Precision :  0.8858545002516267
Recall :  0.8764766307139189
