##### Title: Exercice 10.2
##### Author: Jerock Kalala
##### Date: Febuary 19th 2023
##### Modified By: --
##### Exercise: Sentiment Analysis


Using the hotel reviews dataset, create a sentiment analysis model using at least one of the methods described this week (you’re welcome to create more than one). Be sure to have three data slices - train, validation, and test as specified in the text.
Note that the "sentiment" is expressed in the Is_Response, which should be label encoded (0=happy, 1=not happy). If you're not familiar with label encoding, check this out:

1. Import dependencies and data

In [25]:
import pandas as pd
import spacy
#import text_normalizer as tn

df = pd.read_csv("E:\\Bellevue\\Winter_2022\\DSC 360 Data Mining Text Analytics an\\Week_10\\hotel-reviews.csv", encoding='ISO-8859-1')
df.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [26]:
df['Is_Response'].value_counts()

happy        26521
not happy    12411
Name: Is_Response, dtype: int64

2. Preprocessing

Let's' get rid of unused columns that are irrelevant for this project’s Sentiment Analysis. Those columns are User_ID, Browser_Used, and Device_Used.

In [27]:
df.drop(columns = ['User_ID', 'Browser_Used', 'Device_Used'], inplace = True)

Next we will change the Is_Response column values from "happy" and "not happy" to "positive" and "negative"

In [28]:
df['Is_Response'] = df['Is_Response'].map({'happy' : 'positive', 'not happy' : 'negative'})
df.sample(3)

Unnamed: 0,Description,Is_Response
35011,I just returned from a --day stay at the Sofit...,positive
11733,Stayed here on December --th for - night befor...,positive
31015,The hotel is situated in the heart of the busi...,positive


3. Text Cleaning

We will clean the text by removing any punctuations and other unnecessary words.

In [32]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
twitter_handle = r'@[A-Za-z0-9_]+'                         # remove twitter handle (@username)
url_handle = r'http[^ ]+'                                  # remove website URLs that start with 'https?://'
combined_handle = r'|'.join((twitter_handle, url_handle))  # join
www_handle = r'www.[^ ]+'                                  # remove website URLs that start with 'www.'
punctuation_handle = r'\W+'

In [33]:
#Download the text file of stop words
stopwords = pd.read_csv('stopwords.txt')

In [34]:
def process_text(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()

    try:
        text = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        text = souped

    cleaned_text = re.sub(punctuation_handle, " ",(re.sub(www_handle, '', re.sub(combined_handle, '', text)).lower()))
    cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stopwords])

    return (" ".join([word for word in tokenizer.tokenize(cleaned_text) if len(word) > 1])).strip()

In [35]:
#Let's test our text cleaning methode with the below input

example_text = "hahaha if above a ----'-' www.adasd apakah SAYA ingin pergi pada tanggal 15 bulan februari besok ? tidak karena hari kemarin @twitter suka main https://www.twitter.com"

process_text(example_text)



'hahaha if above apakah saya ingin pergi pada tanggal 15 bulan februari besok tidak karena hari kemarin suka main'

In [36]:
cleaned_text = []

for text in df.Description:
    cleaned_text.append(process_text(text))

clean_text = pd.DataFrame({'clean_text' : cleaned_text})
df = pd.concat([df, clean_text], axis = 1)

df.sample(5)



Unnamed: 0,Description,Is_Response,clean_text
23593,The location is perfect. Many restaurants and ...,positive,the location is perfect many restaurants and a...
8248,LOCATION LOCATION LOCATION\r\nThe hotel is - b...,negative,location location location the hotel is blocks...
23204,We asked for and got a -nd floor room at the r...,positive,we asked for and got nd floor room at the rear...
33324,We stayed here for - nights. Our wedding night...,positive,we stayed here for nights our wedding night an...
8276,I stayed at The Chandler Inn (Boston) Dec.-- -...,positive,stayed at the chandler inn boston dec it was e...


4. Splitting Train Data

In [38]:
from sklearn.model_selection import train_test_split

x = df.clean_text
y = df.Is_Response

In [54]:
#We will split entire data set into four variables

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state = 225)

print('Attribute or x_train :', len(x_train))
print('Attribute or x_test  :', len(x_test))
print('Target    or y_train :', len(y_train))
print('Target    or y_test  :', len(y_test))

Attribute or x_train : 35038
Attribute or x_test  : 3894
Target    or y_train : 35038
Target    or y_test  : 3894


5. Training (Model)

We will train the model of this project by Vectorizing using TF-IDF and the Classifier using Logistic Regression

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tvector = TfidfVectorizer()
classif = LogisticRegression()

In [55]:
#Create Model Pipeline

from sklearn.pipeline import Pipeline

model = Pipeline([('Vectorizer',tvector),('Classifier',classif)])

model.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


6. Testing

- Below is a phrase to be used as an example to test the model above, which outputs the verdict of what the predicted sentiment is

In [56]:
example_text = ["This was an amazing place. I'm very happy now"]
example_result = model.predict(example_text)

print(example_result)

['positive']


- We will now perform a testing with x_test

In [50]:
res= model.predict(x_test)
print(res[:20])

['positive' 'positive' 'positive' 'positive' 'positive' 'positive'
 'positive' 'positive' 'negative' 'positive' 'negative' 'positive'
 'positive' 'positive' 'positive' 'positive' 'positive' 'positive'
 'positive' 'positive']


7. Evaluation

Confusion_matrix, which is also known as an error matrix, a specific table layout that allows visualization of the performance of an algorithm

In [46]:
from sklearn.metrics import confusion_matrix

verdict = model.predict(x_test)

confusion_matrix(verdict, y_test)

array([[1023,  160],
       [ 300, 2411]], dtype=int64)

Display the accuracy we got by comparing the test result of verdict and actual result of target_test

In [47]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(verdict, y_test))
print("Precision : ", precision_score(verdict, y_test, average = 'weighted'))
print("Recall : ", recall_score(verdict, y_test, average = 'weighted'))

Accuracy :  0.881869542886492
Precision :  0.8877846606422838
Recall :  0.881869542886492
