# FIT5120 - Industry Experience Studio Project  S1 2022

### Project Name: HOTEL REVIEW ASSISTANT
### Task Name: Data Visualization - Iteration 2 - Bubble Chart


Team information
- Team Name: AntiFake
- Team Number: TA 36

Date: 02/05/2022

Version: 1.0

Programming Language: Python 3.8 and Jupyter notebook

Python Libraries used:
- pandas (For data manipulation and analysis)
- numpy (For building the fake detection algorithm)
- re (For data extraction)
- googletrans (For interpret the review with non-English language) 
- os (For manipulate the file processing)
- nltk (For natural language processing)
- matplotlib (For support data visualization)
- textblob (For processing the textual data)

## Table of Contents

* [1. Import Library](#sec_1)
* [2. Data Wrangling](#sec_2)
* [3. Natural Language Processing](#sec_3)
* [4. Word Frequency](#sec_4)
* [5. Sentiment Classification](#sec_5)
* [6. Preprocessing Data for Visualization](#sec_6)

### 1. Import Library

In [163]:
import pandas as pd
import numpy as np
import re
from googletrans import Translator
from os import path
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
import matplotlib.pyplot as plt
from googletrans import Translator
from textblob import TextBlob

### 2. Data Wrangling 

In [164]:
df_review = pd.read_csv('reviews.csv')
df_listing = pd.read_csv('listings.csv')
df_listing = df_listing.loc[~df_listing.name.isnull()] # remove null in names
df_listing = df_listing.rename(columns={'id':'listing_id'})
# Merge the listing and review dataset
df_all = df_review.merge(df_listing, on='listing_id', how='left')
words = df_all[['listing_id','name', 'comments']].drop_duplicates()
words

Unnamed: 0,listing_id,name,comments
0,9835,Beautiful Room & House,"Very hospitable, much appreciated.\r<br/>"
1,9835,Beautiful Room & House,A beautiful house in a lovely quiet neighbourh...
2,9835,Beautiful Room & House,This was my first time using airbnb and it was...
3,9835,Beautiful Room & House,I was visiting Melbourne to spend time with my...
4,12936,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,Perfect apartment in a perfect location!!!! \r...
...,...,...,...
466954,53612627,Lovely 2-Bedroom Luxuary Apartment with Proper...,- Excellent communications <br/>- I was given ...
466955,53612891,2 Bedroom · 2 Bedroom · MC Magpie Cottage,"Accomodation was clean, in a good location and..."
466956,53613335,Fantastic view 2BR1BA apt in Melbourne,one of the best views we've had in the city. w...
466957,53645767,Private Bedroom in North Melbourne,I booked Hisham's place at last moment at nigh...


In [165]:
# Aggregating the reviews by listing id
df_aslist = words.groupby('listing_id').aggregate(lambda x: list(x)).reset_index()
df_aslist.loc[:, 'new_name'] = df_aslist.name.map(lambda x: x[0])
# Drop the duplicate column
df_aslist = df_aslist.drop(['name'],axis = 1)

### 3. Natural Language Processing

In [46]:
# Combine all the review words in one list
liii = []
for i in range(len(df_aslist['comments'])):
    liii.append(','.join(str(v) for v in df_aslist['comments'][i]))
    # Assign to the new column and display the result
df_aslist['cleaned_comments'] = liii
df_aslist

Unnamed: 0,listing_id,comments,new_name,cleaned_comments
0,9835,"[Very hospitable, much appreciated.\r<br/>, A ...",Beautiful Room & House,"Very hospitable, much appreciated.\r<br/>,A be..."
1,12936,[Perfect apartment in a perfect location!!!! \...,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,Perfect apartment in a perfect location!!!! \r...
2,33111,"[Paul is a lovely guy, very helpful and friend...",Million Dollar Views Over Melbourne,"Paul is a lovely guy, very helpful and friendl..."
3,38271,[Darly and Dee were very very friendly and nic...,Melbourne - Old Trafford Apartment,Darly and Dee were very very friendly and nice...
4,41836,[Thanks for Diana\r<br/>She is a great host\r<...,CLOSE TO CITY & MELBOURNE AIRPORT,Thanks for Diana\r<br/>She is a great host\r<b...
...,...,...,...,...
13616,53612627,[- Excellent communications <br/>- I was given...,Lovely 2-Bedroom Luxuary Apartment with Proper...,- Excellent communications <br/>- I was given ...
13617,53612891,"[Accomodation was clean, in a good location an...",2 Bedroom · 2 Bedroom · MC Magpie Cottage,"Accomodation was clean, in a good location and..."
13618,53613335,[one of the best views we've had in the city. ...,Fantastic view 2BR1BA apt in Melbourne,one of the best views we've had in the city. w...
13619,53645767,[I booked Hisham's place at last moment at nig...,Private Bedroom in North Melbourne,I booked Hisham's place at last moment at nigh...


In [52]:
# Filtering the non-adjective words
def get_adjectives(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag == "JJ"]
# Add into new column
df_aslist['adjectives'] = df_aslist['cleaned_comments'].apply(get_adjectives)

In [93]:
# Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

l = []
for i in df_aslist['adjectives']:
    property_word = []
    for w in i:
        try:
            w = str(w)
            tokenized = word_tokenize(w)
            
            filtered_list = [word.lower() for word in tokenized if not word.lower() in stop_words]
#             stemmed_words = [stemmer.stem(word) for word in filtered_list]
#             lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
            property_word.append(filtered_list)
        except:
            property_word.append('not availiable')
            
    property_word = [word for word_l in property_word for word in word_l]
    l.append(property_word)
    


### 4. Word Frequency

In [94]:
from nltk import FreqDist
import string
import re
# Remove the punctuation
punctuation = ['br/', 'mel', 'st','kilda', 'place', 'stay', 'us', 'would', 'even','made', "'s"]

freq = []
# Generate the meaningful word lists
for i in l: 
    meaningful_words = [word for word in i if word.lower() not in punctuation]
    meaningful_words = [word for word in meaningful_words if word.lower() not in stop_words ]
    meaningful_words = [word for word in meaningful_words if word.lower() not in string.punctuation]
    frequency_distribution = FreqDist(meaningful_words)
    freq.append(frequency_distribution.most_common(30))

### 5. Sentiment Classification

In [95]:
# Import pre-trained NLP model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

# Using the piepline to generate the result for testing data
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
generator = pipeline(task="text-classification", model=model, tokenizer=tokenizer)

In [96]:
# Building the function for classify the sentiment type and its probability
# Group by listing ids, HAS NON-VALUE
def get_stars(message, generator):
    """
    Output: (label: string, probability: number)
    """
    message = message[:128]
    result = generator(message)[0].values()
    result = list(result)
    label = result[0]
    return (label)

In [None]:
# Iterating over the each semetiment vacabulary and get its sentiment type
all_words_list = []
counter = 1
for i in freq:
    word_sentiment_list = []
    for w in i:
        word_sentiment = get_stars(w[0][:10], generator)
        word_sentiment_list.append(word_sentiment)
    all_words_list.append(word_sentiment_list)
    counter += 1

In [109]:
# Validate the result
all_words_list
len(freq)

13621

### 6. Preprocessing Data for Visualization

In [118]:
# Rename the classification label
for i in range(len(all_words_list)):
    for x in range(len(all_words_list[i])):
        if all_words_list[i][x] == 'POS':
            all_words_list[i][x] = 'Positive'
        elif all_words_list[i][x] == 'NEU':
            all_words_list[i][x] = 'Neutral'
        elif all_words_list[i][x] == 'NEG':
            all_words_list[i][x] = 'Negative'

In [126]:
# Combining the classification label with sentiment words
all_new_list = []
for x in range(len(freq)):
    new_list = []
    for i in range(len(freq[x])):
        new = (*freq[x][i], all_words_list[x][i])
        new_list.append(new)
    all_new_list.append(new_list)
# Display the result
all_new_list

[[('lovely', 3, 'Positive'),
  ('comfortable', 2, 'Neutral'),
  ('great', 2, 'Positive'),
  ('pleasant', 2, 'Positive'),
  ('hospitable', 1, 'Positive'),
  ('quiet', 1, 'Neutral'),
  ('quick', 1, 'Neutral'),
  ('rate', 1, 'Neutral'),
  ('reasonable', 1, 'Neutral'),
  ('welcome', 1, 'Positive'),
  ('first', 1, 'Neutral'),
  ('much', 1, 'Neutral'),
  ('easy', 1, 'Neutral'),
  ('new', 1, 'Neutral'),
  ('pleased', 1, 'Neutral'),
  ('spacious', 1, 'Neutral'),
  ('able', 1, 'Neutral'),
  ('short', 1, 'Neutral'),
  ('fast', 1, 'Neutral'),
  ('nice', 1, 'Positive')],
 [('great', 22, 'Positive'),
  ('nice', 7, 'Positive'),
  ('good', 7, 'Positive'),
  ('clean', 6, 'Neutral'),
  ('perfect', 5, 'Positive'),
  ('excellent', 5, 'Positive'),
  ('helpful', 4, 'Positive'),
  ('responsive', 4, 'Neutral'),
  ('happy', 3, 'Positive'),
  ('little', 3, 'Neutral'),
  ('fantastic', 3, 'Positive'),
  ('easy', 3, 'Neutral'),
  ('close', 3, 'Neutral'),
  ('quick', 3, 'Neutral'),
  ('comfortable', 3, 'Neutral'),

In [134]:
# Generate the key value pairs for word frequency 
keys = ["name", "value"]

bubble_list = []

for x in range(len(all_new_list)):
    all_list = []
    for i in all_new_list[x]:
        new = {}
        res = {}
        nww_list = []
        for n in range(len(keys)):
            print(i[n])
            res[keys[n]] = i[n]
        nww_list.append(res)
        new['name'] = i[2]
        new['data'] = nww_list

        all_list.append(new)
    bubble_list.append(all_list)
# Display the result
bubble_list

[[{'name': 'Positive', 'data': [{'name': 'lovely', 'value': 3}]},
  {'name': 'Neutral', 'data': [{'name': 'comfortable', 'value': 2}]},
  {'name': 'Positive', 'data': [{'name': 'great', 'value': 2}]},
  {'name': 'Positive', 'data': [{'name': 'pleasant', 'value': 2}]},
  {'name': 'Positive', 'data': [{'name': 'hospitable', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'quiet', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'quick', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'rate', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'reasonable', 'value': 1}]},
  {'name': 'Positive', 'data': [{'name': 'welcome', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'first', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'much', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'easy', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'new', 'value': 1}]},
  {'name': 'Neutral', 'data': [{'name': 'pleased', 'value': 1}]},
  {'name': 'Neut

In [162]:

# Mapping each sentiment vacabulary by classification label
final_bubble = []
for x in range(len(bubble_list)):

    new_lll = []
    new_ll2 = []
    new_ll3 = []

    dict111 = {}
    dict112 = {}
    dict113 = {}
    for i in bubble_list[x]:
        if i['name'] == 'Positive':
            new_lll.append(i['data'][0])
        elif i['name'] == 'Negative':
            new_ll2.append(i['data'][0])
        elif i['name'] == 'Neutral':
            new_ll3.append(i['data'][0])
        

    dict111['name'] = 'Positive'
    dict111['data'] = new_lll
    dict112['name'] = 'Negative'
    dict112['data'] = new_ll2
    dict113['name'] = 'Neutral'
    dict113['data'] = new_ll3
    sum_list = [dict111, dict112, dict113]
    final_bubble.append(sum_list)
# Display the reslut
final_bubble

[[{'name': 'Positive',
   'data': [{'name': 'lovely', 'value': 3},
    {'name': 'great', 'value': 2},
    {'name': 'pleasant', 'value': 2},
    {'name': 'hospitable', 'value': 1},
    {'name': 'welcome', 'value': 1},
    {'name': 'nice', 'value': 1}]},
  {'name': 'Negative', 'data': []},
  {'name': 'Neutral',
   'data': [{'name': 'comfortable', 'value': 2},
    {'name': 'quiet', 'value': 1},
    {'name': 'quick', 'value': 1},
    {'name': 'rate', 'value': 1},
    {'name': 'reasonable', 'value': 1},
    {'name': 'first', 'value': 1},
    {'name': 'much', 'value': 1},
    {'name': 'easy', 'value': 1},
    {'name': 'new', 'value': 1},
    {'name': 'pleased', 'value': 1},
    {'name': 'spacious', 'value': 1},
    {'name': 'able', 'value': 1},
    {'name': 'short', 'value': 1},
    {'name': 'fast', 'value': 1}]}],
 [{'name': 'Positive',
   'data': [{'name': 'great', 'value': 22},
    {'name': 'nice', 'value': 7},
    {'name': 'good', 'value': 7},
    {'name': 'perfect', 'value': 5},
    {'n

In [168]:
# Adding to the dataframe
df_aslist['bubble_freq'] = final_bubble
df_aslist

Unnamed: 0,listing_id,comments,new_name,bubble_freq
0,9835,"[Very hospitable, much appreciated.\r<br/>, A ...",Beautiful Room & House,"[{'name': 'Positive', 'data': [{'name': 'lovel..."
1,12936,[Perfect apartment in a perfect location!!!! \...,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,"[{'name': 'Positive', 'data': [{'name': 'great..."
2,33111,"[Paul is a lovely guy, very helpful and friend...",Million Dollar Views Over Melbourne,"[{'name': 'Positive', 'data': [{'name': 'lovel..."
3,38271,[Darly and Dee were very very friendly and nic...,Melbourne - Old Trafford Apartment,"[{'name': 'Positive', 'data': [{'name': 'great..."
4,41836,[Thanks for Diana\r<br/>She is a great host\r<...,CLOSE TO CITY & MELBOURNE AIRPORT,"[{'name': 'Positive', 'data': [{'name': 'nice'..."
...,...,...,...,...
13616,53612627,[- Excellent communications <br/>- I was given...,Lovely 2-Bedroom Luxuary Apartment with Proper...,"[{'name': 'Positive', 'data': []}, {'name': 'N..."
13617,53612891,"[Accomodation was clean, in a good location an...",2 Bedroom · 2 Bedroom · MC Magpie Cottage,"[{'name': 'Positive', 'data': [{'name': 'good'..."
13618,53613335,[one of the best views we've had in the city. ...,Fantastic view 2BR1BA apt in Melbourne,"[{'name': 'Positive', 'data': []}, {'name': 'N..."
13619,53645767,[I booked Hisham's place at last moment at nig...,Private Bedroom in North Melbourne,"[{'name': 'Positive', 'data': [{'name': 'good'..."


In [169]:
# Generate the output file
df_aslist.to_csv('bubble_chart.csv', index=False)