# FIT5120 - Industry Experience Studio Project  S1 2022

### Project Name: HOTEL REVIEW ASSISTANT
### Task Name: Data Visulaization - Iteration2 - Wordcloud



Team information
- Team Name: AntiFake
- Team Number: TA 36

Date: 02/05/2022

Version: 1.0

Programming Language: Python 3.8 and Jupyter notebook

Python Libraries used:
- pandas (For data manipulation and analysis)
- numpy (For building the fake detection algorithm)
- re (For data extraction)
- googletrans (For interpret the review with non-English language) 
- os (For manipulate the file processing)
- nltk (For natural language processing)
- matplotlib (For support data visualization)
- textblob (For processing the textual data)

## Table of Contents

* [1. Import Library](#sec_1)
* [2. Data Wrangling](#sec_2)
* [3. Tokenization](#sec_3)
* [4. Calculating Word Frequency](#sec_4)
* [5. ### 5. Data Formating and Storing ](#sec_5)

### 1. Import Library

In [1]:
import pandas as pd
import numpy as np
import re
from googletrans import Translator
from os import path
import nltk
import matplotlib.pyplot as plt
from textblob import TextBlob

### 2. Data Wrangling 

In [2]:
df_review = pd.read_csv('reviews.csv')
df_listing = pd.read_csv('listings.csv')
df_listing = df_listing.loc[~df_listing.name.isnull()] # remove null in names
df_listing = df_listing.rename(columns={'id':'listing_id'})
# Merge the listing and review dataset
df_all = df_review.merge(df_listing, on='listing_id', how='left')
words = df_all[['listing_id','name', 'comments']].drop_duplicates()
words

Unnamed: 0,listing_id,name,comments
0,9835,Beautiful Room & House,"Very hospitable, much appreciated.\r<br/>"
1,9835,Beautiful Room & House,A beautiful house in a lovely quiet neighbourh...
2,9835,Beautiful Room & House,This was my first time using airbnb and it was...
3,9835,Beautiful Room & House,I was visiting Melbourne to spend time with my...
4,12936,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,Perfect apartment in a perfect location!!!! \r...
...,...,...,...
466954,53612627,Lovely 2-Bedroom Luxuary Apartment with Proper...,- Excellent communications <br/>- I was given ...
466955,53612891,2 Bedroom · 2 Bedroom · MC Magpie Cottage,"Accomodation was clean, in a good location and..."
466956,53613335,Fantastic view 2BR1BA apt in Melbourne,one of the best views we've had in the city. w...
466957,53645767,Private Bedroom in North Melbourne,I booked Hisham's place at last moment at nigh...


In [42]:
# Aggregating the reviews by listing id
df_aslist = words.groupby('listing_id').aggregate(lambda x: list(x)).reset_index()
df_aslist.loc[:, 'new_name'] = df_aslist.name.map(lambda x: x[0])
df_aslist = df_aslist.drop(['name'],axis = 1)

In [43]:
# Display the cleaned data frame
df_aslist

Unnamed: 0,listing_id,comments,new_name
0,9835,"[Very hospitable, much appreciated.\r<br/>, A ...",Beautiful Room & House
1,12936,[Perfect apartment in a perfect location!!!! \...,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC
2,33111,"[Paul is a lovely guy, very helpful and friend...",Million Dollar Views Over Melbourne
3,38271,[Darly and Dee were very very friendly and nic...,Melbourne - Old Trafford Apartment
4,41836,[Thanks for Diana\r<br/>She is a great host\r<...,CLOSE TO CITY & MELBOURNE AIRPORT
...,...,...,...
13616,53612627,[- Excellent communications <br/>- I was given...,Lovely 2-Bedroom Luxuary Apartment with Proper...
13617,53612891,"[Accomodation was clean, in a good location an...",2 Bedroom · 2 Bedroom · MC Magpie Cottage
13618,53613335,[one of the best views we've had in the city. ...,Fantastic view 2BR1BA apt in Melbourne
13619,53645767,[I booked Hisham's place at last moment at nig...,Private Bedroom in North Melbourne


### 3. Tokenization

In [44]:
# Combine all the review words in one list
liii = []
for i in range(len(df_aslist['comments'])):
    liii.append(','.join(str(v) for v in df_aslist['comments'][i]))

In [46]:
# Assign to the new column and display the result
df_aslist['cleaned_comments'] = liii
df_aslist

Unnamed: 0,listing_id,comments,new_name,cleaned_comments
0,9835,"[Very hospitable, much appreciated.\r<br/>, A ...",Beautiful Room & House,"Very hospitable, much appreciated.\r<br/>,A be..."
1,12936,[Perfect apartment in a perfect location!!!! \...,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,Perfect apartment in a perfect location!!!! \r...
2,33111,"[Paul is a lovely guy, very helpful and friend...",Million Dollar Views Over Melbourne,"Paul is a lovely guy, very helpful and friendl..."
3,38271,[Darly and Dee were very very friendly and nic...,Melbourne - Old Trafford Apartment,Darly and Dee were very very friendly and nice...
4,41836,[Thanks for Diana\r<br/>She is a great host\r<...,CLOSE TO CITY & MELBOURNE AIRPORT,Thanks for Diana\r<br/>She is a great host\r<b...
...,...,...,...,...
13616,53612627,[- Excellent communications <br/>- I was given...,Lovely 2-Bedroom Luxuary Apartment with Proper...,- Excellent communications <br/>- I was given ...
13617,53612891,"[Accomodation was clean, in a good location an...",2 Bedroom · 2 Bedroom · MC Magpie Cottage,"Accomodation was clean, in a good location and..."
13618,53613335,[one of the best views we've had in the city. ...,Fantastic view 2BR1BA apt in Melbourne,one of the best views we've had in the city. w...
13619,53645767,[I booked Hisham's place at last moment at nig...,Private Bedroom in North Melbourne,I booked Hisham's place at last moment at nigh...


In [52]:
# Filtering the non-adjective words
def get_adjectives(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag == "JJ"]
# Add into new column
df_aslist['adjectives'] = df_aslist['cleaned_comments'].apply(get_adjectives)

In [28]:
# Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

l = []
for i in df_aslist['adjectives']:
    property_word = []
    for w in i:
        try:
            w = str(w)
            tokenized = word_tokenize(w)
            
            filtered_list = [word.lower() for word in tokenized if not word.lower() in stop_words]
#             stemmed_words = [stemmer.stem(word) for word in filtered_list]
#             lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
            property_word.append(filtered_list)
        except:
            property_word.append('not availiable')
            
    property_word = [word for word_l in property_word for word in word_l]
    l.append(property_word)
    


### 4. Calculating Word Frequency

In [14]:
from nltk import FreqDist
import string
import re
# Remove the punctuation
punctuation = ['br/', 'mel', 'st','kilda', 'place', 'stay', 'us', 'would', 'even','made', "'s"]

freq = []
# Generate the meaningful word lists
for i in l: 
    meaningful_words = [word for word in i if word.lower() not in punctuation]
    meaningful_words = [word for word in meaningful_words if word.lower() not in stop_words ]
    meaningful_words = [word for word in meaningful_words if word.lower() not in string.punctuation]
    frequency_distribution = FreqDist(meaningful_words)
    freq.append(frequency_distribution.most_common(30))
# Display the result
freq

[[('house', 4),
  ('lovely', 4),
  ('walk', 4),
  ('manju', 4),
  ('much', 2),
  ('neighbourhood', 2),
  ('around', 2),
  ('away', 2),
  ('comfortable', 2),
  ('time', 2),
  ('great', 2),
  ('accommodation', 2),
  ('needed', 2),
  ('find', 2),
  ('room', 2),
  ('host', 2),
  ('pleasant', 2),
  ('bus', 2),
  ('hospitable', 1),
  ('appreciated', 1),
  ('beautiful', 1),
  ('quiet', 1),
  ('5', 1),
  ('minute', 1),
  ('seminar', 1),
  ('venue', 1),
  ('manningham', 1),
  ('hotel.nice', 1),
  ('parks', 1),
  ('quick', 1)],
 [('great', 39),
  ('location', 27),
  ('apartment', 21),
  ('beach', 10),
  ('well', 10),
  ('vince', 9),
  ('nice', 9),
  ('frank', 8),
  ('everything', 7),
  ('hosts', 7),
  ('good', 7),
  ('clean', 7),
  ('perfect', 6),
  ('close', 6),
  ('helpful', 5),
  ('responsive', 5),
  ('lovely', 5),
  ('easy', 5),
  ('space', 5),
  ('restaurants', 5),
  ('tram', 5),
  ('excellent', 5),
  ('walk', 5),
  ('bars', 4),
  ('highly', 4),
  ('recommend', 4),
  ('little', 4),
  ('coup

### 5. Data Formating and Storing  

In [None]:
# Preprocessing the format of storing
keys = ["x", "value"]

# Generate the key value pairs for word frequency 
all_list = []
for x in freq:
    sss = []
    for i in x:
        res = {}
        for n in range(len(keys)):
            res[keys[n]] = i[n]
        sss.append(res)
    all_list.append(sss)

In [48]:
# Mapping into new dataframe
df_aslist['wordcloud'] = all_list
df_aslist.loc[:, 'new_col'] = df_aslist.name.map(lambda x: x[0])
df_aslist = df_aslist.rename(columns={'new_col': 'name'})
df_aslist

Unnamed: 0,listing_id,comments,wordcloud,name
0,9835,"[Very hospitable, much appreciated.\r<br/>, A ...","[{'x': 'house', 'value': 4}, {'x': 'lovely', '...",Beautiful Room & House
1,12936,[Perfect apartment in a perfect location!!!! \...,"[{'x': 'great', 'value': 39}, {'x': 'location'...",St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC
2,33111,"[Paul is a lovely guy, very helpful and friend...","[{'x': 'paul', 'value': 3}, {'x': 'brenda', 'v...",Million Dollar Views Over Melbourne
3,38271,[Darly and Dee were very very friendly and nic...,"[{'x': 'dee', 'value': 67}, {'x': 'daryl', 'va...",Melbourne - Old Trafford Apartment
4,41836,[Thanks for Diana\r<br/>She is a great host\r<...,"[{'x': 'diana', 'value': 150}, {'x': 'rob', 'v...",CLOSE TO CITY & MELBOURNE AIRPORT
...,...,...,...,...
13616,53612627,[- Excellent communications <br/>- I was given...,"[{'x': 'excellent', 'value': 1}, {'x': 'commun...",Lovely 2-Bedroom Luxuary Apartment with Proper...
13617,53612891,"[Accomodation was clean, in a good location an...","[{'x': 'accomodation', 'value': 1}, {'x': 'cle...",2 Bedroom · 2 Bedroom · MC Magpie Cottage
13618,53613335,[one of the best views we've had in the city. ...,"[{'x': ''ve', 'value': 2}, {'x': 'better', 'va...",Fantastic view 2BR1BA apt in Melbourne
13619,53645767,[I booked Hisham's place at last moment at nig...,"[{'x': 'hisham', 'value': 3}, {'x': 'room', 'v...",Private Bedroom in North Melbourne


In [None]:
# Generate the output file
df_aslist.to_csv('wordcloud.csv', index=False)