##### Text mining examples using online hotel reviews data
## Data Preparation
(c) Nuno Antonio 2019-2021

## Packages required for this set of exercices:
- **Pandas**: Install via Anaconda Navigator Environments tab or using *pip install pandas*
- **Numpy**: Install via Anaconda Navigator Environments tab or using *pip install numpy*
- **Scikit**: Install via Anaconda Navigator Environments tab or using *pip install scikit-learn*
- **Matplotlib**: Install via Anaconda Navigator Environments tab or using *pip install matplotlib*
- **Seaborn**: Install via Anaconda Navigator Environments tab or using *pip install seaborn*
- **NLTK**: Install via Anaconda Navigator Environments tab or using *pip install nltk*
- **NLTK data**: After installing NLTK open a note book and run:<br>
*import nltk<br>
nltk.download()*
- **BeautifulSoup4**: Install via Anaconda Navigator Environments tab or using *pip install python3-bs4*
- **WordCloud**: Install via Anaconda Navigator Environments tab or using *pip install wordcloud*
- **Spacy**: Install via Anaconda Navigator Environments tab or using *pip install spacy*<br>
   To install Spanish models: *python -m spacy download es_core_news_sm*<br>
   To install Portuguese models: *python -m spacy download pt_core_news_sm*<br>
   To install English models: *python -m spacy download en_core_web_sm*<br>
   To install other languages models see: https://spacy.io/usage/models#languages<br>
- **Rake-NLTK**: Install via Anaconda Navigator Environments tab or using *pip install rake-nltk*
- **VaderSentiment**: Install via Anaconda Navigator Environments tab or using *pip install vaderSentiment*
- **Gensim**: Install via Anaconda Navigator Environments tab or using *pip install gensim*
- **pyLDAvis**: Install via Anaconda Navigator Environments tab or using *pip install pip install pyldavis*

In [2]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 

In [3]:
# Load dataset
dtypes = {'RevID':'category','Source':'category','HotelID':'category',
  'HotelType':'category','HotelStars':'category','ObsDateGlobalRating':'float64',
  'Language':'category','RevUserName':'category','RevUserLocation':'category','RevOverallRating':'float64'}
ds = pd.DataFrame(pd.read_csv("HotelOnlineReviews.txt",sep="|", 
  error_bad_lines=False, dtype=dtypes, decimal=','))
ds.head()

b'Skipping line 12799: expected 21 fields, saw 23\n'
b'Skipping line 37247: expected 21 fields, saw 22\n'


Unnamed: 0,RevID,Source,HotelID,HotelType,HotelStars,HotelRooms,ObsDate,ObsDateTotalReviewsOnSite,ObsDateGlobalRating,RevDescription,...,RevUsername,RevUserLocation,RevPublishedDate,RevOverallRating,RevLocationRating,RevSleepQualityRating,RevRoomsRating,RevServiceRating,RevValueRating,RevCleanlinessRating
0,B377,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Hotel is so centrally located with bars and...,...,John,Ireland,2015-11-12,10.0,-1,-1,-1,-1,-1,-1
1,B378,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Room were big & comfortable.þ Very noisy at ni...,...,David,United Kingdom,2015-11-09,5.8,-1,-1,-1,-1,-1,-1
2,B379,Booking,1,Resort,4,158,2015-11-29,1092,8.8,right near the beach & breakfast good!þ Noisy ...,...,Dawn,Portugal,2015-11-07,7.5,-1,-1,-1,-1,-1,-1
3,B380,Booking,1,Resort,4,158,2015-11-29,1092,8.8,"The location is the best in Portimao, just acr...",...,Ron,USA,2015-10-29,9.2,-1,-1,-1,-1,-1,-1
4,B381,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Hotel was clean and the staff helpful and frie...,...,Pat,Spain,2015-10-27,7.1,-1,-1,-1,-1,-1,-1


In [4]:
# Describe dataset
ds.describe()

Unnamed: 0,HotelRooms,ObsDateTotalReviewsOnSite,ObsDateGlobalRating,RevOverallRating,RevLocationRating,RevSleepQualityRating,RevRoomsRating,RevServiceRating,RevValueRating,RevCleanlinessRating
count,39421.0,39421.0,39421.0,39421.0,39421.0,39421.0,39421.0,39421.0,39421.0,39421.0
mean,162.889145,1735.951599,7.234961,7.14079,-0.671622,-0.583192,-0.691789,-0.533548,-0.67842,-0.691662
std,97.114807,1489.900588,2.084053,2.353779,0.578889,0.908503,0.46176,1.061298,0.538623,0.462636
min,17.0,3.0,3.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,70.0,648.0,4.5,5.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,152.0,1258.0,8.2,7.9,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
75%,224.0,2382.0,8.8,9.2,0.0,0.0,0.0,0.0,0.0,0.0
max,369.0,6668.0,10.0,10.0,5.0,5.0,0.0,5.0,5.0,5.0


In [5]:
# View text of review with ID T9909
index = np.where(ds['RevID']=='T9909')
reviewText = ((ds.RevDescription[index[0]]).values)[0]
print("Full review:\n",reviewText)

Full review:
 My husband &amp; I stayed here for 10 days. Hotel arranged for airport pick up prior to our arrival. Staff was friendly and helpful. Lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. Base level rooms are VERY small (read size when booking). Bathroom had tub which is a nice plus. Linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. Outside noise (trash trucks) awoke me two separate nights. Breakfast offering was plentiful and good, bar drinks also good. We did have one problem during our stay. One night when we arrived to our room our door was open. Nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. Staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) &amp; housekeeping had entered our room and thus they were unable to determine who left the door open

In [6]:
# Remove HTML
from bs4 import BeautifulSoup
rawText = BeautifulSoup(reviewText,'html.parser').get_text()
print("Full review without HTML:\n",rawText) # seems similar, but check "Variables"


Full review without HTML:
 My husband & I stayed here for 10 days. Hotel arranged for airport pick up prior to our arrival. Staff was friendly and helpful. Lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. Base level rooms are VERY small (read size when booking). Bathroom had tub which is a nice plus. Linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. Outside noise (trash trucks) awoke me two separate nights. Breakfast offering was plentiful and good, bar drinks also good. We did have one problem during our stay. One night when we arrived to our room our door was open. Nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. Staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) & housekeeping had entered our room and thus they were unable to determine who left the door

In [7]:
# Normalize case
normalizedText = rawText.lower()
print("Normalized text:\n",normalizedText)

Normalized text:
 my husband & i stayed here for 10 days. hotel arranged for airport pick up prior to our arrival. staff was friendly and helpful. lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. base level rooms are very small (read size when booking). bathroom had tub which is a nice plus. linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. outside noise (trash trucks) awoke me two separate nights. breakfast offering was plentiful and good, bar drinks also good. we did have one problem during our stay. one night when we arrived to our room our door was open. nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) & housekeeping had entered our room and thus they were unable to determine who left the door open. fo

In [8]:
# Remove certain classes of characters (using "regular expressions" - regex)
# To test regex online: https://regex101.com
# Regex in more detail: https://www.machinelearningplus.com/python/python-regex-tutorial-examples/
import re
charsToRemove = r'\?|\.|\!|\;|\.|\"|\,|\(|\)|\&'  # punctuation characters and other characters
textWOPunctuation = re.sub(charsToRemove,'', normalizedText,)
print("Text without punctuation:\n",textWOPunctuation)

Text without punctuation:
 my husband  i stayed here for 10 days hotel arranged for airport pick up prior to our arrival staff was friendly and helpful lisbon is a great walking city and the location is very central easy walk to sites trams etc base level rooms are very small read size when booking bathroom had tub which is a nice plus linens were good mattress quite comfortable pillows a little too hard for our preference outside noise trash trucks awoke me two separate nights breakfast offering was plentiful and good bar drinks also good we did have one problem during our stay one night when we arrived to our room our door was open nightly treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat staff investigated and we were told by reviewing our lock personnel from the restaurant treat  housekeeping had entered our room and thus they were unable to determine who left the door open fortunately nothing was s

In [9]:
# Word tokenization - Break text into words
from nltk.tokenize import word_tokenize
tokenizedText = word_tokenize(textWOPunctuation, language='english')
print("List of words:\n",tokenizedText)

List of words:
 ['my', 'husband', 'i', 'stayed', 'here', 'for', '10', 'days', 'hotel', 'arranged', 'for', 'airport', 'pick', 'up', 'prior', 'to', 'our', 'arrival', 'staff', 'was', 'friendly', 'and', 'helpful', 'lisbon', 'is', 'a', 'great', 'walking', 'city', 'and', 'the', 'location', 'is', 'very', 'central', 'easy', 'walk', 'to', 'sites', 'trams', 'etc', 'base', 'level', 'rooms', 'are', 'very', 'small', 'read', 'size', 'when', 'booking', 'bathroom', 'had', 'tub', 'which', 'is', 'a', 'nice', 'plus', 'linens', 'were', 'good', 'mattress', 'quite', 'comfortable', 'pillows', 'a', 'little', 'too', 'hard', 'for', 'our', 'preference', 'outside', 'noise', 'trash', 'trucks', 'awoke', 'me', 'two', 'separate', 'nights', 'breakfast', 'offering', 'was', 'plentiful', 'and', 'good', 'bar', 'drinks', 'also', 'good', 'we', 'did', 'have', 'one', 'problem', 'during', 'our', 'stay', 'one', 'night', 'when', 'we', 'arrived', 'to', 'our', 'room', 'our', 'door', 'was', 'open', 'nightly', 'treats', 'are', 'deli

In [10]:
# Remove stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
#print (stop_words) # remove the remark to see the list of stopwords in English
cleanedText = []
for t in tokenizedText:
    if t not in stop_words:
        cleanedText.append(t)
print("Text without stopwords:\n",cleanedText) 

Text without stopwords:
 ['husband', 'stayed', '10', 'days', 'hotel', 'arranged', 'airport', 'pick', 'prior', 'arrival', 'staff', 'friendly', 'helpful', 'lisbon', 'great', 'walking', 'city', 'location', 'central', 'easy', 'walk', 'sites', 'trams', 'etc', 'base', 'level', 'rooms', 'small', 'read', 'size', 'booking', 'bathroom', 'tub', 'nice', 'plus', 'linens', 'good', 'mattress', 'quite', 'comfortable', 'pillows', 'little', 'hard', 'preference', 'outside', 'noise', 'trash', 'trucks', 'awoke', 'two', 'separate', 'nights', 'breakfast', 'offering', 'plentiful', 'good', 'bar', 'drinks', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arrived', 'room', 'door', 'open', 'nightly', 'treats', 'delivered', 'suspected', 'door', 'inadvertently', 'left', 'fully', 'closed', 'person', 'delivered', 'nights', "'", 'treat', 'staff', 'investigated', 'told', 'reviewing', 'lock', 'personnel', 'restaurant', 'treat', 'housekeeping', 'entered', 'room', 'thus', 'unable', 'determine', 'left', 'door', 

In [11]:
# Create a string from all the words
newText = ''
for t in cleanedText:
  newText = newText + t + ' '
newText = newText.rstrip()
print("Full sentence with changes so far:\n", newText)

Full sentence with changes so far:
 husband stayed 10 days hotel arranged airport pick prior arrival staff friendly helpful lisbon great walking city location central easy walk sites trams etc base level rooms small read size booking bathroom tub nice plus linens good mattress quite comfortable pillows little hard preference outside noise trash trucks awoke two separate nights breakfast offering plentiful good bar drinks also good one problem stay one night arrived room door open nightly treats delivered suspected door inadvertently left fully closed person delivered nights ' treat staff investigated told reviewing lock personnel restaurant treat housekeeping entered room thus unable determine left door open fortunately nothing stolen deeply unsettling offer upgrade us suite unused gm offered declined surprised couple nights later ran gm hotel bar pick bar tab would likely stay hotel recommend based experience


In [12]:
# Lexicon normalization - Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lemmatizedText= []
for t in cleanedText:
    lemWord = lem.lemmatize(t) # it should in consideration the POS, or assume it is a Noun
    lemmatizedText.append(lemWord)
print("Lemmatized text :\n",lemmatizedText) 

Lemmatized text :
 ['husband', 'stayed', '10', 'day', 'hotel', 'arranged', 'airport', 'pick', 'prior', 'arrival', 'staff', 'friendly', 'helpful', 'lisbon', 'great', 'walking', 'city', 'location', 'central', 'easy', 'walk', 'site', 'tram', 'etc', 'base', 'level', 'room', 'small', 'read', 'size', 'booking', 'bathroom', 'tub', 'nice', 'plus', 'linen', 'good', 'mattress', 'quite', 'comfortable', 'pillow', 'little', 'hard', 'preference', 'outside', 'noise', 'trash', 'truck', 'awoke', 'two', 'separate', 'night', 'breakfast', 'offering', 'plentiful', 'good', 'bar', 'drink', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arrived', 'room', 'door', 'open', 'nightly', 'treat', 'delivered', 'suspected', 'door', 'inadvertently', 'left', 'fully', 'closed', 'person', 'delivered', 'night', "'", 'treat', 'staff', 'investigated', 'told', 'reviewing', 'lock', 'personnel', 'restaurant', 'treat', 'housekeeping', 'entered', 'room', 'thus', 'unable', 'determine', 'left', 'door', 'open', 'fortunat

In [13]:
# Lexicon normalization - Stemming
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
stemmedText = []
for t in cleanedText:
    stemmedWord = stem.stem(t)
    stemmedText.append(stemmedWord)
print("Stemmed text :\n",stemmedText) 

Stemmed text :
 ['husband', 'stay', '10', 'day', 'hotel', 'arrang', 'airport', 'pick', 'prior', 'arriv', 'staff', 'friendli', 'help', 'lisbon', 'great', 'walk', 'citi', 'locat', 'central', 'easi', 'walk', 'site', 'tram', 'etc', 'base', 'level', 'room', 'small', 'read', 'size', 'book', 'bathroom', 'tub', 'nice', 'plu', 'linen', 'good', 'mattress', 'quit', 'comfort', 'pillow', 'littl', 'hard', 'prefer', 'outsid', 'nois', 'trash', 'truck', 'awok', 'two', 'separ', 'night', 'breakfast', 'offer', 'plenti', 'good', 'bar', 'drink', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arriv', 'room', 'door', 'open', 'nightli', 'treat', 'deliv', 'suspect', 'door', 'inadvert', 'left', 'fulli', 'close', 'person', 'deliv', 'night', "'", 'treat', 'staff', 'investig', 'told', 'review', 'lock', 'personnel', 'restaur', 'treat', 'housekeep', 'enter', 'room', 'thu', 'unabl', 'determin', 'left', 'door', 'open', 'fortun', 'noth', 'stolen', 'deepli', 'unsettl', 'offer', 'upgrad', 'us', 'suit', 'unus',