# Text Mining
### (c) Nuno António 2021

## Packages installation

Packages required for this set of exercices:
- **NLTK**: Install via Anaconda Navigator Environments tab or using *pip install nltk* or using *conda install -c anaconda nltk*
- **NLTK data**: After installing NLTK open a note book and run:<br>
*import nltk<br>
nltk.download()*
- **BeautifulSoup4**: Install via Anaconda Navigator Environments tab or using *pip install python3-bs4* or using *conda install -c anaconda beautifulsoup4*
- **WordCloud**: Install via Anaconda Navigator Environments tab or using *pip install wordcloud*  or using *conda install -c conda-forge wordcloud*
- **Spacy**: Install via Anaconda Navigator Environments tab or using *pip install spacy* or *conda install -c anaconda nltk*<br>
   To install Spanish models: *python -m spacy download es_core_news_sm*<br>
   To install Portuguese models: *python -m spacy download pt_core_news_sm*<br>
   To install English models: *python -m spacy download en_core_web_sm*<br>
   To install other languages models see: https://spacy.io/usage/models#languages<br>
- **Rake-NLTK**: Install via Anaconda Navigator Environments tab or using *pip install rake-nltk* or  *conda install -c conda-forge rake_nltk*
- **VaderSentiment**: Install via Anaconda Navigator Environments tab or using *pip install vaderSentiment* or *conda install -c conda-forge vadersentiment*
- **Vader-multi**: Install via Anaconda Navigator Environments tab or using *pip install vader-multi* or *conda install -c conda-forge vader-multi*
- **Gensim**: Install via Anaconda Navigator Environments tab or using *pip install gensim* or *conda install -c anaconda gensim*
- **pyLDAvis**: Install via Anaconda Navigator Environments tab or using *pip install pip install pyldavis* or *conda install -c conda-forge pyldavis*

## Data preprocessing

In [1]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 

In [2]:
# Load dataset
dtypes = {'RevID':'category','Source':'category','HotelID':'category',
  'HotelType':'category','HotelStars':'category','ObsDateGlobalRating':'float64',
  'Language':'category','RevUserName':'category','RevUserLocation':'category','RevOverallRating':'float64'}
ds = pd.DataFrame(pd.read_csv("HotelOnlineReviews.txt",sep="|", 
  error_bad_lines=False, dtype=dtypes, decimal=',', index_col='RevID'))

b'Skipping line 12799: expected 21 fields, saw 23\n'
b'Skipping line 37247: expected 21 fields, saw 22\n'


In [3]:
# Check first rows
ds.head()

Unnamed: 0_level_0,Source,HotelID,HotelType,HotelStars,HotelRooms,ObsDate,ObsDateTotalReviewsOnSite,ObsDateGlobalRating,RevDescription,Language,RevUsername,RevUserLocation,RevPublishedDate,RevOverallRating,RevLocationRating,RevSleepQualityRating,RevRoomsRating,RevServiceRating,RevValueRating,RevCleanlinessRating
RevID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
B377,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Hotel is so centrally located with bars and...,English,John,Ireland,2015-11-12,10.0,-1,-1,-1,-1,-1,-1
B378,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Room were big & comfortable.þ Very noisy at ni...,English,David,United Kingdom,2015-11-09,5.8,-1,-1,-1,-1,-1,-1
B379,Booking,1,Resort,4,158,2015-11-29,1092,8.8,right near the beach & breakfast good!þ Noisy ...,English,Dawn,Portugal,2015-11-07,7.5,-1,-1,-1,-1,-1,-1
B380,Booking,1,Resort,4,158,2015-11-29,1092,8.8,"The location is the best in Portimao, just acr...",English,Ron,USA,2015-10-29,9.2,-1,-1,-1,-1,-1,-1
B381,Booking,1,Resort,4,158,2015-11-29,1092,8.8,Hotel was clean and the staff helpful and frie...,English,Pat,Spain,2015-10-27,7.1,-1,-1,-1,-1,-1,-1


In [4]:
# Describe dataset
summary=ds.describe(include='all')
summary=summary.transpose()
summary.head(len(summary))

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Source,39421,2.0,Booking,27271.0,,,,,,,
HotelID,39421,64.0,23,2640.0,,,,,,,
HotelType,39421,2.0,City,23854.0,,,,,,,
HotelStars,39421,4.0,4,23694.0,,,,,,,
HotelRooms,39421,,,,162.889,97.1148,17.0,70.0,152.0,224.0,369.0
ObsDate,39421,375.0,2016-04-24,1883.0,,,,,,,
ObsDateTotalReviewsOnSite,39421,,,,1735.95,1489.9,3.0,648.0,1258.0,2382.0,6668.0
ObsDateGlobalRating,39421,,,,7.23496,2.08405,3.0,4.5,8.2,8.8,10.0
RevDescription,39421,30470.0,þ,8451.0,,,,,,,
Language,39421,3.0,English,18848.0,,,,,,,


In [5]:
# View text of review with ID T9909
reviewText = ds.RevDescription['T9909']
print("Full review:\n",reviewText)
# There is HTML after "My husband" and after "restaurant (treat)"
# &amp; is the HTML encoding for "&"

Full review:
 My husband &amp; I stayed here for 10 days. Hotel arranged for airport pick up prior to our arrival. Staff was friendly and helpful. Lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. Base level rooms are VERY small (read size when booking). Bathroom had tub which is a nice plus. Linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. Outside noise (trash trucks) awoke me two separate nights. Breakfast offering was plentiful and good, bar drinks also good. We did have one problem during our stay. One night when we arrived to our room our door was open. Nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. Staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) &amp; housekeeping had entered our room and thus they were unable to determine who left the door open

In [6]:
# Let's use BeatifulSoap to remove HMTL
from bs4 import BeautifulSoup

In [7]:
# Let's now remove the HTML
rawText = BeautifulSoup(reviewText,'html.parser').get_text()
print("Full review without HTML:\n",rawText)
# Check the differences (&amp;) see the "<br>"

Full review without HTML:
 My husband & I stayed here for 10 days. Hotel arranged for airport pick up prior to our arrival. Staff was friendly and helpful. Lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. Base level rooms are VERY small (read size when booking). Bathroom had tub which is a nice plus. Linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. Outside noise (trash trucks) awoke me two separate nights. Breakfast offering was plentiful and good, bar drinks also good. We did have one problem during our stay. One night when we arrived to our room our door was open. Nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. Staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) & housekeeping had entered our room and thus they were unable to determine who left the door

In [8]:
# Normalize case
# The typical is normalizing to lower case
normalizedText = rawText.lower()
print("Normalized text:\n",normalizedText)

Normalized text:
 my husband & i stayed here for 10 days. hotel arranged for airport pick up prior to our arrival. staff was friendly and helpful. lisbon is a great "walking city" and the location is very central; easy walk to sites, trams, etc. base level rooms are very small (read size when booking). bathroom had tub which is a nice plus. linens were good, mattress quite comfortable, pillows a little too "hard" for our preference. outside noise (trash trucks) awoke me two separate nights. breakfast offering was plentiful and good, bar drinks also good. we did have one problem during our stay. one night when we arrived to our room our door was open. nightly, treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat. staff investigated and we were told by reviewing our lock personnel from the restaurant (treat) & housekeeping had entered our room and thus they were unable to determine who left the door open. fo

In [9]:
# Remove certain classes of characters (using "regular expressions" - regex)
# For that let's use RE
import re
# To test regex online: https://regex101.com
# Regex in more detail: https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

In [10]:
# Remove punctation and other characters such as "&"
charsToRemove = r'\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|[0-9]'
textWOPunctuation = re.sub(charsToRemove,'', normalizedText,)
print("Text without punctuation:\n",textWOPunctuation)

Text without punctuation:
 my husband  i stayed here for  days hotel arranged for airport pick up prior to our arrival staff was friendly and helpful lisbon is a great walking city and the location is very central easy walk to sites trams etc base level rooms are very small read size when booking bathroom had tub which is a nice plus linens were good mattress quite comfortable pillows a little too hard for our preference outside noise trash trucks awoke me two separate nights breakfast offering was plentiful and good bar drinks also good we did have one problem during our stay one night when we arrived to our room our door was open nightly treats are delivered and we suspected the door was inadvertently left not fully closed by the person who delivered that nights' treat staff investigated and we were told by reviewing our lock personnel from the restaurant treat  housekeeping had entered our room and thus they were unable to determine who left the door open fortunately nothing was sto

In [11]:
# Now, let's remove stop words
# Let's start by getting the package and checking them
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print (stop_words)

{'more', 'this', "doesn't", 'on', 'be', 'i', 'where', "couldn't", "hasn't", 'aren', 'when', 'you', 'has', 'by', 'we', 'yourself', 've', "shouldn't", 'before', 'from', 'don', 'won', 'hers', 'own', 'some', 'yourselves', 'all', 'its', 'y', 'until', "won't", 'myself', 're', "that'll", "should've", 'haven', 'yours', "mightn't", "you've", 'should', 'as', 'under', 'them', 'hadn', 'why', 'so', 'very', 'which', 'against', 'while', 'not', 'only', "wasn't", 'doing', 'ours', 'because', "weren't", 'an', 'ma', 'll', 'his', 'she', 'now', 'have', "she's", 'down', 'in', 'both', 'shouldn', 'there', 'that', 'then', 'up', 'herself', 'few', 'once', 'too', 'will', 'been', "shan't", 'into', 'their', "aren't", 'and', "needn't", 'whom', 'ain', 'just', "didn't", 'is', 'at', 'after', 'was', 'other', 'nor', 'theirs', 'each', 'needn', 'the', 'isn', 't', 'mustn', 'what', 'does', 'out', 'to', "isn't", "wouldn't", "you're", 'had', 'who', 'if', 'weren', "it's", 'further', 'such', 'or', 'how', 'can', 'himself', 'during

In [12]:
# First, we need to tokenize text - Break it into words
from nltk.tokenize import word_tokenize
tokenizedText = word_tokenize(textWOPunctuation, language='english')
print("List of words:\n",tokenizedText)

List of words:
 ['my', 'husband', 'i', 'stayed', 'here', 'for', 'days', 'hotel', 'arranged', 'for', 'airport', 'pick', 'up', 'prior', 'to', 'our', 'arrival', 'staff', 'was', 'friendly', 'and', 'helpful', 'lisbon', 'is', 'a', 'great', 'walking', 'city', 'and', 'the', 'location', 'is', 'very', 'central', 'easy', 'walk', 'to', 'sites', 'trams', 'etc', 'base', 'level', 'rooms', 'are', 'very', 'small', 'read', 'size', 'when', 'booking', 'bathroom', 'had', 'tub', 'which', 'is', 'a', 'nice', 'plus', 'linens', 'were', 'good', 'mattress', 'quite', 'comfortable', 'pillows', 'a', 'little', 'too', 'hard', 'for', 'our', 'preference', 'outside', 'noise', 'trash', 'trucks', 'awoke', 'me', 'two', 'separate', 'nights', 'breakfast', 'offering', 'was', 'plentiful', 'and', 'good', 'bar', 'drinks', 'also', 'good', 'we', 'did', 'have', 'one', 'problem', 'during', 'our', 'stay', 'one', 'night', 'when', 'we', 'arrived', 'to', 'our', 'room', 'our', 'door', 'was', 'open', 'nightly', 'treats', 'are', 'delivered'

In [13]:
# Let's create a list with all words that are not part of the stop words list
cleanedText = []
for t in tokenizedText:
    if t not in stop_words:
        cleanedText.append(t)
print("Text without stopwords:\n",cleanedText) 

Text without stopwords:
 ['husband', 'stayed', 'days', 'hotel', 'arranged', 'airport', 'pick', 'prior', 'arrival', 'staff', 'friendly', 'helpful', 'lisbon', 'great', 'walking', 'city', 'location', 'central', 'easy', 'walk', 'sites', 'trams', 'etc', 'base', 'level', 'rooms', 'small', 'read', 'size', 'booking', 'bathroom', 'tub', 'nice', 'plus', 'linens', 'good', 'mattress', 'quite', 'comfortable', 'pillows', 'little', 'hard', 'preference', 'outside', 'noise', 'trash', 'trucks', 'awoke', 'two', 'separate', 'nights', 'breakfast', 'offering', 'plentiful', 'good', 'bar', 'drinks', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arrived', 'room', 'door', 'open', 'nightly', 'treats', 'delivered', 'suspected', 'door', 'inadvertently', 'left', 'fully', 'closed', 'person', 'delivered', 'nights', "'", 'treat', 'staff', 'investigated', 'told', 'reviewing', 'lock', 'personnel', 'restaurant', 'treat', 'housekeeping', 'entered', 'room', 'thus', 'unable', 'determine', 'left', 'door', 'open'

In [14]:
# Now, let's concatenate it again to a sentence
newText = ''
for t in cleanedText:
  newText = newText + t + ' '
newText = newText.rstrip() # rstring (remove spaces at the right of the string)
print("Full sentence with changes so far:\n", newText)

Full sentence with changes so far:
 husband stayed days hotel arranged airport pick prior arrival staff friendly helpful lisbon great walking city location central easy walk sites trams etc base level rooms small read size booking bathroom tub nice plus linens good mattress quite comfortable pillows little hard preference outside noise trash trucks awoke two separate nights breakfast offering plentiful good bar drinks also good one problem stay one night arrived room door open nightly treats delivered suspected door inadvertently left fully closed person delivered nights ' treat staff investigated told reviewing lock personnel restaurant treat housekeeping entered room thus unable determine left door open fortunately nothing stolen deeply unsettling offer upgrade us suite unused gm offered declined surprised couple nights later ran gm hotel bar pick bar tab would likely stay hotel recommend based experience


In [15]:
# Many times terms need to be replaced for other terms
# E.g., wi-fi, wifi, internet
# In this example: tub by bathtub
newText = (" "+newText+" ").replace(" tub "," bathtub ")
print("Text the replacements:\n", newText)

Text the replacements:
  husband stayed days hotel arranged airport pick prior arrival staff friendly helpful lisbon great walking city location central easy walk sites trams etc base level rooms small read size booking bathroom bathtub nice plus linens good mattress quite comfortable pillows little hard preference outside noise trash trucks awoke two separate nights breakfast offering plentiful good bar drinks also good one problem stay one night arrived room door open nightly treats delivered suspected door inadvertently left fully closed person delivered nights ' treat staff investigated told reviewing lock personnel restaurant treat housekeeping entered room thus unable determine left door open fortunately nothing stolen deeply unsettling offer upgrade us suite unused gm offered declined surprised couple nights later ran gm hotel bar pick bar tab would likely stay hotel recommend based experience 


In [16]:
# Now, let's lemmatize the text - reduce terms to their origin
# Let's start by loading the necessary package
from nltk.stem.wordnet import WordNetLemmatizer

In [17]:
# We are using the tokanized text
lem = WordNetLemmatizer()
lemmatizedText= []
for t in cleanedText:
    lemWord = lem.lemmatize(t)
    lemmatizedText.append(lemWord)
print("Lemmatized text :\n",lemmatizedText)
# Not that much of a difference - let's see steming

Lemmatized text :
 ['husband', 'stayed', 'day', 'hotel', 'arranged', 'airport', 'pick', 'prior', 'arrival', 'staff', 'friendly', 'helpful', 'lisbon', 'great', 'walking', 'city', 'location', 'central', 'easy', 'walk', 'site', 'tram', 'etc', 'base', 'level', 'room', 'small', 'read', 'size', 'booking', 'bathroom', 'tub', 'nice', 'plus', 'linen', 'good', 'mattress', 'quite', 'comfortable', 'pillow', 'little', 'hard', 'preference', 'outside', 'noise', 'trash', 'truck', 'awoke', 'two', 'separate', 'night', 'breakfast', 'offering', 'plentiful', 'good', 'bar', 'drink', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arrived', 'room', 'door', 'open', 'nightly', 'treat', 'delivered', 'suspected', 'door', 'inadvertently', 'left', 'fully', 'closed', 'person', 'delivered', 'night', "'", 'treat', 'staff', 'investigated', 'told', 'reviewing', 'lock', 'personnel', 'restaurant', 'treat', 'housekeeping', 'entered', 'room', 'thus', 'unable', 'determine', 'left', 'door', 'open', 'fortunately', 

In [18]:
# Load the necessary package
from nltk.stem.porter import PorterStemmer

In [19]:
# Check the differences after this
# Verbs like "stayed -> stay" or "arranged -> arrang"
stem = PorterStemmer()
stemmedText = []
for t in cleanedText:
    stemmedWord = stem.stem(t)
    stemmedText.append(stemmedWord)
print("Stemmed text :\n",stemmedText) 

Stemmed text :
 ['husband', 'stay', 'day', 'hotel', 'arrang', 'airport', 'pick', 'prior', 'arriv', 'staff', 'friendli', 'help', 'lisbon', 'great', 'walk', 'citi', 'locat', 'central', 'easi', 'walk', 'site', 'tram', 'etc', 'base', 'level', 'room', 'small', 'read', 'size', 'book', 'bathroom', 'tub', 'nice', 'plu', 'linen', 'good', 'mattress', 'quit', 'comfort', 'pillow', 'littl', 'hard', 'prefer', 'outsid', 'nois', 'trash', 'truck', 'awok', 'two', 'separ', 'night', 'breakfast', 'offer', 'plenti', 'good', 'bar', 'drink', 'also', 'good', 'one', 'problem', 'stay', 'one', 'night', 'arriv', 'room', 'door', 'open', 'nightli', 'treat', 'deliv', 'suspect', 'door', 'inadvert', 'left', 'fulli', 'close', 'person', 'deliv', 'night', "'", 'treat', 'staff', 'investig', 'told', 'review', 'lock', 'personnel', 'restaur', 'treat', 'housekeep', 'enter', 'room', 'thu', 'unabl', 'determin', 'left', 'door', 'open', 'fortun', 'noth', 'stolen', 'deepli', 'unsettl', 'offer', 'upgrad', 'us', 'suit', 'unus', 'gm',