##### Social Media Analytics
### Introduction to Text Mining
## Named Entity Recognition
(c) Nuno Antonio 2019-2022 v1.02

### Initial setup

In [1]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 
import re
from bs4 import BeautifulSoup
import spacy
from spacy import displacy
from collections import Counter

In [2]:
# Load dataset
dtypes = {'RevID':'category','Source':'category','HotelID':'category',
  'HotelType':'category','HotelStars':'category','ObsDateGlobalRating':'float64',
  'Language':'category','RevUserName':'category','RevUserLocation':'category','RevOverallRating':'float64'}
ds = pd.DataFrame(pd.read_csv("HotelOnlineReviews.txt",sep="|", 
  error_bad_lines=False, dtype=dtypes, decimal=',', index_col='RevID'))



  ds = pd.DataFrame(pd.read_csv("HotelOnlineReviews.txt",sep="|",
b'Skipping line 12799: expected 21 fields, saw 23\n'
b'Skipping line 37247: expected 21 fields, saw 22\n'


In [3]:
# Drop non-English reviews
ds = ds.drop(ds[ds.Language!='English'].index)

### Functions

In [4]:
# Text preprocessing
def textPreProcess(rawText, removeHTML=True, charsToRemove = r'\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|\:|\-', removeNumbers=True, removeLineBreaks=False, specialCharsToRemove = r'[^\x00-\xfd]', convertToLower=True, removeConsecutiveSpaces=True):
    if type(rawText) != str:
        return rawText
    procText = rawText
        
    # Remove HTML
    if removeHTML:
        procText = BeautifulSoup(procText,'html.parser').get_text()

    # Remove punctuation and other special characters
    if len(charsToRemove)>0:
        procText = re.sub(charsToRemove,' ',procText)

    # Remove numbers
    if removeNumbers:
        procText = re.sub(r'\d+',' ',procText)

    # Remove line breaks
    if removeLineBreaks:
        procText = procText.replace('\n',' ').replace('\r', '')

    # Remove special characters
    if len(specialCharsToRemove)>0:
        procText = re.sub(specialCharsToRemove,' ',procText)

    # Normalize to lower case
    if convertToLower:
        procText = procText.lower() 

    # Replace multiple consecutive spaces with just one space
    if removeConsecutiveSpaces:
        procText = re.sub(' +', ' ', procText)

    return procText

### Analysis

In [5]:
# Create a dataframe with only the description
processedReviews = pd.DataFrame(data=ds.RevDescription.apply(textPreProcess,charsToRemove ='', removeLineBreaks=False, removeNumbers=False).values, index=ds.index, columns=['PreProcessedText'])



In [6]:
# Remove rows with empty text
processedReviews.PreProcessedText = processedReviews.PreProcessedText.str.strip()
processedReviews = processedReviews[processedReviews.PreProcessedText != '']

In [7]:
# Load Spacy English model
nlp = spacy.load("en_core_web_sm")

In [8]:
# Check entities in review 
print(processedReviews['PreProcessedText']['B24307'])
doc = nlp(processedReviews['PreProcessedText']['B24307'])
print([(X.text, X.label_) for X in doc.ents])

everything! the location is perfect, and the view from the room was sensational! we loved the kind and thoughtful hotel staff at the front desk, especially jorge. awesome breakfast. great hotel!
[]


In [9]:
# Check entities in review 
print(processedReviews['PreProcessedText']['B21450'])
doc = nlp(processedReviews['PreProcessedText']['B21450'])
print([(X.text, X.label_) for X in doc.ents])

the hotel is very well located as it is at walking distance from the metro that takes you to the city center (this is like literally across the street), to the airport (7 min walking) and also there is a bus that can take you directly to belem (very nice area to visit). the area is quiet, there is a small market near by, a car rental firm (europcar) and el corte ingles for the shopping passionate. the breakfast is really good value for money as it can satisfy both salty and sweet preferences. the room was cleaned every day and it was comfortable. the staff is very friendly and provided us with directions and help each time we requested it.
[('metro', 'FAC'), ('7 min', 'QUANTITY'), ('belem', 'GPE'), ('el corte', 'ORG'), ('every day', 'DATE')]


In [10]:
# Check entities in review
print(processedReviews['PreProcessedText']['T13867'])
doc = nlp(processedReviews['PreProcessedText']['T13867'])
print([(X.text, X.label_) for X in doc.ents])

just returned from our second trip in 3 years to this hotel. wonderful location, very nicely kept grounds and our 2 bed villa (superior), was significantly better than the standard villa we stayed in last time and definitely worth the extra.we stayed half board with 3 children (15, 12 and 6) and the food was excellent. just the odd day out of the 10 we stayed did we have to queue for a short while for breakfast, this coincided with the hotel getting busy with the portuguese holidays starting in august. we even had the luxury of english sausages and bacon this time, which pleased the children!! restaurant staff were very obliging. alvor is a 10 min walk or 5 euros in taxi and still retains its charm. we used the water taxi to lagos fromthe harbour one day which was an enjoyable trip. a couple of rounds of golf at the very close by pestana alto course (7 euros in a taxi) was also very enjoyable and quite reasonable on their summer deal. the evening entertainment inthe hotel is very under

In [11]:
# Count the labels
labels = [x.label_ for x in doc.ents]
Counter(labels)

Counter({'ORDINAL': 1,
         'DATE': 5,
         'CARDINAL': 6,
         'ORG': 2,
         'NORP': 1,
         'LANGUAGE': 1,
         'QUANTITY': 1,
         'TIME': 1})

In [12]:
# Show top 3 labels
top_labels = [x.text for x in doc.ents]
Counter(top_labels).most_common(3)

[('10', 2), ('second', 1), ('3 years', 1)]

In [13]:
# Entities visualization
displacy.render(doc, jupyter=True, style='ent')

In [14]:
# For example, if our objective was understand what guests say about the staff language skills we could look for reviews that mention languages
counter=0   # to stop after x for demostration speed
annReviews=[]
for r in processedReviews['PreProcessedText']:
  doc = nlp(r)
  for i in doc.ents:
      if i.label_=='LANGUAGE':
          annReviews.append(r)
          counter = counter + 1
          break
  if counter>=3:    # Stop after the first three reviews have been found
      break

annReviews

['location right in front of the beach 1. very basic level of service. the hotel was not made with a sole, just to provide minimum facilities required from a 4 start to be raited like that. that is why the overall impression was much worse as compared to a number of boutique 3 stars i stayed along my way in spain. 2. the staff ranged from arrogant to incapable to speak other languages except from spanish, you do not feel hosted while your stay',
 'very nice and friendly stuff with very good level of english. super location.',
 'friendly staff and close to the beach.good sized apartments.close to three /four bar/restaraunts the swimming pool was unusable.two english channels on the tv news only']