# Evaluated Exercise IV
## Part 1: Topic Model based on customer reviews

Please analyze the customer reviews and try to find relevant topics in the reviews, so that an automized reporting system can be implemented to analyze the reviews in close to real time.
- What is the optimal number of topics?
- Please describe the topics

### Data IO and Package Import

#### Package Handling

In [None]:
# Dataframes
import pandas as pd

# Text Cleaning
import re 
# import nltk and spacy
import nltk
import spacy

# Gensim to do the LDA models (topicmodels)
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Better printing, in Unix and R: cat-command
from pprint import pprint # nice printing


In [None]:
# Load nlp model from spaCy, English language
nlp = spacy.load("en")

#### Data I/O

In [None]:
url = r'https://raw.githubusercontent.com/jandroi/3_2_BD/main/hotelSatisfaction_English.csv'
df =pd.read_csv(url,engine='python',encoding='latin-1')

In [None]:
df.iloc[260:264]

Unnamed: 0,Comments,OverallSentiment
260,ÊEverything costs extra unfortunately.,negative
261,lack of sports program,negative
262,no sports facilities,negative
263,the food could be improved in terms of quality.,negative


In [None]:
# Shape of the file: How many rows (e-mails)? and how many columns?
print("Shape of the dataset:", df.shape)
# How many different categories - and please print the categories out!
print("Number of different topics: ", str(len(df['OverallSentiment'].unique())))
print(df['OverallSentiment'].unique())
# Print out the fist five or so rows
df.head(10)

Shape of the dataset: (860, 2)
Number of different topics:  2
['positive' 'negative']


Unnamed: 0,Comments,OverallSentiment
0,Rooms were clean.,positive
1,Excellent value for money,positive
2,Parking too small. No free wifi in rooms. No c...,negative
3,"Comfortable rooms, outstanding breakfast, nice...",positive
4,Quiet location right on the beach.,positive
5,Pleasant service,positive
6,View on the beautiful countryside and the sea.,positive
7,Beautiful location,positive
8,"The service, the cleanness and neatness of ro...",positive
9,Outstanding services,positive


## Stopwords

In [None]:
# Download stopwords from nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')

In [None]:
# User defined stopwords if needed
stop_words.extend(['room'])

# Remove Text / Cleaning

Let us use regular expressions (package re), but we need a list to apply re

In [None]:
data = df['Comments'].values.tolist()
data[260]

'ÊEverything costs extra unfortunately.'

In [None]:
# Remove Special Chars
data = [re.sub("[^0-9a-zA-Z:,]+", ' ', sent) for sent in data]
data[260]

' Everything costs extra unfortunately '

In [None]:
# Remove newline characters
data = [re.sub('\s+', ' ', sent) for sent in data]

  data = [re.sub('\s+', ' ', sent) for sent in data]


In [None]:
# Print out some texts
pprint(data[1:2])

['Excellent value for money']


## Tokenize words and clean-up text

A few possibilities here:
- spaCy
- Gensim `simple_preprocess`
- nltk
- ...

In [None]:
def sent_to_words(sentences):
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc = True))

In [None]:
# Call_
data_words = list(sent_to_words(data))

In [None]:
pprint(data_words[260])

['everything', 'costs', 'extra', 'unfortunately']


## Remove StopWords

In [None]:
# function to remove stopwords
def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] 

In [None]:
## Call the function:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

In [None]:
print(data_words_nostops[:1])
print(data_words_nostops[260])

[['rooms', 'clean']]
['everything', 'costs', 'extra', 'unfortunately']


## Lemmatization


In [None]:
# Define a function lemmatization
def lemmatization(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
  # Empty list for results
  texts_out = []
  # do the lemmatization and select pos tags:
  for sent in texts:
    # first convert into a list and then apply nlp() function
    doc = nlp(" ".join(sent))
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return texts_out

In [None]:
# Do lemmatization keeping only noun, adj, verb, adv:
data_lemmatized = lemmatization(texts = data_words_nostops)

In [None]:
# Print out first text or so
pprint(data_lemmatized[2:3])
pprint(data_lemmatized[260])

[['park', 'small', 'free', 'wifi', 'room', 'crib', 'child']]
['cost', 'extra', 'unfortunately']


## Create the Dictionary and Corpus needed for Topic Modeling

Input of an LDA: 
- `dictionary` (id2word): like a real dictionary, but mapping approach: 0: car; 1: one, 2: impression, ...
- `Corpus`: Collection of texts: [[(0, 1), (1, 2), ...],[]] -- Like in Map Reduce approach

Computer is not working with text, working with a numeric representation: which id's are related to each other --> SED



In [None]:
# Create dictionary:
id2word = corpora.Dictionary(data_lemmatized)

In [None]:
print(id2word)

Dictionary(571 unique tokens: ['clean', 'room', 'excellent', 'money', 'value']...)


In [None]:
# Create the corpus
texts = data_lemmatized

In [None]:
texts[:20]

[['room', 'clean'],
 ['excellent', 'value', 'money'],
 ['park', 'small', 'free', 'wifi', 'room', 'crib', 'child'],
 ['comfortable', 'room', 'outstanding', 'breakfast', 'service'],
 ['quiet', 'location', 'beach'],
 ['pleasant', 'service'],
 ['beautiful', 'countryside'],
 ['beautiful', 'location'],
 ['rich', 'imaginative', 'breakfast', 'buffet'],
 ['outstanding', 'service'],
 ['excellent',
  'diverse',
  'excursion',
  'offering',
  'vary',
  'buffet',
  'friendly',
  'service',
  'staff'],
 ['friendly', 'service', 'attentive', 'staff'],
 ['hotel', 'staff', 'always', 'courteous'],
 ['monotonous', 'morning', 'offer', 'fresh', 'fruit'],
 [],
 ['small'],
 ['wonderful', 'nice'],
 ['quiet'],
 ['expensive'],
 ['buffet', 'luxurious', 'bathroom', 'staff', 'tour', 'guide']]

In [None]:
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
pprint(corpus)

[[(0, 1), (1, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(1, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(15, 1), (16, 1), (17, 1)],
 [(14, 1), (18, 1)],
 [(19, 1), (20, 1)],
 [(16, 1), (19, 1)],
 [(11, 1), (21, 1), (22, 1), (23, 1)],
 [(13, 1), (14, 1)],
 [(2, 1),
  (14, 1),
  (21, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(14, 1), (26, 1), (28, 1), (30, 1)],
 [(28, 1), (31, 1), (32, 1), (33, 1)],
 [(34, 1), (35, 1), (36, 1), (37, 1), (38, 1)],
 [],
 [(9, 1)],
 [(39, 1), (40, 1)],
 [(17, 1)],
 [(41, 1)],
 [(21, 1), (28, 1), (42, 1), (43, 1), (44, 1), (45, 1)],
 [(2, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1)],
 [(51, 1), (52, 1)],
 [(14, 1), (26, 1), (33, 1), (53, 1), (54, 1), (55, 1)],
 [(9, 1), (56, 1), (57, 1), (58, 1), (59, 1)],
 [(16, 1), (17, 1), (60, 1)],
 [(60, 1), (61, 1)],
 [(1, 1), (21, 1), (28, 1), (30, 1), (39, 1), (62, 1), (63, 1)],
 [(14, 1), (39, 1), (64, 1), (65, 1), (66, 2)],
 [(11,

In [None]:
# Use dictionary to get a human readable output:
id2word[1]

'room'

In [None]:
# Combine information in corpus and dictionary
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[0:2]]

[[('clean', 1), ('room', 1)], [('excellent', 1), ('money', 1), ('value', 1)]]

## Build the Topicmodel

What do we need to run a topicmodel:
- dictionary
- corpus
- Number of topics.
   
**Hyperparamter Tuning** here:
- alpha and eta, they affect the sparsity of the topics.
  
**More parameters:**
- chunksize: number of documents to be used in each training chunk
- update_every
- passes: total number of trainin passes

In [None]:
# Build the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                            num_topics = 7, 
                                            id2word = id2word, 
                                            random_state = 100, 
                                            update_every = 1,
                                            chunksize = 100,
                                            passes = 10,
                                            alpha = 'auto',
                                            per_word_topics = True)


In [None]:
pprint(lda_model.print_topics())

[(0,
  '0.096*"location" + 0.077*"excellent" + 0.039*"perfect" + 0.035*"poor" + '
  '0.032*"back" + 0.031*"dirty" + 0.030*"come" + 0.023*"never" + '
  '0.020*"beautiful" + 0.020*"definitely"'),
 (1,
  '0.143*"bed" + 0.095*"clean" + 0.088*"loved" + 0.031*"expensive" + '
  '0.030*"access" + 0.026*"well" + 0.020*"old" + 0.017*"guest" + '
  '0.011*"uncomfortable" + 0.011*"early"'),
 (2,
  '0.166*"good" + 0.069*"internet" + 0.048*"free" + 0.045*"price" + '
  '0.045*"small" + 0.044*"food" + 0.032*"restaurant" + 0.023*"soundproof" + '
  '0.016*"ratio" + 0.016*"performance"'),
 (3,
  '0.190*"staff" + 0.124*"friendly" + 0.113*"great" + 0.064*"hotel" + '
  '0.050*"helpful" + 0.026*"breakfast" + 0.026*"stay" + 0.021*"desk" + '
  '0.019*"front" + 0.015*"average"'),
 (4,
  '0.138*"service" + 0.091*"nice" + 0.079*"check" + 0.046*"upgrade" + '
  '0.034*"wonderful" + 0.026*"people" + 0.021*"reception" + 0.021*"fast" + '
  '0.020*"quick" + 0.020*"expect"'),
 (5,
  '0.148*"room" + 0.059*"bad" + 0.050*"c

## Expand the toolbox: Better visualization if the topics and the corresponding words!

Let us use the `pyLDAvis` package here! It offers an interactive way to investigate the model.

In [None]:
# pip or apt: https://askubuntu.com/questions/431780/apt-get-install-vs-pip-install
!pip install pyLDAvis==2.1.2



In [None]:
# Visualize the results - to decide how many topics and to find names 
#  for the topics
import pyLDAvis
import pyLDAvis.gensim

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook() # initialize
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)


In [None]:
pyLDAvis.enable_notebook(local=True) # initialize

In [None]:
vis

In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5944439327238984


## Searching for optimal Number of topics

In [None]:
## Clean Deprecation warnings from model + gensim
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:

def topic_search():
  for i in range(1,21,1):
    lda_model = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                                num_topics = i, 
                                                id2word = id2word, 
                                                random_state = 100, 
                                                update_every = 1,
                                                chunksize = 100,
                                                passes = 10,
                                                alpha = 'auto',
                                                per_word_topics = True)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print(i,' topic(s): ', 'Coherence Score: ', coherence_lda)



In [None]:
topic_search()

1  topic(s):  Coherence Score:  0.5151070096624618
2  topic(s):  Coherence Score:  0.5416700288381515
3  topic(s):  Coherence Score:  0.6281943634639034
4  topic(s):  Coherence Score:  0.5880355359177367
5  topic(s):  Coherence Score:  0.5944439327238984
6  topic(s):  Coherence Score:  0.6244130509968315
7  topic(s):  Coherence Score:  0.6331784064582188
8  topic(s):  Coherence Score:  0.6157365664394723
9  topic(s):  Coherence Score:  0.6092174991341966
10  topic(s):  Coherence Score:  0.6062446838004246
11  topic(s):  Coherence Score:  0.5967855395767143
12  topic(s):  Coherence Score:  0.5974058452796257
13  topic(s):  Coherence Score:  0.5869982113791777
14  topic(s):  Coherence Score:  0.5816334829440573
15  topic(s):  Coherence Score:  0.5815674224128401
16  topic(s):  Coherence Score:  0.576720575791533
17  topic(s):  Coherence Score:  0.5914249289767679
18  topic(s):  Coherence Score:  0.5884593750305578
19  topic(s):  Coherence Score:  0.5819099463609031
20  topic(s):  Coheren