In [117]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import gensim.downloader as api
import numpy as np

## Task 3 - Feature Extraction
First we will load the ground truth reviews dataset.

In [3]:
reviews = pd.read_csv("ground_truth_reviews.csv")
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,ground_truth_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1


### 3.1. Bag of Words (BoW)
Here, we will create a Bag of Words (BoW) representation of the reviews. 
This involves tokenizing the text and creating a matrix where each row corresponds to a review and each column corresponds to a word in the vocabulary.

In [4]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews["review"])

In [5]:
vocabulary = vectorizer.get_feature_names_out()
print(f"Size of BoW vocabulary: {len(vocabulary)}")

Size of BoW vocabulary: 17968


We can see here that there are 17,968 unique words in the vocabulary extracted from the reviews.

In [6]:
bow_matrix = pd.DataFrame(X.toarray(), columns=vocabulary)
print(f"Shape of the BoW matrix: {bow_matrix.shape}")

Shape of the BoW matrix: (5186, 17968)


The shape of the BoW matrix is (5186, 17968), meaning there are 5186 reviews, and each vector has 17968 features corresponding to the unique words in the vocabulary.

We can even print the first row of the BoW matrix to see how it looks.

In [108]:
print(bow_matrix.iloc[0])

000                   0
01                    0
0111and               0
0120                  0
0130                  0
                     ..
顶楼还有个游泳池不过没来得及享受一下    0
𝐄𝐚𝐬𝐡𝐚𝐧𝐢               0
𝐆𝐮𝐞𝐬𝐭                 0
𝐑𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬             0
𝐢𝐧                    0
Name: 0, Length: 17968, dtype: int64


Let's check for some words from the first review that are present.

In [111]:
print(bow_matrix.iloc[0][bow_matrix.iloc[0] > 0])

beds        1
booked      1
even        1
found       1
give        1
good        1
lighters    1
non         1
paper       1
rolls       1
room        1
smoking     1
stay        1
though      1
toilet      1
twin        1
us          1
Name: 0, dtype: int64


### 3.2. Term Frequency-Inverse Document Frequency (TF-IDF)
Here, we will create a Term Frequency-Inverse Document Frequency (TF-IDF) representation of the reviews.

In [7]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews['review'])

In [8]:
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"Size of TF-IDF vocabulary: {len(feature_names)}")

Size of TF-IDF vocabulary: 17968


Once again, we can see that there are 17,968 unique words in the vocabulary extracted from the reviews.

In [9]:
tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(f"Shape of the TF-IDF matrix: {tfidf_matrix_df.shape}")

Shape of the TF-IDF matrix: (5186, 17968)


The shape of the TF-IDF matrix is also (5186, 17968), meaning there are 5186 reviews, and each vector has 17968 features corresponding to the unique words in the vocabulary.

Once again, we can print the first row of the TF-IDF matrix to see how it looks.

In [109]:
print(tfidf_matrix_df.iloc[0])

000                   0.0
01                    0.0
0111and               0.0
0120                  0.0
0130                  0.0
                     ... 
顶楼还有个游泳池不过没来得及享受一下    0.0
𝐄𝐚𝐬𝐡𝐚𝐧𝐢               0.0
𝐆𝐮𝐞𝐬𝐭                 0.0
𝐑𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬             0.0
𝐢𝐧                    0.0
Name: 0, Length: 17968, dtype: float64


Let's check for some words from the first review that are present in the TF-IDF matrix.

In [112]:
print(tfidf_matrix_df.iloc[0][tfidf_matrix_df.iloc[0] > 0])

beds        0.203157
booked      0.181399
even        0.144703
found       0.203431
give        0.205972
good        0.092402
lighters    0.406354
non         0.280659
paper       0.296387
rolls       0.342779
room        0.095741
smoking     0.374567
stay        0.084388
though      0.192924
toilet      0.232811
twin        0.320513
us          0.112083
Name: 0, dtype: float64


We can also check for the top 10 words with the highest TF-IDF scores in the first review.

In [114]:
top_tfidf_words = tfidf_matrix_df.iloc[0].nlargest(10)
print("Top 10 words with highest TF-IDF scores in the first review:")
print(top_tfidf_words)

Top 10 words with highest TF-IDF scores in the first review:
lighters    0.406354
smoking     0.374567
rolls       0.342779
twin        0.320513
paper       0.296387
non         0.280659
toilet      0.232811
give        0.205972
found       0.203431
beds        0.203157
Name: 0, dtype: float64


### 3.3. Word2Vec
Here, we will create a Word2Vec model using the reviews.

First, we need to tokenize the reviews into words.

In [10]:
tokenized_reviews = [word_tokenize(review.lower()) for review in reviews['review']]

Now we can train a Word2Vec model on the tokenized reviews. We will use a vector size of 500, a window size of 100, and set the minimum count to 0 to include all words.

In [59]:
w2v_model = Word2Vec(sentences=tokenized_reviews, vector_size=500, window=100, min_count=0, workers=8, sg=1)
w2v_model.save("word2vec.model")

In [63]:
print(f"Shape of the Word2Vec matrix: {w2v_model.wv.vectors.shape}")

Shape of the Word2Vec matrix: (17998, 500)


We can see that the Word2Vec model has a shape of (17998, 500), meaning there are 17,998 unique words in the vocabulary, and each word is represented by a 500-dimensional vector.

We can also check the vector value for a specific word, such as "bed".

In [54]:
bed_vector = w2v_model.wv['bed']
print(f"Vector for 'bed': {bed_vector}")

Vector for 'bed': [ 0.01328971  0.07161278  0.06789137 -0.04979965 -0.2273252  -0.22884874
  0.23627952  0.35391557  0.08354398 -0.06751227 -0.553474   -0.08800868
  0.04060964 -0.10346626 -0.16143037  0.0650394   0.17125958 -0.22617334
  0.07321496 -0.2670425   0.45155233  0.08253594  0.00937192 -0.04241944
 -0.3494298  -0.15199725 -0.32324192 -0.11509268  0.26708007 -0.1962376
  0.5659793  -0.05382812  0.48235992 -0.10746121 -0.31623477 -0.00499177
  0.03613308  0.11710807  0.07628288 -0.35120985  0.55010337 -0.3976128
  0.06993041 -0.3392599   0.23556261  0.24872307 -0.01639684  0.2988197
  0.1873349   0.37989455  0.26626685  0.07608344  0.10935228 -0.03928052
 -0.2505354   0.27249217 -0.17389402  0.19608381 -0.16193092 -0.11035869
 -0.11730596  0.02260891  0.09064058 -0.36476663 -0.11571939  0.3008439
  0.12167478  0.00809896 -0.14899743  0.1450408  -0.05531208  0.06742488
  0.05966388 -0.08510252  0.17568527 -0.44088602  0.06441578  0.09202424
 -0.08971269  0.0678653  -0.41538003 

We can also find the most similar words to "bed" using the Word2Vec model.

In [55]:
similar_words = w2v_model.wv.most_similar('bed')
print(f"Most similar words to 'bed': {similar_words}")

Most similar words to 'bed': [('bathroom', 0.7106977701187134), ('deet', 0.6923396587371826), ('drenching', 0.6879812479019165), ('hacking', 0.6844325661659241), ('barren', 0.6829866766929626), ('ragged', 0.6811744570732117), ('partially', 0.679260790348053), ('squeezed', 0.6706869602203369), ('classed', 0.6693724393844604), ('hte', 0.6689077019691467)]


We can also perform analogy tasks using the Word2Vec model. For example, we can find a word that is to "colombo" as "galle" is to "city".

In [56]:
analogy_result = w2v_model.wv.most_similar(positive=['colombo', 'galle'], negative=['city'], topn=1)
print(f"Analogy result for 'colombo' - 'city' + 'galle': {analogy_result}")

Analogy result for 'colombo' - 'city' + 'galle': [('fort', 0.5598229765892029)]


Here, we can see that the model determines that Colombo - City + Galle = Fort. Which makes intuitive sense. 

We can also perform other analogy tasks, such as finding a word that is to "bed" as "internet" is to "sleep".

In [115]:
analogy_result = w2v_model.wv.most_similar(positive=['bed', 'internet'], negative=['sleep'], topn=1)
print(f"Analogy result for 'bed' - 'sleep' + 'internet': {analogy_result}")

Analogy result for 'bed' - 'sleep' + 'internet': [('connection', 0.5347010493278503)]


Here, it determined that Bed - Sleep + Internet = Connection. Which also makes sense.

Let's check another analogy task, such as finding a word that is to "bed" as "water" is to "pillow".

In [58]:
analogy_result = w2v_model.wv.most_similar(positive=['bed', 'water'], negative=['pillow'], topn=1)
print(f"Analogy result for 'bed' - 'pillow' + 'water': {analogy_result}")

Analogy result for 'bed' - 'pillow' + 'water': [('hot', 0.5803027153015137)]


Here, it determined that Bed - Pillow + Water = Hot. This is unexpected, and highlights the limitations of the model in understanding certain relationships.

We can also check for some common relationships, but those which might not be present in the dataset.

In [92]:
result = w2v_model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"Analogy result for 'king' - 'man' + 'woman': {result}")

Analogy result for 'king' - 'man' + 'woman': [('tomato', 0.6332396864891052)]


Here, we see that the model struggles to come up with a meaningful analogy for this relationship, which highlights the limitations of the dataset to generalize.

Finally, we can vectorize the reviews using the Word2Vec model by averaging the word vectors for each review.

In [118]:
def get_review_vector(review, model):
    tokens = word_tokenize(review.lower())
    vector = sum(model.wv[token] for token in tokens if token in model.wv) / len(tokens)
    return vector

w2v_review_vectors = reviews['review'].apply(lambda x: get_review_vector(x, w2v_model))
w2v_review_vectors = np.vstack(w2v_review_vectors.values)
w2v_review_vectors = pd.DataFrame(w2v_review_vectors)
print(f"Shape of the Word2Vec review vectors: {w2v_review_vectors.shape}")

Shape of the Word2Vec review vectors: (5186, 500)


We can see that the dataset now consists of 5186 reviews, and each review is represented by a 500-dimensional vector. We can even print the first review vector to see how it looks.

In [119]:
print(w2v_review_vectors[0])

0       0.059414
1      -0.000099
2      -0.005535
3       0.031925
4       0.021695
          ...   
5181    0.040219
5182    0.010806
5183   -0.013544
5184    0.013583
5185   -0.008888
Name: 0, Length: 5186, dtype: float32


### 3.4. GloVe

Here, we will use the GloVe model to create word embeddings for the reviews.

In [79]:
glove_model = api.load('glove-wiki-gigaword-100')

In [80]:
print(f"Shape of the GloVe model: {glove_model.vectors.shape}")

Shape of the GloVe model: (400000, 100)


We can see that the GloVe model has a shape of (400000, 100), meaning there are 400,000 unique words in the vocabulary, and each word is represented by a 100-dimensional vector.

We can check the vector value for a specific word, such as "bed".

In [81]:
bed_vector_glove = glove_model['bed']
print(f"Vector for 'bed' in GloVe: {bed_vector_glove}")

Vector for 'bed' in GloVe: [-0.83528    0.57023    0.19219   -0.025946  -0.50039    0.36531
  0.16811    0.98349   -0.16987   -0.40123    0.82593    0.77665
  0.30743    1.1451     1.0567    -0.46868   -0.48286   -0.26397
 -0.14814   -0.82403   -0.31156    0.56133   -0.12384   -0.054355
  0.42796    0.38446   -0.38117   -0.53408   -0.34122    0.15891
  0.30952   -0.16873    0.36541    0.035137  -0.0095616  0.79946
 -0.50871   -0.031998   0.95187   -0.56081    0.23932    0.014487
 -0.15236   -0.73492   -0.24992    0.36617   -1.2171     0.42764
  0.47683   -0.28761   -0.46801   -0.44108    0.641      1.0611
 -0.14081   -1.885     -0.20596   -0.087876   1.2402     0.13708
  0.40899    0.5898    -0.14214   -0.13007    0.31583    0.60933
  1.0137    -0.31204   -0.34454   -0.45771   -0.26633    0.067238
  0.6028    -0.21555    0.27647    0.51912    0.33038   -0.1537
  0.36153    0.14506    0.021452   0.71533    0.33153    0.32344
 -0.41465   -0.45996    0.2721    -0.37398   -0.8419     0.427

We can also find the most similar words to "bed" using the GloVe model.

In [82]:
similar_words_glove = glove_model.most_similar('bed', topn=5)
print(f"Most similar words to 'bed' in GloVe: {similar_words_glove}")

Most similar words to 'bed' in GloVe: [('beds', 0.7630817890167236), ('sleeping', 0.7616754770278931), ('room', 0.7250887155532837), ('bedroom', 0.6915374994277954), ('mattress', 0.6799734830856323)]


Similar to before, we can perform analogy tasks using the GloVe model. For example, we can find a word that is to "colombo" as "galle" is to "city".

In [85]:
result = glove_model.most_similar(positive=['colombo', 'galle'], negative=['city'], topn=1)
print(f"Analogy result for 'colombo' - 'city' + 'galle' in GloVe: {result}")

Analogy result for 'colombo' - 'city' + 'galle' in GloVe: [('kandy', 0.5980703830718994)]


Here, we see that the model determines that Colombo - City + Galle = Kandy, which doesn't make much sense, but highlights the limitations of the model in specializing to specific domains.

In [86]:
result = glove_model.most_similar(positive=['bed', 'internet'], negative=['sleep'], topn=1)
print(f"Analogy result for 'bed' - 'sleep' + 'internet' in GloVe: {result}")

Analogy result for 'bed' - 'sleep' + 'internet' in GloVe: [('web', 0.7473989725112915)]


However for more generic relationships, the model performs better. For example, it determines that Bed - Sleep + Internet = Web, which makes sense.

In [89]:
result = glove_model.most_similar(positive=['bed', 'water'], negative=['pillow'], topn=1)
print(f"Analogy result for 'bed' - 'pillow' + 'water' in GloVe: {result}")

Analogy result for 'bed' - 'pillow' + 'water' in GloVe: [('electricity', 0.651475727558136)]


Here it determined that Bed - Pillow + Water = Electricity, which is similar to the Word2Vec model, but again highlights the limitations of the model in understanding certain relationships.

In [94]:
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"Analogy result for 'king' - 'man' + 'woman' in GloVe: {result}")

Analogy result for 'king' - 'man' + 'woman' in GloVe: [('queen', 0.7698540091514587)]


However, this model does perform very well on generic relationships, such as the one above.

Finally, we can vectorize the reviews using the GloVe model by averaging the word vectors for each review.

In [120]:
def get_glove_review_vector(review, model):
    tokens = word_tokenize(review.lower())
    vector = sum(model[token] for token in tokens if token in model) / len(tokens)
    return vector

glove_review_vectors = reviews['review'].apply(lambda x: get_glove_review_vector(x, glove_model))
glove_review_vectors = np.vstack(glove_review_vectors.values)
glove_review_vectors = pd.DataFrame(glove_review_vectors)

In [121]:
print(glove_review_vectors[0])

0      -0.188056
1      -0.172444
2      -0.217831
3      -0.092853
4      -0.272053
          ...   
5181   -0.101548
5182   -0.089383
5183   -0.112040
5184   -0.133328
5185   -0.156379
Name: 0, Length: 5186, dtype: float32


In [122]:
print(f"Shape of the GloVe review vectors: {glove_review_vectors.shape}")

Shape of the GloVe review vectors: (5186, 100)


Here, we can see that the dataset now consists of 5186 reviews, and each review is represented by a 100-dimensional vector. We can even print the first review vector to see how it looks.

Finally, we can save all the feature matrices to CSV files for further use.

In [123]:
feature_matrices = {
    'bow': bow_matrix,
    'tfidf': tfidf_matrix_df,
    'word2vec': w2v_review_vectors,
    'glove': glove_review_vectors
}

In [124]:
for name, matrix in feature_matrices.items():
    print(f"Saving {name} feature matrix...")
    matrix.to_csv(f"feature_matrix_{name}.csv", index=False)

Saving bow feature matrix.
Saving tfidf feature matrix.
Saving word2vec feature matrix.
Saving glove feature matrix.
