There are 1000 reviews for restaurants and films in a collection under the IA3-2.csv file. All of those reviews are saved as text files. In this assignment, you are required to investigate the topics of those reviews.

In [35]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.decomposition import LatentDirichletAllocation

In [219]:
df = pd.read_excel('IA3-2.xlsx')
df.head()

Unnamed: 0,id,review,label
0,1,About the shop: There is a restaurant in Soi L...,restaurant
1,2,About the shop: Through this store for about t...,restaurant
2,3,Roast Coffee &amp; Eatery is a restaurant loca...,restaurant
3,4,Eat from the children. The shop is opposite. P...,restaurant
4,5,The Ak 1 shop at another branch tastes the sam...,restaurant


## 1. Transform those reviews into a term-document matrix, lemmatize all the words, remove the stop-words and punctuations, set the minimal document frequency for each term to be 5 and include 2-gram.

In [220]:
# lemmatize, remove stop-words
lemmatizer = nltk.stem.WordNetLemmatizer()
processed_review = []
for doc in df['review']:
    tokens = nltk.word_tokenize(doc.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if not token in stopwords.words('english')]
    processed_review.append(" ".join(tokens))

In [221]:
# Use TFIDF to set the minimal document frequency for each term to be 5 and include 2-gram
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df = 5)
vectorizer.fit(processed_review)
X = vectorizer.transform(processed_review)
print(X.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.13698245 ... 0.         0.         0.        ]]


## 2. Use the LDA model to extract the topics of each document. In particular, we assume there are 6 topics.

In [222]:
lda = LatentDirichletAllocation(n_components=6).fit(X)

In [223]:
lda.components_ # counts(prob) of word j to appear in topic i

array([[1.28920483, 1.36559168, 2.19168414, ..., 0.35056112, 2.3615017 ,
        0.21554325],
       [0.16667049, 0.16666892, 0.1666688 , ..., 0.16667155, 0.16668615,
        0.16667888],
       [0.16667054, 0.16666895, 0.16666883, ..., 0.16667162, 0.1666864 ,
        0.16667904],
       [0.16667054, 0.16666895, 0.16666883, ..., 0.16667162, 0.16668641,
        0.16667904],
       [0.1666701 , 0.1666674 , 0.16668607, ..., 0.16687853, 0.24384703,
        4.41711932],
       [0.16667053, 0.16666894, 0.16666883, ..., 0.1666716 , 0.16668635,
        0.16667901]])

In [224]:
lda.components_.shape

(6, 6258)

In [225]:
terms = vectorizer.get_feature_names_out()

## 3. Report the topic distribution and the top-2 topics of the first 10 restaurant reviews (id = [1:10]) and the first 10 movie reviews (id = [501:510]).

In [226]:
# First 10 restaurant reviews
lda.transform(X[0:10])

array([[0.01567623, 0.15112103, 0.01558031, 0.01558032, 0.7864618 ,
        0.01558031],
       [0.01759481, 0.18356539, 0.01576118, 0.01576118, 0.75152571,
        0.01579174],
       [0.01564439, 0.01539142, 0.01539131, 0.01539131, 0.92279027,
        0.0153913 ],
       [0.24464539, 0.01947687, 0.01947691, 0.01947691, 0.67744704,
        0.0194769 ],
       [0.19124413, 0.02767625, 0.0276763 , 0.02767631, 0.69805072,
        0.02767629],
       [0.01676469, 0.01647329, 0.01647172, 0.01647172, 0.91734687,
        0.01647171],
       [0.01884125, 0.01820465, 0.01820468, 0.01820468, 0.90834007,
        0.01820467],
       [0.01927809, 0.01921875, 0.01921877, 0.01921877, 0.90384684,
        0.01921877],
       [0.04475409, 0.04472161, 0.04472162, 0.04472162, 0.77635944,
        0.04472162],
       [0.04110911, 0.04050126, 0.04050136, 0.04050136, 0.79688477,
        0.04050214]])

In [227]:
# Distribution of  each topic: probabilities of topics rounded to 3 decimals
for i in range(0,10):
    id = i+1
    print('Review ID %d' %id)
    for j in range(0,6):
        print('Topic '+str(j)+':', round(lda.transform(X[0:10])[i][j], 3),str(','), end=' ')
    print()
    print('-----------')

Review ID 1
Topic 0: 0.016 , Topic 1: 0.151 , Topic 2: 0.016 , Topic 3: 0.016 , Topic 4: 0.786 , Topic 5: 0.016 , 
-----------
Review ID 2
Topic 0: 0.018 , Topic 1: 0.184 , Topic 2: 0.016 , Topic 3: 0.016 , Topic 4: 0.752 , Topic 5: 0.016 , 
-----------
Review ID 3
Topic 0: 0.016 , Topic 1: 0.015 , Topic 2: 0.015 , Topic 3: 0.015 , Topic 4: 0.923 , Topic 5: 0.015 , 
-----------
Review ID 4
Topic 0: 0.245 , Topic 1: 0.019 , Topic 2: 0.019 , Topic 3: 0.019 , Topic 4: 0.677 , Topic 5: 0.019 , 
-----------
Review ID 5
Topic 0: 0.191 , Topic 1: 0.028 , Topic 2: 0.028 , Topic 3: 0.028 , Topic 4: 0.698 , Topic 5: 0.028 , 
-----------
Review ID 6
Topic 0: 0.017 , Topic 1: 0.016 , Topic 2: 0.016 , Topic 3: 0.016 , Topic 4: 0.917 , Topic 5: 0.016 , 
-----------
Review ID 7
Topic 0: 0.019 , Topic 1: 0.018 , Topic 2: 0.018 , Topic 3: 0.018 , Topic 4: 0.908 , Topic 5: 0.018 , 
-----------
Review ID 8
Topic 0: 0.019 , Topic 1: 0.019 , Topic 2: 0.019 , Topic 3: 0.019 , Topic 4: 0.904 , Topic 5: 0.019

In [228]:
# Top 2 topics of each review(reported by topic number)
for i in range(0,10):
    id = i+1
    print('Review ID %d' %id)
    t2, t1 = lda.transform(X[0:10])[i].argsort()[-2:]
    print('Top 1:', t1,'Top 2:', t2)
    print('-----------')

Review ID 1
Top 1: 4 Top 2: 1
-----------
Review ID 2
Top 1: 4 Top 2: 1
-----------
Review ID 3
Top 1: 4 Top 2: 0
-----------
Review ID 4
Top 1: 4 Top 2: 0
-----------
Review ID 5
Top 1: 4 Top 2: 0
-----------
Review ID 6
Top 1: 4 Top 2: 0
-----------
Review ID 7
Top 1: 4 Top 2: 0
-----------
Review ID 8
Top 1: 4 Top 2: 0
-----------
Review ID 9
Top 1: 4 Top 2: 0
-----------
Review ID 10
Top 1: 4 Top 2: 0
-----------


In [229]:
# First 10 movie reviews
lda.transform(X[500:510])

# Distribution of  each topic: probabilities of topics rounded to 3 decimals
for i in range(0,10):
    id = i+501
    print('Review ID %d' %id)
    for j in range(0,6):
        print('Topic '+str(j)+':', round(lda.transform(X[500:510])[i][j], 3),str(','), end=' ')
    print()
    print('-----------')

Review ID 501
Topic 0: 0.918 , Topic 1: 0.016 , Topic 2: 0.016 , Topic 3: 0.016 , Topic 4: 0.016 , Topic 5: 0.016 , 
-----------
Review ID 502
Topic 0: 0.927 , Topic 1: 0.012 , Topic 2: 0.012 , Topic 3: 0.012 , Topic 4: 0.024 , Topic 5: 0.012 , 
-----------
Review ID 503
Topic 0: 0.944 , Topic 1: 0.011 , Topic 2: 0.011 , Topic 3: 0.011 , Topic 4: 0.011 , Topic 5: 0.011 , 
-----------
Review ID 504
Topic 0: 0.862 , Topic 1: 0.028 , Topic 2: 0.028 , Topic 3: 0.028 , Topic 4: 0.028 , Topic 5: 0.028 , 
-----------
Review ID 505
Topic 0: 0.924 , Topic 1: 0.015 , Topic 2: 0.015 , Topic 3: 0.015 , Topic 4: 0.015 , Topic 5: 0.015 , 
-----------
Review ID 506
Topic 0: 0.915 , Topic 1: 0.017 , Topic 2: 0.017 , Topic 3: 0.017 , Topic 4: 0.017 , Topic 5: 0.017 , 
-----------
Review ID 507
Topic 0: 0.91 , Topic 1: 0.018 , Topic 2: 0.018 , Topic 3: 0.018 , Topic 4: 0.018 , Topic 5: 0.018 , 
-----------
Review ID 508
Topic 0: 0.857 , Topic 1: 0.029 , Topic 2: 0.029 , Topic 3: 0.029 , Topic 4: 0.029 ,

In [230]:
# Top 2 topics of each review(reported by topic number)
for i in range(0,10):
    id = i+501
    print('Review ID %d' %id)
    t2, t1 = lda.transform(X[500:510])[i].argsort()[-2:]
    print('Top 1:', t1,'Top 2:', t2)
    print('-----------')

Review ID 501
Top 1: 0 Top 2: 4
-----------
Review ID 502
Top 1: 0 Top 2: 4
-----------
Review ID 503
Top 1: 0 Top 2: 4
-----------
Review ID 504
Top 1: 0 Top 2: 4
-----------
Review ID 505
Top 1: 0 Top 2: 4
-----------
Review ID 506
Top 1: 0 Top 2: 4
-----------
Review ID 507
Top 1: 0 Top 2: 4
-----------
Review ID 508
Top 1: 0 Top 2: 4
-----------
Review ID 509
Top 1: 0 Top 2: 4
-----------
Review ID 510
Top 1: 0 Top 2: 4
-----------


- Note: 
I used argsort to find the top two topics of a given review, but argsort( ) can not handle well when there are tied elements in an array.  
  
    When checking the array of each review, I have noticed that it looks like there are tied elements in a given review when displayed in eight decimals. Take id 1 for example, the array shows as follows:  [[0.01558005, 0.78980687, 0.14787295, 0.01558005, 0.01558005, 0.01558003]]  
 
    While it may  seemed like there are three elements(0.01558005) can be ranked as 2nd, I further used unique( ) function to see if there are really a tie, or it is just because of array decimal display setting.
 
    It turned out that even though some elements looked equal when displayed in eight decimals, they are actually not the same. Therefore, using argsort( ) in the above codes to find top two is still valid

In [231]:
for i in range(0,10):
    print(len(np.unique(lda.transform(X[i]))))

6
6
6
6
6
6
6
6
6
6


In [232]:
for i in range(500,510):
    print(len(np.unique(lda.transform(X[i]))))

6
6
6
6
6
6
6
6
6
6


## 4. Find the top-5 terms (terms with the top-5 highest weights) for each of the 6 topics. Based on those terms, describe what those topics are about. 

In [233]:
for i, topic in enumerate(lda.components_):
    print('Topic %d:' % i )
    # for each topic, get the top 5 terms
    for j in topic.argsort()[:-6:-1]:
        print(terms[j])
    print('--------------')

Topic 0:
quot
film
wa
love
ha
--------------
Topic 1:
gt good
berry
exceeded
win competition
average price
--------------
Topic 2:
jia
fine
live
star
shanghai
--------------
Topic 3:
fifth
thonglor
located soi
forget
located
--------------
Topic 4:
delicious
food
eat
restaurant
good
--------------
Topic 5:
town
old one
describe
typical
south
--------------


## 5. Based on finding in 3 and 4, describe what review 1 [ID=1] and review 501 [ID=501] are about? 

- Review 1
    Top topics for review 1 are: topic 4 (~0.8) 1 and topic 1 (~0.15). It is clear that this review is talking about how good this restaurant or its food was. In particular, popular terms of these two topics are 'good', 'eat', 'delicious', 'food', 'restaurant', and 'berry'.
 
- Review 501:
     Top topic for review 1 is: topic 0 (~0.9). Other topics makes up only 1.6% of the total. This review's main idea is not as clear as review 1. From the popular words in topic 0, it tells us that it is a review about a 'film'. Words like 'love' and 'ha' may indicates a positive attitude toward the film. 'wa' and 'quot'('quote', maybe?) do not provide much information here.