# Unsupervised Topic Modeling

## Loading LDA model and making predictions

In the last notebook, we had trained an LDA model with the apt hyperparameters. Now, we will make predictions for each review. Each review can have multiple topics. Now, these are not going to be the final predictions as we were also provided with an evaluation_labels.MD file that consisted of 12 topic definitions for the reviews to be mapped to. 

In [1]:
import gensim
from gensim.utils import simple_preprocess
%store -r stop_words
%store -r bigram_mod
%store -r id2word
%store -r nlp

# loading our trained LDA model
lda_model = gensim.models.LdaMulticore.load('lda.model')



In [2]:
import re
# Function to assign topics to reviews

def topic_assign(text):
  
  # Basic preprocessing to make data suitable to feed into LDA model  
  text = re.sub('[,\.!?]', ' ', text)
  text = text.lower()

  text = [word for word in simple_preprocess(text) if word not in stop_words]
  text = bigram_mod[text]

  allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
  text = nlp(" ".join(text)) 
  text = [token.lemma_ for token in text if token.pos_ in allowed_postags]

  text = id2word.doc2bow(text)

  # The review is assigned a topic only if the probability is of that topic is greater than 0.2  
  topics = lda_model.get_document_topics(text, minimum_probability=0.2) 

  topic_list = []
  for i in topics:
    topic_list.append(i[0])
    

  return topic_list  

In [3]:
import pandas as pd

df = pd.read_csv('./data.csv')

texts = list(df['text'])
topics = []

for text in texts:
  top = topic_assign(text)
  topics.append(top)

output = pd.DataFrame({'text':texts, 'topics':topics})

print(output.describe())
print(output.head(5))

                                                     text topics
count                                               10132  10132
unique                                              10132     77
top     Tires where delivered to the garage of my choi...    [0]
freq                                                    1   5169
                                                text topics
0  Tires where delivered to the garage of my choi...    [1]
1  Easy Tyre Selection Process, Competitive Prici...    [0]
2         Very easy to use and good value for money.    [0]
3              Really easy and convenient to arrange    [0]
4  It was so easy to select tyre sizes and arrang...    [0]


We've assigned the LDA predicted topics to each review. Now, we can map these topics to the desired topics based on the keywords they cover. All we have to do is see how many labels from evaluation_labels.MD are covered in each topic.
NOTE : There can be an overlap in topics and this is perfectly okay. This will be done manually for each topic, i.e for the 10 topics generated by our LDA model. I think this would make it slightly easier to have more meaningful predictions. \
Also, we can quickly check if there are any null values. If yes, we shall discard them.

In [4]:
print(output.isnull().sum())

text      0
topics    0
dtype: int64


## Manual assignment of topics 
Let's take a look at the topics provided to us in the evaluation_labels.md file. We will then map these to our LDA generated topics. 

In [5]:
f = open('./evaluation_labels.md', 'r')

content = f.read()
print(content)


## General
The dataset provided to you has over 10000 reviews(documents) that belongs to the domain of tyre change and repair service.

## Topic definitions
- **value for money** 0: This topic refers to documents/reviews that mention the value for money for the service provided. Is it cost effective, is it expensive and so on. 
- **garage service** 1: This topic refers to documents/reviews that talk about the service offered by a garage. If the service was quick, efficient, customers satisfaction with the experience at the garage and so on.
- **ease of booking** 2: This topic refers to documents/reviews that talk about the ease of scheduling a service or appointment with the garage.
- **tyre quality** 3: This topic refers to documents/reviews that talk about the quality of tyres provided by the vendor
- **mobile fitter** 4: This topic refers to documents/reviews that talk about the quality of service provided by mobile fitters. Mobile fitters are tyre mechanics who can commute to and 

In [6]:
# Storing LDA topics in a csv file

topics = lda_model.print_topics()
top = []

for topic in topics:
  top.append(topic)

td = pd.DataFrame({'topic':top})
td.to_csv('./topics.csv')

After manually mapping the topics, an extra column was added to show the labels such as 'garage service', 'delivery puncutality' etc to the csv file. This is what the corresponding labels for each topic looks like. 

In [7]:
mapped_topics = pd.read_csv('./mapped_topics.csv')

print(mapped_topics.head(10))

   Unnamed: 0                                              topic  \
0           0  (0, '0.078*"service" + 0.059*"good" + 0.058*"p...   
1           1  (1, '0.057*"tyre" + 0.038*"time" + 0.031*"fit"...   
2           2  (2, '0.043*"start" + 0.042*"finish" + 0.026*"e...   
3           3  (3, '0.048*"staff" + 0.041*"helpful" + 0.027*"...   
4           4  (4, '0.020*"anywhere_else" + 0.011*"wide" + 0....   
5           5  (5, '0.046*"tyre" + 0.018*"car" + 0.011*"order...   
6           6  (6, '0.027*"smoothly" + 0.011*"efficiently" + ...   
7           7  (7, '0.028*"day" + 0.022*"email" + 0.016*"call...   
8           8  (8, '0.009*"painless" + 0.003*"turnaround" + 0...   
9           9  (9, '0.016*"charge" + 0.009*"torque" + 0.008*"...   

            labels  
0     [0, 1, 2, 3]  
1     [4, 6, 7, 9]  
2  [3, 4, 6, 7, 9]  
3        [1, 2, 4]  
4           [3, 5]  
5           [3, 8]  
6    [2, 4, 8, 10]  
7      [6, 10, 11]  
8       [0, 5, 10]  
9              [3]  


Let's make a dictionary to map the labels to their corresponding topic definitions as provided.

In [8]:
# Creating dictionary to map topic definitons to labels
topic_defs = ['value for money', 'garage service', 'ease of booking', 'tyre quality', 'mobile fitter', 'location', 'length of fitting', 'delivery punctuality', 'booking confusion', 'wait time', 'discounts', 'change of date']
topic_labels = [i for i in range(12)]
dict_topics = dict(zip(topic_defs, topic_labels))

# Creating dictionary to map lda topics to topic definiton labels
lda_topics = [i for i in range(10)]
lda_topic_labels = [i for i in mapped_topics['labels']]
dict_mapped_topics = dict(zip(lda_topics, lda_topic_labels))

print(dict_topics)
print(dict_mapped_topics)

{'value for money': 0, 'garage service': 1, 'ease of booking': 2, 'tyre quality': 3, 'mobile fitter': 4, 'location': 5, 'length of fitting': 6, 'delivery punctuality': 7, 'booking confusion': 8, 'wait time': 9, 'discounts': 10, 'change of date': 11}
{0: '[0, 1, 2, 3]', 1: '[4, 6, 7, 9]', 2: '[3, 4, 6, 7, 9]', 3: '[1, 2, 4]', 4: '[3, 5]', 5: '[3, 8]', 6: '[2, 4, 8, 10]', 7: '[6, 10, 11]', 8: '[0, 5, 10]', 9: '[3]'}


Now, let's assign the topics to the given topic definitions for each review.

In [9]:
# Function to convert topic label to topic definition
def convert_topic(topic):
    pred = list(dict_topics.keys())[list(dict_topics.values()).index(topic)]
    
    return pred

In [10]:
# We now assign topic definitions to each review

top_defs = output['topics'].tolist()

top_defs_words = []

for i in top_defs: # Traverse through every review
    words = []
    for j in i: # Traverse through LDA predicted topics for each review
        words.append(convert_topic(j)) # append the topic definition for the corresponding label
    top_defs_words.append(words) # append the entire set of topic definitions to the review
    
print(top_defs_words[0:2]) # Checking the topic definitions for the first 2 reviews    

[['garage service'], ['value for money']]


## Final Predictions
We can now store our predictions in a dataframe which we can then write to a csv.

In [11]:
final_data = pd.DataFrame({'text':texts, 'topic definitons':top_defs_words})

print(final_data.head(10))

                                                text  \
0  Tires where delivered to the garage of my choi...   
1  Easy Tyre Selection Process, Competitive Prici...   
2         Very easy to use and good value for money.   
3              Really easy and convenient to arrange   
4  It was so easy to select tyre sizes and arrang...   
5  service was excellent. Only slight downside wa...   
6  User friendly Website. Competitive Prices. Goo...   
7                       Excellent prices and service   
8  It was very straightforward and the garage was...   
9                               Use of local garage.   

                    topic definitons  
0                   [garage service]  
1                  [value for money]  
2                  [value for money]  
3                  [value for money]  
4                  [value for money]  
5  [value for money, garage service]  
6                  [value for money]  
7                  [value for money]  
8  [value for money, garage serv