With the help of the Latent Dirichlet Allocation model from the genism module, I extracted several words or topics that were used by the employers about working in their organization, in this case, Amazon.

# Part 1: Scraping the webpage and extracting the reviews from the employees of Amazon

In [42]:
#importing all the necessary libraries 

import requests
import json
import gensim
import re
import string
from pprint import pprint
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob
from gensim import corpora, models

Suppose we only want to extract the reviews of the employees for the month of January. We hard-code for this process and extract the data from the web page. 

In [28]:
#Function that takes url as the input, & only outputs list of lists containing Pros['LikesText'], Cons['DisLikesText'], Date['Modified']

def MyFunc1(url):
    response = requests.get(url)
    json = response.json()
    dic = dict(json)
    data = dic["data"]["reviews"]

    MyData = []
    for i in data:
        dt = i["Modified"]
        year = int(dt[0:4])
        month = int(dt[5:7])
        if year < 2022:
            return 0
        if month == 1:  
            MyData.append([i['LikesText'], i["DisLikesText"], i["Modified"]])
    return MyData

In [29]:
#Function call 

final_data = []
num_pages = 120 
for i in range(num_pages):
    url = "https://www.ambitionbox.com/api/v2/reviews/data/114?page="+str(i+1)+"&sort=recent"
    L = MyFunc1(url)
    if L == 0:
        break
    for j in L:
        final_data.append(j)

In [30]:
len(final_data)

451

# Part 2: Creating a CSV file and analysing the reviews. 

In [31]:
#Converting our list to a csv file to complete Task 2 and extract main topics. 

import pandas as pd
headerList = ['Pros', 'Cons', 'Date']

raw_data = pd.DataFrame(final_data)
raw_data.to_csv('reviews.csv', index=False, header=headerList)
raw_data = pd.read_csv('reviews.csv')
raw_data.head()

Unnamed: 0,Pros,Cons,Date
0,"Everything work culture, seniors they are very...",Nothing this company is the best.i loved worki...,2022-01-31 23:39:41
1,Work Life balance and career growth,Nothing,2022-01-31 23:20:21
2,Good company for skill development,Nothing\r\n,2022-01-31 23:11:58
3,"Great company, terrible process",They only accept what their reports generated ...,2022-01-31 21:41:40
4,Work life balance and work culture is somethin...,Salary is decent but career growth is comparat...,2022-01-31 19:50:00


In [32]:
data = raw_data.Pros.values.tolist()

In [33]:


stopWords = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

def clean_txt(txt):
    corpus = []    
    for t in txt:
        review = []

#First it is converted to lower, then anything other than alphabets are removed followed by removal of  whitespaces.
        
        review = str(t).lower()
        review = re.sub('[^a-zA-Z]', ' ',str(review))  
        review = re.sub("^\s*|\s*$", "", review) #whitespaces and newline chars are removed
        
        #Now tokenize the sentences, remove stopwords, punctuation if any, and lemmatize the words 
        
        w_txt = word_tokenize(review)
        stopWords.update(["iam", "nan", "na", "nothing", "b","c","e","f","g","h","j","k","l","n","p","q","r","u","v","w","x","z"])
        StopWords = [ww for ww in w_txt if ww.lower() not in stopWords and ww not in string.punctuation]
        review = [lemmatizer.lemmatize(word, pos = 'n') for word in StopWords]  
        corpus.append(review)
    
    return corpus 

clean = list(clean_txt(data))

In [48]:
corp = corpora.Dictionary(clean)

#A bag of words model necessary for our LDA model later on

corpus = [corp.doc2bow(ww) for ww in clean]

In [55]:
lda_model =  models.LdaModel(corpus, num_topics=3, id2word=corp, passes=4)

In [56]:
pprint(lda_model.print_topics())

[(0,
  '0.071*"good" + 0.068*"work" + 0.046*"culture" + 0.035*"amazon" + '
  '0.032*"company" + 0.018*"great" + 0.015*"management" + 0.014*"environment" '
  '+ 0.013*"like" + 0.012*"best"'),
 (1,
  '0.060*"work" + 0.028*"amazon" + 0.026*"culture" + 0.022*"job" + '
  '0.020*"company" + 0.018*"security" + 0.017*"policy" + 0.015*"employee" + '
  '0.014*"best" + 0.013*"benefit"'),
 (2,
  '0.020*"job" + 0.019*"work" + 0.019*"nice" + 0.017*"experience" + '
  '0.015*"working" + 0.014*"amazon" + 0.010*"place" + 0.010*"customer" + '
  '0.009*"time" + 0.009*"help"')]


Words such as culture, environment, management, security, benefit, help imply all the pros employees think they have while working the Amazon. At the same time, there are some words which do not give a clear meaning, but you can somewhat read along the lines [words like good, nice, work, best]. Their meaning is closer to the workplace being good, nice, best or the environment. 

To get around this issue, we can use bigrams or trigrams to have a better understanding of these words with some context.

# Now same, but for Cons

In [37]:
cons = raw_data.Cons.values.tolist()

In [38]:
clean_cons = list(clean_txt(cons))

In [51]:
corp_cons = corpora.Dictionary(clean_cons)

#A bag of words model necessary for our LDA model later on

corpus_cons = [corp_cons.doc2bow(ww) for ww in clean_cons]

In [60]:
lda_model_cons =  models.LdaModel(corpus_cons, num_topics=3,id2word=corp_cons, passes=4)

In [61]:
pprint(lda_model_cons.print_topics())

[(0,
  '0.028*"job" + 0.018*"company" + 0.018*"management" + 0.016*"work" + '
  '0.011*"politics" + 0.011*"employee" + 0.011*"even" + 0.010*"best" + '
  '0.010*"contract" + 0.009*"permanent"'),
 (1,
  '0.034*"good" + 0.022*"dislike" + 0.022*"amazon" + 0.019*"work" + '
  '0.017*"everything" + 0.014*"company" + 0.011*"thing" + 0.011*"manager" + '
  '0.011*"pressure" + 0.009*"shift"'),
 (2,
  '0.043*"growth" + 0.032*"work" + 0.023*"slow" + 0.020*"career" + '
  '0.019*"salary" + 0.015*"working" + 0.014*"life" + 0.013*"time" + '
  '0.013*"opportunity" + 0.013*"balance"')]


There are words which give you a clear idea about some of the cons about the workplace. Words such as politics which imply office politics, shift which might imply the work shift, or salary which the employees might not think is enough.

Also, here, like in Pros, words such as life, balance probably imply work life balance, or the slow might indicate the pace of one’s workflow. Similarly, we can get around the issue using n-grams, I believe
