# Text Analysis on Glassdoor Reviews

<span>
    <img src="../images/glassdoor_logo.png" height="100" width="100">
    <img src="../images/xero_logo.png" height="100" width="100">
</span>

In [1]:
### Packages ###
library(xml2)       #convert to XML document: read_html()
library(rvest)      #scrape
library(purrr)      #iterate scraping by map_df()


Attaching package: 'purrr'

The following object is masked from 'package:rvest':

    pluck



In [2]:
## Set URL details
company <- "Xero-Reviews-E318448"   #You can just change this value to any company you want to scrape
baseurl <- "https://www.glassdoor.com/Reviews/"
sort <- ".htm?sort.sortType=RD&sort.ascending=true"

In [3]:
# This will check the total number of reviews and determine the maximum page results to iterate over.
totalReviews <- read_html(paste(baseurl, company, sort, sep = "")) %>%
  html_nodes("h2.col-6") %>%
  html_text() %>%
  sub(".*?([0-9]+).*", "\\1", .) %>%  # remove text from string and retain the total review value
  as.integer()

In [6]:
totalReviews

In [7]:
maxresults <- as.integer(ceiling(totalReviews/10))     #10 reviews per page, round up to whole number

In [9]:
maxresults

In [10]:
### Create df by scraping: Date, Summary, Rating, Title, Pros, Cons
### There are more information you can scrape on Glassdoor aside from what were already stated
df <- map_df(1:maxresults, function(i) {

  Sys.sleep(2)      #Time delay helps avoid errors from scrapping through pages
  cat("P",i," ")    #Progress indicator on what page is currently being scrape

  pg <- read_html(paste(baseurl, company, "_P", i, sort, sep=""))
  data.frame(rev.date = html_text(html_nodes(pg, ".date.subtle.small, .featuredFlag")),
               rev.sum = html_text(html_nodes(pg, ".reviewLink .summary:not([class*='hidden'])")),
               rev.rating = html_attr(html_nodes(pg, ".gdStars.gdRatings.sm .rating .value-title"), "title"),
               rev.title = html_text(html_nodes(pg, ".authorInfo")),
               rev.pros = html_text(html_nodes(pg, ".mt-md:nth-child(1) .strong+ p")),
               rev.cons = html_text(html_nodes(pg, ".mt-md:nth-child(2) .strong+ p")),
               stringsAsFactors=F)
})

P 1  P 2  P 3  P 4  P 5  P 6  P 7  P 8  P 9  P 10  P 11  P 12  P 13  P 14  P 15  P 16  P 17  P 18  P 19  P 20  P 21  P 22  P 23  P 24  P 25  P 26  P 27  P 28  P 29  P 30  P 31  P 32  P 33  

In [11]:
head(df)

rev.date,rev.sum,rev.rating,rev.title,rev.pros,rev.cons
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"May 3, 2013","Amazingly refreshing culture. Smart, Real genuine people,loving every minute!!!",5.0,Current Employee - Anonymous Employee,"Awesome culture, amazing people, great product - were just getting started!!","None , seriously don't change a thing"
"Jun 26, 2013","Newly hired, couldn't be happier with the opportunity",5.0,"Current Employee - Direct Sales in San Francisco, CA",Great Interview experience from start to finish,"No cons so far, everything is great so far"
"Jul 12, 2013","Great place to work, room for growth and AWESOME culture",4.0,"Current Employee - Account Manager in San Francisco, CA",The culture is awesome and I love how we really protect our brand and care about how we are perceived to the public and ensure its a positive experience for clients.,"As we grow its been difficult to maintain the close knit feel, this is part of the growth process, so if all employees keep that as a priority we will do great!"
"Jul 24, 2013",Best company to work for in the Bay Area!,5.0,"Current Employee - Anonymous Employee in San Francisco, CA","Inspiring, compassionate and open-minded leadership. Awesome product. Where else can you speak with awesome coworkers across 4 global timezones in one day?",Wish I had found Xero sooner.
"Sep 10, 2013","New Hire - Denver, CO",5.0,"Current Employee - Customer Support Specialist in Denver, CO","I recently completed the interview process and am excited for my start date to come. I can honestly say that I've never been more excited to start a new job. It is clear that their philosophy is happy employees make great employees. I could see this in the people that I had interviewed with. They were interested in finding out who I am and what I have to offer, as opposed to how well I can answer obscure interview questions....",None that I'm aware of.
"Nov 28, 2013","Beautiful Accounting Software, Beautiful Culture & A Beautiful Career!",5.0,"Current Employee - Customer Experience Specialist in Milton Keynes, England","Amazing culture. Working with like-minded people. Fresh challenges every day. Ideas are always welcomed. Great career progression. Huge growth. Plenty of opportunities. Awesome product. Global team. ""Choose a job you love, and you will never have to work a day in your life."" - Confucius (Think he must of worked for Xero)","In a year and a half, I am yet to find one."


Using head to do a lookup on the df to see what if the data we scrape is what we intended to scrape.

In [12]:
nrow(df)

In [13]:
### Save df in CSV
write.csv(df, "../data/xero-glassdoor-output.csv")

Instead of using R for cleaning the data and visualisation, I used a BI tool that I just recently learned.
PowerBI is such an awesome, easy to learn and very interactive BI tools that you can do almost anything without or much lesser coding.

I loaded the CSV file onto the PowerBI.

<p>
    <img src="../images/Pros-Cons.jpg" height="455" width="934">
</p>

<p>
    <img src="../images/Line-Graph.jpg" height="300" width="615">
</p>

<p>
    <img src="../images/Bar-Graph.jpg" height="300" width="615">
</p>

PowerBI: [Xero Review Dashboard Link Here](https://app.powerbi.com/view?r=eyJrIjoiYjkyYmE4MGItYmQyZC00MGJmLWIzNmMtYTY1MjI5NTcyMmQ0IiwidCI6IjllMWI3ZTBlLTliZjUtNGJhOC1iOWFlLWNkODkzMDI4NmZjZSJ9)