# Scraping Workplace Reviews and Text Classification through an API

The other day I was talking to a contractor, who enthusiastically told me about the stock market application of sentiment analysis. Inspired, I decided to do a little analysis of my own, but rather than analyzing stocks, I will instead analyze reviews of the Texas Workforce Commission.

To get reviews, I'll scraping data off of the [Consumer Affairs](https://www.consumeraffairs.com/) website.

Let's get started!

## Scraping Reviews

To scrape data from the website, I'll need the reqests library to send http requests, and the BeautifulSoup library to parse the content of the response.

In [1]:
import requests
from bs4 import BeautifulSoup

For each page in the 7 pages of reviews, I'll request the url, then parse the html content.

Below is the content of the first page.

In [2]:
url = 'https://www.consumeraffairs.com/employment/tx_work.html?page='
response = requests.get(url+"1")
content = response.content
parser = BeautifulSoup(content,'html.parser')
print(parser)

<!DOCTYPE doctype html>
<!--[if IE]><![endif]--><!--[if lt IE 7 ]><html lang="en" class="no-js ie6"><![endif]--><!--[if IE 7 ]><html lang="en" class="no-js ie7"><![endif]--><!--[if IE 8 ]><html lang="en" class="no-js ie8"><![endif]--><!--[if IE 9 ]><html lang="en" class="no-js ie9"><![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html class="no-js" lang="en"><!--<![endif]--><head><!-- Google Tag Manager --><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.defer=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-WSBZRR');</script><!-- End Google Tag Manager --><meta content="IE=edge" http-equiv="X-UA-Compatible"/><link href="//media.consumeraffairs.com/static/manifest.85a14cf9e066.json" rel="manifest"/><meta charset="utf-8"><script type="text/javascript">(window.NR

Clear, right? Just, kidding. Stare at this for too long and you'll go blind.

To parse the reviews, I opened the webpage in Google Chrome, right-clicked the web page, then clicked "Inspect" to see the CSS class of html tag I wanted.

The CSS class is "ca-txt-bd-2", which I'll pass into the select method of the parser. The select method will then return a list of all the html code with the CSS class.

Below is the text from the first item in the list.

In [3]:
review_box = parser.select(".ca-txt-bd-2")
review_box[0].get_text()

'Page 1\n        \n        Reviews 1 - 30\n    '

And here's the text of the second item in the list.

In [4]:
review_box[1].get_text()

"Original review: March 14, 2019I put in my unemployment as I was supposed to, did job searches, filled out their dumb** forms. Was denied for a month, then I put in the second round. It took I got my check 1 and a half weeks later. Then the week after that I get a notice saying I had to pay back everything plus 150 dollars due to overpayment because it was sent to Oklahoma, I had it sent to Oklahoma due to a family issue I had to tend to. I live in Texas, have a Texas license, was outta town a week and half, still did my online job searches. Now I have to pay back money plus extra just because, it's **. I don't even have a job to pay it back, they even said on the notice ignoring could result in charges and a jail sentence.\n    (adsbygoogle = window.adsbygoogle || []).push({\n        params: {\n            google_ad_channel: 9398056344,\n            google_ad_client: 'ca-pub-0200629403145096',\n            google_ad_type: 'text',\n            google_override_format: true,\n          

I'll need to remove "(adsbygoogle" portion of the text, since this is not part of the review.

Let's make a function that does this.

In [5]:
def remove_ad(text):
    index = text.find("(adsbygoogle")
    if index != -1:
        return text[0:(index-1)]
    else:
        return text

Testing said function.

In [6]:
remove_ad(review_box[1].get_text())

"Original review: March 14, 2019I put in my unemployment as I was supposed to, did job searches, filled out their dumb** forms. Was denied for a month, then I put in the second round. It took I got my check 1 and a half weeks later. Then the week after that I get a notice saying I had to pay back everything plus 150 dollars due to overpayment because it was sent to Oklahoma, I had it sent to Oklahoma due to a family issue I had to tend to. I live in Texas, have a Texas license, was outta town a week and half, still did my online job searches. Now I have to pay back money plus extra just because, it's **. I don't even have a job to pay it back, they even said on the notice ignoring could result in charges and a jail sentence.\n   "

Ok, my strategy is to make a list of reviews. For each page, I'll parse the html content, search for the CSS class with the review, remove the first element, then for each review I'll remove the ad from the text and append the text to a list.

In [7]:
reviews = []
for i in range(1,8):
    response = requests.get(url+str(i))
    content = response.content
    parser = BeautifulSoup(content,'html.parser')
    review_box = parser.select(".ca-txt-bd-2")
    for j in range(1,len(review_box)):
        text = remove_ad(review_box[j].get_text())
        reviews.append(text)

To test that my algorithm worked correctly, I'll check that there are 183 reviews in my list of reviews. If the length of my list is 183, then this length will match the number displayed on the first webpage.

In [8]:
len(reviews)

183

Yay! On to the API reqests!

## Sending Requests to text-processing.com API

I found [this](https://text-processing.com/demo/sentiment/) website, which demos Sentiment Analysis using NLTK 2.0.4. The demo takes text as input, then outputs 1 of 3 classes of text, which is either positive, negative or neutral. 

Upon further reading, I found that the webiste has a fairly simple API, with 1,000 free requests per IP. For each request sent, the API will return a JSON dictionary with the keys probability and label. The value for the probability key is just another dictionary with the keys pos, neg, and neutral, whose values are the probabilities for each text class. The value for the label key is the predicted text class.

To demonstrate I'll send a request for the first review and get the JSON dictionary.

In [9]:
data = { "text" : reviews[0] }
response = requests.post('http://text-processing.com/api/sentiment/', data=data)
response.json()

{'label': 'neutral',
 'probability': {'neg': 0.8254697386652051,
  'neutral': 0.905510239593793,
  'pos': 0.17453026133479488}}

Easy, right? Now, I just gotta do this 183 more times.

My plan of attack is to loop through the list of reviews, and store all the data in a dictionary of dictionaries.

Let's do that.

In [10]:
review_dict = {}
for i in range(len(reviews)):
    review_dict[i] = {}
    review_dict[i]['text'] = reviews[i]
    data = { 'text' : reviews[i] }
    response = requests.post('http://text-processing.com/api/sentiment/', data=data)
    json_dict = response.json()
    review_dict[i]['label'] = json_dict['label']
    review_dict[i]['neg_prob'] = json_dict['probability']['neg']
    review_dict[i]['neutral_prob'] = json_dict['probability']['neutral']
    review_dict[i]['pos_prob'] = json_dict['probability']['pos']

Here's what the resulting dictionary looks like.

In [11]:
print(review_dict)



## Identifying the Best Review

There are 183 reviews. Of the few that I looked at, all seemed pretty negative. Let's see which review was the most positive.

I'll loop through the keys of the dictionary, and find the key with highest positive probability value.

In [12]:
max_prob = review_dict[0]['pos_prob']
max = 0
for i in range(1,len(reviews)):
    prob = review_dict[i]['pos_prob']
    if prob > max_prob:
        max_prob = prob
        max = i

Drum roll please...

And the best review is...

In [13]:
review_dict[max]['text']

"Original review: May 1, 2015Says they're going to send you a check in two, and then sends a letter asking for more information. This cycle has happened twice. Children are starving, and people are suffering because TX can't get its act together. People are entitled to unemployment compensation after losing a job. That's what we get for living in Texas, I guess. Well, not anymore. TX want to play games with people's fundamental rights. We will be moving to a state that respects and treats their workforce with dignity. Thanks, TX... for teaching us how not to act."

...

## Conclusion

The reviews of the Texas Workforce Commission are pretty negative.

I had some fun scraping reviews from a website and sending http requests to an API.

Thanks for reading.