# NLP Project: Predicting Glassdoor Ratings

**Overview:**

In this project, company reviews (pros and cons) on Glassdoor are analyzed and modelled using natural language processing techniques. The reviews are scraped on Glassdoor's most popular companies and the companies are grouped as 3-star and 4-star companies. After the data is acquired, the review contents are cleaned and prepared so they can be analyzed. Statistical analysis is done to identify words and phrases that make a company 3-star or 4-star. Lastly, NLP techniques and classification models will be used to predict whether a company is 3-star or 4-star based on the company's reviews.

**Goals:**
- identify words and phrases that suggest whether a company is 3-star or 4-star.
    - ideally, these words will represent positive and negative things about working for the company whether it be pay, work-life balance, or management.
- build a machine learning classification model that will accurately predict whether a company is 3-star or 4-star.

In [1]:
# import libraries
import pandas as pd
%config InlineBackend.figure_format = 'retina'

import wrangle as w

## Data Wrangling

Data Acquisition: Selenium is used to scrape and store data on the top 1,000 companies.
1. First, company names, ratings, CEO approval, recommendation percentages, and review URL's are scraped.
2. The 100 most recent company reviews (pros and cons) are then scraped using the review URL's.


In [13]:
pd.read_csv('data/glassdoor_reviews.csv').head(3)

Unnamed: 0,url,pros,cons,name,rating,ceo_approval,recommended
0,https://www.glassdoor.com/Reviews/Amazon-Revie...,Gain useful experience and great benefits\nRea...,Not much room for advancement\nYou have to be ...,Amazon,3.7,71.0,69.0
1,https://www.glassdoor.com/Reviews/Deloitte-Rev...,"Well, it's BigD isn't it? Everyone on the ball...","Not really a con, but a very large, structured...",Deloitte,4.0,87.0,78.0
2,https://www.glassdoor.com/Reviews/Walmart-Revi...,good employees good working emviromet\nAdvance...,log time stading on feet\nUnderstaffing issues...,Walmart,3.3,59.0,55.0


Data Preparation:
- Review text is lowercased, reduced normalized, unicode characters, tokenized and lemmatized. 
- Common stopwords are also removed as well duplicate companies and companies with missing.
- The target, ratings, is also binned into 3-star companies and 4-star companies.

In [3]:
original, uni_count_vect, bi_count_vect, tri_count_vect = w.wrangle_glassdoor()
train, val, test = original

In [12]:
train.head(3)

Unnamed: 0_level_0,pros,cons,name,rating,ceo_approval,friend_recommendation,pros_cleaned,pros_lemmatized,cons_cleaned,cons_lemmatized,binned_rating,binned_rating_int
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
https://www.glassdoor.com/Reviews/Alight-Solutions-Reviews-E1729719.htm,Company values you as an employee\nAlight is b...,High call volume sometimes during busy season\...,Alight Solutions,3.5,71.0,64.0,company values you as an employee\nalight is b...,company value employee alight far great compan...,high call volume sometimes during busy season\...,high call volume sometimes busy season larger ...,Three,3
https://www.glassdoor.com/Reviews/eBay-Reviews-E7853.htm,I never expected working at a large company to...,I haven't found any cons yet!\nAlthough the in...,eBay,4.1,84.0,80.0,i never expected working at a large company to...,never expected working large company like ever...,i haven ' t found any cons yet\nalthough the i...,' found con yet although initiative really foc...,Four,4
https://www.glassdoor.com/Reviews/Trane-Technologies-Reviews-E349.htm,"- Company is socially-minded and progressive, ...","- A few people are the over-promise, under-del...",Trane Technologies,3.9,81.0,75.0,company is sociallyminded and progressive whic...,company sociallyminded progressive make feel g...,a few people are the overpromise underdeliver ...,people overpromise underdeliver type thats unc...,Three,3


## Exploratory Analysis

Only exploring the training set to avoid bias during modeling.

***Do 4-star companies have a signicant diffierence in CEO approval than 3-star companies?***

***Are 4-star companies more likely to be recommended than 3-star companies?***

## Modeling

## Conclusion

**Insights**

**Modeling**

**Recommendations**

**Next Steps**
