# **Update Project Notebook**

The purpose of this notebook is to be run every month to update our datasets with the last reviews published, and get reviews for the new companies we want to study.

## **0) Imports**

In [1]:
!git clone https://github.com/pentagramswheel/DataX15.git

import sys
root_path = 'DataX15/Final Project/'
sys.path.append(root_path + 'run-to-update')
sys.path.append(root_path + 'topic-detection')

!pip install transformers
import pandas as pd
import numpy as np
import json
import notebook_script
import social_classification

Cloning into 'DataX15'...
remote: Enumerating objects: 217, done.[K
remote: Counting objects: 100% (217/217), done.[K
remote: Compressing objects: 100% (171/171), done.[K
remote: Total 217 (delta 104), reused 74 (delta 35), pack-reused 0[K
Receiving objects: 100% (217/217), 6.11 MiB | 3.70 MiB/s, done.
Resolving deltas: 100% (104/104), done.
Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 439 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 55.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 k

In order to run the scraper you will need to get connected to your Glassdoor account. Please upload a secret.json file into the run-to-update subfolder: {"username": your_username, "password": your_password}

or use the create_credentials() function.

In [None]:
your_username = 'put_your_glassdoor_username'
your_password = 'put_your_glassdoor_password' # cybersecurity <3
notebook_script.create_credentials(your_username, your_password, root_path)

## **1) Open data**

In [2]:
companies = pd.read_csv(root_path + 'datasets/studied_companies.csv', sep = ';')
all_reviews = pd.read_csv(root_path + 'datasets/all_reviews.csv', sep = ';')

display(companies.head())
display(all_reviews.head())

Unnamed: 0,Company,URL,Latest review
0,Aramco,https://www.glassdoor.com/Reviews/Saudi-Aramco...,2021-10-22
1,BP,https://www.glassdoor.com/Reviews/BP-Reviews-E...,2021-10-16
2,Chevron,https://www.glassdoor.com/Reviews/Chevron-Revi...,2021-10-15
3,ConocoPhillips,https://www.glassdoor.com/Reviews/ConocoPhilli...,2021-10-26
4,DTE,https://www.glassdoor.com/Reviews/DTE-Energy-R...,2021-10-20


Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons
0,ExxonMobil,2021-05-18,IT Analyst,"Current Employee, more than 1 year",Great Company Overall,Great work environment Great benefits Pretty g...,I have not experienced anything negative so fa...,0.9509,-0.8689
1,ExxonMobil,2021-09-04,R&D Manager,Former Employee,working on energy R&D,"Outstanding colleagues, working on high impact...",Difficult industry business environment curren...,0.3182,-0.6486
2,ExxonMobil,2021-10-16,Chemical Technician,"Current Employee, more than 3 years",Flexibility,The flexibility and the nature of working ther...,No downside. PERIOD. Such a great place to joi...,0.8553,-0.9305
3,ExxonMobil,2021-10-15,Anonymous,"Current Employee, more than 10 years",I can only be thankful,I am achieving my dreams in partnership with t...,"It is hard times right now. But for me, it's w...",0.7715,-0.9072
4,ExxonMobil,2021-10-13,Engineer,Former Employee,Decent company to work for,"Competitive pay, structured benefits, and job ...",Even if you worked your tail off the whole yea...,0.7003,-0.7184


## **2) Update existing reviews datasets**

In [3]:
notebook_script.update_reviews_studied_companies()

## **3) Get new companies' reviews**

This sections enables getting reviews for new companies we don't have yet in our datasets.

In [4]:
min_date = '2021-12-01' # Select the date from which you want reviews for the new companies
notebook_script.get_reviews_new_companies(min_date)

'Done'

## **4) Update all_reviews.csv file (gather reviews, classification and sentiment)**

In [5]:
all_reviews = notebook_script.assemble_all_reviews(root_path) # assemble all the reviews, clean them and save them in all_reviews.csv
display(all_reviews.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons
0,Shell,2021-10-22,Business Development Manager,Current Employee,Great company to work for,"Work life balance, total compensation.",Process stifles progress. Uncertain future str...
1,Shell,2021-10-19,Business Analyst,Current Employee,Great company,A lot of room for growth.,Don't have any cons at the moment.. Be the fir...
2,Shell,2021-10-17,Cashier,Former Employee,i loved it here,you are never alone here.,some customers are very rude' but thats anywhe...
3,Shell,2021-10-14,Rotational Analyst,Current Employee,Great Compensation,Compensation. Benefits. Work life balance. 4/8...,Bureaucracy. Slow to change. Hard to find info...
4,Shell,2021-10-12,Account Manager,Former Employee,organizer,"was fun, organized, clean, nice.","was tiring, alot to do, and was time consuming..."


In [6]:
all_reviews = notebook_script.predict_sentiment(all_reviews) # predicts 'pros' and 'cons' sentiment and saves the updated all_reviews.csv
display(all_reviews.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons
0,Shell,2021-10-22,Business Development Manager,Current Employee,Great company to work for,"Work life balance, total compensation.",Process stifles progress. Uncertain future str...,49.593538,-8.705231
1,Shell,2021-10-19,Business Analyst,Current Employee,Great company,A lot of room for growth.,Don't have any cons at the moment.. Be the fir...,69.113963,-10.462674
2,Shell,2021-10-17,Cashier,Former Employee,i loved it here,you are never alone here.,some customers are very rude' but thats anywhe...,59.190143,-7.178357
3,Shell,2021-10-14,Rotational Analyst,Current Employee,Great Compensation,Compensation. Benefits. Work life balance. 4/8...,Bureaucracy. Slow to change. Hard to find info...,69.113963,-11.881949
4,Shell,2021-10-12,Account Manager,Former Employee,organizer,"was fun, organized, clean, nice.","was tiring, alot to do, and was time consuming...",92.111049,-10.462674


#### **1. BERT predictions**

In [7]:
# CAREFUL! This cell is excessively long to run, even for one call of predict_social_classification() in the loop
# Please try it with overwrite = False, only one nature_review element and one Social_criteria element, and on all_reviews.head().copy() to see it run


run_this_cell = False # True if you really want to run this code
overwrite = False # will overwrite the file all_reviews_S.csv containing all the social predictions

if run_this_cell:
  nature_review = ['pros', 'cons']

  Social_criteria = ['Insurance', 'Safety', 'Balance',
        'Retirement', 'Racism', 'Sexism', 'Ageism',
        'Benefits', 'Resources', 'Opportunities', 'Privacy',
        'Culture']

  all_reviews_S = all_reviews.copy() #.head() for test

  for nature in nature_review:
    for criteria in Social_criteria:
      all_reviews_S = social_classification.predict_class(all_reviews_S, criteria, nature, root_path)

  if overwrite:
    all_reviews_S.to_csv(root_path + 'datasets/all_reviews_S.csv', index = False, sep = ';')
  display(all_reviews_S)


#### **2. Keywords predictions**

In [15]:
with open(root_path + 'topic-detection/social-models/social_keywords.json') as data_file:
  social_keywords = json.load(data_file)

print(social_keywords)

all_reviews_S = notebook_script.keywords_topic_detection(all_reviews, social_keywords)
display(all_reviews_S.head())

{'insurance': ['insur', 'health', 'coverage', 'sick', 'medical'], 'safety': ['safe', 'drug', 'alcohol', 'violence', 'violent', 'hazard', 'working conditions'], 'balance': ['work life balance', 'worklife balance', 'work-life balance', 'work/life balance', 'work and life balance', 'burnout', 'burn out', 'stress', 'time management'], 'retirement': ['retire', '10-99R', 'saving', 'long-term', 'long term'], 'culture': ['culture', 'people', 'colleague', 'value', 'trust', 'atmosphere', 'collaborat'], 'racism': ['racis', 'my race', 'his race', 'her race', 'their race', 'prejudice', 'racial', 'black', 'white', 'indian', 'Indian', 'asian', 'Asian', 'minorit'], 'sexism': ['gender', 'male', 'female'], 'ageism': ['age', 'retire'], 'benefits': ['benefit', 'cash', 'pay', 'compensat', 'salar', 'time off', 'day off', 'days off', 'bonus'], 'opportunities': ['opportunit', 'project', 'collaborat', 'grow', 'skill', 'advanc', 'dream'], 'privacy': ['priva', 'personal', 'bag', 'clothe'], 'resources': ['resourc

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons,insurance_pros,insurance_cons,safety_pros,safety_cons,balance_pros,balance_cons,retirement_pros,retirement_cons,culture_pros,culture_cons,racism_pros,racism_cons,sexism_pros,sexism_cons,ageism_pros,ageism_cons,benefits_pros,benefits_cons,opportunities_pros,opportunities_cons,privacy_pros,privacy_cons,resources_pros,resources_cons
0,Shell,2021-10-22,Business Development Manager,Current Employee,Great company to work for,"Work life balance, total compensation.",Process stifles progress. Uncertain future str...,49.593538,-8.705231,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Shell,2021-10-19,Business Analyst,Current Employee,Great company,A lot of room for growth.,Don't have any cons at the moment.. Be the fir...,69.113963,-10.462674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,Shell,2021-10-17,Cashier,Former Employee,i loved it here,you are never alone here.,some customers are very rude' but thats anywhe...,59.190143,-7.178357,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Shell,2021-10-14,Rotational Analyst,Current Employee,Great Compensation,Compensation. Benefits. Work life balance. 4/8...,Bureaucracy. Slow to change. Hard to find info...,69.113963,-11.881949,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
4,Shell,2021-10-12,Account Manager,Former Employee,organizer,"was fun, organized, clean, nice.","was tiring, alot to do, and was time consuming...",92.111049,-10.462674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## **5) Update Tableau input**

In [16]:
all_reviews_S = pd.read_csv(root_path + 'datasets/all_reviews_S.csv', sep = ';')
display(all_reviews_S.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons,insurance_pros,insurance_cons,safety_pros,safety_cons,balance_pros,balance_cons,retirement_pros,retirement_cons,culture_pros,culture_cons,racism_pros,racism_cons,sexism_pros,sexism_cons,ageism_pros,ageism_cons,benefits_pros,benefits_cons,opportunities_pros,opportunities_cons,privacy_pros,privacy_cons,resources_pros,resources_cons
0,Shell,2021-10-22,Business Development Manager,Current Employee,Great company to work for,"Work life balance, total compensation.",Process stifles progress. Uncertain future str...,49.593538,-8.705231,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Shell,2021-10-19,Business Analyst,Current Employee,Great company,A lot of room for growth.,Don't have any cons at the moment.. Be the fir...,69.113963,-10.462674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,Shell,2021-10-17,Cashier,Former Employee,i loved it here,you are never alone here.,some customers are very rude' but thats anywhe...,59.190143,-7.178357,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Shell,2021-10-14,Rotational Analyst,Current Employee,Great Compensation,Compensation. Benefits. Work life balance. 4/8...,Bureaucracy. Slow to change. Hard to find info...,69.113963,-11.881949,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
4,Shell,2021-10-12,Account Manager,Former Employee,organizer,"was fun, organized, clean, nice.","was tiring, alot to do, and was time consuming...",92.111049,-10.462674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### **1. Table, Bar and 2x2 Tableau dashboards input**



In [17]:
Social_criteria = ['Insurance', 'Safety', 'Balance',
       'Retirement', 'Racism', 'Sexism', 'Ageism',
       'Benefits', 'Resources', 'Opportunities', 'Privacy',
       'Culture']

aggregated_social_scores = notebook_script.aggregate_company_results(all_reviews_S, Social_criteria)
display(aggregated_social_scores.head())

Tableau_input_general, Tableau_input_pros, Tableau_input_cons = notebook_script.generate_tableau_inputs(aggregated_social_scores, Social_criteria, root_path)
display(Tableau_input_general.head())

Unnamed: 0,Insurance_mean,Insurance_count,Safety_mean,Safety_count,Balance_mean,Balance_count,Retirement_mean,Retirement_count,Racism_mean,Racism_count,Sexism_mean,Sexism_count,Ageism_mean,Ageism_count,Benefits_mean,Benefits_count,Resources_mean,Resources_count,Opportunities_mean,Opportunities_count,Privacy_mean,Privacy_count,Culture_mean,Culture_count,Total_mean,Total_count,Insurance_pros_mean,Insurance_pros_count,Safety_pros_mean,Safety_pros_count,Balance_pros_mean,Balance_pros_count,Retirement_pros_mean,Retirement_pros_count,Racism_pros_mean,Racism_pros_count,Sexism_pros_mean,Sexism_pros_count,Ageism_pros_mean,Ageism_pros_count,Benefits_pros_mean,Benefits_pros_count,Resources_pros_mean,Resources_pros_count,Opportunities_pros_mean,Opportunities_pros_count,Privacy_pros_mean,Privacy_pros_count,Culture_pros_mean,Culture_pros_count,Total_pros_mean,Total_pros_count,Insurance_cons_mean,Insurance_cons_count,Safety_cons_mean,Safety_cons_count,Balance_cons_mean,Balance_cons_count,Retirement_cons_mean,Retirement_cons_count,Racism_cons_mean,Racism_cons_count,Sexism_cons_mean,Sexism_cons_count,Ageism_cons_mean,Ageism_cons_count,Benefits_cons_mean,Benefits_cons_count,Resources_cons_mean,Resources_cons_count,Opportunities_cons_mean,Opportunities_cons_count,Privacy_cons_mean,Privacy_cons_count,Culture_cons_mean,Culture_cons_count,Total_cons_mean,Total_cons_count
Shell,31,21,100,30,87,89,7,10,0,9,0,2,91,271,100,328,100,227,100,142,88,14,100,222,15,1220,75,12,100,18,75,63,47,4,100,2,0,0,100,52,100,253,100,186,100,87,99,3,100,155,8,610,-80,9,-18,12,-53,26,-72,6,-100,7,-100,2,-16,219,-31,75,-25,41,0,55,-10,11,-23,67,-6,610
PG&E,68,18,62,25,44,46,67,12,56,9,36,3,69,133,31,238,25,201,22,80,35,4,36,135,55,754,46,16,72,15,22,29,18,10,60,1,0,0,21,40,19,202,13,172,32,55,43,1,35,90,52,377,-17,2,-26,10,-21,17,-27,2,-23,8,-40,3,-22,93,-8,36,-22,29,-40,25,-60,3,-21,45,-7,377
DTE,18,5,1,17,74,30,27,6,38,2,100,2,88,87,13,83,17,68,48,49,71,2,44,80,72,378,0,4,0,12,61,17,6,2,0,0,100,1,20,19,0,67,1,57,54,32,0,0,45,41,63,189,-11,1,0,5,-14,13,-23,4,-36,2,0,1,0,68,0,16,0,11,-23,17,0,2,-2,39,0,189
Phillips 66,23,15,57,15,24,36,85,13,78,3,27,4,85,123,32,196,14,130,31,71,95,4,31,114,32,498,64,7,56,9,11,12,61,7,30,1,0,0,55,29,26,150,13,104,42,42,59,2,38,66,25,249,-52,8,-7,6,-19,24,-26,6,-10,2,-49,4,-14,94,-17,46,-64,26,-24,29,-15,2,-23,48,-5,249
Schlumberger,64,24,37,32,0,151,0,14,56,6,44,5,35,259,26,460,23,460,9,284,15,23,16,276,0,1978,70,19,66,23,76,32,68,6,0,0,0,2,47,83,29,371,22,400,35,211,36,7,43,196,86,989,-94,5,-100,9,-54,119,-100,8,-14,6,-44,3,-72,176,-67,89,-100,60,-100,73,-80,16,-100,80,-100,989


Unnamed: 0,Company,Criteria 1,Score 1,Count 1,Criteria 2,Score 2,Count 2,Rank
0,Shell,Insurance,31,21,Insurance,31,21,1
1,Shell,Insurance,31,21,Safety,100,30,1
2,Shell,Insurance,31,21,Balance,87,89,1
3,Shell,Insurance,31,21,Retirement,7,10,1
4,Shell,Insurance,31,21,Racism,0,9,1


#### **2. Timeline Tableau dashboards input**

In [19]:
Tableau_timeline_S = notebook_script.generate_tableau_timeline(all_reviews_S, root_path)
display(Tableau_timeline_S.head())

Unnamed: 0,Company,Date,Score,Review,Criteria
42,Shell,2021-09-22,78.833274,"401K match, Time Off, Health Benefits, Fitness...",Insurance
117,Shell,2021-07-17,97.300475,Pension. 10% 401k after 9 years. Medical for f...,Insurance
127,Shell,2021-07-08,72.110026,"competitive pay, pension, 401k contribution, d...",Insurance
159,Shell,2021-05-27,99.723912,Best in Industry salary and benefits including...,Insurance
190,Shell,2021-06-08,78.833274,Excellent coverage options for a family.,Insurance


## **6) Push the work to git repo**

In [None]:
!git push https://github.com/pentagramswheel/DataX15.git Final Project:origin master