# **Update Project Notebook**

The purpose of this notebook is to be run every month to update our datasets with the last reviews published, and get reviews for the new companies we want to study.

## **0) Imports**

In [1]:
#!git clone https://github.com/pentagramswheel/DataX15.git

import sys
root_path = 'DataX15/Final Project/'
sys.path.append(root_path + 'run-to-update')
sys.path.append(root_path + 'topic-detection')

!pip install transformers
import pandas as pd
import numpy as np
import notebook_script
import social_classification

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In order to run the scraper you will need to get connected to your Glassdoor account. Please upload a secret.json file into the run-to-update subfolder: {"username": your_username, "password": your_password}

or use the create_credentials() function.

In [None]:
your_username = 'put_your_glassdoor_username'
your_password = 'put_your_glassdoor_password' # cybersecurity <3
notebook_script.create_credentials(your_username, your_password, root_path)

## **1) Open data**

In [None]:
companies = pd.read_csv(root_path + 'datasets/studied_companies.csv', sep = ';')
all_reviews = pd.read_csv(root_path + 'datasets/all_reviews.csv', sep = ';')

display(companies.head())
display(all_reviews.head())

Unnamed: 0,Company,URL,Latest review
0,Aramco,https://www.glassdoor.com/Reviews/Saudi-Aramco...,2021-10-22
1,BP,https://www.glassdoor.com/Reviews/BP-Reviews-E...,2021-10-16
2,Chevron,https://www.glassdoor.com/Reviews/Chevron-Revi...,2021-10-15
3,ConocoPhillips,https://www.glassdoor.com/Reviews/ConocoPhilli...,2021-10-26
4,DTE,https://www.glassdoor.com/Reviews/DTE-Energy-R...,2021-10-20


Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons
0,ExxonMobil,2021-05-18,IT Analyst,"Current Employee, more than 1 year",Great Company Overall,Great work environment Great benefits Pretty g...,I have not experienced anything negative so fa...,0.9509,-0.8689
1,ExxonMobil,2021-09-04,R&D Manager,Former Employee,working on energy R&D,"Outstanding colleagues, working on high impact...",Difficult industry business environment curren...,0.3182,-0.6486
2,ExxonMobil,2021-10-16,Chemical Technician,"Current Employee, more than 3 years",Flexibility,The flexibility and the nature of working ther...,No downside. PERIOD. Such a great place to joi...,0.8553,-0.9305
3,ExxonMobil,2021-10-15,Anonymous,"Current Employee, more than 10 years",I can only be thankful,I am achieving my dreams in partnership with t...,"It is hard times right now. But for me, it's w...",0.7715,-0.9072
4,ExxonMobil,2021-10-13,Engineer,Former Employee,Decent company to work for,"Competitive pay, structured benefits, and job ...",Even if you worked your tail off the whole yea...,0.7003,-0.7184


## **2) Update existing reviews datasets**

In [None]:
notebook_script.update_reviews_studied_companies()

## **3) Get new companies' reviews**

This sections enables getting reviews for new companies we don't have yet in our datasets.

In [None]:
min_date = '2021-10-22' # Select the date from which you want reviews for the new companies
notebook_script.get_reviews_new_companies(min_date)

'Done'

## **4) Update all_reviews.csv file (gather reviews, classification and sentiment)**

In [5]:
all_reviews = notebook_script.assemble_all_reviews(root_path) # assemble all the reviews, clean them and save them in all_reviews.csv
display(all_reviews.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons
0,Marathon,2021-10-21,Safety Professional,Current Employee,Great place to work,They take care of their employees especially d...,Becoming more focused on cost and budget. Be t...
1,Marathon,2021-10-20,Senior Project Engineer,Former Employee,Good company with bad culture,"Great pay and benefits compared to US average,...",The culture is in the gutter. Used to be a gre...
2,Marathon,2021-09-20,Advanced Business Analyst,Current Employee,Good Company,"Coworkers, Benefits, Challenging Work Environm...","Old world culture, job advancement, employee d..."
3,Marathon,2021-10-11,Trader,Current Employee,Great Benefits,Great benefits and good people.,have started to offer flex schedules as trial ...
4,Marathon,2021-06-27,Pipeline Project Engineer,Former Employee,Very generous and professionally run company,Lots of training to management and it shows. T...,Typical large corporation with bloated bureauc...


In [6]:
all_reviews = notebook_script.predict_sentiment(all_reviews) # predicts 'pros' and 'cons' sentiment and saves the updated all_reviews.csv
display(all_reviews.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons
0,Marathon,2021-10-21,Safety Professional,Current Employee,Great place to work,They take care of their employees especially d...,Becoming more focused on cost and budget. Be t...,0.4939,-0.8655
1,Marathon,2021-10-20,Senior Project Engineer,Former Employee,Good company with bad culture,"Great pay and benefits compared to US average,...",The culture is in the gutter. Used to be a gre...,0.9694,-0.9568
2,Marathon,2021-09-20,Advanced Business Analyst,Current Employee,Good Company,"Coworkers, Benefits, Challenging Work Environm...","Old world culture, job advancement, employee d...",0.4939,-0.9201
3,Marathon,2021-10-11,Trader,Current Employee,Great Benefits,Great benefits and good people.,have started to offer flex schedules as trial ...,0.8625,-0.91
4,Marathon,2021-06-27,Pipeline Project Engineer,Former Employee,Very generous and professionally run company,Lots of training to management and it shows. T...,Typical large corporation with bloated bureauc...,0.8225,-0.9091


In [10]:
# CAREFUL! This cell is excessively long to run, even for one call of predict_social_classification() in the loop
# Please try it with overwrite = False, only one nature_review element and one Social_criteria element, and on all_reviews.head().copy() to see it run


run_this_cell = False # True if you really want to run this code
overwrite = False # will overwrite the file all_reviews_S.csv containing all the social predictions

if run_this_cell:
  nature_review = ['pros', 'cons']

  Social_criteria = ['Insurance', 'Safety', 'Balance',
        'Retirement', 'Racism', 'Sexism', 'Ageism',
        'Benefits', 'Resources', 'Opportunities', 'Privacy',
        'Culture']

  all_reviews_S = all_reviews.copy() #.head() for test

  for nature in nature_review:
    for criteria in Social_criteria:
      all_reviews_S = social_classification.predict_class(all_reviews_S, criteria, nature, root_path)

  if overwrite:
    all_reviews_S.to_csv(root_path + 'datasets/all_reviews_S.csv', index = False, sep = ';')
  display(all_reviews_S)


## **5) Update Tableau input**

In [None]:
all_reviews_S = pd.read_csv(root_path + 'datasets/all_reviews_S.csv', sep = ';')
display(all_reviews_S.head())

Unnamed: 0,Company,date,employee_title,employee_status,review_title,pros,cons,score_pros,score_cons,insurance_pros,safety_pros,balance_pros,retirement_pros,racism_pros,sexism_pros,ageism_pros,benefits_pros,resources_pros,opportunities_pros,privacy_pros,culture_pros,insurance_cons,safety_cons,balance_cons,retirement_cons,racism_cons,sexism_cons,ageism_cons,benefits_cons,resources_cons,opportunities_cons,privacy_cons,culture_cons
0,ExxonMobil,2021-05-18,IT Analyst,"Current Employee, more than 1 year",Great Company Overall,Great work environment Great benefits Pretty g...,I have not experienced anything negative so fa...,0.9509,-0.8689,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0
1,ExxonMobil,2021-09-04,R&D Manager,Former Employee,working on energy R&D,"Outstanding colleagues, working on high impact...",Difficult industry business environment curren...,0.3182,-0.6486,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0
2,ExxonMobil,2021-10-16,Chemical Technician,"Current Employee, more than 3 years",Flexibility,The flexibility and the nature of working ther...,No downside. PERIOD. Such a great place to joi...,0.8553,-0.9305,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,0
3,ExxonMobil,2021-10-15,Anonymous,"Current Employee, more than 10 years",I can only be thankful,I am achieving my dreams in partnership with t...,"It is hard times right now. But for me, it's w...",0.7715,-0.9072,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,1,1,0,0
4,ExxonMobil,2021-10-13,Engineer,Former Employee,Decent company to work for,"Competitive pay, structured benefits, and job ...",Even if you worked your tail off the whole yea...,0.7003,-0.7184,0,1,0,0,0,1,0,1,1,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0


In [None]:
Social_criteria = ['Insurance', 'Safety', 'Balance',
       'Retirement', 'Racism', 'Sexism', 'Ageism',
       'Benefits', 'Resources', 'Opportunities', 'Privacy',
       'Culture']

aggregated_social_scores = notebook_script.aggregate_company_results(all_reviews_S, Social_criteria)
display(aggregated_social_scores.head())

Tableau_input_general, Tableau_input_pros, Tableau_input_cons = notebook_script.generate_tableau_inputs(aggregated_social_scores, Social_criteria)
display(Tableau_input_general.head())

Unnamed: 0,Insurance_mean,Insurance_count,Safety_mean,Safety_count,Balance_mean,Balance_count,Retirement_mean,Retirement_count,Racism_mean,Racism_count,Sexism_mean,Sexism_count,Ageism_mean,Ageism_count,Benefits_mean,Benefits_count,Resources_mean,Resources_count,Opportunities_mean,Opportunities_count,Privacy_mean,Privacy_count,Culture_mean,Culture_count,Total_mean,Total_count,Insurance_pros_mean,Insurance_pros_count,Safety_pros_mean,Safety_pros_count,Balance_pros_mean,Balance_pros_count,Retirement_pros_mean,Retirement_pros_count,Racism_pros_mean,Racism_pros_count,Sexism_pros_mean,Sexism_pros_count,Ageism_pros_mean,Ageism_pros_count,Benefits_pros_mean,Benefits_pros_count,Resources_pros_mean,Resources_pros_count,Opportunities_pros_mean,Opportunities_pros_count,Privacy_pros_mean,Privacy_pros_count,Culture_pros_mean,Culture_pros_count,Total_pros_mean,Total_pros_count,Insurance_cons_mean,Insurance_cons_count,Safety_cons_mean,Safety_cons_count,Balance_cons_mean,Balance_cons_count,Retirement_cons_mean,Retirement_cons_count,Racism_cons_mean,Racism_cons_count,Sexism_cons_mean,Sexism_cons_count,Ageism_cons_mean,Ageism_cons_count,Benefits_cons_mean,Benefits_cons_count,Resources_cons_mean,Resources_cons_count,Opportunities_cons_mean,Opportunities_cons_count,Privacy_cons_mean,Privacy_cons_count,Culture_cons_mean,Culture_cons_count,Total_cons_mean,Total_cons_count
ExxonMobil,9,101,22,371,6,340,43,43,0,53,41,21,41,22,15,338,14,414,14,357,5,218,0,307,10,1000,44,48,38,182,6,169,57,24,2,17,34,12,14,12,30,172,26,202,31,177,31,105,5,147,18,500,-98,53,-83,189,-80,171,-80,19,-88,36,-67,9,-54,10,-87,166,-93,212,-88,180,-87,113,-94,160,-86,500
Phillips66,62,48,29,197,29,172,50,25,64,26,99,7,0,9,33,182,39,208,25,163,29,100,72,145,33,500,83,29,66,97,75,81,100,13,61,16,44,5,2,2,65,96,68,108,57,82,78,49,89,86,75,250,-87,19,-96,100,-96,91,-100,12,-100,10,-69,2,-94,7,-98,86,-100,100,-98,81,-95,51,-95,59,-100,250
Conocophilips,100,189,100,773,100,672,57,109,92,116,99,38,100,42,100,633,100,845,96,719,100,403,98,584,100,2000,100,93,100,397,100,350,46,48,50,56,76,19,94,25,100,316,100,415,78,360,100,199,94,274,100,1000,-3,96,0,376,-3,322,-15,61,0,60,-55,19,-2,17,-3,317,-14,430,0,359,-4,204,-17,310,-1,1000
Schlumberger,29,181,53,779,45,685,43,105,78,138,54,42,63,44,56,649,48,847,48,703,32,424,47,640,46,1980,0,76,27,403,12,339,26,52,24,72,23,23,44,21,30,324,10,425,10,349,5,215,15,318,12,990,-22,105,-16,376,-12,346,-32,53,-2,66,-38,19,-20,23,-12,325,-18,422,-11,354,-21,209,-29,322,-13,990
Valero,35,41,57,202,48,184,63,26,99,34,85,10,41,11,45,156,62,228,57,165,43,96,76,148,53,480,83,19,83,101,75,88,88,16,99,17,73,5,73,4,60,81,84,115,84,83,63,46,78,86,75,240,-87,22,-63,101,-64,96,-88,10,-37,17,-61,5,-69,7,-66,75,-68,113,-71,82,-55,50,-68,62,-62,240


Unnamed: 0,Company,Criteria 1,Score 1,Count 1,Criteria 2,Score 2,Count 2,Rank
0,ExxonMobil,Insurance,9,101,Insurance,9,101,16
1,ExxonMobil,Insurance,9,101,Safety,22,371,16
2,ExxonMobil,Insurance,9,101,Balance,6,340,16
3,ExxonMobil,Insurance,9,101,Retirement,43,43,16
4,ExxonMobil,Insurance,9,101,Racism,0,53,16


## **6) Push the work to git repo**

In [None]:
!git push https://github.com/pentagramswheel/DataX15.git Final Project:origin master

fatal: not a git repository (or any of the parent directories): .git
