<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Four-Day Work Week

## Background

A four-day work week has been discussed dating back to the 1970s, and studies have proven productivity and profits increase when a reduced work week is implemented and employees have a positive reaction to it. However, the four-day work week has not taken off, and one main problem is dealing with the world of five-day work weeks ([source](https://doi.org/10.2307/41164622
)).

In recent years, momentum is building around the globe for a four-day working week. The Covid-19 pandemic has shifted workforce perspectives toward work, and employees are calling for a more flexible and shorter work week as well as better work-life balance and mental health according to Adecco Group research ([source](https://www.adeccogroup.com/future-of-work/latest-research/reset-normal/#download-the-global-report)).

The idea of a four-day work week is that employees will work four days a week with a total of 32 hours while earning the same salary and benefits, but with the same workload. Iceland, New Zealand, and Japan have shifted to a four-day workweek scheme and reported higher productivity and better work-life balance ([source](https://sloanreview.mit.edu/article/what-does-the-four-day-workweek-mean-for-the-future-of-work/)). Belgium is one of the recent countries that offers employees a four-day workweek as part of its changes in labour laws in a post-Covid era ([source](https://www.straitstimes.com/world/europe/belgium-permits-four-day-week-to-boost-work-flexibility-post-covid-19)) and UK started piloting the four-day week in June 2022 ([source](https://www.theguardian.com/business/2022/jun/06/thousands-workers-worlds-biggest-trial-four-day-week)).

In Singapore, a recent Indeed survey shows 88% Singaporean surveyed supported a four-day work week with the same pay and employees priorities family, physical health and relaxation and looks for better work-life balance with increased flexibility, better financial compensation, and a less stressful workplace ([source](https://www.humanresourcesonline.net/88-of-singapore-employees-surveyed-support-a-four-day-workweek-with-the-same-pay)). During the pandemic, companies had rolled out flexible work arrangements, including a shorter work week. However, employees are worried that the four-day work week mean clocking longer hours in a compressed work week and that company performance would also suffer as well ([source](https://www.straitstimes.com/singapore/jobs/workers-worried-4-day-week-will-mean-longer-hours-for-them-survey)). There are also concerns on the rise of the mental health issues at work ([source](https://www.straitstimes.com/singapore/health/growing-focus-on-mental-health-at-workplace-as-covid-19-pandemic-takes-toll)) and the great resignation wave ([source](https://www.channelnewsasia.com/commentary/great-resignation-wave-quit-find-job-employer-boss-pay-mental-health-2386761)) in Singapore.


## Problem Statement

The Covid-19 pandemic has sparked the debate over the four-day work week globally with employees in search of better work-life balance and mental health. With more countries reporting better employee well-being and higher productivity with the four-day work week scheme, should Singapore join the trend and introduce a four-day work week trial to reduce mental health issues in the workplace and retain or attract talent? With the pandemic altering the way the world works, the Ministry of Manpower (MOM) has assigned its data scientists to analyze the data and make recommendations regarding the possibility of a four-day week in Singapore. The analysis includes:

- Examining employee sentiments about a four-day work week 
- Assessing whether a four-day workweek would affect the company's productivity
- Build a model to predict company productivity with at least 80% accurate 
- Identify the top 3 factors that significantly impact the productivity of the company.


## Methodology

1. Scrape Glassdoor company reviews data using Selenium and Beautiful Soup.

2. Data cleaning, text preprocessing (removing punctuations and stopwords, text tokenizing, and lemmatizing the text), and visualization.

3. Perform sentiment analysis on the Glassdoor company reviews.

4. Perform modeling using Pycaret and identify the top 3 important features.

5. Perform prediction on test dataset.

Our definition of a good model is one has at least 80% accuracy with a F1 score above 0.8. F1 score will be our primary evaluation metric.

Notebook 1 (current notebook) will covers the data collection. In this notebook, we will use Selenium to scrape Glassdoor reviews for the 4-day work week companies.

Notebook 2 will covers the data cleaning, sentiment analysis and insights from visualizations.

Notebook 3 will covers modeling and provides recommendations, limitations and future plans.

## Import Python Modules

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException

import time
import re
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd, numpy as np
from bs4 import BeautifulSoup as bs

import warnings
warnings.simplefilter(action='ignore')

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Extract data

### Login Glassdoor

In [2]:
# Specify the login credential values for the following variables and then run all the cells
g_username = ''
g_password = ''

In [3]:
from selenium.webdriver.firefox.options import Options as FirefoxOptions

options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

driver.get("https://www.glassdoor.sg/index.htm")

In [4]:
sign_up = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[2]/div/div/div/div/div[1]/article/header/nav/div[2]/div/div/div/button'))).click()

In [5]:
glassdoor_username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']")))
glassdoor_password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))

In [6]:
glassdoor_username.clear()
glassdoor_username.send_keys(g_username)
glassdoor_password.clear()
glassdoor_password.send_keys(g_password)

In [7]:
Login_button  = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[8]/div/div/div[2]/div[2]/div[2]/div/div/form/div[3]/button/span'))).click()

### Setup Companies List

In [8]:
# Set up the companies list (full list)
companies_review_fullList = {'Administrate': 'Administrate-Reviews-E936767.htm', 
                         'Advice Direct Scotland': 'Advice-Direct-Scotland-Reviews-E3083697.htm',
                         'Applied Reviews': 'Applied-Reviews-E2339980.htm',
                         'Arken Legal': 'Arken-legal-Reviews-E6996538.htm',
                         'Atom Bank': 'Atom-Bank-Reviews-E1354887.htm',
                         '3D Issue':'3D-Issue-Reviews-E1944114.htm',
                         '448 Studio': '448-Studio-Reviews-E6471702.htm',
                         'Abstract': 'Abstract-Reviews-E2010334.htm',
                         'ALAMI': 'ALAMI-Sharia-Reviews-E4787557.htm',
                         'Awin': 'Awin-Reviews-E38054.htm',
                         'Backbone': 'Backbone-CO-Reviews-E2285766.htm',
                         'Big Potato Games': 'Big-Potato-Reviews-E1608222.htm',
                         'Bijles Aan Huis': 'Bijles-Aan-Huis-Reviews-E4677512.htm',
                         'Bit.io': 'bit-io-Reviews-E3210006.htm',
                         'Blackbird Interactive': 'Blackbird-Interactive-Reviews-E660577.htm',
                         'Blink SEO': 'Blink-SEO-Reviews-E3314924.htm',
                         'Bolt': 'Bolt-Reviews-E1181527.htm',
                         'Boulder County': 'Boulder-County-Reviews-E282126.htm',
                         'Brett Nicholls Associates': 'Brett-Nicholls-Associates-Reviews-E3141340.htm',
                         'Buffer': 'Buffer-Reviews-E941992.htm',
                         'Bunny Studio': 'Bunny-Studio-Reviews-E2135659.htm',
                         'ByteChek': 'ByteChek-Reviews-E6238945.htm',
                         'Charlton Morris': 'Charlton-Morris-Reviews-E1881577.htm',
                         'Chief Nation': 'Chief-Nation-Reviews-E1703171.htm',
                         'Chordify': 'Chordify-Reviews-E5801829.htm',
                         'CIB Group': 'The-CIB-Group-Reviews-E1356135.htm',
                         'City to Sea': 'City-to-Sea-Reviews-E3086213.htm',
                         'Civo': 'Civo-Reviews-E5464787.htm',
                         'Close': 'Close-Reviews-E1155591.htm',
                         'CMG Technologies': 'CMG-Technologies-Reviews-E4183119.htm',
                         'Cockroach Labs': 'Cockroach-Labs-Reviews-E1168502.htm',
                         'Coconut Software': 'Coconut-Software-Reviews-E1821886.htm',
                         'Commission Factory': 'Commission-Factory-Reviews-E1336191.htm',
                         'Cosmic': 'Cosmic-CA-Reviews-E2266739.htm',
                         'D’Youville University': 'D-Youville-College-Reviews-E128529.htm',
                         'Dassana': 'Dassana-Reviews-E5403581.htm', 
                         'Daye': 'Daye-Reviews-E3195974.htm',
                         'Deedmob': 'Deedmob-Reviews-E2257453.htm',
                         'Desigual': 'Desigual-Reviews-E385914.htm',
                         'DHIS2': 'DHIS2-Reviews-E6625379.htm',
                         'Diesdas Digital': 'diesdas-digital-Reviews-E2988941.htm',
                         'Digible': 'Digible-Reviews-E3625748.htm',
                         'DNSFilter': 'DNSFilter-Reviews-E2146068.htm',
                         'Eduwill': 'Eduwill-Reviews-E5360229.htm',
                         'eFileCabinet': 'EFileCabinet-Reviews-E323527.htm',
                         'Eidos-Montréal': 'Eidos-Montreal-Reviews-E456207.htm',
                         'Elektra Lighting': 'Elektra-Lighting-Reviews-E3658503.htm',
                         'Elephant Ventures': 'Elephant-Ventures-Reviews-E844433.htm',
                         'Emtrain': 'Emtrain-Reviews-E916115.htm',
                         'Evolved Search': 'Evolved-Search-Reviews-E2558249.htm',
                         'Fast Retailing': 'Fast-Retailing-Reviews-E12005.htm',
                         'Feathr': 'Feathr-Reviews-E1394102.htm',
                         'Formedix': 'Formedix-Reviews-E687727.htm',
                         'FundersClub': 'FundersClub-Reviews-E1409640.htm',
                         'G2i': 'G2i-Reviews-E4414795.htm',
                         'Galt Pharmaceuticals': 'Galt-Pharmaceuticals-Reviews-E1925821.htm',
                         'GoLinks': 'GoLinks-Enterprises-Reviews-E3300549.htm', 
                         'Grounded Packaging': 'Grounded-Packaging-Reviews-E5580592.htm',
                         'Groupe LDLC': 'Groupe-LDLC-Reviews-E4529434.htm',
                         'Headspace': 'Headspace-Reviews-E984335.htm',
                         'Healthwise': 'Healthwise-Reviews-E103390.htm',
                         'Hitachi': 'Hitachi-Reviews-E3525.htm',
                         'ICE Group': 'ICE-Group-Reviews-E1535608.htm',
                         'ihorizon': 'ihorizon-Reviews-E1257740.htm',
                         'InDebted': 'InDebted-Reviews-E2351821.htm',
                         'IriusRisk': 'IriusRisk-Reviews-E2008259.htm',
                         'JAYU': 'JAYU-Reviews-E2531471.htm',
                         'Jexo': 'Jexo-Reviews-E5803822.htm',
                         'JMK Solicitors': 'JMK-Solicitors-Reviews-E4396998.htm',
                         'Justuno': 'Justuno-Reviews-E856089.htm',
                         'Kickstarter': 'Kickstarter-Reviews-E491996.htm',
                         'KPMG': 'KPMG-US-Reviews-E4930476.htm',
                         'Luscii': 'Luscii-Reviews-E2527503.htm', 
                         'Maaemo': 'Maaemo-Reviews-E6064682.htm',
                         'Marketing Signals': 'Marketing-Signals-Reviews-E1517959.htm',
                         'MeiliSearch': 'MeiliSearch-Reviews-E4147421.htm',
                         'Mizuho Financial Group': 'Mizuho-Financial-Group-Reviews-E42320.htm',
                         'Modo25': 'Modo25-Reviews-E3424273.htm',
                         'Monograph': 'Monograph-Reviews-E5844012.htm',
                         'Morrisons': 'Morrisons-Reviews-E10276.htm', 
                         'MRL Consulting Group': 'MRL-Consulting-Group-Reviews-E667599.htm',
                         'Nexton': 'Nexton-Reviews-E2903853.htm',
                         'Nomad': 'Nomad-Reviews-E758185.htm', 
                         'Officely': 'Officely-Reviews-E6073767.htm',
                         'Our Community': 'Our-Community-Reviews-E5628053.htm', 
                         'Panasonic': 'Panasonic-Reviews-E4279.htm',
                         'PDQ.com': 'PDQ-com-Reviews-E2113848.htm',
                         'Perpetual Guardian': 'Perpetual-Guardian-Reviews-E4212325.htm',
                         'Piktochart': 'Piktochart-Reviews-E1000937.htm',
                         'Polar': 'Polar-Reviews-E476124.htm',
                         'Praytell': 'Praytell-Strategy-Reviews-E1000535.htm',
                         'Procurify': 'Procurify-Reviews-E828973.htm',
                         'Qwick': 'Qwick-Reviews-E2285211.htm',
                         'Reboot': 'Reboot-Online-Marketing-Reviews-E2368554.htm',
                         'Reflect Digital': 'Reflect-Digital-Reviews-E3130444.htm',
                         'REM Web Solutions': 'REM-Web-Solutions-Reviews-E2430696.htm',
                         'Seed&Spark': 'Seed-and-Spark-Reviews-E2525694.htm',
                         'SEOMG!': 'SEOMG-Reviews-E4908851.htm',
                         'Signifyd': 'Signifyd-Reviews-E776012.htm',
                         'Smalls': 'Smalls-Reviews-E3266103.htm',
                         'Spin Brands': 'Spin-Brands-Reviews-E2183919.htm',
                         'Starship': 'Starship-Reviews-E2308180.htm',
                         'Stora': 'Stora-Reviews-E5650944.htm',
                         'streamGo': 'streamGo-Reviews-E4495328.htm',
                         'Swash Labs': 'Swash-Labs-Reviews-E2297615.htm',
                         'Talewind': 'Talewind-Reviews-E5554851.htm',
                         'Telefónica': 'Telefónica-Reviews-E3511.htm',
                         'The Curve Group': 'The-Curve-Group-Reviews-E625753.htm',
                         'The Wanderlust Group': 'The-Wanderlust-Group-Reviews-E5618947.htm',
                         'Think Productive': 'Think-Productive-Reviews-E5481334.htm',
                         'thredUP': 'thredUP-Reviews-E447264.htm', 
                         'THRYVE': 'thryve-Reviews-E2192049.htm',
                         'Toshiba': 'Toshiba-Reviews-E3543.htm',
                         'Treehouse': 'Treehouse-Oregon-Reviews-E748187.htm',
                         'Twistcode': 'Twistcode-Technologies-Reviews-E3129268.htm',
                         'Uncharted': 'Uncharted-Reviews-E633539.htm',
                         'Unito': 'Unito-Reviews-E1326301.htm',
                         'Uplevel': 'Uplevel-WA-Reviews-E7149257.htm',
                         'Uplift': 'Uplift-ltd-Reviews-E4631378.htm',
                         'UsabilityHub': 'UsabilityHub-Reviews-E2439999.htm',
                         'Venture Stream': 'Venture-Stream-Reviews-E2185910.htm',
                         'Volt Athletics': 'Volt-Athletics-Reviews-E1698590.htm',
                         'WANdisco': 'WANdisco-Reviews-E582936.htm',
                         'Welcome to the Jungle': 'Welcome-to-the-Jungle-Reviews-E2888749.htm',
                         'Whyfield': 'WHYFIELD-Reviews-E5197627.htm',
                         'Wildbit': 'Wildbit-Reviews-E717954.htm',
                         'Wonde': 'Wonde-Reviews-E2142305.htm',
                         'Wonderlic': 'Wonderlic-Reviews-E259986.htm',
                         'Shopify': 'Shopify-Reviews-E675933.htm',
                         'Canon': 'Canon-EMEA-Reviews-E238656.htm',                             
                         'Hutch': 'Hutch-Games-Reviews-E3286494.htm',
                         'Highfield Professional Solutions Ltd': 'Highfield-Professional-Solutions-Reviews-E729949.htm',                             
                         'Gungho Marketing': 'Gungho-Marketing-Reviews-E3164713.htm',
                         'HearFocus': 'Hearfocus-Reviews-E2594463.htm',
                         'Tulip': 'Tulip-Retail-Reviews-E930460.htm'
                            }

print(companies_review_fullList)

{'Administrate': 'Administrate-Reviews-E936767.htm', 'Advice Direct Scotland': 'Advice-Direct-Scotland-Reviews-E3083697.htm', 'Applied Reviews': 'Applied-Reviews-E2339980.htm', 'Arken Legal': 'Arken-legal-Reviews-E6996538.htm', 'Atom Bank': 'Atom-Bank-Reviews-E1354887.htm', '3D Issue': '3D-Issue-Reviews-E1944114.htm', '448 Studio': '448-Studio-Reviews-E6471702.htm', 'Abstract': 'Abstract-Reviews-E2010334.htm', 'ALAMI': 'ALAMI-Sharia-Reviews-E4787557.htm', 'Awin': 'Awin-Reviews-E38054.htm', 'Backbone': 'Backbone-CO-Reviews-E2285766.htm', 'Big Potato Games': 'Big-Potato-Reviews-E1608222.htm', 'Bijles Aan Huis': 'Bijles-Aan-Huis-Reviews-E4677512.htm', 'Bit.io': 'bit-io-Reviews-E3210006.htm', 'Blackbird Interactive': 'Blackbird-Interactive-Reviews-E660577.htm', 'Blink SEO': 'Blink-SEO-Reviews-E3314924.htm', 'Bolt': 'Bolt-Reviews-E1181527.htm', 'Boulder County': 'Boulder-County-Reviews-E282126.htm', 'Brett Nicholls Associates': 'Brett-Nicholls-Associates-Reviews-E3141340.htm', 'Buffer': 'Bu

### Extract Glassdoor Review Information

In [9]:
# Extract review, rating, pro, con and date/author

# Setup the dataframes to store the extracted data
df_review_header = pd.DataFrame(columns=['company', 'review_header'])
df_rating = pd.DataFrame(columns=['company', 'rating'])
df_pro = pd.DataFrame(columns=['company', 'pro'])
df_con = pd.DataFrame(columns=['company', 'con'])
df_date_author = pd.DataFrame(columns=['company', 'date_author'])

# Loop for each company in the companies list
for company, url in companies_review_fullList.items():
    
    driver.get('https://www.glassdoor.sg/Reviews/'+ url)
    source = driver.page_source
    data=bs(source, 'html.parser')
    
    # Retrieve the pagination object
    pagination = data.find(attrs={'data-test':'pagination'})
    
    if pagination != None:
        pages = pagination.find_all('a')

        # Create a list for storing the URL of each page
        urls = []

        # Iterate through the list of pages
        for page in pages:

            # Store the page number only if it is a number. 
            # This will help to omit other items found in the pagination object such as Next
            pageNum = int(page.text) if page.text.isdigit() else None

            # Check if the page number is not null 
            if pageNum != None:

                # Retrieve the URL of each page from the value of its corresponding href element
                link = page.get('href')

                # Add each page URL to the urls list
                urls.append(link)
        
        print(f'Extract review information for {company}')
        
        # Loop each url
        for url in urls:
            driver.get('https://www.glassdoor.sg'+ url)
            source = driver.page_source
            data=bs(source, 'html.parser')
            

            # Extract rating headers
            review_headings = data.find_all('h2', class_='mb-xxsm mt-0 css-93svrw el6ke055')
            for review in review_headings:
                df_review_header = df_review_header.append({'company': str(company), 'review_header': str(review.text)}, ignore_index=True)

            # Extract company ratings
            review_ratings = data.find_all('span', class_='ratingNumber mr-xsm')
            for review in review_ratings:
                df_rating = df_rating.append({'company': str(company), 'rating': str(review.text)}, ignore_index=True)

            # Extract pro reviews
            review_pros = data.find_all(attrs={'data-test': 'pros'})
            for review in review_pros:
                df_pro = df_pro.append({'company': str(company), 'pro': str(review.text)}, ignore_index=True)

            # Extract con reviews
            review_cons = data.find_all(attrs={'data-test': 'cons'})
            for review in review_cons:
                df_con = df_con.append({'company': str(company), 'con': str(review.text)}, ignore_index=True)

            # Extract date/Author
            review_dates = data.find_all('span', class_='authorJobTitle middle common__EiReviewDetailsStyle__newGrey')
            for review in review_dates:
                df_date_author = df_date_author.append({'company': str(company), 'date_author': str(review.text)}, ignore_index=True)

        
    else:
        print(f'Extract review information for {company}')
        # Extract rating headers
        review_headings = data.find_all('h2', class_='mb-xxsm mt-0 css-93svrw el6ke055')
        for review in review_headings:
            df_review_header = df_review_header.append({'company': str(company), 'review_header': str(review.text)}, ignore_index=True)

        # Extract company ratings
        review_ratings = data.find_all('span', class_='ratingNumber mr-xsm')
        for review in review_ratings:
            df_rating = df_rating.append({'company': str(company), 'rating': str(review.text)}, ignore_index=True)

        # Extract pro reviews
        review_pros = data.find_all(attrs={'data-test': 'pros'})
        for review in review_pros:
            df_pro = df_pro.append({'company': str(company), 'pro': str(review.text)}, ignore_index=True)

        # Extract con reviews
        review_cons = data.find_all(attrs={'data-test': 'cons'})
        for review in review_cons:
            df_con = df_con.append({'company': str(company), 'con': str(review.text)}, ignore_index=True)

        # Extract date/Author
        review_dates = data.find_all('span', class_='authorJobTitle middle common__EiReviewDetailsStyle__newGrey')
        for review in review_dates:
            df_date_author = df_date_author.append({'company': str(company), 'date_author': str(review.text)}, ignore_index=True)

    # Set a timer before next iteration
    time.sleep(10)

Extract review information for Administrate
Extract review information for Advice Direct Scotland
Extract review information for Applied Reviews
Extract review information for Arken Legal
Extract review information for Atom Bank
Extract review information for 3D Issue
Extract review information for 448 Studio
Extract review information for Abstract
Extract review information for ALAMI
Extract review information for Awin
Extract review information for Backbone
Extract review information for Big Potato Games
Extract review information for Bijles Aan Huis
Extract review information for Bit.io
Extract review information for Blackbird Interactive
Extract review information for Blink SEO
Extract review information for Bolt
Extract review information for Boulder County
Extract review information for Brett Nicholls Associates
Extract review information for Buffer
Extract review information for Bunny Studio
Extract review information for ByteChek
Extract review information for Charlton Morris
E

In [10]:
df_review_header.head()

Unnamed: 0,company,review_header
0,Administrate,Ultimate Human Organization
1,Administrate,Great!
2,Administrate,"Great culture, good balance."
3,Administrate,Great company culture and life work balance.
4,Administrate,Awesome team


In [11]:
df_rating.head()

Unnamed: 0,company,rating
0,Administrate,5.0
1,Administrate,4.0
2,Administrate,5.0
3,Administrate,5.0
4,Administrate,5.0


In [12]:
df_pro.head()

Unnamed: 0,company,pro
0,Administrate,This is a start with an interesting product! \r\nSome pros:\r\n\r\n-4 Day Work Week\r\n-Good Benefits\r\n-Helpful teammates
1,Administrate,- work life balance is amazing\r\n- slowly getting out of 'start-up' phase and maturing quickly
2,Administrate,Four day week. Great gear. Career progression.
3,Administrate,"* Great life-work balance with 4 days a week and generous holidays allocation.\n* Friendly and open company culture that focuses on making it a great place to work for any humans.\n* Administrate does not focus on short term profits, but on long term customer relations and sustainable development.\n* Flat leadership structure.\n* Transparent company-wide decision making.\n* Modern and varied technologies, so it's fun to code."
4,Administrate,Caring people doing awesome work. 4 day work week!


In [13]:
df_con.head()

Unnamed: 0,company,con
0,Administrate,The company is still growing so certain procedures as not concrete.
1,Administrate,- not quite a mature yet
2,Administrate,"It could be high pressure at times, but that's startup!"
3,Administrate,"* No private healthcare.\n* Going through some teething pains as the company switched it's business operation model recently. Outlook looks positive, but we're not 100% there.\n* Early days of changing the way we approach software development to make it more sustainable. I like the idea, but I'm not completely confident this will be successfully applied.\n* Some amount of random, hectic work resulting from issues with communication with the customer and churn rate among management. The bad part is that it makes the projects less fun to work on, the good part is that issue is identified and acknowledged by the management, so there is a good chance it will improve in time. Also, this is very team dependant."
4,Administrate,Perks taken for granted and can create a sense of entitlement amongst some.


In [14]:
df_date_author.head()

Unnamed: 0,company,date_author
0,Administrate,5 May 2022 - Anonymous Employee
1,Administrate,28 Feb 2022 - Anonymous Employee
2,Administrate,15 Jan 2022 - Marketing Director
3,Administrate,16 Aug 2021 - Software Engineer
4,Administrate,21 Jul 2021 - Commercial


### Combine Dataframes

In [15]:
# Combine df_review_header, df_rating, df_pro, df_con, df_date_author into 1 dataframe
company_review_df = pd.merge(df_review_header, df_rating, left_index=True, right_index=True, suffixes=('', '_remove'))
company_review_df = pd.merge(company_review_df, df_pro, left_index=True, right_index=True, suffixes=('', '_remove'))
company_review_df = pd.merge(company_review_df, df_con, left_index=True, right_index=True, suffixes=('', '_remove'))
company_review_df = pd.merge(company_review_df, df_date_author, left_index=True, right_index=True, suffixes=('', '_remove'))

# Remove the duplicate columns
company_review_df.drop([i for i in company_review_df if 'remove' in i], axis=1, inplace=True)

company_review_df.head()

Unnamed: 0,company,review_header,rating,pro,con,date_author
0,Administrate,Ultimate Human Organization,5.0,This is a start with an interesting product! \r\nSome pros:\r\n\r\n-4 Day Work Week\r\n-Good Benefits\r\n-Helpful teammates,The company is still growing so certain procedures as not concrete.,5 May 2022 - Anonymous Employee
1,Administrate,Great!,4.0,- work life balance is amazing\r\n- slowly getting out of 'start-up' phase and maturing quickly,- not quite a mature yet,28 Feb 2022 - Anonymous Employee
2,Administrate,"Great culture, good balance.",5.0,Four day week. Great gear. Career progression.,"It could be high pressure at times, but that's startup!",15 Jan 2022 - Marketing Director
3,Administrate,Great company culture and life work balance.,5.0,"* Great life-work balance with 4 days a week and generous holidays allocation.\n* Friendly and open company culture that focuses on making it a great place to work for any humans.\n* Administrate does not focus on short term profits, but on long term customer relations and sustainable development.\n* Flat leadership structure.\n* Transparent company-wide decision making.\n* Modern and varied technologies, so it's fun to code.","* No private healthcare.\n* Going through some teething pains as the company switched it's business operation model recently. Outlook looks positive, but we're not 100% there.\n* Early days of changing the way we approach software development to make it more sustainable. I like the idea, but I'm not completely confident this will be successfully applied.\n* Some amount of random, hectic work resulting from issues with communication with the customer and churn rate among management. The bad part is that it makes the projects less fun to work on, the good part is that issue is identified and acknowledged by the management, so there is a good chance it will improve in time. Also, this is very team dependant.",16 Aug 2021 - Software Engineer
4,Administrate,Awesome team,5.0,Caring people doing awesome work. 4 day work week!,Perks taken for granted and can create a sense of entitlement amongst some.,21 Jul 2021 - Commercial


In [16]:
company_review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2688 entries, 0 to 2687
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   company        2688 non-null   object
 1   review_header  2688 non-null   object
 2   rating         2688 non-null   object
 3   pro            2688 non-null   object
 4   con            2688 non-null   object
 5   date_author    2688 non-null   object
dtypes: object(6)
memory usage: 126.1+ KB


In [17]:
company_review_df['company'].nunique()

114

In [18]:
company_review_df['company'].unique()

array(['Administrate', 'Advice Direct Scotland', 'Applied Reviews',
       'Arken Legal', 'Atom Bank', '3D Issue', '448 Studio', 'Abstract',
       'ALAMI', 'Awin', 'Backbone', 'Big Potato Games', 'Bijles Aan Huis',
       'Bit.io', 'Blackbird Interactive', 'Blink SEO', 'Bolt',
       'Boulder County', 'Buffer', 'Bunny Studio', 'ByteChek',
       'Charlton Morris', 'Chief Nation', 'City to Sea', 'Civo', 'Close',
       'CMG Technologies', 'Cockroach Labs', 'Coconut Software',
       'Commission Factory', 'Cosmic', 'D’Youville University', 'Dassana',
       'Daye', 'Deedmob', 'Desigual', 'Diesdas Digital', 'Digible',
       'DNSFilter', 'Eduwill', 'eFileCabinet', 'Eidos-Montréal',
       'Elephant Ventures', 'Emtrain', 'Evolved Search', 'Fast Retailing',
       'Feathr', 'Formedix', 'G2i', 'Galt Pharmaceuticals', 'GoLinks',
       'Headspace', 'Healthwise', 'Hitachi', 'ICE Group', 'ihorizon',
       'InDebted', 'IriusRisk', 'JMK Solicitors', 'Justuno',
       'Kickstarter', 'KPMG', 'Mar

## Export Dataset to csv File

In [19]:
# Export dataset to csv
company_review_df.to_csv('../data/company_reviews.csv', index=False)

**Continue in Notebook 2**