# Data Cleanup
Now let's begin the process of cleaning up the string data we scraped from Glassdoor.

In [2]:
# libraries needed
import pandas as pd
import numpy as np
from datetime import datetime

pd.set_option('display.max_rows', 100)
pd.options.mode.chained_assignment = None

In [3]:
# get some information on the saved data
file_name = r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\AV_eng_data.csv" # enter the filepath between the quotes
data = pd.read_csv(file_name)
data.head(10)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Audio Visual System Design Engineer,$61K,$100K,"$80,500 /yr (est.)",,CCS,"Denver, CO",,,,,,
1,Audio Visual Design Engineer,$85K,$110K,"$97,500 /yr (est.)",4.3,AV-Worx\n4.3,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD)
2,Audio Visual Systems Field Engineer,$48K,$83K,"$63,005 /yr (est.)",4.3,System Source\n4.3,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD)
3,(NY) Audio/Visual Design Engineer,$50K,$98K,"$69,978 /yr (est.)",3.8,A-V Services Inc.\n3.8,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable
4,Audio Visual Systems Engineer,,,,,Camera Corner / Connecting Point,"Green Bay, WI",,,,,,
5,Audio Visual Engineer,$44K,$103K,"$67,259 /yr (est.)",3.1,JVN Systems Inc.\n3.1,"Deer Park, NY",1 to 50 Employees,,Company - Private,,,$5 to $10 million (USD)
6,Audio Visual Design Engineer,$65K,$90K,"$77,500 /yr (est.)",,Technology Providers Inc.,"Gilbert, AZ",,,,,,
7,Audio Visual Design Engineer/Estimator,$85K,$105K,"$95,000 /yr (est.)",3.2,Network Cabling Services (NCS)\n3.2,"Dallas, TX",201 to 500 Employees,,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable
8,Audio Visual Systems Engineer,,,,,ACP CreativIT dba Arlington Computer Products,"Buffalo Grove, IL",,,,,,
9,"Pre-Sales Design Engineer, Audio Visual Remote",$56K,$102K,"$75,493 /yr (est.)",3.6,Johnson Controls\n3.6,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD)


In [24]:
#We see some NaN values, so let's confirm they are recognized as nulls
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          1000 non-null   object 
 1   Salary Minimum     816 non-null    object 
 2   Salary Maximum     816 non-null    object 
 3   Salary Average     848 non-null    object 
 4   Rating             817 non-null    float64
 5   Company Name       1000 non-null   object 
 6   Location           1000 non-null   object 
 7   Size               857 non-null    object 
 8   Founded            646 non-null    float64
 9   Type of ownership  857 non-null    object 
 10  Industry           712 non-null    object 
 11  Sector             712 non-null    object 
 12  Revenue            857 non-null    object 
dtypes: float64(2), object(11)
memory usage: 101.7+ KB


Above, we can see that some columns are fully populated, such as 'Job Title', but others have nulls. I do see some black values where there should be NaNs, so I will run through the document and replace empty cells with NaN. I will want to convert the year founded into years existing. I have to clean up some duplicates I see, convert the salary to a float and remove non numeric characters, and clean up some trailing characters at the end of the company names (/n#). It would be useful to have the location be split into city and states. I may want to clean up the string for type of ownership into just Private vs Public, but I'll run through it to confirm. Finally, I will have to review the revenue data and convert the Unknown/NA into nulls, then determine if the information can be used. **A significant and fun list!**

In [25]:
#Replace empty cells with NaN
#r = raw string. ^ = start of line $ = end of line \s* = any length of string (accounts for whitespace)
data = data.replace(r'^\s*$', np.nan, regex=True) # we use regex to check the cell expression and see if it matches the input
data.head(10)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Audio Visual System Design Engineer,$61K,$100K,"$80,500 /yr (est.)",,CCS,"Denver, CO",,,,,,
1,Audio Visual Design Engineer,$85K,$110K,"$97,500 /yr (est.)",4.3,AV-Worx\n4.3,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD)
2,Audio Visual Systems Field Engineer,$48K,$83K,"$63,005 /yr (est.)",4.3,System Source\n4.3,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD)
3,(NY) Audio/Visual Design Engineer,$50K,$98K,"$69,978 /yr (est.)",3.8,A-V Services Inc.\n3.8,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable
4,Audio Visual Systems Engineer,,,,,Camera Corner / Connecting Point,"Green Bay, WI",,,,,,
5,Audio Visual Engineer,$44K,$103K,"$67,259 /yr (est.)",3.1,JVN Systems Inc.\n3.1,"Deer Park, NY",1 to 50 Employees,,Company - Private,,,$5 to $10 million (USD)
6,Audio Visual Design Engineer,$65K,$90K,"$77,500 /yr (est.)",,Technology Providers Inc.,"Gilbert, AZ",,,,,,
7,Audio Visual Design Engineer/Estimator,$85K,$105K,"$95,000 /yr (est.)",3.2,Network Cabling Services (NCS)\n3.2,"Dallas, TX",201 to 500 Employees,,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable
8,Audio Visual Systems Engineer,,,,,ACP CreativIT dba Arlington Computer Products,"Buffalo Grove, IL",,,,,,
9,"Pre-Sales Design Engineer, Audio Visual Remote",$56K,$102K,"$75,493 /yr (est.)",3.6,Johnson Controls\n3.6,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD)


In [26]:
# checking columns for nulls
data.isnull().sum()

Job Title              0
Salary Minimum       184
Salary Maximum       184
Salary Average       152
Rating               183
Company Name           0
Location               0
Size                 143
Founded              354
Type of ownership    143
Industry             288
Sector               288
Revenue              143
dtype: int64

In [34]:
#Remove features where the salary is null,since that doesnt help us
data_cleaned = data.dropna(axis=0,subset=['Salary Average', 'Salary Minimum'])
data_cleaned.isnull().sum()

Job Title              0
Salary Minimum         0
Salary Maximum         0
Salary Average         0
Rating               131
Company Name           0
Location               0
Size                  91
Founded              269
Type of ownership     91
Industry             204
Sector               204
Revenue               91
dtype: int64

In [35]:
# separate hourly rows from salary rows
data_cleaned = pd.DataFrame(data = data_cleaned) # convert the slice to a pandas dataframe to work with it
data_cleaned['Average Hourly Rate'] = data_cleaned["Salary Average"].apply(lambda x: 1 if '/hr' in x.lower() else 0)
data_cleaned = data_cleaned.reset_index(drop=True)
data_cleaned.head(20)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate
0,Audio Visual System Design Engineer,$61K,$100K,"$80,500 /yr (est.)",,CCS,"Denver, CO",,,,,,,0
1,Audio Visual Design Engineer,$85K,$110K,"$97,500 /yr (est.)",4.3,AV-Worx\n4.3,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD),0
2,Audio Visual Systems Field Engineer,$48K,$83K,"$63,005 /yr (est.)",4.3,System Source\n4.3,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD),0
3,(NY) Audio/Visual Design Engineer,$50K,$98K,"$69,978 /yr (est.)",3.8,A-V Services Inc.\n3.8,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0
4,Audio Visual Engineer,$44K,$103K,"$67,259 /yr (est.)",3.1,JVN Systems Inc.\n3.1,"Deer Park, NY",1 to 50 Employees,,Company - Private,,,$5 to $10 million (USD),0
5,Audio Visual Design Engineer,$65K,$90K,"$77,500 /yr (est.)",,Technology Providers Inc.,"Gilbert, AZ",,,,,,,0
6,Audio Visual Design Engineer/Estimator,$85K,$105K,"$95,000 /yr (est.)",3.2,Network Cabling Services (NCS)\n3.2,"Dallas, TX",201 to 500 Employees,,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0
7,"Pre-Sales Design Engineer, Audio Visual Remote",$56K,$102K,"$75,493 /yr (est.)",3.6,Johnson Controls\n3.6,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD),0
8,Audio Visual Sales Engineer,$70K,$80K,"$75,000 /yr (est.)",,Vario,Remote,,,,,,,0
9,Audio Visual Design Sales Engineer,$65K,$85K,"$75,000 /yr (est.)",3.6,Spectra Audio Design Group\n3.6,"New York, NY",1 to 50 Employees,,Company - Private,Advertising & Public Relations,Media & Communication,$5 to $10 million (USD),0


We can see that there appears to be average hourly rate data in here. Let's split those out so we can compare the hourly rates to the salary rates givin when we analyze the data.

In [36]:
#clean up Salary min/max/average strings to only have numeric strings, then convert to float
# remove $, ',',(/yr (est.)

#First, lets clean up the Average Salary
salary_avg = data_cleaned['Salary Average'].apply(lambda x: x.split("/")[0])
salary_avg = salary_avg.apply(lambda x: x.replace('$', '').replace(',',''))
data_cleaned['Salary Average'] = salary_avg

#Now the minimum salary
salary_min = data_cleaned['Salary Minimum'].apply(lambda x:x.replace('$', '').replace('K','').replace('/hr', '').replace('/mo', ''))
data_cleaned['Salary Minimum']=salary_min


#Now the maximum salary
salary_max = data_cleaned['Salary Maximum'].apply(lambda x:x.replace('$', '').replace('K','').replace('/mo', ''))
data_cleaned['Salary Maximum']=salary_max
data_cleaned.tail(10)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate
806,"Audio Visual Technician (AV) - Santa Clara, CA",44.0,98.0,65414.0,3.7,Black Box\n3.7,"Santa Clara, CA",1001 to 5000 Employees,1975.0,Company - Private,Information Technology Support Services,Information Technology,$500 million to $1 billion (USD),0
807,Audio Visual Design Engineer,44.0,75.0,57568.0,3.7,"Conference Technologies, Inc.\n3.7","Atlanta, GA",201 to 500 Employees,1988.0,Company - Private,Telecommunications Services,Telecommunications,$50 to $100 million (USD),0
808,(NY) Audio/Visual Design Engineer,50.0,98.0,69978.0,3.8,A-V Services Inc.\n3.8,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0
809,Systems Engineer - Audio/Visual Integration,63.0,130.0,90706.0,2.9,Bluum\n2.9,"New York, NY",501 to 1000 Employees,1946.0,Company - Private,Information Technology Support Services,Information Technology,Unknown / Non-Applicable,0
810,Audio Visual Field Engineer,20.0,40.0,30.0,,"Vistacom, Inc",Pennsylvania,,,,,,,1
811,Audio Visual Field Engineer,20.0,40.0,30.0,,"Vistacom, Inc",Pennsylvania,,,,,,,1
812,"Pre-Sales Design Engineer, Audio Visual Remote",56.0,102.0,75493.0,3.6,Johnson Controls\n3.6,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD),0
813,Audio Visual Sales Engineer,70.0,80.0,75000.0,,Vario,Remote,,,,,,,0
814,Audio Visual Systems Engineer,70.0,70.0,70000.0,4.2,The Mom Project\n4.2,"New York, NY",51 to 200 Employees,2016.0,Company - Private,HR Consulting,Human Resources & Staffing,Unknown / Non-Applicable,0
815,Lead Audio Visual Field Technician,35.0,76.0,51724.0,3.4,EOS\n3.4,"New York, NY",501 to 1000 Employees,2008.0,Company - Private,Information Technology Support Services,Information Technology,Unknown / Non-Applicable,0


Now lets convert the hourly values to salary

In [39]:
# convert hourly salary to yearly salary
# convert $/hr to $/year and replace cells with the yearly estimate
def hr_to_year(i):
    i = int(float(i)) # convert the string to a float
    i = i*40*52
    #print("A rate of $", i,"/hr will be a salary of $", int(salary),"/yr.")
    return i
data_cleaned["Salary Maximum"] = data_cleaned["Salary Maximum"].astype(str).apply(lambda x: hr_to_year(x)/1000 if x.find('.') != -1 else x)
data_cleaned["Salary Minimum"] = data_cleaned["Salary Minimum"].astype(str).apply(lambda x: hr_to_year(x)/1000 if x.find('.') != -1 else x)
data_cleaned["Converted Salary"] = data_cleaned["Salary Average"].apply(lambda x: hr_to_year(x) if x.find('.') != -1 else x)
data_cleaned[["Salary Average", "Salary Minimum", 'Salary Maximum', "Converted Salary"]].tail(10)


Unnamed: 0,Salary Average,Salary Minimum,Salary Maximum,Converted Salary
806,65414.0,44.0,98.0,65414
807,57568.0,44.0,75.0,57568
808,69978.0,50.0,98.0,69978
809,90706.0,63.0,130.0,90706
810,30.0,41.6,83.2,62400
811,30.0,41.6,83.2,62400
812,75493.0,56.0,102.0,75493
813,75000.0,70.0,80.0,75000
814,70000.0,70.0,70.0,70000
815,51724.0,35.0,76.0,51724


With the hourly rate converted to yearly salary and a column marking which rows had an hourly value, we can now move on to some simplier tasks, such as converting data types and cleaning up some strings.

In [40]:
# remove trailing 5 characters from Company Name strings
data_cleaned['Company Name'] = data_cleaned["Company Name"].apply(lambda x: x.split('\n', 1)[0] if x.find('\n') != -1 else x)
data_cleaned['Company Name'].head(15)


0                                CCS
1                            AV-Worx
2                      System Source
3                  A-V Services Inc.
4                   JVN Systems Inc.
5          Technology Providers Inc.
6     Network Cabling Services (NCS)
7                   Johnson Controls
8                              Vario
9         Spectra Audio Design Group
10                     Vistacom, Inc
11                             Bluum
12                   The Mom Project
13                       Robert Half
14            TM Technology Partners
Name: Company Name, dtype: object

In [41]:
# split location to city and state


#str(string_check.iloc[0]).split(',')[1]
#a = 0
#for i in DA_data_cleaned["Location"]:
#    if ',' in str(DA_data_cleaned["Location"].iloc[i]):
#        DA_data_cleaned["City"] = str(DA_data_cleaned["Location"].iloc[i]).split(',')[0]
#        DA_data_cleaned["State"] = str(DA_data_cleaned["Location"].iloc[i]).split(',')[1]
#        a += 1
#    else:
#        DA_data_cleaned["City"] = "Remote"
#        DA_data_cleaned["State"] = "Remote"
#        a +=1

data_cleaned["City"] = data_cleaned["Location"].apply(lambda x: x.split(', ')[0])
data_cleaned["State"] = data_cleaned["Location"].apply(lambda x: x.split(',')[-1] if x.find(',') != 1 else "Remote") 
data_cleaned["State"] = data_cleaned["State"].apply(lambda x: x.strip() if x.strip().lower() != 'manhattan' else 'NY') # edge case, comment out if needed
data_cleaned["State"] = data_cleaned["State"].apply(lambda x: "PA" if x.strip().lower() == 'pennsylvania' else x)# edge case, comment out if needed
data_cleaned["State"] = data_cleaned["State"].apply(lambda x: "NY" if x.strip().lower() == 'new york state' else x)# edge case, comment out if needed
data_cleaned["State"] = data_cleaned["State"].apply(lambda x: "Remote" if x.strip().lower() == 'united states' else x)# edge case, comment out if needed
data_cleaned['State'].value_counts()


NY        233
CA         91
Remote     57
TX         47
FL         46
GA         43
WA         42
NJ         32
ID         30
DC         29
PA         29
MI         29
MN         20
CO         19
HI         17
AZ         16
CT         11
MA          7
MD          3
OK          3
IL          2
MO          2
OH          2
NV          2
LA          2
UT          1
VA          1
Name: State, dtype: int64

In [42]:
# convert year founded to years in existance
currentyear = datetime.now().year
data_cleaned['Company Age (years)'] = data_cleaned["Founded"].apply(lambda x:-1 if x==float(np.nan) else currentyear - x)
data_cleaned

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate,Converted Salary,City,State,Company Age (years)
0,Audio Visual System Design Engineer,61.0,100.0,80500.0,,CCS,"Denver, CO",,,,,,,0,80500,Denver,CO,
1,Audio Visual Design Engineer,85.0,110.0,97500.0,4.3,AV-Worx,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD),0,97500,West Palm Beach,FL,8.0
2,Audio Visual Systems Field Engineer,48.0,83.0,63005.0,4.3,System Source,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD),0,63005,Hunt Valley,MD,41.0
3,(NY) Audio/Visual Design Engineer,50.0,98.0,69978.0,3.8,A-V Services Inc.,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0,69978,New York,NY,62.0
4,Audio Visual Engineer,44.0,103.0,67259.0,3.1,JVN Systems Inc.,"Deer Park, NY",1 to 50 Employees,,Company - Private,,,$5 to $10 million (USD),0,67259,Deer Park,NY,
5,Audio Visual Design Engineer,65.0,90.0,77500.0,,Technology Providers Inc.,"Gilbert, AZ",,,,,,,0,77500,Gilbert,AZ,
6,Audio Visual Design Engineer/Estimator,85.0,105.0,95000.0,3.2,Network Cabling Services (NCS),"Dallas, TX",201 to 500 Employees,,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0,95000,Dallas,TX,
7,"Pre-Sales Design Engineer, Audio Visual Remote",56.0,102.0,75493.0,3.6,Johnson Controls,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD),0,75493,Roswell,GA,137.0
8,Audio Visual Sales Engineer,70.0,80.0,75000.0,,Vario,Remote,,,,,,,0,75000,Remote,Remote,
9,Audio Visual Design Sales Engineer,65.0,85.0,75000.0,3.6,Spectra Audio Design Group,"New York, NY",1 to 50 Employees,,Company - Private,Advertising & Public Relations,Media & Communication,$5 to $10 million (USD),0,75000,New York,NY,


In [43]:
# Group jobs under archetypes (Junior v senior, analyst v business analyst)
pd.set_option('display.max_rows', None)
data_cleaned["Job Title"].value_counts() # count instances of job titles occuring

Audio Visual Engineer                                                      62
Audio Visual Design Engineer                                               61
Audio Visual Technician                                                    44
Audio Visual Sales Engineer                                                38
Systems Engineer - Audio/Visual Integration                                32
Pre-Sales Design Engineer, Audio Visual Remote                             31
Senior Audio Visual Systems Engineer                                       30
(NY) Audio/Visual Design Engineer                                          29
Audio Visual Technician (Sr. Engineer)                                     29
AUDIO VIDEO SYSTEMS ENGINEER/Jr. CRESTRON PROGRAMMER                       29
Audio Visual Engineer - Information Technology Consultant - Career         28
Sr. Network Technician-Audio Visual 1                                      26
Audio Visual (AV) Engineer                                      

We can see from above that there are already some redundancies due to small changes in the titles (eg Sr. Data Analyst ). Let's group the jobs together with a function that searches the titles and combines everything under similar banners (manager, analyst, specialist, etc). Ken Jee created some nice functions that will serve us well, so if you wish to see more here is the link: https://youtu.be/QWgg4w1SpJ8.

In [44]:
#define functions to bin jobs into groups
def title_condencer(title):
    if 'engineer' in title.lower():
        return 'engineer'
    elif 'technician' in title.lower():
        return 'technician'
    else:
        return 'Unbinned'

#identify if there is a seniority or level flag
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr.' in title.lower() or 'lead' in title.lower() or 'prinicpal' in title.lower() or 'iii' in title.lower():
        return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower() or 'junior' in title.lower():
        return 'junior'
    else:
        return 'na'

In [45]:
#Check binning
data_cleaned['Job Title'] = data_cleaned['Job Title'].values.astype(str)
data_cleaned['Title Grouping'] = data_cleaned['Job Title'].apply(title_condencer)
data_cleaned['Title Grouping'].value_counts()

engineer      681
technician    115
Unbinned       20
Name: Title Grouping, dtype: int64

In [46]:
#check seniority level
data_cleaned['Seniority Level'] = data_cleaned['Job Title'].apply(seniority)
data_cleaned['Seniority Level'].value_counts()

na        608
senior    179
junior     29
Name: Seniority Level, dtype: int64

In [47]:
#input the filename you wish to save the information as
file = r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\AV_eng_data_cleaned.csv" #place the filepath between the quotes

data_cleaned.to_csv(file, index = False)

In [None]:
# check for string answers for Ownership column - to do

In [None]:
# Replace string "unknown/Non-applicable" in revenue with NaN - to do

In [None]:
# remove texts from revenue and convert revenue range to an average revenue as an int/float