# Data Cleanup
Now let's begin the process of cleaning up the string data we scraped from Glassdoor.

In [125]:
# libraries needed
import pandas as pd
import numpy as np
from datetime import datetime

pd.set_option('display.max_rows', 100)
pd.options.mode.chained_assignment = None

In [126]:
# get some information on the saved data
file_name = r"" # enter the filepath between the quotes
data = pd.read_csv(file_name)
data.head(10)

FileNotFoundError: [Errno 2] No such file or directory: ''

In [127]:
#We see some NaN values, so let's confirm they are recognized as nulls
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          700 non-null    object 
 1   Salary Minimum     630 non-null    object 
 2   Salary Maximum     630 non-null    object 
 3   Salary Average     630 non-null    object 
 4   Rating             657 non-null    float64
 5   Company Name       700 non-null    object 
 6   Location           700 non-null    object 
 7   Size               670 non-null    object 
 8   Founded            564 non-null    float64
 9   Type of ownership  670 non-null    object 
 10  Industry           638 non-null    object 
 11  Sector             638 non-null    object 
 12  Revenue            670 non-null    object 
dtypes: float64(2), object(11)
memory usage: 71.2+ KB


Above, we can see that some columns are fully populated, such as 'Job Title', but others have nulls. I do see some black values where there should be NaNs, so I will run through the document and replace empty cells with NaN. I will want to convert the year founded into years existing. I have to clean up some duplicates I see, convert the salary to a float and remove non numeric characters, and clean up some trailing characters at the end of the company names (/n#). It would be useful to have the location be split into city and states. I may want to clean up the string for type of ownership into just Private vs Public, but I'll run through it to confirm. Finally, I will have to review the revenue data and convert the Unknown/NA into nulls, then determine if the information can be used. **A significant and fun list!**

In [128]:
#Replace empty cells with NaN
#r = raw string. ^ = start of line $ = end of line \s* = any length of string (accounts for whitespace)
data = data.replace(r'^\s*$', np.nan, regex=True) # we use regex to check the cell expression and see if it matches the input
data.head(10)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,IT-Data Analyst,$45K,$82K,"$61,021 /yr (est.)",3.9,Federated Mutual Insurance Company\n3.9,"Owatonna, MN",1001 to 5000 Employees,1904.0,Company - Private,Insurance Carriers,Insurance,$1 to $2 billion (USD)
1,Research and Data Analyst,$51K,$91K,"$67,848 /yr (est.)",4.1,Association of American Medical Colleges\n4.1,"Washington, DC",501 to 1000 Employees,1876.0,Nonprofit Organization,Health Care Services & Hospitals,Healthcare,$100 to $500 million (USD)
2,Jr. Data Analyst,$34.00 /hr,$36.00,$35.00 /hr (est.),4.9,Spartan Capital Group LLC\n4.9,Remote,1 to 50 Employees,2018.0,Company - Private,Business Consulting,Management & Consulting,$1 to $5 million (USD)
3,Data Analyst + Apprentice (Entry-Level),$35K,$45K,"$40,000 /yr (est.)",3.3,New Apprenticeship\n3.3,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable
4,Data Analyst,$90K,$110K,"$100,000 /yr (est.)",4.6,Store Space Self Storage\n4.6,"Greenwood Village, CO",201 to 500 Employees,2017.0,Company - Private,Real Estate,Real Estate,Unknown / Non-Applicable
5,Business Data Analyst,,,,3.7,Richardson Electronics Ltd\n3.7,"Lafox, IL",201 to 500 Employees,1947.0,Company - Public,Wholesale,Retail & Wholesale,$100 to $500 million (USD)
6,Data Modelling Analyst,$40K,$85K,"$58,299 /yr (est.)",3.9,Aurora Energy Research Limited\n3.9,"Austin, TX",201 to 500 Employees,2013.0,Company - Private,Energy & Utilities,"Energy, Mining & Utilities",Unknown / Non-Applicable
7,Data Analyst,$97K,$135K,"$116,000 /yr (est.)",3.6,KEYENCE\n3.6,"Itasca, IL",5001 to 10000 Employees,1974.0,Company - Public,Machinery Manufacturing,Manufacturing,$5 to $10 million (USD)
8,Cybersecurity Analyst II (Remote),$100K,$180K,"$140,000 /yr (est.)",3.8,The Home Depot\n3.8,"Atlanta, GA",10000+ Employees,1978.0,Company - Public,Home Furniture & Housewares Stores,Retail & Wholesale,$10+ billion (USD)
9,Data Analyst,$80K,$100K,"$90,000 /yr (est.)",3.1,Entourage Freight Solutions\n3.1,"Westerville, OH",201 to 500 Employees,1985.0,Company - Private,Food & Beverage Manufacturing,Manufacturing,Unknown / Non-Applicable


In [129]:
# checking columns for nulls
data.isnull().sum()

Job Title              0
Salary Minimum        70
Salary Maximum        70
Salary Average        70
Rating                43
Company Name           0
Location               0
Size                  30
Founded              136
Type of ownership     30
Industry              62
Sector                62
Revenue               30
dtype: int64

In [130]:
#Remove features where the salary is null,since that doesnt help us
data_cleaned = data.dropna(axis=0,subset=['Salary Average', 'Salary Minimum'])
data_cleaned.isnull().sum()

Job Title              0
Salary Minimum         0
Salary Maximum         0
Salary Average         0
Rating                40
Company Name           0
Location               0
Size                  27
Founded              131
Type of ownership     27
Industry              58
Sector                58
Revenue               27
dtype: int64

In [131]:
# separate hourly rows from salary rows
data_cleaned = pd.DataFrame(data = data_cleaned) # convert the slice to a pandas dataframe to work with it
data_cleaned['Average Hourly Rate'] = data_cleaned["Salary Average"].apply(lambda x: 1 if '/hr' in x.lower() else 0)
data_cleaned = data_cleaned.reset_index(drop=True)
data_cleaned.head(20)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate
0,IT-Data Analyst,$45K,$82K,"$61,021 /yr (est.)",3.9,Federated Mutual Insurance Company\n3.9,"Owatonna, MN",1001 to 5000 Employees,1904.0,Company - Private,Insurance Carriers,Insurance,$1 to $2 billion (USD),0
1,Research and Data Analyst,$51K,$91K,"$67,848 /yr (est.)",4.1,Association of American Medical Colleges\n4.1,"Washington, DC",501 to 1000 Employees,1876.0,Nonprofit Organization,Health Care Services & Hospitals,Healthcare,$100 to $500 million (USD),0
2,Jr. Data Analyst,$34.00 /hr,$36.00,$35.00 /hr (est.),4.9,Spartan Capital Group LLC\n4.9,Remote,1 to 50 Employees,2018.0,Company - Private,Business Consulting,Management & Consulting,$1 to $5 million (USD),1
3,Data Analyst + Apprentice (Entry-Level),$35K,$45K,"$40,000 /yr (est.)",3.3,New Apprenticeship\n3.3,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable,0
4,Data Analyst,$90K,$110K,"$100,000 /yr (est.)",4.6,Store Space Self Storage\n4.6,"Greenwood Village, CO",201 to 500 Employees,2017.0,Company - Private,Real Estate,Real Estate,Unknown / Non-Applicable,0
5,Data Modelling Analyst,$40K,$85K,"$58,299 /yr (est.)",3.9,Aurora Energy Research Limited\n3.9,"Austin, TX",201 to 500 Employees,2013.0,Company - Private,Energy & Utilities,"Energy, Mining & Utilities",Unknown / Non-Applicable,0
6,Data Analyst,$97K,$135K,"$116,000 /yr (est.)",3.6,KEYENCE\n3.6,"Itasca, IL",5001 to 10000 Employees,1974.0,Company - Public,Machinery Manufacturing,Manufacturing,$5 to $10 million (USD),0
7,Cybersecurity Analyst II (Remote),$100K,$180K,"$140,000 /yr (est.)",3.8,The Home Depot\n3.8,"Atlanta, GA",10000+ Employees,1978.0,Company - Public,Home Furniture & Housewares Stores,Retail & Wholesale,$10+ billion (USD),0
8,Data Analyst,$80K,$100K,"$90,000 /yr (est.)",3.1,Entourage Freight Solutions\n3.1,"Westerville, OH",201 to 500 Employees,1985.0,Company - Private,Food & Beverage Manufacturing,Manufacturing,Unknown / Non-Applicable,0
9,IT Data Analyst,$62K,$100K,"$81,000 /yr (est.)",3.0,"DTRIC Insurance Company, Limited\n3.0","Honolulu, HI",51 to 200 Employees,1992.0,Company - Public,Insurance Carriers,Insurance,$10 to $25 million (USD),0


We can see that there appears to be average hourly rate data in here. Let's split those out so we can compare the hourly rates to the salary rates givin when we analyze the data.

In [132]:
#clean up Salary min/max/average strings to only have numeric strings, then convert to float
# remove $, ',',(/yr (est.)

#First, lets clean up the Average Salary
salary_avg = data_cleaned['Salary Average'].apply(lambda x: x.split("/")[0])
salary_avg = salary_avg.apply(lambda x: x.replace('$', '').replace(',',''))
data_cleaned['Salary Average'] = salary_avg

#Now the minimum salary
salary_min = data_cleaned['Salary Minimum'].apply(lambda x:x.replace('$', '').replace('K','').replace('/hr', ''))
data_cleaned['Salary Minimum']=salary_min


#Now the maximum salary
salary_max = data_cleaned['Salary Maximum'].apply(lambda x:x.replace('$', '').replace('K',''))
data_cleaned['Salary Maximum']=salary_max
data_cleaned.tail(10)

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate
620,Underwriting Data Analyst I,47.0,78.0,60754.0,3.7,DentaQuest\n3.7,"Boston, MA",1001 to 5000 Employees,2001.0,Company - Private,Insurance Carriers,Insurance,$2 to $5 billion (USD),0
621,IT-Data Analyst,45.0,82.0,61021.0,3.9,Federated Mutual Insurance Company\n3.9,"Owatonna, MN",1001 to 5000 Employees,1904.0,Company - Private,Insurance Carriers,Insurance,$1 to $2 billion (USD),0
622,IT-Data Analyst,45.0,82.0,61021.0,3.9,Federated Mutual Insurance Company\n3.9,"Owatonna, MN",1001 to 5000 Employees,1904.0,Company - Private,Insurance Carriers,Insurance,$1 to $2 billion (USD),0
623,Research and Data Analyst,51.0,91.0,67848.0,4.1,Association of American Medical Colleges\n4.1,"Washington, DC",501 to 1000 Employees,1876.0,Nonprofit Organization,Health Care Services & Hospitals,Healthcare,$100 to $500 million (USD),0
624,Jr. Data Analyst,34.0,36.0,35.0,4.9,Spartan Capital Group LLC\n4.9,Remote,1 to 50 Employees,2018.0,Company - Private,Business Consulting,Management & Consulting,$1 to $5 million (USD),1
625,Data Analyst,90.0,110.0,100000.0,4.6,Store Space Self Storage\n4.6,"Greenwood Village, CO",201 to 500 Employees,2017.0,Company - Private,Real Estate,Real Estate,Unknown / Non-Applicable,0
626,Data Modelling Analyst,40.0,85.0,58299.0,3.9,Aurora Energy Research Limited\n3.9,"Austin, TX",201 to 500 Employees,2013.0,Company - Private,Energy & Utilities,"Energy, Mining & Utilities",Unknown / Non-Applicable,0
627,Data Analyst + Apprentice (Entry-Level),35.0,45.0,40000.0,3.3,New Apprenticeship\n3.3,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable,0
628,Data Analyst,97.0,135.0,116000.0,3.6,KEYENCE\n3.6,"Itasca, IL",5001 to 10000 Employees,1974.0,Company - Public,Machinery Manufacturing,Manufacturing,$5 to $10 million (USD),0
629,Cybersecurity Analyst II (Remote),100.0,180.0,140000.0,3.8,The Home Depot\n3.8,"Atlanta, GA",10000+ Employees,1978.0,Company - Public,Home Furniture & Housewares Stores,Retail & Wholesale,$10+ billion (USD),0


Now lets convert the hourly values to salary

In [133]:
# convert hourly salary to yearly salary
# convert $/hr to $/year and replace cells with the yearly estimate
def hr_to_year(i):
    i = int(float(i)) # convert the string to a float
    i = i*40*52
    #print("A rate of $", i,"/hr will be a salary of $", int(salary),"/yr.")
    return i

data_cleaned["Converted Salary"] = data_cleaned["Salary Average"].apply(lambda x: hr_to_year(x) if x.find('.') != -1 else x)
data_cleaned[["Salary Average","Converted Salary"]].head(10)


Unnamed: 0,Salary Average,Converted Salary
0,61021.0,61021
1,67848.0,67848
2,35.0,72800
3,40000.0,40000
4,100000.0,100000
5,58299.0,58299
6,116000.0,116000
7,140000.0,140000
8,90000.0,90000
9,81000.0,81000


With the hourly rate converted to yearly salary and a column marking which rows had an hourly value, we can now move on to some simplier tasks, such as converting data types and cleaning up some strings.

In [134]:
# remove trailing 5 characters from Company Name strings
data_cleaned['Company Name'] = data_cleaned["Company Name"].apply(lambda x: x.split('\n', 1)[0] if x.find('\n') != -1 else x)
data_cleaned['Company Name'].head(15)


0           Federated Mutual Insurance Company
1     Association of American Medical Colleges
2                    Spartan Capital Group LLC
3                           New Apprenticeship
4                     Store Space Self Storage
5               Aurora Energy Research Limited
6                                      KEYENCE
7                               The Home Depot
8                  Entourage Freight Solutions
9             DTRIC Insurance Company, Limited
10                               meadows group
11                         Driven Brands, Inc.
12                             G2 Secure Staff
13                                     Peraton
14              Amica Mutual Insurance Company
Name: Company Name, dtype: object

In [135]:
# split location to city and state


#str(string_check.iloc[0]).split(',')[1]
#a = 0
#for i in DA_data_cleaned["Location"]:
#    if ',' in str(DA_data_cleaned["Location"].iloc[i]):
#        DA_data_cleaned["City"] = str(DA_data_cleaned["Location"].iloc[i]).split(',')[0]
#        DA_data_cleaned["State"] = str(DA_data_cleaned["Location"].iloc[i]).split(',')[1]
#        a += 1
#    else:
#        DA_data_cleaned["City"] = "Remote"
#        DA_data_cleaned["State"] = "Remote"
#        a +=1

data_cleaned["City"] = data_cleaned["Location"].apply(lambda x: x.split(', ')[0])
data_cleaned["State"] = data_cleaned["Location"].apply(lambda x: x.split(',')[-1] if x.find(',') != 1 else "BAR") 
data_cleaned["State"] = data_cleaned["State"].apply(lambda x: x.strip() if x.strip().lower() != 'manhattan' else 'NY')
data_cleaned['State'].value_counts()


Remote    87
NY        74
CO        71
IL        54
CA        45
TX        38
PA        32
VA        32
GA        29
HI        25
FL        16
MO        15
OH        15
NJ        14
DC        13
MN        11
TN         9
NC         9
MA         8
MI         7
AZ         6
RI         6
NE         5
IN         3
UT         2
SC         2
CT         1
MD         1
Name: State, dtype: int64

In [136]:
# convert year founded to years in existance
currentyear = datetime.now().year
data_cleaned['Company Age (years)'] = data_cleaned["Founded"].apply(lambda x:-1 if x==float(np.nan) else currentyear - x)
data_cleaned

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate,Converted Salary,City,State,Company Age (years)
0,IT-Data Analyst,45,82,61021,3.9,Federated Mutual Insurance Company,"Owatonna, MN",1001 to 5000 Employees,1904.0,Company - Private,Insurance Carriers,Insurance,$1 to $2 billion (USD),0,61021,Owatonna,MN,118.0
1,Research and Data Analyst,51,91,67848,4.1,Association of American Medical Colleges,"Washington, DC",501 to 1000 Employees,1876.0,Nonprofit Organization,Health Care Services & Hospitals,Healthcare,$100 to $500 million (USD),0,67848,Washington,DC,146.0
2,Jr. Data Analyst,34.00,36.00,35.00,4.9,Spartan Capital Group LLC,Remote,1 to 50 Employees,2018.0,Company - Private,Business Consulting,Management & Consulting,$1 to $5 million (USD),1,72800,Remote,Remote,4.0
3,Data Analyst + Apprentice (Entry-Level),35,45,40000,3.3,New Apprenticeship,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable,0,40000,Raleigh,NC,
4,Data Analyst,90,110,100000,4.6,Store Space Self Storage,"Greenwood Village, CO",201 to 500 Employees,2017.0,Company - Private,Real Estate,Real Estate,Unknown / Non-Applicable,0,100000,Greenwood Village,CO,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
625,Data Analyst,90,110,100000,4.6,Store Space Self Storage,"Greenwood Village, CO",201 to 500 Employees,2017.0,Company - Private,Real Estate,Real Estate,Unknown / Non-Applicable,0,100000,Greenwood Village,CO,5.0
626,Data Modelling Analyst,40,85,58299,3.9,Aurora Energy Research Limited,"Austin, TX",201 to 500 Employees,2013.0,Company - Private,Energy & Utilities,"Energy, Mining & Utilities",Unknown / Non-Applicable,0,58299,Austin,TX,9.0
627,Data Analyst + Apprentice (Entry-Level),35,45,40000,3.3,New Apprenticeship,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable,0,40000,Raleigh,NC,
628,Data Analyst,97,135,116000,3.6,KEYENCE,"Itasca, IL",5001 to 10000 Employees,1974.0,Company - Public,Machinery Manufacturing,Manufacturing,$5 to $10 million (USD),0,116000,Itasca,IL,48.0


In [137]:
# Group jobs under archetypes (Junior v senior, analyst v business analyst)
data_cleaned["Job Title"].value_counts() # count instances of job titles occuring

Data Analyst                                                                 150
Jr. Data Analyst                                                              30
Data Analyst I                                                                27
IT Data Analyst                                                               22
Data Analyst - Merchandising (Remote)                                         20
Data Analyst (remote)                                                         20
Cybersecurity Analyst II (Remote)                                             20
Data Analyst (focus on PowerBI and DAX)                                       18
Senior Data Analyst                                                           15
Remote Data Manipulation Analyst                                              14
Junior Data Analyst                                                           14
Strategy and Data Analyst, Sustainability                                     12
Data and Systems Analyst    

We can see from above that there are already some redundancies due to small changes in the titles (eg Sr. Data Analyst ). Let's group the jobs together with a function that searches the titles and combines everything under similar banners (manager, analyst, specialist, etc). Ken Jee created some nice functions that will serve us well, so if you wish to see more here is the link: https://youtu.be/QWgg4w1SpJ8.

In [138]:
#define functions to bin jobs into groups
def title_condencer(title):
    if 'scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'machine learning' in title.lower():
        return 'machine learning'
    elif 'data scientist' in title.lower():
        return 'data scientist'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'manager' in title.lower():
        return 'director'
    elif 'specialist' in title.lower():
        return 'specialist'
    elif 'business' in title.lower():
        return 'business-based'
    else:
        return 'Unbinned'

#identify if there is a seniority or level flag
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr.' in title.lower() or 'lead' in title.lower() or 'prinicpal' in title.lower() or 'iii' in title.lower():
        return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower() or 'junior' in title.lower():
        return 'junior'
    else:
        return 'na'

In [139]:
#Check binning
data_cleaned['Job Title'] = data_cleaned['Job Title'].values.astype(str)
data_cleaned['Title Grouping'] = data_cleaned['Job Title'].apply(title_condencer)
data_cleaned['Title Grouping'].value_counts()

analyst       627
specialist      2
Unbinned        1
Name: Title Grouping, dtype: int64

In [140]:
#check seniority level
data_cleaned['Seniority Level'] = data_cleaned['Job Title'].apply(seniority)
data_cleaned['Seniority Level'].value_counts()

na        531
junior     51
senior     48
Name: Seniority Level, dtype: int64

In [142]:
#input the filename you wish to save the information as
file = r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\DA_data_cleaned.csv" #place the filepath between the quotes

data_cleaned.to_csv(file, index = False)

In [None]:
# check for string answers for Ownership column - to do

In [None]:
# Replace string "unknown/Non-applicable" in revenue with NaN - to do

In [None]:
# remove texts from revenue and convert revenue range to an average revenue as an int/float