# Scraping salary data for veterinary clinical pathologists from Glassdoor
My wife is a veterinary clinical pathologist and is very interested in determining the range of salaries for her profession. One way of doing that is webscraping to aquire a large enough dataset to identify trends, such as salary by city/state or by foci. Glassdoor is a good resource for this, as it is widely used and I can easily shift keywords to search for broader categories to compare. 

For this project, I will attempt to pool together some code to generate a dataframe to analyze. For data collection, I will use a webscraping approach through Glassdoor using code prepared by Omer Sakarya (https://github.com/arapfaik/scraping-glassdoor-selenium). I only slightly modified the data and plan on forking it off from github.

In [1]:
#import libraries and scripts
import sys
sys.path.append(r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Python_scripts")
import pandas as pd
import glassdoor_scraper_mod_v2 as gsm2

import numpy as np

In [2]:
# running scraper on glassdoor and save the output as a .csv

path = 'C:/Program Files/Chrome_driver/chromedriver'
data = gsm2.get_jobs("Data Analyst", 1200, False, path, 2)
data.to_csv(r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\DA_raw_data", index = False) # inpt your prefered path


There are  60 Data Anal Jobs  for the input keyword(s).
Populating button list
Progress: 0/1200
Attempting to gather information
Looking for Information Container
Progress: 1/1200
StaleElementReferenceException while trying to click...seeking element again.
Attempting to gather information
Looking for Information Container
Progress: 2/1200
StaleElementReferenceException while trying to click...seeking element again.
Attempting to gather information
Looking for Information Container
Progress: 3/1200
StaleElementReferenceException while trying to click...seeking element again.
Attempting to gather information
Looking for Information Container
Progress: 4/1200
StaleElementReferenceException while trying to click...seeking element again.
Attempting to gather information
Looking for Information Container
Progress: 5/1200
StaleElementReferenceException while trying to click...seeking element again.
Attempting to gather information
Looking for Information Container
Progress: 6/1200
StaleEleme

ElementClickInterceptedException: Message: element click intercepted: Element <li class="react-job-listing css-bkasv9 eigr9kq0" data-brandviews="BRAND:n=jsearch-job-listing:eid=4927757:jlid=1007945081820" data-id="1007945081820" data-adv-type="GENERAL" data-is-organic-job="false" data-ad-order-id="1110586" data-sgoc-id="1019" data-is-easy-apply="false" data-normalize-job-title="Entry Level Data Analyst" data-job-loc="Raleigh, NC" data-job-loc-id="1138960" data-job-loc-type="C" data-selected="true" data-test="jobListing" style="" data-triggered-brandview="">...</li> is not clickable at point (186, 205). Other element would receive the click: <span class="css-1buaf54 pr-xxsm css-iii9i8 e1rrn5ka4">...</span>
  (Session info: chrome=102.0.5005.115)
Stacktrace:
Backtrace:
	Ordinal0 [0x0066D953+2414931]
	Ordinal0 [0x005FF5E1+1963489]
	Ordinal0 [0x004EC6B8+837304]
	Ordinal0 [0x0051FC27+1047591]
	Ordinal0 [0x0051DC08+1039368]
	Ordinal0 [0x0051B90B+1030411]
	Ordinal0 [0x0051A659+1025625]
	Ordinal0 [0x00510293+983699]
	Ordinal0 [0x0053449C+1131676]
	Ordinal0 [0x0050FC74+982132]
	Ordinal0 [0x005346B4+1132212]
	Ordinal0 [0x00544812+1198098]
	Ordinal0 [0x005342B6+1131190]
	Ordinal0 [0x0050E860+976992]
	Ordinal0 [0x0050F756+980822]
	GetHandleVerifier [0x008DCC62+2510274]
	GetHandleVerifier [0x008CF760+2455744]
	GetHandleVerifier [0x006FEABA+551962]
	GetHandleVerifier [0x006FD916+547446]
	Ordinal0 [0x00605F3B+1990459]
	Ordinal0 [0x0060A898+2009240]
	Ordinal0 [0x0060A985+2009477]
	Ordinal0 [0x00613AD1+2046673]
	BaseThreadInitThunk [0x7597FA29+25]
	RtlGetAppContainerNamedObjectPath [0x77547A9E+286]
	RtlGetAppContainerNamedObjectPath [0x77547A6E+238]


In [3]:
#Let's you view the entire dataframe. 
#You can change "None" to an int of your choice to limit the number of rows shown
pd.set_option('display.max_rows',20)
data.dtypes

Job Title            object
Salary Minimum       object
Salary Maximum       object
Salary Average       object
Rating               object
Company Name         object
Location             object
Size                 object
Founded              object
Type of ownership    object
Industry             object
Sector               object
Revenue              object
dtype: object

In [4]:
#get some information on the dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Job Title          1000 non-null   object
 1   Salary Minimum     1000 non-null   object
 2   Salary Maximum     1000 non-null   object
 3   Salary Average     1000 non-null   object
 4   Rating             1000 non-null   object
 5   Company Name       1000 non-null   object
 6   Location           1000 non-null   object
 7   Size               1000 non-null   object
 8   Founded            1000 non-null   object
 9   Type of ownership  1000 non-null   object
 10  Industry           1000 non-null   object
 11  Sector             1000 non-null   object
 12  Revenue            1000 non-null   object
dtypes: object(13)
memory usage: 101.7+ KB


In [5]:
#Replace empty cells, -1 values (string) with NAN
data_blanked = data.replace(r'^\s*$', np.nan, regex=True) 
data_nulled = data_blanked.replace(-1, np.nan)
data_nulled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Job Title          1000 non-null   object
 1   Salary Minimum     931 non-null    object
 2   Salary Maximum     931 non-null    object
 3   Salary Average     931 non-null    object
 4   Rating             866 non-null    object
 5   Company Name       1000 non-null   object
 6   Location           1000 non-null   object
 7   Size               894 non-null    object
 8   Founded            708 non-null    object
 9   Type of ownership  894 non-null    object
 10  Industry           757 non-null    object
 11  Sector             757 non-null    object
 12  Revenue            894 non-null    object
dtypes: object(13)
memory usage: 101.7+ KB


In [9]:
#pickle the data to view at a later time
#dataframe.to_pickle("path_to_file.pkl")

data_nulled.to_pickle(r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\Data_analyst_dataset.pkl") # input your prefered path

#dataframe.to_csv("path_to_file.csv")
data_nulled.to_csv(r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\Data_analyst_dataset.csv", index = False) # inpt your prefered path

In [10]:
#set a variable to take the data data
DA_data = pd.read_csv(r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\Data_analyst_dataset.csv")

In [11]:
#view data
DA_data.head()

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Audio Visual System Design Engineer,$61K,$100K,"$80,500 /yr (est.)",,CCS,"Denver, CO",,,,,,
1,Audio Visual Design Engineer,$85K,$110K,"$97,500 /yr (est.)",4.3,AV-Worx\n4.3,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD)
2,Audio Visual Systems Field Engineer,$48K,$83K,"$63,005 /yr (est.)",4.3,System Source\n4.3,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD)
3,(NY) Audio/Visual Design Engineer,$50K,$98K,"$69,978 /yr (est.)",3.8,A-V Services Inc.\n3.8,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable
4,Audio Visual Systems Engineer,,,,,Camera Corner / Connecting Point,"Green Bay, WI",,,,,,


In [29]:
DA_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          1200 non-null   object 
 1   Salary Minimum     1085 non-null   object 
 2   Salary Maximum     1085 non-null   object 
 3   Salary Average     1085 non-null   object 
 4   Rating             908 non-null    float64
 5   Company Name       1200 non-null   object 
 6   Location           1200 non-null   object 
 7   Size               983 non-null    object 
 8   Founded            761 non-null    float64
 9   Type of ownership  983 non-null    object 
 10  Industry           791 non-null    object 
 11  Sector             791 non-null    object 
 12  Revenue            983 non-null    object 
dtypes: float64(2), object(11)
memory usage: 122.0+ KB
