# Scraping salary data for veterinary clinical pathologists from Glassdoor
My wife is a veterinary clinical pathologist and is very interested in determining the range of salaries for her profession. One way of doing that is webscraping to aquire a large enough dataset to identify trends, such as salary by city/state or by foci. Glassdoor is a good resource for this, as it is widely used and I can easily shift keywords to search for broader categories to compare. 

For this project, I will attempt to pool together some code to generate a dataframe to analyze. For data collection, I will use a webscraping approach through Glassdoor using code prepared by Omer Sakarya (https://github.com/arapfaik/scraping-glassdoor-selenium). I only slightly modified the data and plan on forking it off from github.

In [1]:
#import libraries and scripts

import pandas as pd
import glassdoor_scraper_mod_v2 as gsm2

import numpy as np

In [5]:
# running scraper on glassdoor



path = 'C:/Program Files/Chrome_driver/chromedriver'
data = gsm2.get_jobs("Veterinary Clinical Pa", 100, False, path, 3)

There are  186 Veterinary Clinical P Jobs  for the input keyword(s).
Progress: 0/100
StaleElementReferenceException while trying to click...seeking element again.
Populating new list
Looking for Information Container
The minimum pay range is: $81K 
The maximum pay range is: $225K
The average pay range is: $134,906 /yr (est.)
The rating is: 3.4
Progress: 1/100
StaleElementReferenceException while trying to click...seeking element again.
Populating new list
Looking for Information Container
The minimum pay range is: $99K 
The maximum pay range is: $197K
The average pay range is: $139,865 /yr (est.)
The rating is: 4.1
Progress: 2/100
StaleElementReferenceException while trying to click...seeking element again.
Populating new list
Looking for Information Container
The minimum pay range is: $74K 
The maximum pay range is: $172K
The average pay range is: $113,205 /yr (est.)
The rating is: 4.0
Progress: 3/100
StaleElementReferenceException while trying to click...seeking element again.
Popula

IndexError: list index out of range

In [8]:
#Let's you view the entire dataframe. 
#You can change "None" to an int of your choice to limit the number of rows shown
pd.set_option('display.max_rows',None)
data

Unnamed: 0,Job Title,Salary AND Provider,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Scientist,$70K (Employer est.),$70K,$70K,"$70,000 /yr (est.)",-1.0,HCW Biologics Inc.,"Miramar, FL",-1,-1,-1,-1,-1,-1
1,Postdoctoral Research Fellow,$70K (Employer est.),$55K,$82K,"$68,429 /yr (est.)",4.1,University of California - San Francisco\n4.1,"San Francisco, CA",10000+ Employees,1864,College / University,Colleges & Universities,Education,$25 to $50 million (USD)
2,Scientist I-II,$70K (Employer est.),-1,-1,-1,-1.0,Heliae Development,"Gilbert, AZ",-1,-1,-1,-1,-1,-1
3,Scientist - Pharmacokinetics,$70K (Employer est.),$57K,$124K,"$83,688 /yr (est.)",4.0,Absorption Systems\n4.0,"Exton, PA",1 to 50 Employees,-1,Company - Public,-1,-1,Unknown / Non-Applicable
4,Scientist - Drug Development,$70K (Employer est.),$84K,$175K,"$121,166 /yr (est.)",4.3,"Systems Planning and Analysis, Inc.\n4.3","Lorton, VA",1001 to 5000 Employees,1972,Company - Private,Aerospace & Defense,Aerospace & Defense,$100 to $500 million (USD)
5,Scientist I,$70K (Employer est.),$90K,$100K,"$95,000 /yr (est.)",-1.0,Lumos Diagnostics,"Carlsbad, CA",-1,-1,-1,-1,-1,-1
6,Assistant Laboratory Manager,$70K (Employer est.),$65K,$75K,"$70,000 /yr (est.)",3.6,KAM Consultants Corp.\n3.6,"Long Island City, NY",1 to 50 Employees,-1,Company - Private,-1,-1,$1 to $5 million (USD)
7,"Manufacturing, Process Development Scientist, ...",$70K (Employer est.),$58K,$107K,"$78,569 /yr (est.)",4.2,"Agilent Technologies, Inc.\n4.2","Kirkland, WA",10000+ Employees,1999,Company - Public,Biotech & Pharmaceuticals,Pharmaceutical & Biotechnology,$2 to $5 billion (USD)
8,"Senior Research Associate/Junior Scientist, Bi...",$70K (Employer est.),$83K,$83K,"$82,500 /yr (est.)",-1.0,GALY,"Sunderland, MA",1 to 50 Employees,2019,Company - Private,Biotech & Pharmaceuticals,Pharmaceutical & Biotechnology,Unknown / Non-Applicable
9,"Scientist/Sr Scientist, Protein Chemist",$70K (Employer est.),$95K,$165K,"$130,000 /yr (est.)",-1.0,"Crossbow Therapeutics, Inc","Cambridge, MA",-1,-1,-1,-1,-1,-1


In [7]:
#get some information on the dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Job Title            10 non-null     object
 1   Salary AND Provider  10 non-null     object
 2   Salary Minimum       10 non-null     object
 3   Salary Maximum       10 non-null     object
 4   Salary Average       10 non-null     object
 5   Rating               10 non-null     object
 6   Company Name         10 non-null     object
 7   Location             10 non-null     object
 8   Size                 10 non-null     object
 9   Founded              10 non-null     object
 10  Type of ownership    10 non-null     object
 11  Industry             10 non-null     object
 12  Sector               10 non-null     object
 13  Revenue              10 non-null     object
dtypes: object(14)
memory usage: 1.2+ KB


In [38]:
#Replace -1 values (string) with NAN
data_nulled = data.replace(-1, np.nan)
data_nulled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 670 entries, 0 to 669
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Job Title          670 non-null    object
 1   Salary Minimum     563 non-null    object
 2   Salary Maximum     563 non-null    object
 3   Salary Average     563 non-null    object
 4   Rating             577 non-null    object
 5   Company Name       670 non-null    object
 6   Location           670 non-null    object
 7   Size               627 non-null    object
 8   Founded            510 non-null    object
 9   Type of ownership  627 non-null    object
 10  Industry           564 non-null    object
 11  Sector             564 non-null    object
 12  Revenue            627 non-null    object
dtypes: object(13)
memory usage: 68.2+ KB


In [39]:
#pickle the data to view at a later time
#dataframe.to_pickle("path_to_file.pkl")
data_nulled.to_pickle("C:/Users/Tineash/Data_analyst_dataset.pkl")

In [40]:
#set a variable to take the pickle data
DA_data = pd.read_pickle("./Data_analyst_dataset.pkl")

In [41]:
#view pickled data
DA_data

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Data Analyst,,,$60.00 /hr (est.),4.1,CRG\n4.1,"Macungie, PA",51 to 200 Employees,1994.0,Company - Private,HR Consulting,Human Resources & Staffing,Unknown / Non-Applicable
1,Data Analyst + Apprentice (Entry-Level),$35K,$45K,"$40,000 /yr (est.)",3.3,New Apprenticeship\n3.3,"Raleigh, NC",1 to 50 Employees,,Company - Private,,,Unknown / Non-Applicable
2,Performance Data Analyst,$125K,$150K,"$137,500 /yr (est.)",,Bellwether Staffing Solutions,Remote,1 to 50 Employees,2001.0,Company - Public,HR Consulting,Human Resources & Staffing,Unknown / Non-Applicable
3,"Data Analyst, Snap",$53K,$108K,"$75,304 /yr (est.)",3.9,Jellysmack\n3.9,"Los Angeles, CA",501 to 1000 Employees,2016.0,Company - Private,Internet & Web Services,Information Technology,Unknown / Non-Applicable
4,Data Analyst,$95K,$100K,"$97,500 /yr (est.)",3.0,G2 Secure Staff\n3.0,"Irving, TX",1001 to 5000 Employees,2005.0,Company - Private,"Airlines, Airports & Air Transportation",Transportation & Logistics,$100 to $500 million (USD)
5,Product Data Analyst,$105K,$125K,"$115,000 /yr (est.)",5.0,Underdog Fantasy\n5.0,Remote,51 to 200 Employees,2020.0,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",Unknown / Non-Applicable
6,Data Analyst,$90K,$90K,"$90,000 /yr (est.)",,"innoVet Health, LLC",Remote,,,,,,
7,Data Analyst - Payment Reconciliation Focused,$20.50 /hr,$20.50,$20.50 /hr (est.),3.7,DentaQuest\n3.7,Remote,1001 to 5000 Employees,2001.0,Company - Private,Insurance Carriers,Insurance,$2 to $5 billion (USD)
8,Data Analyst/Insights Analyst (100% Remote),$31.07 /hr,$31.07,$31.07 /hr (est.),4.3,The Mom Project\n4.3,Remote,51 to 200 Employees,2016.0,Company - Private,HR Consulting,Human Resources & Staffing,Unknown / Non-Applicable
9,Vulnerability Data Entry Analyst,,,,1.0,Defiant\n1.0,Remote,1 to 50 Employees,,Company - Public,,,Unknown / Non-Applicable
