# Motivation

Going to my job boards are tedious - type in the URL, browse through crappy LI search engine results, manually read through each job desription to see if they are a a match and then apply.

To bypass this tedium and laborious workflow, automate it by:
1. scraping particular websites/web for specific job titles or job descriptions with particular experience/skills
1. advise on which parts of resume need tweeking to customize resume to job application
1. autoapply to jobs
1. tracking system for jobs

# (1) Relevant jobs

* [Sonara](https://app.sonara.ai/) - very relevant, similar to jobs I would acutally apply for 
* [Simplify](https://simplify.jobs/) - compared to Sonara, can tell that the search criteria is very strict to my criteria; e.g. same companies. and within companies, offering roles that are tangential to preferences and not my seniority level. will use this as an aggregate for companies of interest
* [Massive](https://usemassive.com/) - huge number of results that take some time to manually sift through but there is diversity of results; looking at companies I would usually not look at

### Given that there are some gaps in the results, would be good to build own scraper since it may cover some preferences not covered in above job post aggregators.

References:
* https://www.chrislovejoy.me/job-scraper
* https://www.google.com/search?q=scrape+web+to+get+job+openings+using+key+words+python&sca_esv=76e5c97e26a6097a&rlz=1C5CHFA_enUS931US931&sxsrf=ACQVn095Z62yPVOH8qj60RfS4NwtYU2a_w%3A1705107521682&ei=QeChZZmVKZ-ywt0PqpOyoAk&ved=0ahUKEwjZuYjBlNmDAxUfmbAFHaqJDJQQ4dUDCBA&uact=5&oq=scrape+web+to+get+job+openings+using+key+words+python&gs_lp=Egxnd3Mtd2l6LXNlcnAiNXNjcmFwZSB3ZWIgdG8gZ2V0IGpvYiBvcGVuaW5ncyB1c2luZyBrZXkgd29yZHMgcHl0aG9uSABQAFgAcAB4AZABAJgBAKABAKoBALgBA8gBAPgBAeIDBBgAIEE&sclient=gws-wiz-serp

In [1]:
# Thank you: https://github.com/smortezah/portfolio/blob/main/scrape/jobinventory.com/tutorial.ipynb

!pip install -q requests beautifulsoup4 pandas --quiet

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
pd.set_option('display.max_colwidth', None)

# # Define the search query and location
# search_query = "data analyst"
# location = "San Francisco, CA"

# # Construct the URL
# url = f"http://www.jobinventory.com/search?q={search_query}&l={location}"

# # Send a GET request to the URL
# response = requests.get(url, headers = {
#     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"})

# # Parse the HTML content using BeautifulSoup
# soup = BeautifulSoup(response.content, "html.parser")

# # Find all the job listings on the page
# job_listings = soup.find_all("li", class_="resultBlock")

# # Define empty lists to store the job details
# titles = []
# companies = []
# locations = []
# descriptions = []

# # Loop through each job listing and extract the relevant details
# for job in job_listings:
#     title = job.find("div", class_="title").text.strip()
#     company = job.find("span", class_="company").text.strip()
#     location = (
#         job.find("div", class_="state").text.split("\xa0-\xa0")[-1].strip()
#     )
#     description = job.find("div", class_="description").text.strip()

#     titles.append(title)
#     companies.append(company)
#     locations.append(location)
#     descriptions.append(description)

# # Clean up the job descriptions using regular expressions
# regex = re.compile(r"\s+")
# clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]

# # Create a Pandas DataFrame to store the job details
# df = pd.DataFrame(
#     {
#         "Title": titles,
#         "Company": companies,
#         "Location": locations,
#         "Description": clean_descriptions,
#     }
# )

# # Export the DataFrame to a CSV file
# df.to_csv("job_listings.csv", index=False)

# print("Scraping complete! The results are saved in 'job_listings.csv'.")

# df

In [3]:
# response.text

_______________

# Trying to scrape LinkedIn

## Motivation:
* You want to create your own compilation of job opportunities for a particular location
* I want to analyze new trends in a particular domain and salaries

In [4]:
# Thank you: https://www.scrapingdog.com/blog/scrape-linkedin-jobs/#Complete_Code

import math

all_job_ids=[]
job_info={}
final_job_data=[]

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"}
#target_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Python%20%28Programming%20Language%29&location=Las%20Vegas%2C%20Nevada%2C%20United%20States&geoId=100293800&currentJobId=3415227738&start={}'

## this is just showing the search result's with specific jobID at the top. what is rendered in the browser is only the first 25 results (1st page)
## another thing to note: jobs-guest/jobs/api/seeMoreJobPostings is what makes the rendering completely different from what is seen from LI site
target_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=data%20analyst&location=San%20Francisco%20Bay%20Area&geoId=90000084' 

In [5]:
%%time
for i in range(0, 1):#math.ceil(800/25)): #32 (pages) --looping through each page (32 total) of search result
    res = requests.get(target_url.format(i)) #get the request for each webpage with 25 result job ID's
    soup=BeautifulSoup(res.text,'html.parser') # makes website response readable
    alljobs_on_this_page=soup.find_all("li") # get all job listings that are identified by <li>...<li> block
    #print(len(alljobs_on_this_page)) # 25*32 = 800 individual listings
    for x in range(0,len(alljobs_on_this_page)): # looping through each jobID per page--25 total job listings 
        jobid = alljobs_on_this_page[x].find("div",{"class":"base-card"}).get('data-entity-urn').split(":")[3] #get job posting ID
        all_job_ids.append(jobid) # adding jobid to list of all job ID's (total: 800)
target_url='https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/{}'
for j in range(0,len(all_job_ids)):
    resp = requests.get(target_url.format(all_job_ids[j])) #this gets each indiviual job posting ID's data via api call
    soup=BeautifulSoup(resp.text,'html.parser') #this helps to organize api response data in a readable format to grab more info from
    try:
        job_info["company"]=soup.find("div",{"class":"top-card-layout__card"}).find("a").find("img").get("alt")
    except:
        job_info["company"]=None
    try:
        job_info["job-title"]=soup.find("div",{"class":"top-card-layout__entity-info"}).find("a").text.strip()
    except:
        job_info["job-title"]=None
    try:
        job_info["level"]=soup.find("ul",{"class":"description__job-criteria-list"}).find("li").text.replace("Seniority level","").strip()
    except:
        job_info["level"]=None
    try:
        job_info["link"]=soup.find("div",{"class":"top-card-layout__cta-container"}).find("code").get('id') #, {"class": "base-card__full-link"}).get("href")
    except:
        job_info["link"]=None
    
    
    final_job_data.append(job_info)
    job_info={}
df = pd.DataFrame(final_job_data)
df.to_csv('linkedinjobs.csv', index=False, encoding='utf-8')
df

CPU times: user 902 ms, sys: 31.7 ms, total: 934 ms
Wall time: 6.15 s


Unnamed: 0,company,job-title,level,link
0,Zortech Solutions,"Data Analyst-SQL, Tableau",Associate,i18n_save_job_form_email_check_error
1,Zortech Solutions,"Data Analyst-SQL, Python, Visualization tool",Associate,i18n_save_job_form_email_check_error
2,WinMax,Data Analyst II,Associate,i18n_save_job_form_email_check_error
3,Mission Neighborhood Health Center,Data Analyst,Entry level,i18n_save_job_form_email_check_error
4,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
5,ClifyX,Data Analysts,Entry level,applyUrl
6,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
7,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
8,WinMax,Data Analyst IV,Associate,i18n_save_job_form_email_check_error
9,LeadStack Inc.,Data Analyst,Entry level,applyUrl


In [6]:
df = df.replace('\n',' ', regex=True)
df

Unnamed: 0,company,job-title,level,link
0,Zortech Solutions,"Data Analyst-SQL, Tableau",Associate,i18n_save_job_form_email_check_error
1,Zortech Solutions,"Data Analyst-SQL, Python, Visualization tool",Associate,i18n_save_job_form_email_check_error
2,WinMax,Data Analyst II,Associate,i18n_save_job_form_email_check_error
3,Mission Neighborhood Health Center,Data Analyst,Entry level,i18n_save_job_form_email_check_error
4,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
5,ClifyX,Data Analysts,Entry level,applyUrl
6,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
7,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error
8,WinMax,Data Analyst IV,Associate,i18n_save_job_form_email_check_error
9,LeadStack Inc.,Data Analyst,Entry level,applyUrl


____
# Testing...

In [7]:
# get duckdb to check out what is returned
# compare this to the actual website results

!pip install duckdb --quiet
!pip install jupysql --quiet

In [8]:
import duckdb

%load_ext sql
conn = duckdb.connect()
%sql conn --alias duckdb

Deploy Shiny apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


In [9]:
# get task and prompt columns only
query = """
select *
, link[:20] as test
from df
"""
test = duckdb.query(query).df()
test

Unnamed: 0,company,job-title,level,link,test
0,Zortech Solutions,"Data Analyst-SQL, Tableau",Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
1,Zortech Solutions,"Data Analyst-SQL, Python, Visualization tool",Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
2,WinMax,Data Analyst II,Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
3,Mission Neighborhood Health Center,Data Analyst,Entry level,i18n_save_job_form_email_check_error,i18n_save_job_form_e
4,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
5,ClifyX,Data Analysts,Entry level,applyUrl,applyUrl
6,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
7,WinMax,Data Analyst III,Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
8,WinMax,Data Analyst IV,Associate,i18n_save_job_form_email_check_error,i18n_save_job_form_e
9,LeadStack Inc.,Data Analyst,Entry level,applyUrl,applyUrl


In [10]:
test['test'].value_counts()

test
i18n_save_job_form_e    14
applyUrl                10
Name: count, dtype: int64

In [11]:
# def make_clickable(val):
#     # target _blank to open new window
#     return '<a href="{}">{}</a>'.format(val,val)
    

# test.style.format({'link': make_clickable})
# test

In [12]:

# .find("a", {"class": "base-card__full-link"}).get('href')
# .find("div",{"class":"base-card"}).get('data-entity-urn').split(":")[3] 

In [13]:
# Python3
from bs4 import BeautifulSoup

# html = '''<a href="https://some_url.com">next</a>
# <span class="class">
# <a href="https://some_other_url.com">another_url</a></span>'''

# html = '''

    
#       <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card" data-entity-urn="urn:li:jobPosting:3784593580" data-search-id="kDhajo0kExUk+EPWD98Yfg==" data-tracking-id="SdqDqyKKZFYJBsYj34wJww==" data-column="1" data-row="1">
        

#         <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/jobs/view/data-analyst-lyft-business-at-lyft-3784593580?refId=kDhajo0kExUk%2BEPWD98Yfg%3D%3D&amp;trackingId=SdqDqyKKZFYJBsYj34wJww%3D%3D&amp;position=1&amp;pageNum=0&amp;trk=public_jobs_jserp-result_search-card" data-tracking-control-name="public_jobs_jserp-result_search-card" data-tracking-client-ingraph="" data-tracking-will-navigate="">
          
#           <span class="sr-only">
              
        
#         Data Analyst, Lyft Business
      
      
#           </span>
#         </a>

      
        
#     <div class="search-entity-media">
        
#       <img class="artdeco-entity-image artdeco-entity-image--square-4
#           " data-delayed-url="https://media.licdn.com/dms/image/C560BAQFoMDej0VdZVA/company-logo_100_100/0/1630565402130/lyft_logo?e=1714608000&amp;v=beta&amp;t=j2dAJplJynj6e-gEFqeYAF1HshTmG0VXbjV0jyIwJ8I" data-ghost-classes="artdeco-entity-image--ghost" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9a9u41thxt325ucfh5z8ga4m8" alt="">
  
#     </div>
  

#         <div class="base-search-card__info">
#           <h3 class="base-search-card__title">
            
#         Data Analyst, Lyft Business
      
#           </h3>

#             <h4 class="base-search-card__subtitle">
              
#           <a class="hidden-nested-link" data-tracking-client-ingraph="" data-tracking-control-name="public_jobs_jserp-result_job-search-card-subtitle" data-tracking-will-navigate="" href="https://www.linkedin.com/company/lyft?trk=public_jobs_jserp-result_job-search-card-subtitle">
#             Lyft
#           </a>
      
#             </h4>

# <!---->
#             <div class="base-search-card__metadata">
              
#           <span class="job-search-card__location">
#             San Francisco, CA
#           </span>

# <!---->
#           <div class="job-search-card__benefits">
    

#     <div class="result-benefits">
#       <icon class="result-benefits__icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/1pwj7aot6lxrfhr9sdta6cnxw" data-svg-class-name="result-benefits__icon-svg"></icon>
#       <span class="result-benefits__text">
#          Actively Hiring
# <!---->      </span>
#     </div>
  
          
#           </div>

# <!---->
#           <time class="job-search-card__listdate" datetime="2024-01-26">
            

#       4 days ago
  
#           </time>

# <!---->      
#             </div>
#         </div>
# <!---->      
    
#       </div>
# '''

# soup = BeautifulSoup(html)

# for a in soup.find_all('a', href=True):
#     print("Found the URL:", a['href'])

_____

In [14]:
# Trying Py Pkg: linkedin-jobs-scraper
# Thank you: https://github.com/spinlud/py-linkedin-jobs-scraper

# !pip install linkedin-jobs-scraper --quiet

In [15]:
# import logging
# from linkedin_jobs_scraper import LinkedinScraper
# from linkedin_jobs_scraper.events import Events, EventData, EventMetrics
# from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
# from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters, \
#     OnSiteOrRemoteFilters

# # Change root logger level (default is WARN)
# logging.basicConfig(level=logging.INFO)


# # Fired once for each successfully processed job
# def on_data(data: EventData):
#     print('[ON_DATA]', data.title, data.company, data.company_link, data.date, data.link, data.insights,
#           len(data.description))


# # Fired once for each page (25 jobs)
# def on_metrics(metrics: EventMetrics):
#     print('[ON_METRICS]', str(metrics))


# def on_error(error):
#     print('[ON_ERROR]', error)


# def on_end():
#     print('[ON_END]')


# scraper = LinkedinScraper(
#     chrome_executable_path=None,  # Custom Chrome executable path (e.g. /foo/bar/bin/chromedriver)
#     chrome_binary_location=None,  # Custom path to Chrome/Chromium binary (e.g. /foo/bar/chrome-mac/Chromium.app/Contents/MacOS/Chromium)
#     chrome_options=None,  # Custom Chrome options here
#     headless=True,  # Overrides headless mode only if chrome_options is None
#     max_workers=1,  # How many threads will be spawned to run queries concurrently (one Chrome driver for each thread)
#     slow_mo=0.5,  # Slow down the scraper to avoid 'Too many requests 429' errors (in seconds)
#     page_load_timeout=40  # Page load timeout (in seconds)    
# )

# # Add event listeners
# scraper.on(Events.DATA, on_data)
# scraper.on(Events.ERROR, on_error)
# scraper.on(Events.END, on_end)

# queries = [
#     Query(
#         options=QueryOptions(
#             limit=27  # Limit the number of jobs to scrape.            
#         )
#     ),
#     Query(
#         query='Engineer',
#         options=QueryOptions(
#             locations=['United States', 'Europe'],
#             apply_link=True,  # Try to extract apply link (easy applies are skipped). If set to True, scraping is slower because an additional page must be navigated. Default to False.
#             skip_promoted_jobs=True,  # Skip promoted jobs. Default to False.
#             page_offset=2,  # How many pages to skip
#             limit=5,
#             filters=QueryFilters(
#                 company_jobs_url='https://www.linkedin.com/jobs/search/?f_C=1441%2C17876832%2C791962%2C2374003%2C18950635%2C16140%2C10440912&geoId=92000000',  # Filter by companies.                
#                 relevance=RelevanceFilters.RECENT,
#                 time=TimeFilters.MONTH,
#                 type=[TypeFilters.FULL_TIME, TypeFilters.INTERNSHIP],
#                 on_site_or_remote=[OnSiteOrRemoteFilters.REMOTE],
#                 experience=[ExperienceLevelFilters.MID_SENIOR]
#             )
#         )
#     ),
# ]

# scraper.run(queries)

___

In [16]:
# Trying Jobspy: https://github.com/Bunsly/JobSpy
# !pip install python-jobspy

In [17]:
# from jobspy import scrape_jobs

# jobs = scrape_jobs(
#     site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
#     search_term="software engineer",
#     location="Dallas, TX",
#     results_wanted=10,
#     country_indeed='USA'  # only needed for indeed / glassdoor
# )
# print(f"Found {len(jobs)} jobs")
# jobs.to_csv("jobs.csv", index=False) # to_xlsx
# jobs.head()