# Job descriptions

Given a particular company's career page:
- Download+cache all open job descriptions
- Feed all job descriptions through an LLM extractor
- Standardize the job description information

Experiment ideas:
- Try using a Langsmith dataset and experiment
- Try building out tools for the popular ATS systems to best handle their data formats
- Extend the scraping code to pick up a career page dynamically and shift to ATS-specific parsers


In [1]:
from core import init

init()

In [16]:
from utils.scrape import request_article, response_to_article
from bs4 import BeautifulSoup
from scrapfly import (
    ScrapeApiResponse,
    ScrapeConfig,
    ScrapflyClient,
    ScrapflyScrapeError,
    ScrapflyError,
)
import os

# TODO: Consider increasing max_concurrency from 1 to 5 here
SCRAPFLY = ScrapflyClient(key=os.environ.get("SCRAPFLY_KEY"))

BASE_CONFIG = {
    "asp": True,
    "render_js": True,
    "country": "us",
    # "proxy_pool": "public_residential_pool",
    "retry": False,
    # Cache for 1 week (max TTL in Scrapfly)
    "cache": True,
    "cache_ttl": 604800,
    # TO CONSIDER
    # session = value (this will reuse the same machine for subsequent requests due to sticky_proxy, but disables caching)
    # render_js = True (this will render the page with a headless browser and might help sometimes)
}


careers_url = "https://apply.workable.com/gable/"

response = await SCRAPFLY.async_scrape(ScrapeConfig(careers_url, **BASE_CONFIG))
response

# response = request_article(careers_url)
# soup = BeautifulSoup(response.text, 'html.parser')
# article = response_to_article(response)

# response.status_code, article.text

<scrapfly.api_response.ScrapeApiResponse at 0x7fafaca84d10>

In [24]:
soup = BeautifulSoup(response.content, 'html.parser')
# article = response_to_article(response)

# response.status_code, article.text

job_list_div = soup.find('ul', {'data-ui': 'list'})
# print(job_list_div)

# Find all links that match a pattern like https://apply.workable.com/gable/j/27509D233B/
job_links = job_list_div.find_all('a', href=True)
job_links = [link for link in job_links if '/j/' in link['href']]

for job_link in job_links:
    print(job_link)

job_description_urls = [f"https://apply.workable.com{job_link['href']}" for job_link in job_links]
job_description_urls

<a aria-labelledby="27509D233B_title 27509D233B_posted_on 27509D233B_locations" class="styles--1OnOt" href="/gable/j/27509D233B/"></a>
<a aria-labelledby="E176E721AD_title E176E721AD_posted_on E176E721AD_locations" class="styles--1OnOt" href="/gable/j/E176E721AD/"></a>
<a aria-labelledby="CE8AC0E609_title CE8AC0E609_posted_on CE8AC0E609_department CE8AC0E609_locations" class="styles--1OnOt" href="/gable/j/CE8AC0E609/"></a>
<a aria-labelledby="99A4F53602_title 99A4F53602_posted_on 99A4F53602_locations" class="styles--1OnOt" href="/gable/j/99A4F53602/"></a>
<a aria-labelledby="1A48FFBF3B_title 1A48FFBF3B_posted_on 1A48FFBF3B_department 1A48FFBF3B_locations" class="styles--1OnOt" href="/gable/j/1A48FFBF3B/"></a>


['https://apply.workable.com/gable/j/27509D233B/',
 'https://apply.workable.com/gable/j/E176E721AD/',
 'https://apply.workable.com/gable/j/CE8AC0E609/',
 'https://apply.workable.com/gable/j/99A4F53602/',
 'https://apply.workable.com/gable/j/1A48FFBF3B/']

In [25]:
jd_responses = []

async for result in SCRAPFLY.concurrent_scrape([ScrapeConfig(url, **BASE_CONFIG) for url in job_description_urls], concurrency=2):
    if not isinstance(result, ScrapflyScrapeError):
        jd_responses.append(result)
    else:
        print(
            f"failed to scrape {result.api_response.config['url']}, got: {result.message}"
        )


In [29]:
soup = BeautifulSoup(jd_responses[0].content, 'html.parser')
soup.find("main").text

"Share this job\xa0SVGs not supported by this browser.DescriptionAbout us:Gable.ai is a Seattle-based startup revolutionizing the data industry. Through our data communication, change management, and collaboration platform, we empower developers to build and manage data assets, bridging the gap between data producers and consumers to upscale data quality. Fresh out of stealth mode and backed by prominent venture partners, our mission is to reshape data management by fostering collaboration and innovation. Join us in transforming the landscape of the data industry!As a Static Code Analysis Expert at Gable.ai, you will be at the forefront of developing and integrating static code analysis tools that are core to our product offerings. Your role will involve designing, implementing, and maintaining static analysis tools and features that help improve the quality, security, and maintainability of our clients' codebases. You will work closely with our engineering team to integrate these tool

In [34]:
# Ashby

ashby_url = "https://jobs.ashbyhq.com/Abridge"
# ashby-job-posting-brief-list > a

# example JD: https://jobs.ashbyhq.com/Abridge/77e38354-bf42-42de-b404-ed2648414d23
# div#overview

response = await SCRAPFLY.async_scrape(ScrapeConfig(ashby_url, **BASE_CONFIG))
soup = BeautifulSoup(response.content, 'html.parser')

ashby_job_description_urls = []
job_divs = soup.find_all('div', {'class': 'ashby-job-posting-brief-list'})
for job_div in job_divs:
    job_links = job_div.find_all('a', href=True)
    ashby_job_description_urls.extend([job_link['href'] for job_link in job_links])

ashby_job_description_urls = [f"https://jobs.ashbyhq.com{url}" for url in ashby_job_description_urls]
print(ashby_job_description_urls)

['https://jobs.ashbyhq.com/Abridge/c37f7f5c-ec63-4983-8f3f-b13bb85e088d', 'https://jobs.ashbyhq.com/Abridge/77e38354-bf42-42de-b404-ed2648414d23', 'https://jobs.ashbyhq.com/Abridge/0481e7b5-7252-472d-b8be-63d347bd2198', 'https://jobs.ashbyhq.com/Abridge/d980a314-1c5f-422e-99f9-d36bda21f49d', 'https://jobs.ashbyhq.com/Abridge/52f68350-2209-4327-bd4d-63eba4a564d5', 'https://jobs.ashbyhq.com/Abridge/f71cb8cf-d160-478c-8391-f7db60582e5e', 'https://jobs.ashbyhq.com/Abridge/25bfeaa6-7d0f-4026-85aa-cdab2aa5b725', 'https://jobs.ashbyhq.com/Abridge/47b16b43-be73-4a97-bb42-79b47d0feb92', 'https://jobs.ashbyhq.com/Abridge/e9af8bf2-21c6-458d-adc8-b0d59d6a9061', 'https://jobs.ashbyhq.com/Abridge/7a28a84b-6756-4fe6-8af9-501fc8772a62', 'https://jobs.ashbyhq.com/Abridge/a8a2b7af-992c-4121-b12c-81149e871469', 'https://jobs.ashbyhq.com/Abridge/03699ed8-5cf5-4917-96a6-101f15a653e5', 'https://jobs.ashbyhq.com/Abridge/215c2503-ccd8-492c-a916-eca356f7de00', 'https://jobs.ashbyhq.com/Abridge/1235179f-c64a-4a

In [35]:
ashby_jd_responses = []

async for result in SCRAPFLY.concurrent_scrape(
    [ScrapeConfig(url, **BASE_CONFIG) for url in ashby_job_description_urls], concurrency=2
):
    if not isinstance(result, ScrapflyScrapeError):
        ashby_jd_responses.append(result)
    else:
        print(f"failed to scrape {result.api_response.config['url']}, got: {result.message}")


soup = BeautifulSoup(ashby_jd_responses[0].content, 'html.parser')
soup.find("div", {"id": "overview"}).text

CRITICAL:root:<-- 422 | ERR::PROXY::TIMEOUT - Proxy connection or website was too slow and timeout - Proxy or website do not respond after 15s - Check if the website is online or geoblocking, if you are using session, rotate it.. Checkout the related doc: https://scrapfly.io/docs/scrape-api/understand-timeout


"Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation efficiencies while enabling clinicians to focus on what matters most—their patients.Our enterprise-grade technology transforms patient-clinician conversations into structured clinical notes in real-time, with deep EMR integrations. Powered by Linked Evidence and our purpose-built, auditable AI, we are the only company that maps AI-generated summaries to ground truth, helping providers quickly trust and verify the output. As pioneers in generative AI for healthcare, we are setting the industry standards for the responsible deployment of AI across health systems.We are a growing team of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers working together to empower people and make care make more sense.The RoleAs a user researcher at Abridge, you’ll play a pivotal role in

In [None]:


print(soup.find("div", {"id": "overview"}).text)

Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation efficiencies while enabling clinicians to focus on what matters most—their patients.Our enterprise-grade technology transforms patient-clinician conversations into structured clinical notes in real-time, with deep EMR integrations. Powered by Linked Evidence and our purpose-built, auditable AI, we are the only company that maps AI-generated summaries to ground truth, helping providers quickly trust and verify the output. As pioneers in generative AI for healthcare, we are setting the industry standards for the responsible deployment of AI across health systems.We are a growing team of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers working together to empower people and make care make more sense.The RoleAs a user researcher at Abridge, you’ll play a pivotal role in 

In [42]:
# smartrecruiters
sr_url = "https://careers.smartrecruiters.com/Logic2020Inc"
# #st-openings > a

# example: https://jobs.smartrecruiters.com/Logic2020Inc/744000027953455-senior-business-development-executive
# main

response = await SCRAPFLY.async_scrape(ScrapeConfig(sr_url, **BASE_CONFIG))
soup = BeautifulSoup(response.content, 'html.parser')

sr_job_description_urls = []
job_div = soup.find('section', {'id': 'st-openings'})
job_links = job_div.find_all('a', {"class": "link--block"}, href=True)
sr_job_description_urls.extend([job_link['href'] for job_link in job_links])

sr_job_description_urls = [f"{url}" for url in sr_job_description_urls]
print("Found job links: ", sr_job_description_urls)

sr_jd_responses = []

async for result in SCRAPFLY.concurrent_scrape(
    [ScrapeConfig(url, **BASE_CONFIG) for url in sr_job_description_urls], concurrency=2
):
    if not isinstance(result, ScrapflyScrapeError):
        sr_jd_responses.append(result)
    else:
        print(f"failed to scrape {result.api_response.config['url']}, got: {result.message}")

soup = BeautifulSoup(sr_jd_responses[0].content, 'html.parser')
print("Job description: ", soup.find("main").text)

Found job links:  ['https://jobs.smartrecruiters.com/Logic2020Inc/744000027953455-senior-business-development-executive', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000025287191-sr-business-development-executive', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000025114686-sr-business-development-executive', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000025115865-sr-business-development-executive', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000024645686-manager-sap-s-4hana-functional-analyst', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000014624384-consulting-manager-energy-utilities', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000014623806-consulting-manager-energy-utilities', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000011980069-senior-consultant-energy-utilities', 'https://jobs.smartrecruiters.com/Logic2020Inc/744000009701914-consultant-strategy-operations']
Job description:  Senior Consultant - Energy & UtilitiesFull-timePractice Area:

In [52]:
sr_jd_responses[0].context["url"]

def parse_smartrecruiters_response(response: ScrapeApiResponse):
    soup = BeautifulSoup(response.content, 'html.parser')
    return {
        "url": response.context["url"],
        "job_description": soup.find("main").text,
    }

def parse_workable_response(response: ScrapeApiResponse):
    soup = BeautifulSoup(response.content, 'html.parser')
    return {
        "url": response.context["url"],
        "job_description": soup.find("main").text,
    }

def parse_ashby_response(response: ScrapeApiResponse):
    soup = BeautifulSoup(response.content, 'html.parser')
    return {
        "url": response.context["url"],
        "job_description": soup.find("div", {"id": "overview"}).text,
    }

import pandas as pd

job_descriptions = [
    parse_smartrecruiters_response(response) for response in sr_jd_responses
] + [
    parse_workable_response(response) for response in jd_responses if not isinstance(response, ScrapflyError)
] + [
    parse_ashby_response(response) for response in ashby_jd_responses if not isinstance(response, ScrapflyError)
]

data = pd.DataFrame(job_descriptions)
data

Unnamed: 0,url,job_description
0,https://jobs.smartrecruiters.com/Logic2020Inc/...,Senior Consultant - Energy & UtilitiesFull-tim...
1,https://jobs.smartrecruiters.com/Logic2020Inc/...,Consultant - Strategy & OperationsFull-timePra...
2,https://jobs.smartrecruiters.com/Logic2020Inc/...,Consulting Manager - Energy & UtilitiesFull-ti...
3,https://jobs.smartrecruiters.com/Logic2020Inc/...,Consulting Manager - Energy & UtilitiesFull-ti...
4,https://jobs.smartrecruiters.com/Logic2020Inc/...,Manager - SAP S/4HANA Functional AnalystFull-t...
5,https://jobs.smartrecruiters.com/Logic2020Inc/...,Sr. Business Development ExecutiveFull-timePra...
6,https://jobs.smartrecruiters.com/Logic2020Inc/...,Sr. Business Development ExecutiveFull-timePra...
7,https://jobs.smartrecruiters.com/Logic2020Inc/...,Sr. Business Development ExecutiveFull-timePra...
8,https://jobs.smartrecruiters.com/Logic2020Inc/...,Senior Business Development ExecutiveFull-time...
9,https://apply.workable.com/gable/j/1A48FFBF3B/,Share this job SVGs not supported by this brow...


In [53]:
data.to_csv("scraped_job_descriptions.csv", index=False)

In [58]:
from markdownify import markdownify

example_html = BeautifulSoup(sr_jd_responses[0].content).find("main").prettify()

print(markdownify(example_html, heading_style="atx"))


# Senior Consultant - Energy & Utilities


* Full-time




* Practice Area: Grid Operations
* Level: IV
* Years of Experience: 2-4

## Company Description


We’re a eight-time “Best Company to Work For,” where intelligent, talented people come together to do outstanding work—and have a lot of fun while they’re at it. We offer a solution-focused environment full of collaboration and dedication, to our goals and to each other. You’ll have the opportunity to drive your own success in a supportive, globally connected environment. From advanced tools and technology to an immersive company culture, working at Logic20/20 means working on the leading edge, with a community of the right people around you.



## Job Description


* **Process Engineering:**
  + Conduct in-depth process analysis and assessments to identify inefficiencies, bottlenecks, and opportunities for improvement.
  + Develop and document detailed process maps, workflows, and procedures.
  + Design and implement process opti