# Job Listings API Example 
This pipeline shows how to use Indexical's `navigate`,`navigate-many` and `extract` actions to find a company's careers page from the landing page, open up each individual job listing, and pull the `job_title`, `job_description`, `salary`, and whether the job is `remote` for each role. 

### Setup
In this section, we'll set up the environment to start using Indexical's API by importing the necessary libraries and saving the API key. Start by inputting your API key below. You can generate an API key from the Indexical console by selecting `Keys` and then hitting `New API Key`. 

In [25]:
import requests
import json


In [26]:

API_KEY="<REPLACE WITH YOUR API_KEY>"  

### Add Pipeline
To create scalable data extraction workflows, Indexical allows you to construct a pipeline of high-level steps that describe how to go from a URL / query to clean, structured data. Each step contains an `action` that is tied to a task-specific agent on our end, as well as a natural language `goal` to instruct the agent. We'll start by importing the `job_listings.json` file as our pipeline and printing it below. 

This pipeline will find a company's careers page from the landing page, open up each individual job listing, and pull the `job_title`, `job_description`, `salary`, and whether the job is `remote` for each role. Indexical uses [JSON Schema](https://json-schema.org/) for describing what fields of data to pull. Note that you can edit the schema to pull more or less information based on your use case. For each of the  items, you should input the name, field type, as well as a natural language description to help the LLM identify what information to pull from the page. 


In [27]:
file_path="pipelines/job_listings.json"
with open(file_path, 'r') as file:
    pipeline_steps = json.load(file)
    print(json.dumps(pipeline_steps, indent=4))


{
    "name": "job_listings",
    "steps": [
        {
            "action": "navigate",
            "goal": "Find a page listing open jobs/positions at the company. Make sure at least 2 job postings are listed on the page"
        },
        {
            "action": "navigate-many",
            "goal": "Find the dedicated pages for each individual job listings, with all the details.",
            "retries": 2,
            "validation": {
                "size": {
                    "min": 2
                }
            }
        },
        {
            "action": "extract",
            "goal": "Structure the information about the current job listing.",
            "schema": {
                "job_description": "$mainContent",
                "company": {
                    "description": "the company posting the role",
                    "type": "string"
                },
                "remote": {
                    "description": "Whether this job is fully remote or it require

Now, we'll call Indexical's `pipelines` API endpoint to save that pipeline for use in future data extraction workflows Each API request requires the header with `x-api-key`, and you include the JSON pipeline as the API body.  

In [28]:
response = requests.post("https://app.indexical.dev/pipelines",
                          headers={'x-api-key': API_KEY}, 
                          json=pipeline_steps)

if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
else:
    print(response.status_code)

### Run data extraction job
Once you've saved your pipeline, you can then run that workflow on any set ofwebsites / queries through that pipeline. Indexical will use AI to handle the process of mapping that pipeline to relevant selectors and information on each page, gathering and transforming the website into a clean, standardized schema of your choosing. 

In this case, I will run the `job_listings` pipeline and input 2 websites at run-time.

In [29]:
response = requests.post("https://app.indexical.dev/runs",
                          headers={'x-api-key': API_KEY}, 
                          json={
                              "name" : "job_listings", 
                              "urls" : ["https://loop.com/", "https://www.merge.dev/"], 
                              "proxiesEnabled" : True
                          })



By default, the `runs` endpoint will run the data extraction pipeline and return the results asynchronously (either available for download on the developer console or transmitted to a subscriber URL via webhooks). As a response, the `runs` endpoint will return both the `pipelineID` and `runID`. 

In [30]:
if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
    run_id = data['id']
    print(data)
else:
    print(response.status_code)

{'pipeline': 1017, 'id': 1553}


### Getting Results
To get the results programmatically, you can either use [webhooks](https://docs.indexical.dev/runs) or use the `outputs` endpoint. Simply call `https://app.indexical.dev/runs/:runId/outputs` with the `runID` returned by the `runs` endpoint.  

In [37]:
output_endpoint = "https://app.indexical.dev/runs/" + str(run_id) + "/outputs"
response = requests.get(output_endpoint,
                          headers={'x-api-key': API_KEY})




By default, the results will be a JSON file with 2 keys `results` and `errors`. Each key will contain an array of results. Each result has a `seed` query/URL, the final `url` from which Indexical extracted the information, as well as a `data` key which has all of the data specified in the pipeline. If Indexical was not able to find information on the page that maps to a specific element, it will either return `NULL` or not output that key.   

In [38]:
if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
    print(json.dumps(data, indent=4))
else:
    print(response.status_code)

{
    "results": [
        {
            "id": 7594536,
            "pipeline": 1017,
            "run": 1553,
            "seed": "https://loop.com/",
            "url": "https://boards.greenhouse.io/loop/jobs/5224737004",
            "data": {
                "title": "Business Analyst",
                "remote": false,
                "company": "Loop",
                "summary": "Loop is hiring an analyst to support the business as it grows. This role will work directly with customer success, product, engineering and business operations to build key reporting, validate Loop\u2019s technology outputs, and support client demos.",
                "job_description": "About Loop\n\nLoop is on a mission to unlock profits trapped in the supply chain\n[https://loop.com/article/unlock-profit-trapped-in-your-supply-chain] and lower\ncosts for consumers. Bad data and inefficient workflows create friction that\nlimits working capital and raises costs for every supply chain stakeholder.\n\nLoop