# Crawl Site API Example 
This pipeline shows how to use Indexical's `crawl` action to crawl through a domain and scrape each page. 

### Setup
In this section, we'll set up the environment to start using Indexical's API by importing the necessary libraries and saving the API key. Start by inputting your API key below. You can generate an API key from the Indexical console by selecting `Keys` and then hitting `New API Key`. 

In [5]:
import requests
import json


In [6]:

API_KEY="<REPLACE WITH YOUR API_KEY>"  

### Add Pipeline
To create scalable data extraction workflows, Indexical allows you to construct a pipeline of high-level steps that describe how to go from a URL / query to clean, structured data. Each step contains an `action` that is tied to a task-specific agent on our end, as well as a natural language `goal` to instruct the agent. We'll start by importing the `crawl_site.json` file as our pipeline and printing it below. 

This pipeline will crawl through the first `10` URLs on each website, and download the main content of the webpage. You can adjust the number of pages scraped or remove the limit entirely. By default, Indexical will crawl through all of the pages in the domain. If you would like to only crawl through a subset of the pages, you can input a regex in `pattern`, and Indexical will only open links that match that pattern. 


In [7]:
file_path="pipelines/crawl_site.json"
with open(file_path, 'r') as file:
    pipeline_steps = json.load(file)
    print(json.dumps(pipeline_steps, indent=4))


{
    "name": "crawl_site",
    "steps": [
        {
            "action": "crawl",
            "limit": 10
        },
        {
            "goal": "extract the following information",
            "action": "extract",
            "schema": {
                "content": "$mainContent"
            }
        }
    ]
}


Now, we'll call Indexical's `pipelines` API endpoint to save that pipeline for use in future data extraction workflows Each API request requires the header with `x-api-key`, and you include the JSON pipeline as the API body.  

In [8]:
response = requests.post("https://app.indexical.dev/pipelines",
                          headers={'x-api-key': API_KEY}, 
                          json=pipeline_steps)

if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
else:
    print(response.status_code)

### Run data extraction job
Once you've saved your pipeline, you can then run that workflow on any set ofwebsites / queries through that pipeline. Indexical will use AI to handle the process of mapping that pipeline to relevant selectors and information on each page, gathering and transforming the website into a clean, standardized schema of your choosing. 

In this case, I will run the `crawl_site` pipeline and input 2 websites at run-time.

In [9]:
response = requests.post("https://app.indexical.dev/runs",
                          headers={'x-api-key': API_KEY}, 
                          json={
                              "name" : "crawl_site", 
                              "urls" : ["https://www.docwirenews.com/", "https://www.nature.com/latest-news"]
                          })



By default, the `runs` endpoint will run the data extraction pipeline and return the results asynchronously (either available for download on the developer console or transmitted to a subscriber URL via webhooks). As a response, the `runs` endpoint will return both the `pipelineID` and `runID`. 

In [12]:
if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
    run_id = data['id']
    print(data)
else:
    print(response.status_code)

{'pipeline': 1015, 'id': 1539}


### Getting Results
To get the results programmatically, you can either use [webhooks](https://docs.indexical.dev/runs) or use the `outputs` endpoint. Simply call `https://app.indexical.dev/runs/:runId/outputs` with the `runID` returned by the `runs` endpoint.  

In [14]:
output_endpoint = "https://app.indexical.dev/runs/" + str(run_id) + "/outputs"
response = requests.get(output_endpoint,
                          headers={'x-api-key': API_KEY})




By default, the results will be a JSON file with 2 keys `results` and `errors`. Each key will contain an array of results. Each result has a `seed` query/URL, the final `url` from which Indexical extracted the information, as well as a `data` key which has all of the data specified in the pipeline. If Indexical was not able to find information on the page that maps to a specific element, it will either return `NULL` or not output that key.   

In [15]:
if response.status_code == 200: 
    # Convert the response to JSON format
    data = response.json()
    print(json.dumps(data, indent=4))
else:
    print(response.status_code)

{
    "results": [
        {
            "id": 7594247,
            "pipeline": 1015,
            "run": 1539,
            "seed": "https://www.nature.com/latest-news",
            "url": "https://www.nature.com/subjects",
            "data": {
                "content": "[/static/images/hero/subjects-hero.jpg]\n\n\nLATEST RESEARCH AND NEWS BY SUBJECT\n\nLearn about the latest research, reviews and news from across all of the Nature\njournals by subject\n\n\nFind a subject\nSearch\n\n\n\nBIOLOGICAL SCIENCES\n\n * Biochemistry [https://www.nature.com/subjects/biochemistry]\n * Biological techniques [https://www.nature.com/subjects/biological-techniques]\n * Biophysics [https://www.nature.com/subjects/biophysics]\n * Biotechnology [https://www.nature.com/subjects/biotechnology]\n * Cancer [https://www.nature.com/subjects/cancer]\n * Cell biology [https://www.nature.com/subjects/cell-biology]\n * Chemical biology [https://www.nature.com/subjects/chemical-biology]\n * Computational biology