<div align="center" id="top">
<img src="https://socialify.git.ci/julep-ai/julep/image?description=1&descriptionEditable=Serverless%20AI%20Workflows%20for%20Data%20%26%20ML%20Teams&font=Source%20Code%20Pro&logo=https%3A%2F%2Fraw.githubusercontent.com%2Fjulep-ai%2Fjulep%2Fdev%2F.github%2Fjulep-logo.svg&owner=1&forks=1&pattern=Solid&stargazers=1&theme=Auto" alt="julep" />

<br>
  <p>
    <a href="https://dashboard.julep.ai">
      <img src="https://img.shields.io/badge/Get_API_Key-FF5733?style=logo=" alt="Get API Key" height="28">
    </a>
    <span>&nbsp;</span>
    <a href="https://docs.julep.ai">
      <img src="https://img.shields.io/badge/Documentation-4B32C3?style=logo=gitbook&logoColor=white" alt="Documentation" height="28">
    </a>
  </p>
  <p>
   <a href="https://www.npmjs.com/package/@julep/sdk"><img src="https://img.shields.io/npm/v/%40julep%2Fsdk?style=social&amp;logo=npm&amp;link=https%3A%2F%2Fwww.npmjs.com%2Fpackage%2F%40julep%2Fsdk" alt="NPM Version" height="28"></a>
    <span>&nbsp;</span>
    <a href="https://pypi.org/project/julep"><img src="https://img.shields.io/pypi/v/julep?style=social&amp;logo=python&amp;label=PyPI&amp;link=https%3A%2F%2Fpypi.org%2Fproject%2Fjulep" alt="PyPI - Version" height="28"></a>
    <span>&nbsp;</span>
    <a href="https://hub.docker.com/u/julepai"><img src="https://img.shields.io/docker/v/julepai/agents-api?sort=semver&amp;style=social&amp;logo=docker&amp;link=https%3A%2F%2Fhub.docker.com%2Fu%2Fjulepai" alt="Docker Image Version" height="28"></a>
    <span>&nbsp;</span>
    <a href="https://choosealicense.com/licenses/apache/"><img src="https://img.shields.io/github/license/julep-ai/julep" alt="GitHub License" height="28"></a>
  </p>
  
  <h3>
    <a href="https://discord.com/invite/JTSBGRZrzj" rel="dofollow">Discord</a>
    ·
    <a href="https://x.com/julep_ai" rel="dofollow">𝕏</a>
    ·
    <a href="https://www.linkedin.com/company/julep-ai" rel="dofollow">LinkedIn</a>
  </h3>
</div>

## Task Definition: Spider Crawler Integration

### Overview

This task is a simple task that leverages the spider `integration` tool, and combines it with a prompt step to crawl a website for a given URL, and then create a summary of the results.

### Task Tools:

**Spider Crawler**: An `integration` type tool that can crawl the web and extract data from a given URL.

### Task Input:

**url**: The URL of the website to crawl.

### Task Output:

**output**: A dictionary that contains a `documents` key which contains the extracted data from the given URL. Check the output below for a detailed output schema.

### Task Flow

1. **Input**: The user provides a URL to crawl.

2. **Spider Tool Integration**: The `spider_crawler` tool is called to crawl the web and extract data from the given URL.

3. **Prompt Step**: The prompt step is used to create a summary of the results from the spider tool.

4. **Output**: The final output is the summary of the results from the spider tool.

```plaintext
+----------+     +-------------+     +------------+     +-----------+
|  User    |     |   Spider    |     |   Prompt   |     |  Output   |
|  Input   | --> |   Crawler   | --> |   Step     | --> |   Step    |
| (URL)    |     |             |     |            |     | Output    |
+----------+     +-------------+     +------------+     +-----------+
      |                |                  |                  |
      |                |                  |                  |
      v                v                  v                  v
   "https://spider.cloud"   Extract data   Create summary   "Here are the
                            from URL       of results      results from the
                                                        spider tool
```


## Implementation

To recreate the notebook and see the code implementation for this task, you can access the Google Colab notebook using the link below:

<a target="_blank" href="https://colab.research.google.com/github/julep-ai/julep/blob/dev/cookbooks/01-website-crawler.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Additional Information

For more details about the task or if you have any questions, please don't hesitate to contact the author:

**Author:** Julep AI  
**Contact:** [hey@julep.ai](mailto:hey@julep.ai) or  <a href="https://discord.com/invite/JTSBGRZrzj" rel="dofollow">Discord</a>

Installing the Julep Client

In [15]:
!pip install --upgrade julep --quiet

#### NOTE:

- UUIDs are generated for both the agent and task to uniquely identify them within the system.
- Once created, these UUIDs should remain unchanged for simplicity.
- Altering a UUID will result in the system treating it as a new agent or task.
- If a UUID is changed, the original agent or task will continue to exist in the system alongside the new one.

In [2]:
import uuid

# NOTE: these UUIDs are used in order not to use the `create_or_update` methods instead of
# the `create` methods for the sake of not creating new resources every time a cell is run.
AGENT_UUID = uuid.uuid4()
TASK_UUID = uuid.uuid4()

## Creating Julep Client with the API Key

Get your API key from [here](https://dashboard.julep.ai/)

In [3]:
from julep import Client
import os

JULEP_API_KEY = "YOUR_JULEP_API_KEY"

# Create a Julep client
client = Client(api_key=JULEP_API_KEY, environment="production")

### Creating an "agent"

Agent is the object to which LLM settings, like model, temperature along with tools are scoped to.

To learn more about the agent, please refer to the Agent section in [Julep Concepts](https://docs.julep.ai/docs/concepts/agents).

In [4]:
# Create agent
agent = client.agents.create_or_update(
    agent_id=AGENT_UUID,
    name="Spiderman",
    about="AI that can crawl the web and extract data",
    model="gpt-4o",
)

### Defining a Task

Tasks in Julep are Github-Actions-style workflows that define long-running, multi-step actions.

You can use them to conduct complex actions by defining them step-by-step.

To learn more about tasks, please refer to the `Tasks` section in [Julep Concepts](https://docs.julep.ai/docs/concepts/tasks).

In [6]:
import yaml

SPIDER_API_KEY = "YOUR_SPIDER_API_KEY"

# Define the task
task_def = yaml.safe_load(f"""       
# yaml-language-server: $schema=https://raw.githubusercontent.com/julep-ai/julep/refs/heads/dev/schemas/create_task_request.json                
name: Julep Crawling Task
description: Crawl a website and create a summary of the results

########################################################
####################### INPUT SCHEMA ###################
########################################################

input_schema:
  type: object
  properties:
    url:
      type: string
      description: The URL of the website to crawl

########################################################
####################### TOOLS ##########################
########################################################

# Define the tools that the task will use in this workflow
tools:
- name: spider_crawler
  type: integration
  integration:
    provider: spider
    setup:
      spider_api_key: "{SPIDER_API_KEY}"

########################################################
####################### MAIN WORKFLOW ##################
########################################################

main:

# Step 0: Call the spider_crawler tool with the url input
- tool: spider_crawler
  arguments:
    url: $_['url'] # You can also use 'steps[0].input.url'
    params:
      request: smart_mode
      limit: "1"
      return_format: markdown
      proxy_enabled: $ True
      filter_output_images: $ True
      filter_output_svg: $ True
      readability: $ True

# Step 1: Evaluate step to create a summary of the results
- evaluate:
    documents: |
      $ " ".join(
      list(
        page['content'] for page in _['result']
        )
      )
      
# Step 2: Prompt step to create a summary of the results
- prompt: |
    $ f'''
    You are {{agent.about}}
    I have given you this url: {{steps[0].input.url}}
    And you have crawled that website. Here are the results you found:
    {{_['documents']}}
    I want you to create a short summary (no longer than 100 words) of the results you found while crawling that website.
    '''
    unwrap: True
""")

<span style="color:olive;">Notes:</span>
- The `unwrap: True` in the prompt step is used to unwrap the output of the prompt step (to unwrap the `choices[0].message.content` from the output of the model).
- The `$` sign is used to differentiate between a Python expression and a string.
- The `_` refers to the output of the previous step.
- The `steps[index].input` refers to the input of the step at `index`.
- The `steps[index].output` refers to the output of the step at `index`.

In [7]:
# creating the task object
task = client.tasks.create_or_update(
    task_id=TASK_UUID,
    agent_id=AGENT_UUID,
    **task_def
)

### Creating an Execution

An execution is a single run of a task. It is a way to run a task with a specific set of inputs.

To learn more about executions, please refer to the `Executions` section in [Julep Concepts](https://docs.julep.ai/docs/concepts/execution).

In [8]:
# creating an execution object
execution = client.executions.create(
    task_id=TASK_UUID,
    input={
        "url": "https://spider.cloud"
    }
)

## Checking execution details and output

There are multiple ways to get the execution details and the output:

1. **Get Execution Details**: This method retrieves the details of the execution, including the output of the last transition that took place.

2. **List Transitions**: This method lists all the task steps that have been executed up to this point in time, so the output of a successful execution will be the output of the last transition (first in the transition list as it is in reverse chronological order), which should have a type of `finish`.


<span style="color:olive;">Note: You need to wait for a few seconds for the execution to complete before you can get the final output, so feel free to run the following cells multiple times until you get the final output.</span>


In [9]:
# Get execution details
execution = client.executions.get(execution.id)
# Print the output
print(execution.output)

Spider.cloud is a leading web crawling tool designed for AI applications, offering high-speed, scalable, and cost-effective data collection solutions. Built in Rust, it can crawl over 20,000 pages in seconds, making it significantly faster and cheaper than traditional scrapers. Spider supports various data formats, including LLM-ready markdown, and integrates seamlessly with major AI tools. It offers features like auto proxy rotations, custom browser scripting, and caching to enhance performance. Users can start with $200 in credits and explore features through a free trial. Spider is trusted by tech businesses worldwide for insightful data solutions.


In [10]:
# Lists all the task steps that have been executed up to this point in time
transitions = client.executions.transitions.list(execution_id=execution.id).items

# Transitions are retrieved in reverse chronological order
for transition in reversed(transitions):
    print("Transition type: ", transition.type)
    print("Transition output: ", transition.output)
    print("-"*50)

Transition type:  init
Transition output:  {'url': 'https://spider.cloud'}
--------------------------------------------------
Transition type:  step
Transition output:  {'documents': [{'id': None, 'metadata': {'description': 'Experience cutting-edge web crawling with unparalleled speeds, perfect for LLMs, Machine Learning, and Artificial Intelligence. The fastest and most efficient web scraper tailored for AI applications.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 10031, 'keywords': ['AI agent stack', 'AWS infrastructure reduced', 'Auto Proxy rotations', 'Comprehensive Data Curation', 'Concurrent Streaming Save time', 'Data Collecting Projects Today Jumpstart web crawling', 'FAQ Frequently asked questions', 'Fastest Web Crawler', 'Latest sports news', 'Multiple response formats', 'Open Source Spider engine', 'Performance Tuned Spider', 'Seamless Integrations Seamlessly integrate Spider', 'Smart Mode Spider dynamically switches', 'Spider accurately crawls', 'Spide

### Running the same task with a different URL

We will use the same code to run the same task, but with a different URL

In [11]:
execution = client.executions.create(
    task_id=TASK_UUID,
    input={
        "url": "https://www.harvard.edu/"
    }
)

In [13]:
execution = client.executions.get(execution.id)
print("\n".join(execution.output.split(". ")))

Harvard University's website emphasizes its commitment to excellence in teaching, learning, and research
It highlights initiatives related to food, including nutrition, sustainability, and healthful eating
The site features experts like Christina Warinner and Leah Penniman, and initiatives like the Harvard Food Systems Initiative and Food Literacy Project
It explores topics such as junk food cravings, vegan diets, and the impact of avocados on heart disease
The site also showcases Harvard's efforts in sustainable food practices, food donation programs, and educational resources like free online cooking courses
Additionally, it highlights the contributions of chefs within the Harvard community.


<span style="color:olive;">Note: you can get the output of the crawling step by accessing the corresponding transition's output from the transitions list.</span>

Example:

In [14]:
transitions = client.executions.transitions.list(execution_id=execution.id).items

transitions[1].output

{'documents': [{'id': None,
   'metadata': {'description': 'Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders who make a difference globally.',
    'domain': 'www.harvard.edu',
    'extracted_data': None,
    'file_size': 15310,
    'keywords': ['Business School podcast',
     'Christina Warinner Christina co-authored',
     'Dental Medicine shares advice',
     'Dining Services Learnings Report Learn',
     'Dining Services team',
     'Education alum started Bite Sized Education',
     'Expand Image Joanne Chang Stephanie Mitchell',
     'Expand Image Julia Child Paul Child Julia Child',
     'Expand Image Ludger Wessels Chef Wessels',
     'Expand Image Nick DiGiovanni Kris Snibbe',
     'Expand Image Nisha Vora Photo',
     'Flour Bakery owner Joanne Chang',
     'Food Donation Program',
     'Food Food nourishes',
     'Food Law',
     'Food Literacy Project',
     'Food Literacy Project hosts',
     'Free cooking courses Le