Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Apify integration #998

Merged
merged 3 commits into from
May 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/docs/ecosystem/apify.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Apify

This page covers how to use [Apify](https://apify.com) within LangChain.

## Overview

Apify is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.

[![Apify Actors](/img/ApifyActors.png)](https://apify.com/store)

This integration enables you run Actors on the Apify platform and load their results into LangChain to feed your vector
indexes with documents and data from the web, e.g. to generate answers from websites with documentation,
blogs, or knowledge bases.

## Installation and Setup

- Install the [Apify API client](https://npmjs.com/package/apify-client) using your favorite package manager:

```bash npm2yarn
npm install apify-client
```

- Get your [Apify API token](https://console.apify.com/account/integrations) and either set it as
an environment variable (`APIFY_API_TOKEN`) or pass it to the `ApifyWrapper` in the constructor.

## Wrappers

### Utility

You can use the `ApifyWrapper` to run Actors on the Apify platform.

```ts
import { ApifyWrapper } from "langchain/tools";
```

For a more detailed walkthrough of this wrapper, see [this guide](../modules/agents/tools/integrations/apify.md).

### Loader

You can also use our `ApifyDatasetLoader` to get data from Apify dataset.

```ts
import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
```

For a more detailed walkthrough of this loader, see [this guide](../modules/indexes/document_loaders/examples/web_loaders/apify_dataset.md).
94 changes: 94 additions & 0 deletions docs/docs/modules/agents/tools/integrations/apify.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Apify

This guide shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain,
to run Actors and feed their results into LangChain.

## Overview

[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various web scraping, crawling, and data extraction use cases.
For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc.

In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,
and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.

## Setup

```bash npm2yarn
npm install apify-client
```

First, import `ApifyWrapper` and some other classes into your source code:

```ts
import { OpenAI } from "langchain/llms/openai";
import { RetrievalQAChain } from "langchain/chains";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { ApifyWrapper } from "langchain/tools";
import { Document } from "langchain/document";
```

Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:

```ts
const OPENAI_API_KEY = "Your OpenAI API key";
const APIFY_API_TOKEN = "Your Apify API token";

const model = new OpenAI({ openAIApiKey: OPENAI_API_KEY });
const apify = new ApifyWrapper(APIFY_API_TOKEN);
```

Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.

Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this guide](../../../indexes/document_loaders/examples/web_loaders/apify_dataset.md). In that guide, you'll also find the explanation of the `datasetMappingFunction`, which is used to map fields from the Apify dataset records to LangChain `Document` fields.

```ts
const loader = await apify.callActor(
"apify/website-content-crawler",
{ startUrls: [{ url: "https://js.langchain.com/docs/" }] },
(item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
})
);
const docs = await loader.load();
```

Initialize the vector index from the crawled documents:

```ts
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ openAIApiKey: OPENAI_API_KEY })
);
```

Next, create the retrieval chain and enter a query:

```ts
const chain = RetrievalQAChain.fromLLM(model, vectorStore.asRetriever(), {
returnSourceDocuments: true,
});
const res = await chain.call({ query: "What is LangChain?" });
```

And finally, output the results:

```ts
console.log(res.text);
console.log(res.sourceDocuments.map((d) => d.metadata.source));
```

```
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
```
2 changes: 2 additions & 0 deletions docs/docs/modules/agents/tools/integrations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import DocCardList from "@theme/DocCardList";

LangChain provides the following tools you can use out of the box:

- [`ApifyWrapper`](Apify) - A wrapper around the Apify platform. Useful for web scraping and data extraction.
- [`AWSLambda`][AWSLambda] - A wrapper around the AWS Lambda API, invoked via the Amazon Web Services Node.js SDK. Useful for invoking serverless functions with any behavior which you need to provide to an Agent.
- [`BingSerpAPI`][BingSerpAPI] - A wrapper around the Bing Search API. Useful for when you need to answer questions about current events. Input should be a search query.
- [`Calculator`][Calculator] - Useful for getting the result of a math expression. The input to this tool should be a valid mathematical expression that could be executed by a simple calculator.
Expand All @@ -21,6 +22,7 @@ LangChain provides the following tools you can use out of the box:
- [`VectorStoreQATool`][VectorStoreQATool] - Useful for retrieving relevant text data from a vector store.
- [`ZapierNLARunAction`][ZapierNLARunAction] - A wrapper around the Zapier NLP API. Useful for triggering Zapier actions with a natural language input. Best when used in a [`ZapierToolkit`][ZapierToolkit].

[Apify]: ./apify
[AWSLambda]: /docs/api/tools_aws_lambda/classes/AWSLambda
[BingSerpAPI]: /docs/api/tools/classes/BingSerpAPI
[Calculator]: /docs/api/tools_calculator/classes/Calculator
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Apify Dataset

This guide shows how to use the [Apify integration](../../../../../ecosystem/apify.md) for LangChain,
to load documents from an Apify Dataset.

## Overview

[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various web scraping, crawling, and data extraction use cases.

This guide shows how to load documents
from an [Apify Dataset](https://docs.apify.com/platform/storage/dataset)—a scalable append-only
storage built for storing structured web scraping results,
such as a list of products or Google SERPs, and then export them to various
formats like JSON, CSV, or Excel.

Datasets are typically used to save results of Actors.
For example, [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor
deeply crawls websites such as documentation, knowledge bases, help centers, or blogs,
and then stores the text content of webpages into a dataset,
from which you can feed the documents into a vector index and answer questions from it.
For details, see [this guide](../../../../agents/tools/integrations/apify.md).

## Setup

```bash npm2yarn
npm install apify-client
```

You need to have an existing dataset on the Apify platform. If you don't have one, please first check out [this guide](../../../../agents/tools/integrations/apify.md) on how to use Apify to extract content from documentation, knowledge bases, help centers, or blogs.

First, import `ApifyDatasetLoader` into your source code:

```ts
import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { Document } from "langchain/document";
```

Then provide a function that maps Apify dataset record fields to LangChain `Document` format.

For example, if your dataset items are structured like this:

```json
{
"url": "https://apify.com",
"text": "Apify is the best web scraping and automation platform."
}
```

The mapping function in the code below will convert them to LangChain `Document` format, so that you can use them further with any LLM model (e.g. for question answering).

```ts
const loader = new ApifyDatasetLoader(
"your-dataset-id",
(item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
})
);
const docs = await loader.load();
```

## An example with question answering

In this example, we use data from a dataset to answer a question.

```ts
import { OpenAI } from "langchain/llms/openai";
import { RetrievalQAChain } from "langchain/chains";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { Document } from "langchain/document";

const OPENAI_API_KEY = "Your OpenAI API key";

// Initialize the LLM to use to answer the question.
const model = new OpenAI({ openAIApiKey: OPENAI_API_KEY });
// Load the data from Apify Dataset
const loader = new ApifyDatasetLoader(
"your-dataset-id",
(item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
})
);
const docs = await loader.load();
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ openAIApiKey: OPENAI_API_KEY })
);

// Create a chain that uses the OpenAI LLM and HNSWLib vector store.
const chain = RetrievalQAChain.fromLLM(model, vectorStore.asRetriever(), {
returnSourceDocuments: true,
});
const res = await chain.call({
query: "What is Apify?",
});
// Output the results
console.log(res.text);
console.log(res.sourceDocuments.map((d: Document) => d.metadata.source));
```

```
Apify is a cloud platform that helps you build reliable web scrapers, fast, and automate anything you can do manually in a web browser.
[
'https://docs.apify.com/platform',
'https://docs.apify.com/platform/integrations',
'https://docs.apify.com/platform/actors/publishing/monetize',
'https://docs.apify.com/platform/security'
]
```
Binary file added docs/static/img/ApifyActors.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions langchain/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ document_loaders.d.ts
document_loaders/base.cjs
document_loaders/base.js
document_loaders/base.d.ts
document_loaders/web/apify_dataset.cjs
document_loaders/web/apify_dataset.js
document_loaders/web/apify_dataset.d.ts
document_loaders/web/cheerio.cjs
document_loaders/web/cheerio.js
document_loaders/web/cheerio.d.ts
Expand Down
13 changes: 13 additions & 0 deletions langchain/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,9 @@
"document_loaders/base.cjs",
"document_loaders/base.js",
"document_loaders/base.d.ts",
"document_loaders/web/apify_dataset.cjs",
"document_loaders/web/apify_dataset.js",
"document_loaders/web/apify_dataset.d.ts",
"document_loaders/web/cheerio.cjs",
"document_loaders/web/cheerio.js",
"document_loaders/web/cheerio.d.ts",
Expand Down Expand Up @@ -345,6 +348,7 @@
"@types/uuid": "^9",
"@typescript-eslint/eslint-plugin": "^5.58.0",
"@typescript-eslint/parser": "^5.58.0",
"apify-client": "^2.7.1",
"axios": "^0.26.0",
"cheerio": "^1.0.0-rc.12",
"chromadb": "^1.4.0",
Expand Down Expand Up @@ -394,6 +398,7 @@
"@tensorflow/tfjs-converter": "*",
"@tensorflow/tfjs-core": "*",
"@zilliz/milvus2-sdk-node": "^2.2.0",
"apify-client": "^2.7.1",
"axios": "*",
"cheerio": "^1.0.0-rc.12",
"chromadb": "^1.4.0",
Expand Down Expand Up @@ -453,6 +458,9 @@
"@zilliz/milvus2-sdk-node": {
"optional": true
},
"apify-client": {
"optional": true
},
"axios": {
"optional": true
},
Expand Down Expand Up @@ -778,6 +786,11 @@
"import": "./document_loaders/base.js",
"require": "./document_loaders/base.cjs"
},
"./document_loaders/web/apify_dataset": {
"types": "./document_loaders/web/apify_dataset.d.ts",
"import": "./document_loaders/web/apify_dataset.js",
"require": "./document_loaders/web/apify_dataset.cjs"
},
"./document_loaders/web/cheerio": {
"types": "./document_loaders/web/cheerio.d.ts",
"import": "./document_loaders/web/cheerio.js",
Expand Down
1 change: 1 addition & 0 deletions langchain/scripts/create-entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ const entrypoints = {
// document_loaders
document_loaders: "document_loaders/index",
"document_loaders/base": "document_loaders/base",
"document_loaders/web/apify_dataset": "document_loaders/web/apify_dataset",
"document_loaders/web/cheerio": "document_loaders/web/cheerio",
"document_loaders/web/puppeteer": "document_loaders/web/puppeteer",
"document_loaders/web/playwright": "document_loaders/web/playwright",
Expand Down
1 change: 1 addition & 0 deletions langchain/src/document_loaders/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export { CollegeConfidentialLoader } from "./web/college_confidential.js";
export { GitbookLoader } from "./web/gitbook.js";
export { HNLoader } from "./web/hn.js";
export { IMSDBLoader } from "./web/imsdb.js";
export { ApifyDatasetLoader } from "./web/apify_dataset.js";
export { DirectoryLoader, UnknownHandling } from "./fs/directory.js";
export { SRTLoader } from "./fs/srt.js";
export { PDFLoader } from "./fs/pdf.js";
Expand Down
45 changes: 45 additions & 0 deletions langchain/src/document_loaders/web/apify_dataset.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import { Document } from "../../document.js";
import { BaseDocumentLoader, DocumentLoader } from "../base.js";

export class ApifyDatasetLoader
extends BaseDocumentLoader
implements DocumentLoader
{
protected datasetId: string;

protected datasetMappingFunction: (
item: Record<string | number, unknown>
) => Document;

constructor(
datasetId: string,
datasetMappingFunction: (item: Record<string | number, unknown>) => Document
) {
super();
this.datasetId = datasetId;
this.datasetMappingFunction = datasetMappingFunction;
}

static async imports(): Promise<{
ApifyClientClass: typeof import("apify-client").ApifyClient;
}> {
try {
const { ApifyClient } = await import("apify-client");
return { ApifyClientClass: ApifyClient };
} catch (e) {
throw new Error(
"Please install apify-client as a dependency with, e.g. `yarn add apify-client`"
);
}
}

async load(): Promise<Document[]> {
const { ApifyClientClass } = await ApifyDatasetLoader.imports();
const apifyClient = new ApifyClientClass();

const datasetItems = (
await apifyClient.dataset(this.datasetId).listItems({ clean: true })
).items;
return datasetItems.map(this.datasetMappingFunction);
}
}