<a href="https://colab.research.google.com/github/graphlit/graphlit-samples/blob/main/python/Notebook%20Examples/Graphlit_2024_09_02_Scrape_Website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Description**

This example shows how to scrape a website and extract as Markdown format. It uses the `allowedPaths` property to filter only on URLs that contain the word `graphlit`.

**Requirements**

Prior to running this notebook, you will need to [signup](https://docs.graphlit.dev/getting-started/signup) for Graphlit, and [create a project](https://docs.graphlit.dev/getting-started/create-project).

You will need the Graphlit organization ID, preview environment ID and JWT secret from your created project.

Assign these properties as Colab secrets: GRAPHLIT_ORGANIZATION_ID, GRAPHLIT_ENVIRONMENT_ID and GRAPHLIT_JWT_SECRET.


---

Install Graphlit Python client SDK

In [1]:
!pip install --upgrade graphlit-client

Collecting graphlit-client
  Downloading graphlit_client-1.0.20240903001-py3-none-any.whl.metadata (2.7 kB)
Collecting httpx (from graphlit-client)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting websockets (from graphlit-client)
  Downloading websockets-13.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting httpcore==1.* (from httpx->graphlit-client)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx->graphlit-client)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading graphlit_client-1.0.20240903001-py3-none-any.whl (197 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.7/197.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.7 MB/s[0m eta 

Initialize Graphlit

In [2]:
import os
from google.colab import userdata
from graphlit import Graphlit
from graphlit_api import input_types, enums, exceptions

os.environ['GRAPHLIT_ORGANIZATION_ID'] = userdata.get('GRAPHLIT_ORGANIZATION_ID')
os.environ['GRAPHLIT_ENVIRONMENT_ID'] = userdata.get('GRAPHLIT_ENVIRONMENT_ID')
os.environ['GRAPHLIT_JWT_SECRET'] = userdata.get('GRAPHLIT_JWT_SECRET')

graphlit = Graphlit()

Define Graphlit helper functions

In [3]:
from typing import List, Optional

async def create_feed(uri: str, allowed_paths: Optional[List[str]] = None):
    if graphlit.client is None:
        return;

    input = input_types.FeedInput(
        name=uri,
        type=enums.FeedTypes.WEB,
        web=input_types.WebFeedPropertiesInput(
            uri=uri,
            allowedPaths=allowed_paths,
            readLimit=5 # limiting to 5 pages from website
        )
    )

    try:
        response = await graphlit.client.create_feed(input)

        return response.create_feed.id if response.create_feed is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

    return None

async def is_feed_done(feed_id: str):
    if graphlit.client is None:
        return;

    response = await graphlit.client.is_feed_done(feed_id)

    return response.is_feed_done.result if response.is_feed_done is not None else None

async def query_contents(feed_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.query_contents(
            filter=input_types.ContentFilter(
                feeds=[
                    input_types.EntityReferenceFilter(
                        id=feed_id
                    )
                ]
            )
        )

        return response.contents.results if response.contents is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

async def delete_all_feeds():
    if graphlit.client is None:
        return;

    _ = await graphlit.client.delete_all_feeds(is_synchronous=True)


Execute Graphlit example

In [4]:
from IPython.display import display, Markdown
import time

# Remove any existing feeds; only needed for notebook example
await delete_all_feeds()

print('Deleted all feeds.')

# Find URLs with the word 'graphlit' in them.
feed_id = await create_feed(uri="https://www.graphlit.com/blog", allowed_paths=["^/blog/.*graphlit.*$"])

if feed_id is not None:
    print(f'Created feed [{feed_id}].')

    # Wait for feed to complete, since ingestion happens asychronously
    done = False
    time.sleep(5)
    while not done:
        done = await is_feed_done(feed_id)

        if not done:
            time.sleep(2)

    print(f'Completed feed [{feed_id}].')

    # Query contents by feed
    contents = await query_contents(feed_id)

    if contents is not None:
        for content in contents:
            if content is not None:
                display(Markdown(f'# Webpage: {content.uri}:\n{content.markdown}'))

Deleted all feeds.
Created feed [baf4235a-c1f3-4740-a0aa-ce8f0e4cf3b3].
Completed feed [baf4235a-c1f3-4740-a0aa-ce8f0e4cf3b3].


# Webpage: https://www.graphlit.com/blog/building-a-conversational-slack-bot-with-graphlit:
Building a Conversational Slack Bot with Graphlit

Kaushil Kundalia

March 23, 2024

Guest Author: Kaushil Kundalia ( kaushil.kundalia@gmail.com )

Prerequisites:

Slack workspace with admin privileges

Ngrok installed

Install all the requirements

Basic understanding of Flask

Create a Graphlit project

## Source Code: The code for this tutorial can be found here on GitHub.

Step 1: Setting up your backend

Creating a Flask Server

To build a Slack bot, we will have a backend server that will receive events (aka messages) from Slack, process the event (by calling Graphlit APIs) and respond back by sending a message on Slack. We will use Flask, and Python to build our backend server. Let’s first create a basic Flask server: from flask import Flask , request , jsonify flask_app = Flask ( __name__ ) @ flask_app . route ( "/slack-incoming" , methods = [ "POST" ] ) def slack_challenge ( ) : event_data = request . json if "challenge" in event_data : # Verification challenge to confirm the endpoint return jsonify ( { 'challenge' : event_data [ 'challenge' ] } ) if __name__ == "__main__" : flask_app . run ( port = 5000 , debug = True )

Let us break this down. We define a route /slack-incoming that listens for POST requests. This route is intended to handle incoming events from Slack. Later in the tutorial, we will tell Slack about this endpoint and it will hit this endpoint with an HTTP POST request whenever it receives a new event.

But before that, we need to make sure our endpoint is verifiable by Slack. Slack sends a verification token (challenge) to the specified endpoint to confirm that the server is prepared to receive events.

In the provided code, when our Flask application receives a POST request with this challenge token, it simply reads the token from the incoming JSON payload and responds back with the same token encapsulated in a JSON response, thereby verifying the endpoint's authenticity to Slack.

We can start our server by simply running this as a Python script:

Exposing it to the outside world

Since this is a locally hosted app, our backend service will not be able to communicate with the outside world. To solve this, we create a secure tunnel from the public internet to a local server running on your machine using a tool called ngrok .

Since our Flask server is running on port 5000, we run: ngrok https 5000

Here, ngrok exposes your local Flask server to the internet by providing a publicly accessible URL. This URL can then be used to configure Slack's Event Subscriptions, allowing Slack to send event notifications, such as new messages, to your /slack-incoming endpoint.

Now that our backend is ready, we move over to the next step of setting up our Slack App.

Step 2: Setting up Slack App

Creating a Slack App

Our first step is to create a Slack bot. To do this, head over to Slack API and click on “Create New App”.

And now select “From Scratch”:

## Enter your app’s name (which is “Graphlit Bot” here) and select a workspace. Hit “Create App”:

Permissions

### Next step is to grant permission that will allow our Graphlit powered app to communicate with Slack via our newly created app. Head over to “OAuth and Permissions”, and add the following “Bot Token Scopes”:

Enabling Events

Next step is to tell Slack about the endpoint on which it will make a POST request whenever an event occurs. Here, we will use the URL that Ngrok created. First turn on the Event Subscriptions and paste the URL (make sure you include the /slack-incoming endpoint).

Once the URL is verified, will subscribe to new messages coming in a channel. To do this, subscribe to the message.channels event.

Hit “Save Changes”

This finishes the Slack App setup. Now we will install the app to our workspace.

Installing to Slack Workspace

Navigate to “Basic Information” section and click on “Install your app”.

### Once the installation is complete you should be able to see “Graphlit Bot” under “Your apps” section on Slack.

Creating Slack Channel

Finally, let’s create a new channel on Slack and add our bot to it. We will use this channel to run our conversational bot. Here I’ve created a new channel named #graphlit-conversation and add Graphlit Bot.

Voila!! This concludes our Slack setup. Now let’s dive into the interesting part.

Step 3: Getting started with Graphlit

Getting started with Graphlit is straightforward and easy. Let’s start by creating a new project.

Creating Project on Graphlit

Head over to the Graphlit Developer Portal and create a new project.  You can read more information about creating a project here .

We are naming our project “My Project”.

Once you create a project, Graphlit provides an API endpoint on which you can make GraphQL queries. Graphlit provides an API that does all the heavy lifting work behind building an LLM application.

This means that as a developer, you do not need to worry about managing vector databases, generating embeddings, integrating with external data sources, building wrappers over LLM models, etc--Graphlit abstracts away all this.

That’s all the setup you need to get started with Graphlit. Let’s start building our chatbot.

Ingesting Slack messages as Feed

A feed in Graphlit allows you to ingest bulk contents into your Graphlit Project. Feed supports ingestion of multiple types of data such as PDFs, messages, images, audios, videos, and even RSS or Reddit posts. Here we will be using the Slack Feed to ingest messages from Slack into our project, and we will schedule our feel to pull new messages from Slack every minute. We will use this content to create a conversation over it in the subsequent step. But for now, let’s focus on how to create a feed.

---
Content in Graphlit is referred to as any form of complex or unstructured data such as PDFs, images, Slack messages, Word documents etc.

We will use the createFeed mutation via the API Explorer. API Explorer provides an in-browser IDE within your Graphlit project that you can use to run your GraphQL queries or mutations.

Request mutation CreateFeed ( $feed : FeedInput! ) { createFeed ( feed : $feed ) { id name state type } }

Variables { "feed" : { "type" : "SLACK" , "slack" : { "token" : "xoxb-your-token" , "channel" : "graphlit-conversation" } , "schedulePolicy" : { "recurrenceType" : "REPEAT" , "repeatInterval" : "PT1M" } , "name" : "Slack Feed" } }

Response { "data" : { "createFeed" : { "id" : "5d3c3d7d-8358-4365-9afa-f4c41bf76b1d" , "name" : "Slack Feed" , "state" : "ENABLED" , "type" : "SLACK" } } }

Let us jump over to Slack to see our feed work in action.

Request query Feed ( $feedId : ID! ) { feed ( id : $feedId ) { name contents { text } slack { channel } } }

Variables { "feedId" : "55974d0f-32dd-4e65-b0d1-2878b178ea28" }

Response { "data" : { "feed" : { "name" : "Slack Feed" , "contents" : [ { "text" : "Slack Message:\n- From: Kaushil Kundalia\n- Created at 3/24/2024 5:39:21 AM UTC\nHello. This is the first message that the Slack Feed will read." } ] , "slack" : { "channel" : "graphlit-conversation" } } } }

Perfect, now that we have data coming in, we will use this to build our conversational chatbot.

Creating Conversation

Conversation on Graphlit is a data model that lets you build chatbot based applications.

Internally a conversation does the following:

When you ingest content, Graphlit will internally create a knowledge graph on it.

A conversation will be across content, based on the optional filter provided with the CreateConversation mutation.  If no filter is provided, the conversation will be across all content in your project. It will then use the knowledge graph that makes it easy to converse about the filtered content.

You can prompt a conversation; i.e. you can give a message to a conversation and it will search for relevant content from the knowledge graph, parse it to an LLM & generate a response.

Each time you prompt a conversation, it will add 2 new messages to its knowledge graph (the user message and assistant message) which updates your context.

You can continue a conversation by specifying a conversation id.

Hence using Conversation abstracts away the process of generating vector embeddings, storing to a vector database, running similarity search, etc.

To create a conversation you can run the createConversation mutation. Graphlit will use Azure OpenAI GPT-3.5 Turbo 16k by default to complete the conversation prompts. But you optionally provide a Specification when creating the conversation and select any model from OpenAI, Anthropic, Mistral etc. Notice that we’re using a feeds filter here. This tells Graphlit on what content to converse over, i.e. the Slack feed.

Request mutation CreateConversation ( $conversation : ConversationInput! ) { createConversation ( conversation : $conversation ) { owner { id } name id } }

Variables { "conversation" : { "name" : "Slack Conversation" , "filter" : { "feeds" : [ { "id" : "55974d0f-32dd-4e65-b0d1-2878b178ea28" } ] } } }

Response { "data" : { "createConversation" : { "owner" : { "id" : "3e6e8313-421c-4dc6-b50d-c6e3c81ff2b3" } , "name" : "Slack Conversation" , "id" : "354f7085-3505-4481-a3b4-f2008c08df11" } } }

Again, we will use the API Explorer to create a conversation and then prompt over that conversation by making calls using Python.

## Note the conversation ID as it will be used later.

Step 4: Bringing it all together

Let us recap what we did so far:

Created a Python backend that can receive slack messages

Created a Slack Bot and added it to a channel

Created a Conversation on Graphlit that we will use to build the chatbot

Now let’s glue it all together in our backend.

Reading environment variables

Create a file .env and provide the following variables: SLACK_BOT_TOKEN = "REDACTED" SLACK_CHANNEL = "graphlit-conversation" SLACK_SIGNING_SECRET = "REDACTED" SLACK_BOT_USER = "Graphlit Bot" GRAPHLIT_ORG_ID = "REDACTED" GRAPHLIT_ENV_ID = "REDACTED" GRAPHLIT_SECRET_KEY = "REDACTED” GRAPHLIT_URL = "https://data-scus.graphlit.io/api/v1/graphql" GRAPHLIT_CONVERSATION_ID = "REDACTED"

Import these in the Flask application app.py . from dotenv import load_dotenv load_dotenv ( ) slack_token = os . getenv ( "SLACK_BOT_TOKEN" ) slack_channel = os . getenv ( "SLACK_CHANNEL" ) signing_secret = os . getenv ( "SLACK_SIGNING_SECRET" ) slack_bot_user = os . getenv ( "SLACK_BOT_USER" ) graphlit_organization_id = os . getenv ( "GRAPHLIT_ORG_ID" ) graphlit_environment_id = os . getenv ( "GRAPHLIT_ENV_ID" ) graphlit_secret_key = os . getenv ( "GRAPHLIT_SECRET_KEY" ) graphlit_url = os . getenv ( "GRAPHLIT_URL" ) graphlit_conversation_id = os . getenv ( "GRAPHLIT_CONVERSATION_ID" )

---
Authenticating Graphlit API

Graphlit uses JWT based authentication for its API. To create a JWT, you can add the following function in app.py .

def get_graphlit_token ( organization_id , environment_id , secret_key , issuer = "graphlit" , audience = "https://portal.graphlit.io" , role = "Owner" , expiration_hours = 1 ) -> str : expiration = datetime . datetime . utcnow ( ) + datetime . timedelta ( hours = expiration_hours ) # Define the payload payload = { "https://graphlit.io/jwt/claims" : { "x-graphlit-environment-id" : environment_id , "x-graphlit-organization-id" : organization_id , "x-graphlit-role" : role , } , "exp" : expiration , "iss" : issuer , "aud" : audience , } # Sign the JWT token = jwt . encode ( payload , secret_key , algorithm = "HS256" ) # verify the token try : decoded = jwt . decode ( token , secret_key , algorithms = [ "HS256" ] , audience = audience ) print ( decoded ) except jwt . ExpiredSignatureError as ex : print ( "Error: Token has expired" ) raise ex except jwt . InvalidTokenError as ex : print ( "Error: Invalid token" ) raise ex return token

This function returns an HS256 encoded token string, and this token will be passed in the header while making any request to the Graphlit APIs.

Using PromptConversation API

The PromptConversation API facilitates the creation and management of conversational chatbots. PromptConversation API expects a user prompt and returns with essential details, including the LLM response, conversation ID, messages count etc.

Upon receiving a request, the API will fetch relevant data by performing a similarity search on content and past conversation, and generate a response through LLM by using the fetched data. Besides, it will also store the user prompt and the LLM response in embeddings which can be queried upon by future prompts.

Here, we will use gql which is a Python GraphQL client. We wrap calls to Graphlit inside graphlit_request function, that will take a user prompt str as an input, call the PromptConversation API and return the LLM response as a string.

token = get_graphlit_token ( graphlit_organization_id , graphlit_environment_id , graphlit_secret_key ) transport = RequestsHTTPTransport ( url = graphlit_url , headers = { "Authorization" : f "Bearer {token}" } ) gql_client = Client ( transport = transport ) def graphlit_request ( prompt : str ) -> dict : query = gql ( "" " mutation PromptConversation ( $prompt : String! , $promptConversationId : ID ) { promptConversation ( prompt : $prompt , id : $promptConversationId ) { message { message } messageCount conversation { id } } } "" " ) variables = { "prompt" : prompt , "promptConversationId" : graphlit_conversation_id } return gql_client . execute ( query , variable_values = variables )

Breaking this down:

We fetch Graphlit tokens by calling get_graphlit_token , and then use it to initialize a GraphQL Client .

Query and Variables:

As seen in the query, the PromptConversation mutation expects 2 arguments ($prompt: String!, $promptConversationId: ID). The $prompt would be the user message coming from Slack and the $promptConversationId is the conversationId received as a response after we created a conversation. Optionally you can also query Conversations .

And the API will respond with the following fields mentioned in the mutation. { message { message } messageCount conversation { id } }

Payload and Request: We wrap the query and variables inside payload and execute the query.

Response: A sample Graphlit API response in this case would look like this { "promptConversation" : { "message" : { "message" : "This is LLM Response" } , "messageCount" : 16 , "conversation" : { "id" : "5363a7ea-bf63-477a-b625-84f9ec9d9e2b" } } }

We will extract the LLM response message from the query response in Flask function: response = graphlit_request ( text ) message = response . get ( "promptConversation" ) . get ( "message" ) . get ( "message" ) print ( message )

Bringing it all together

Now that our backend is able to authenticate and call the Graphlit API we can now connect it with our slack_challenge() function to handle incoming messages. Modify the function in app.py as given below: @ flask_app . route ( "/slack-incoming" , methods = [ "POST" ] ) def slack_challenge ( ) : event_data = request . json slack_user_id = slack_client . api_call ( "auth.test" ) [ "user_id" ] if "challenge" in event_data : # Verification challenge to confirm the endpoint return jsonify ( { 'challenge' : event_data [ 'challenge' ] } ) elif "event" in event_data : event = event_data [ 'event' ] print ( f'event type got it: { event . get ( "type" ) } ' ) # Handle message events if event . get ( "type" ) == "message" and "subtype" not in event : # Process the message user = event [ "user" ] text = event [ "text" ] try : response = graphlit_request ( text ) message = response . get ( "promptConversation" ) . get ( "message" ) . get ( "message" ) print ( message ) except Exception as ex : # if something goes wrong, respond accordingly. message = "I'm sorry something went wrong internally. Please try again." # Send the response to Slack slack_client . chat_postMessage ( channel = slack_channel , text = message ) if user != slack_user_id else None return "OK" , 200

---
The code checks if the incoming payload contains an "event" key, signifying an event notification from Slack. Specifically, for message events without subtypes (i.e., standard messages, not updates or deletions). We extract user and text from the message and pass the text to the graphlit_request to get a response. The response from this operation is then conditionally posted back into the Slack channel, provided the message wasn't sent by the bot itself (thereby preventing the bot from responding to its own messages).

Result

And there we have it!!! Let’s head over to Slack and see the symphony in action.

## We now have a production ready Slack bot that can listen to, remember and respond to conversation messages.

Summary

Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com .

For more information, you can read our Graphlit Documentation , visit our marketing site , or join our Discord community .



# Webpage: https://www.graphlit.com/blog/build-ai-applications-with-next-js-vercel-and-graphlit:
Build LLM-driven applications with Next.js, Vercel and Graphlit

Kirk Marple

July 30, 2024

We have built three new sample applications, which show the capabilities of Graphlit integrated with Next.js, and deployable on Vercel.

Each of this applications uses the Graphlit Node.js SDK ( NPM ) for integration with the Graphlit Platform API. import { Graphlit } from 'graphlit-client' ;

Also, each are deployable to Vercel, via the Deploy button on the README page of each Github repository.

Chat Application [ Github ]: git clone git @ github . com : graphlit / graphlit - samples . git cd nextjs / chat

Similar to ChatGPT, Graphlit supports Retrieval Augmented Generation (RAG) conversations over ingested content.

In this application, you can upload one or more files to your Graphlit project, and then prompt a chat conversation to ask questions or summarize the file contents.

Files will be read from the local filesystem, and Base64-encoded before being sent to the Graphlit API.  Ingestion runs synchrously given the isSynchronous parameter is set to true , and the API route will wait until the file has completed the ingestion workflow. // Initialize the Graphlit client const client = new Graphlit ( ) ; // Process each file by ingesting it into the Graphlit client const results = await Promise . allSettled ( data . files . map ( ( { name , base64 , mimeType } ) => { return client . ingestEncodedFile ( name , base64 , mimeType , undefined , true ) ; } ) ) ;

Then, when the user enters a prompt, it will be sent to the default LLM (OpenAI GPT-4o, at the time of publishing) for completion. // Initialize the Graphlit client const client = new Graphlit ( ) ; // Send the prompt to the conversation const promptResults = await client . promptConversation ( data . prompt , data . conversationId ) ;

### Previous conversations can be queried, and as a conversation is selected, the application loads the previous messages. // Initialize the Graphlit client const client = new Graphlit ( ) ; // Query the Graphlit client for conversations const conversationResults = await client . queryConversations ( ) ; // Extract the conversations from the results const response = conversationResults . conversations ;

Web Extraction Application [ Github ]: git clone git @ github . com : graphlit / graphlit - samples . git cd web - extraction

#### Graphlit can be used to scrape webpages or crawl websites and extract text, even without using the text for a RAG conversation.

Scrape

This sample application demonstrates how to scrape a webpage by URL, and then display the extracted Markdown text and structured JSON output.

#### // Initialize the Graphlit client const client = new Graphlit ( ) ; // Ingest the URI into the Graphlit client const response = await client . ingestUri ( data . uri , undefined , undefined , true ) ;

Crawl

This sample application also demonstrates how to crawl a website by URL, walking the pages via sitemap.  The application then displays the extracted Markdown text and structured JSON output of all crawled pages.

// Initialize the Graphlit client const client = new Graphlit ( ) ; // Create a new feed for the specified URI const response = await client . createFeed ( { name : data . uri , type : FeedTypes . Web , web : { uri : data . uri , readLimit : data . limit , } , } ) ;

File Extraction Application [ Github ]: git clone git @ github . com : graphlit / graphlit - samples . git cd file - extraction

In addition to extracting text from webpages, Graphlit supports extracting text from documents, such as PDFs and Word documents.

Similar to the Chat sample application, the File Extraction application demonstrates how to upload a local file.  Once ingested, the application displays the extracted Markdown text and structured JSON output. // Initialize the Graphlit client const client = new Graphlit ( ) ; // Ingest each file and handle results const results = await Promise . allSettled ( data . files . map ( ( { name , base64 , mimeType } ) => client . ingestEncodedFile ( name , base64 , mimeType , undefined , true ) ) ) ;

Summary

Please email any questions on this article or the Graphlit Platform to questions@graphlit.com .

For more information, you can read our Graphlit Documentation , visit our marketing site , or join our Discord community .



# Webpage: https://www.graphlit.com/blog/langchain-vs-graphlit:
Langchain vs Graphlit: Building a Q&A System

Archana Vaidheeswaran

May 20, 2024

Imagine you're a developer tasked with building an AI-powered application that can answer questions based on information from a specific website. Chances are, you've come across Langchain - a popular library known for its versatility and wide range of capabilities. With Langchain, you can easily integrate various AI components into your application, from data ingestion and processing to model training and inference. It's a go-to choice for many developers looking to create chatbots, RAG systems, or AI agent applications.

## However, there's another tool worth exploring: Graphlit. Graphlit offers a compelling alternative for developers seeking a more streamlined approach to building AI applications. In this article, we'll dive into the key differences between Langchain and Graphlit, exploring their respective strengths and trade-offs, and helping you make an informed decision for your next AI project.

Overview of LangChain and Graphlit

At its core, Langchain follows a modular architecture, providing a set of loosely coupled components that developers can mix and match to create custom AI pipelines. This allows for great flexibility and control over every aspect of the application, from data loading and preprocessing to model selection and deployment. Langchain's ecosystem includes modules for web scraping, text splitting, embeddings, vectorstores, and integrations with various LLM APIs and LLM services.

In contrast, Graphlit takes a more opinionated and integrated approach, offering a unified platform that abstracts away many of the low-level details and vendor wiring you would have to do with LangChain. With Graphlit, developers can quickly set up data feeds, define AI workflows, and configure models using high-level APIs. The platform handles the underlying data processing, storage, and model orchestration, allowing developers to focus on the application logic and user experience.

While Langchain's modular design offers many customization options, it also means that developers need to have a deep understanding of the various components and how they interact. This can lead to a steeper learning curve and longer development cycles, especially for complex applications. Graphlit, on the other hand, prioritizes simplicity and ease of use, trading off some flexibility for a more manageable and productive developer experience.

Throughout this article, we'll explore these differences in greater detail, examining how Langchain and Graphlit approach data ingestion, processing, storage, and retrieval, as well as their integration capabilities with AI models and external services. We'll provide concrete examples and code snippets to illustrate the key concepts and help you understand the implications for your specific use case.

### Whether you're a seasoned AI developer looking for a more efficient workflow or a newcomer seeking a gentle introduction to building AI applications, this comparative analysis of Langchain and Graphlit will provide valuable insights and guidance. By the end of the article, you'll have a clear understanding of the strengths and limitations of each library, empowering you to make the best choice for your next AI project.

Building a Question-Answering System with Langchain

Let's build a question-answering system using Langchain. We'll go through the code step by step, explaining each component and how it fits together to create the final application.

Step 1: Loading and Processing Data

The first step is to load and preprocess the data that our system will use to answer questions. In this example, we're using a single web page as our data source. loader = WebBaseLoader ( web_paths = ( "https://lilianweng.github.io/posts/2023-06-23-agent/" , ) , bs_kwargs = dict ( parse_only = bs4 . SoupStrainer ( class_ = ( "post-content" , "post-title" , "post-header" ) ) ) , ) docs = loader . load ( )

Here, we use Langchain's WebBaseLoader to fetch the content of the specified web page. The bs_kwargs parameter allows us to select specific elements from the page using BeautifulSoup's SoupStrainer. In this case, we're interested in the elements with classes "post-content", "post-title", and "post-header".

Next, we split the loaded documents into chunks using the RecursiveCharacterTextSplitter: text_splitter = RecursiveCharacterTextSplitter ( chunk_size = 1000 , chunk_overlap = 200 ) splits = text_splitter . split_documents ( docs )

This splitter breaks down the text into smaller chunks of a specified size (1000 characters) with some overlap between chunks (200 characters) to maintain context.

Developers have several options for hosting the data loading and processing components. They can run these tasks on their local machine separately, use a cloud-based virtual machine (like AWS EC2 or Google Compute Engine), or even employ serverless functions (such as AWS Lambda or Google Cloud Functions) for a more scalable and cost-effective solution.

#### However, each of these options comes with its own set of challenges. Running on a local machine may not be suitable for production environments, and developers need to ensure proper security measures are in place. Cloud-based VMs require managing infrastructure, scaling, and maintenance. Serverless functions have limitations in terms of execution time and memory, which may not be suitable for all use cases.

Step 2: Creating Vector Embeddings and Storing in a Vector Store

To enable efficient retrieval of relevant chunks based on a given query, we need to convert the text chunks into vector embeddings and store them in a vectorstore. vectorstore = Chroma . from_documents ( documents = splits , embedding = OpenAIEmbeddings ( ) )

Langchain provides various embedding and vectorstore implementations. In this example, we use `OpenAIEmbeddings` to generate embeddings and `Chroma` as our vectorstore. The `from_documents` method takes care of creating embeddings for each chunk and storing them in the vectorstore.

Vectorstores are specialized databases designed to store and efficiently retrieve vector embeddings. Popular options include Pinecone, Weaviate, Qdrant, and Milvus. These vector stores can be hosted on their respective managed platforms (like Pinecone Cloud) or self-hosted on your infrastructure.

---
#### Choosing and managing a vector store adds another layer of complexity to the application architecture. Developers must consider factors like scalability, performance, cost, and integration with other components. Each vectorstore has its own API, query language, and deployment process, which can be time-consuming to learn and implement.

Step 3: Setting Up the Retrieval and Generation Pipeline

With our data processed and stored, we can now set up the pipeline for retrieving relevant chunks based on a query and generating an answer using a language model. retriever = vectorstore . as_retriever ( ) prompt = hub . pull ( "rlm/rag-prompt" ) rag_chain = ( { "context" : retriever | format_docs , "question" : RunnablePassthrough ( ) } | prompt | llm | StrOutputParser ( ) )

The pipeline consists of several components:

`retriever`: Retrieves the most relevant chunks from the vectorstore based on the query.

`format_docs`: A custom function that concatenates the retrieved chunks into a single string.

`prompt`: A pre-defined prompt template that takes the context (retrieved chunks) and the question as inputs.

`llm`: The language model (in this case, gpt-3.5-turbo-0125) that generates the answer based on the formatted prompt.

`StrOutputParser`: Parses the generated output into a string.

Language models like GPT-3.5 and GPT-4 are typically hosted by their respective providers (OpenAI, Anthropic, etc.) and accessed via APIs. Developers need to manage API keys, rate limits, and billing. Some models can also be self-hosted using open-source implementations like GPT-J or BLOOM, but this requires significant computational resources and expertise.

#### Integrating with external LLM APIs adds another point of failure and dependency to the application. Developers need to handle authentication, error handling, and API versioning. Self-hosting models is not practical for most use cases due to the high costs and maintenance overhead.

Step 4: Querying the System

Finally, we can use the rag_chain to ask questions and get answers based on the information in the processed web page. rag_chain . invoke ( "What is Task Decomposition?" )

And there you have it! A complete question-answering system built with Langchain. The modular architecture allows for easy customization and experimentation with different components, while the high-level APIs abstract away much of the complexity.

However, as we've seen, building an AI application with Langchain involves managing various components across different platforms and providers. This can add significant overhead and complexity, especially for developers who are new to this ecosystem; which is why we will be looking into Graphlit. In contrast, Graphlit offers a unified platform that abstracts away many of these concerns. By providing a fully managed, end-to-end solution for building applications, Graphlit allows developers to focus on their core application logic rather than worrying about infrastructure, integration, and management challenges.

Developers need to consider factors like:

Infrastructure management: Provisioning and scaling VMs, containers, or serverless functions for data processing and model hosting.

Integration: Ensuring smooth communication and data flow between different components, APIs, and services.

Security: Managing API keys, access controls, and data encryption across multiple platforms.

Monitoring and logging: Setting up centralized monitoring and logging to troubleshoot issues and monitor performance.

Cost optimization: Analyzing and optimizing costs across various services, considering factors like data transfer, API calls, and compute resources.

### In the next section, we'll explore how Graphlit approaches the same task and compare the two libraries regarding ease of use, flexibility, and performance.

#### Building a Question-Answering System with Graphlit

Step 1: Creating a Feed

The first step is to create a feed from the web page we want to use as our data source. Graphlit provides a high-level `create_feed` function that takes care of fetching and processing the web content. async def create_feed ( graphlit , uri ) : input = FeedInput ( name = uri , type = FeedTypes . WEB , web = WebFeedPropertiesInput ( uri = uri ) ) try : response = await graphlit . client . create_feed ( input ) feed_id = response . create_feed . id except GraphQLClientError as e : return None , str ( e ) return feed_id , None

#### We simply provide the URI of the web page, and Graphlit handles the rest, returning a `feed_id` that we can use to reference the feed later.

Step 2: Creating a Specification

Next, we create a specification that defines how we want to query and process the data in our feed. This includes specifying the search type, model to use for question-answering, and other parameters. async def create_specification ( graphlit ) : input = SpecificationInput ( name = "Summarization" , type = SpecificationTypes . COMPLETION , serviceType = ModelServiceTypes . ANTHROPIC , searchType = SearchTypes . VECTOR , anthropic = AnthropicModelPropertiesInput ( model = AnthropicModels . CLAUDE_3_HAIKU , temperature = 0.1 , probability = 0.2 , completionTokenLimit = 2048 , ) ) try : response = await graphlit . client . create_specification ( input ) spec_id = response . create_specification . id except GraphQLClientError as e : return None , str ( e ) return spec_id , None

#### Graphlit provides a declarative way to define the specification, which makes it easy to understand and modify.

Step 3: Creating a Conversation and Querying

Finally, we create a conversation using the specification and querying the data in our feed. async def create_conversation ( graphlit , spec_id ) : input = ConversationInput ( name = "Conversation" , specification = EntityReferenceInput ( id = spec_id ) ) try : response = await graphlit . client . create_conversation ( input ) conv_id = response . create_conversation . id except GraphQLClientError as e : return None , str ( e ) return conv_id , None async def prompt_conversation ( graphlit , conv_id , prompt ) : response = await graphlit . client . prompt_conversation ( prompt , conv_id ) message = response . prompt_conversation . message . message citations = response . prompt_conversation . message . citations return message , citations

---
We create a conversation with a name and the specification ID and then use the prompt_conversation function to send queries and receive responses.

And that's it! With just a few lines of code and a streamlined architecture, we have a fully functional question-answering system using Graphlit.

Compared to Langchain, Graphlit's architecture is more compact and abstracts away many of the low-level details. We don't need to worry about chunking text, creating embeddings, or managing a vector store - Graphlit handles all of that internally.

### This makes Graphlit an attractive option for developers who want a simpler, more integrated solution for building AI applications. While it may not offer the same level of customization as Langchain, it provides a powerful set of high-level APIs and a managed infrastructure that can accelerate development and reduce operational complexity.

Comparison

As we can see, both Langchain and Graphlit provide powerful tools for building AI applications, but they take different approaches in their architecture and design.

Langchain offers a modular framework where developers have fine-grained control over each component of the pipeline, from data ingestion and processing to retrieval and generation. This flexibility allows for extensive customization and optimization, but it also requires a deeper understanding of the underlying technologies and how they fit together.

On the other hand, Graphlit provides a more integrated and abstracted solution, where many of the low-level details are handled automatically by the platform. Developers can focus on defining high-level specifications and workflows, and Graphlit takes care of the rest. This simplifies the development process and reduces the operational complexity, making it an attractive option for teams looking to quickly prototype and deploy AI applications.

### Ultimately, the choice between Langchain and Graphlit depends on the specific needs and requirements of the project. If maximum flexibility and customization are paramount, Langchain may be the better choice. But if simplicity, ease of use, and rapid development are the top priorities, Graphlit offers a compelling alternative.

#### Appendix

Step 1a: Creating a Workflow (Optional)

Graphlit allows you to define workflows for additional data processing, such as entity extraction. This step is optional, but it demonstrates the flexibility of Graphlit's architecture. async def create_workflow ( graphlit ) : input = WorkflowInput ( name = "Azure Cognitive Services" , extraction = ExtractionWorkflowStageInput ( jobs = [ ExtractionWorkflowJobInput ( connector = EntityExtractionConnectorInput ( type = EntityExtractionServiceTypes . AZURE_COGNITIVE_SERVICES_TEXT , ) ) ] ) ) try : response = await graphlit . client . create_workflow ( input ) workflow_id = response . create_workflow . id except GraphQLClientError as e : return None , str ( e ) return workflow_id , None

## Here, we define a workflow that uses Azure Cognitive Services for entity extraction. Graphlit takes care of orchestrating the workflow and integrating with the external service.

Summary

Please email any questions on this article or the Graphlit Platform to questions@graphlit.com .

For more information, you can read our Graphlit Documentation , visit our marketing site , or join our Discord community .



# Webpage: https://www.graphlit.com/blog/diving-into-open-source-with-graphlit:
Diving Into Open Source with Graphlit: A Beginner's Guide to Contributing and Collaborating

Archana Vaidheeswaran

April 1, 2024

Hello there, aspiring contributors!

Welcome to the vibrant world of open-source software (OSS). If you've ever used a web browser like Firefox or a programming language like Python, you've benefited from open-source software. This software lives and evolves in open-source repositories, and they're a testament to collaborative innovation.

But what does contributing involve, and why should you consider it?

The Advantages of Contributing to Open Source

Contributing to open-source projects is an investment in your skill set. Each project has unique challenges, pushing you to adapt and enhance your technical abilities. From writing cleaner code to learning new programming languages, the learning curve is steep but rewarding. You’ll find yourself picking up best practices and contributing some of your own.

But it's not just about code. It's about the people behind the code. Joining an open-source project plunges you into a vibrant community of like-minded individuals. Collaboration is the heart of open source. You'll work together to solve problems, review each other's code, and share knowledge. This is where innovation thrives, as diverse ideas come together to shape the future of technology.

Moreover, your contributions are a visible testament to your skills, making your public portfolio on platforms like GitHub reflect your expertise and dedication. It’s tangible evidence for potential employers or collaborators of what you can do and how you do it. This isn’t just about showing off your coding chops; it’s about demonstrating your ability to engage with complex projects and see them through to completion.

Exploring the OpenAI Python Client Project for Open-Source Contributions

The OpenAI Python client project presents an exemplary opportunity for open-source contributions. As a repository of Python examples and tools for interacting with OpenAI services, it's a cornerstone for developers looking to integrate AI into their applications. The repository offers a peek into the application of AI in various domains and provides a practical guide for implementing complex AI models.

This repository is noteworthy due to its emphasis on providing Python developers with the tools and examples needed to leverage OpenAI's capabilities. It opens up possibilities for automating tasks, analyzing data, and integrating AI into software solutions.

For beginners, the OpenAI Python client project offers a rich environment in which to learn and contribute to AI-based systems. With detailed documentation, active issues, and pull requests, new contributors can find clear pathways to making meaningful contributions. The diversity of tasks, from refining code samples to improving integration tools, allows a broad range of skill sets to get involved.

Navigating Open Source Repositories Like OpenAI-Python

Open-source repositories are hubs where developers collaborate to build, improve, and maintain software. These platforms host the source code and manage the collaborative process, including issue tracking, feature requests, and code review.

Choosing a repository for contributions should be strategic. Projects like OpenAI-Python are ideal as they offer a mix of technical challenges and the chance to contribute to widely impactful tools. Look for active repositories with a clear contribution guide, and consider the support provided for new contributors.

For first-timers eager to contribute to projects like the OpenAI Python client, here are a few actionable steps:

Familiarize: Start by understanding the project's goals and technology stack.

Identify: Look for issues labeled as 'good first issue' or 'help wanted' to find newcomer-friendly tasks.

Learn: Review the project's contribution guidelines and adhere to the coding standards and practices.

Engage: Don’t hesitate to ask questions in the community or offer help in ongoing discussions.

Overcoming Information Overload

When first exploring the intricate web of open-source projects like the OpenAI Python client, it's easy to feel swamped by the sheer volume of information. The endless stream of issues and commits, and technical discussions can intimidate the most determined newcomer. It's normal to feel like you're facing a firehose of data, but don't let this deter you. The key is to start small—focus on one area, such as documentation or a particular issue that resonates with your skills or interests. Use filtering tools to narrow down issues and discussions relevant to you, and don't be afraid to ask questions. Open-source communities are built on collaboration, and more often than not, you’ll find veteran contributors eager to guide you.

RAG applications might hold the answer to information overload

RAG (Retrieval-Augmented Generation)-based applications are powerful tools for managing and synthesizing large amounts of information, making them particularly useful for navigating open-source software repositories' dense and often overwhelming landscape.

Here's why they stand out:

Contextual Understanding: RAG systems are adept at understanding the context behind queries, which means they can provide more relevant and precise summaries or categorizations of complex issues and discussions within repositories.

Efficiency: They significantly reduce the time you'd otherwise spend manually sifting through issues and documentation. By quickly aggregating the main topics or problems in a repository, RAG-based tools help you identify the areas where you can contribute effectively.

Improved Focus: By presenting a consolidated view of the repository's current state, these applications help maintain your focus on specific areas without getting sidetracked by the sheer volume of information available.

Learning Curve: Understanding the underlying patterns and common themes in a repository's issues can be educational for beginners. RAG applications facilitate this learning by highlighting recurrent topics, which can help new contributors better understand the project's needs.

Collaboration: These tools foster better collaboration by clearly delineating problem areas or active discussions so newcomers can easily see where they might jump in and offer help or seek guidance.

Prioritization: RAG-based applications often rank issues by relevance or frequency, helping maintainers and contributors prioritize their efforts and tackle the most pressing tasks.

By employing a RAG-based approach, newcomers and experienced developers alike can navigate open-source repositories more confidently and contribute more meaningfully, turning the flood of information into a navigable stream.

---
Graphlit as Your Open Source Compass

Enter Graphlit — a tool designed to sift through the complexities of repository issues, making it easier to digest and navigate the project you're interested in. This tool is a boon for tackling the challenge of where to begin. By generating reports on recent GitHub issues, Graphlit RAG helps identify recurring themes, enabling you to understand the broader workstreams at a glance.

Take the OpenAI Python client project, for example. Using Graphlit, one can quickly discern that the repository has ongoing discussions on dependency management, compatibility and portability, performance optimization, documentation, and bug fixes. This organized view allows you to pinpoint where your contributions could be most useful. Whether adding a dependency or enhancing the documentation, Graphlit RAG initially transforms what seems like an insurmountable volume of information into an actionable roadmap for your open-source journey.

Before diving into Graphlit, there's an important step to ensure you can seamlessly navigate the GitHub repositories. You must generate a Personal Access Token (PAT) on GitHub. This token acts as your digital key, granting you the necessary permissions to interact with the repository beyond what's available to a general viewer. With a PAT, you can use Graphlit to its full potential, allowing the tool to access details on issues, pull requests, and other repository data that may be restricted.

With your PAT in hand, you're set to use Graphlit — a tool designed to sift through the complexities of repository issues, making it easier to digest and navigate the projects you're interested in.

Graphlit RAG offers a distilled view of the issues within a repository which can significantly aid in understanding and contributing to complex projects like OpenAI's Python repository.

Consider an issue categorized under "Library Usage and Configuration," which reflects common hurdles users may encounter when interacting with the library. Graphlit points out GitHub Issue #3, titled "AttributeError: partially initialized module 'openai' has no attribute 'Completion'." This indicates users are having trouble accessing certain features due to initialization errors in the library.

Understanding this issue through Graphlit, a contributor could:

Investigate the reported AttributeError to understand why the 'openai' module isn't fully initializing.

Review the initialization process and identify gaps that might lead to incomplete module loading.

Clone the repository and replicate the issue in a local development environment.

Implement fixes to ensure complete and correct initialization of the 'openai' module.

Thoroughly test the changes across different environments to ensure that the 'Completion' attribute is consistently accessible.

Update the documentation to assist users with troubleshooting similar issues in the future.

Prepare and submit a pull request detailing the problem, the implemented fix, and the testing conducted to validate the changes.

By following these steps, including generating your PAT and utilizing Graphlit, not only do you simplify the process of finding relevant issues but also guide contributors on how to address them, making the journey into open-source contribution more approachable and structured.

Remember to be patient; maintainers are often busy but will provide feedback as soon as possible.

Stepping into the world of open-source can be the start of an incredibly rewarding journey. Regardless of size, every contribution is a valuable step toward collective progress. As you prepare to make your mark, take heart in the knowledge that every expert was once a beginner. So go ahead, take the leap, and join the collaborative symphony of open-source. Who knows? Your code could be the next puzzle piece in a project that changes the world. Happy coding!

Summary

Please email any questions on this article or the Graphlit Platform to questions@graphlit.com .

For more information, you can read our Graphlit Documentation , visit our marketing site , or join our Discord community .



# Webpage: https://www.graphlit.com/blog/pdf-ingestion-using-graphlit:
PDF Ingestion Using Graphlit

Archana Vaidheeswaran

## June 23, 2024

Introduction

Navigating through the complex structure of PDF documents to extract usable data poses a significant challenge due to their non-uniform format and embedded elements like images, plots and tables. In this article, we will talk about the challenges of extraction PDF data and how you can use Graphlit to ingest pdf data and use an LLM to unlock interactive question-answering capabilities.

Challenges in PDF Data Extraction

PDFs are one of the most common document formats used across various industries due to their ability to maintain a consistent layout across different devices. However, the very features that make PDFs so versatile also create significant challenges for data extraction:

Non-Standard Layouts: PDFs often contain complex layouts with multicolumn texts, sidebars, and mixed content types like text, images, and tables interspersed. This variability can confuse tools that extract plain text, leading to incomplete or inaccurate data retrieval.

Embedded Content: Text in PDFs may be embedded as images, especially in scanned documents, making it inaccessible for text extraction tools without optical character recognition (OCR) capabilities.

Inconsistent Text Flow: The logical reading order in a PDF might not match the visual presentation. For example, columns on a page might be read left to right when they are intended to be read from top to bottom, which can disrupt data extraction accuracy.

Font and Encoding Issues: PDFs allow for embedding custom fonts, which might not be recognized by standard PDF readers or extraction tools, resulting in garbled or missing text during extraction.

Metadata and Security Features: PDFs can contain metadata and may be secured with encryption or usage restrictions, complicating data extraction without appropriate permissions or tools.

How PDF Data Extraction Works

To overcome these challenges, effective PDF data extraction employs a variety of techniques and technologies, typically involving several stages:

Pre-processing:

Normalization: Converts all data into a uniform format.

Image Pre-processing: To improve OCR results for scanned documents, image quality enhancement techniques such as de-skewing, noise reduction, and contrast adjustment are applied.

Text Extraction:

OCR: Used for scanned PDFs to convert image-based content into selectable and searchable text.

Text Recognition: Tools parse the text layer of digital PDFs to extract readable content. This involves understanding the PDF's internal structure to differentiate between actual text and graphical elements.

Structural Analysis:

Layout Parsing: Identifies the structural elements of a PDF, such as columns, headers, footers, and paragraphs. This is crucial for maintaining the logical flow of content.

Table and Graph Detection: Specialized algorithms detect and reconstruct tables and charts to preserve data integrity and relationships.

Content Post-processing:

Data Cleansing: Removes any artifacts or errors introduced during OCR or text extraction.

Validation: Ensures the extracted data matches expected formats or schemas, using techniques like pattern recognition or checksums.

Semantic Enrichment:

Entity Recognition: Identifies and classifies entities such as dates, names, and locations within the text.

Contextual Analysis: Tools apply natural language processing to infer context and meaning from the text, enhancing the richness of the extracted data.

Integration and Output:

Data Structuring: The extracted information is structured into a format suitable for further analysis or storage, such as JSON, XML, or directly into a vector databases.

Output Delivery: Data is made accessible to users or downstream applications, often through APIs or direct export options.

PDF extraction involves multiple steps, such as document conversion, layout analysis, text extraction, and data parsing. Tackling these steps by yourself can be overwhelming and divert your focus from building your core application and serving your users. Graphlit simplifies the entire process of PDF ingestion and data extraction by providing an easy API.

To get started with Graphlit, all you need is the URL of the PDF you want to ingest and a Graphlit account. With just a few lines of code, you can integrate Graphlit into your application and start processing PDFs effortlessly.

Using Graphlit to Ingest PDF Data

There are a few ways we can ingest PDF data into Graphlit. The easiest way to ingest PDFs is to use the `ingestURI` mutation. You can also try this out in the API Explorer.

You will need to specify the URI for the PDF. In my example, I am using the Arxiv link for the “Attention Is All You Need” paper.

Mutation: mutation IngestUri ( $uri : URL! ) { ingestUri ( uri : $uri ) { name id } }

Variables: { "uri" : "https://arxiv.org/pdf/1706.03762" }

Response: { "data" : { "ingestUri" : { "name" : "1706.03762" , "id" : "c701c43b-68b5-4be1-8d15-347ccd3594b0" } } }

And that’s it, you can now start a conversation in Graphlit or use the QueryContents mutation to start a chat and Graphlit will inject relevant context from this PDF into your chat context!

You can also use Graphlit’s Python API to ingest the PDF. To do that, you will first need to install the graphlit-client package: ! pip install graphlit - client - q

Next you need to initialize your Graphlit client with your organization ID, environment ID, and JWT secret. You will find these values in your project settings in the Graphlit portal. from graphlit import Graphlit graphlit = Graphlit ( organization_id = "<your-org-id-here>" , environment_id = "<your-env-id-here>" , jwt_secret = "<your-jwt-secret-here>" )

Then, you can use the ingest_uri method to ingest your pdf. uri = "https://arxiv.org/pdf/1706.03762" response = await graphlit . client . ingest_uri ( uri , is_synchronous = True ) response . ingest_uri . id

---
However, the true power of Graphlit comes from the external connectors that Graphlit supports. For PDF file, a popular method for higher quality document extraction is to use Azure AI Document Intelligence.

This can smartly ingest the PDF taking into account the document’s layout including titles, paragraphs and tables. It also some common document forrmats like US Tax forms, ID documents, credit cards among others. You can integrate Azure AI Document Intelligence into your ingestion pipeline using workflows.

Let’s see how to do that:

workflow_input = WorkflowInput ( name = "Azure AI Document Intelligence" , preparation = PreparationWorkflowStageInput ( jobs = [ PreparationWorkflowJobInput ( connector = FilePreparationConnectorInput ( type = FilePreparationServiceTypes . AZURE_DOCUMENT_INTELLIGENCE , azureDocument = AzureDocumentPreparationPropertiesInput ( model = AzureDocumentIntelligenceModels . LAYOUT ) ) ) ] ) ) response = await graphlit . client . create_workflow ( input ) workflow_id = response . create_workflow . id response = await graphlit . client . ingest_uri ( uri , is_synchronous = True , workflow = EntityReferenceInput ( id = workflow_id ) )

First, we create a WorkflowInput object named workflow_input. This object represents the configuration for the workflow. Inside the WorkflowInput, we specify the name of the workflow as "Azure AI Document Intelligence". We then define the PreparationWorkflowStageInput.

Within the preparation stage, we specify a list of jobs using PreparationWorkflowJobInput. In this case, we have a single job. The job uses a FilePreparationConnectorInput to specify the type of file preparation service to use.

Here, we set the type to FilePreparationServiceTypes.AZURE_DOCUMENT_INTELLIGENCE. We also provide some additional properties for Azure Document Intelligence using AzureDocumentPreparationPropertiesInput.

In this example, we set the model to AzureDocumentIntelligenceModels.LAYOUT, which means we want to extract the layout information from the documents. After configuring the workflow input, we make a call to graphlit.client.create_workflow(), passing the workflow_input as a parameter. This creates the workflow in the Graphlit system.

Remember to store the ID of the newly created workflow. Finally, we use the ingest_uri method to ingest a document specified by the uri variable. We also provide the workflow_id as a reference to the workflow we created earlier.

By using Graphlit and Azure AI Document Intelligence, you can automate document processing tasks and extract valuable information from your documents efficiently. Graphlit handles all the heavy lifting for you, including document conversion, layout analysis, text extraction, and data parsing. It provides a streamlined and efficient workflow that abstracts away the complexities, allowing you to concentrate on what truly matters: developing your application and delivering value to your users.

Summary

Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com .

For more information, you can read our Graphlit Documentation , visit our marketing site , or join our Discord community .

