Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…into STAGING
  • Loading branch information
prakriti-solankey committed Jul 2, 2024
2 parents 31be778 + 65bd2c4 commit be2e2ea
Show file tree
Hide file tree
Showing 96 changed files with 3,532 additions and 1,412 deletions.
Binary file removed POC_Documents/V1/From Local to Global.docx
Binary file not shown.
Binary file not shown.
Binary file removed POC_Documents/V1/propsed chatbot architecture.jpg
Binary file not shown.
162 changes: 74 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,34 @@

# Knowledge Graph Builder App
This application is designed to convert PDF documents into a knowledge graph stored in Neo4j. It utilizes the power of OpenAI's GPT/Diffbot LLM(Large language model) to extract nodes, relationships and properties from the text content of the PDF and then organizes them into a structured knowledge graph using Langchain framework.
Files can be uploaded from local machine or S3 bucket and then LLM model can be chosen to create the knowledge graph.

### Getting started
Creating knowledge graphs from unstructured data


# LLM Graph Builder

![Python](https://img.shields.io/badge/Python-yellow)
![FastAPI](https://img.shields.io/badge/FastAPI-green)
![React](https://img.shields.io/badge/React-blue)

## Overview
This application is designed to turn Unstructured data (pdfs,docs,txt,youtube video,web pages,etc.) into a knowledge graph stored in Neo4j. It utilizes the power of Large language models (OpenAI,Gemini,etc.) to extract nodes, relationships and their properties from the text and create a structured knowledge graph using Langchain framework.

Upload your files from local machine, GCS or S3 bucket or from web sources, choose your LLM model and generate knowledge graph.

## Key Features
- **Knowledge Graph Creation**: Transform unstructured data into structured knowledge graphs using LLMs.
- **Providing Schema**: Provide your own custom schema or use existing schema in settings to generate graph.
- **View Graph**: View graph for a particular source or multiple sources at a time in Bloom.
- **Chat with Data**: Interact with your data in a Neo4j database through conversational queries, also retrive metadata about the source of response to your queries.

## Getting started

:warning: You will need to have a Neo4j Database V5.15 or later with [APOC installed](https://neo4j.com/docs/apoc/current/installation/) to use this Knowledge Graph Builder.
You can use any [Neo4j Aura database](https://neo4j.com/aura/) (including the free database)
If you are using Neo4j Desktop, you will not be able to use the docker-compose but will have to follow the [separate deployment of backend and frontend section](#running-backend-and-frontend-separately-dev-environment). :warning:

### Deploy locally

## Deployment
### Local deployment
#### Running through docker-compose
By default only OpenAI and Diffbot are enabled since Gemini requires extra GCP configurations.

Expand All @@ -21,13 +40,13 @@ DIFFBOT_API_KEY="your-diffbot-key"

if you only want OpenAI:
```env
LLM_MODELS="OpenAI GPT 3.5,OpenAI GPT 4o"
LLM_MODELS="gpt-3.5,gpt-4o"
OPENAI_API_KEY="your-openai-key"
```

if you only want Diffbot:
```env
LLM_MODELS="Diffbot"
LLM_MODELS="diffbot"
DIFFBOT_API_KEY="your-diffbot-key"
```

Expand All @@ -36,16 +55,16 @@ You can then run Docker Compose to build and start all components:
docker-compose up --build
```

##### Additional configs
#### Additional configs

By default, the input sources will be: Local files, Youtube, Wikipedia and AWS S3. As this default config is applied:
By default, the input sources will be: Local files, Youtube, Wikipedia ,AWS S3 and Webpages. As this default config is applied:
```env
REACT_APP_SOURCES="local,youtube,wiki,s3"
REACT_APP_SOURCES="local,youtube,wiki,s3,web"
```

If however you want the Google GCS integration, add `gcs` and your Google client ID:
```env
REACT_APP_SOURCES="local,youtube,wiki,s3,gcs"
REACT_APP_SOURCES="local,youtube,wiki,s3,gcs,web"
GOOGLE_CLIENT_ID="xxxx"
```

Expand Down Expand Up @@ -76,7 +95,24 @@ Alternatively, you can run the backend and frontend separately:
pip install -r requirements.txt
uvicorn score:app --reload
```
### ENV
### Deploy in Cloud
To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:
```bash
# Frontend deploy
gcloud run deploy
source location current directory > Frontend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
```
```bash
# Backend deploy
gcloud run deploy --set-env-vars "OPENAI_API_KEY = " --set-env-vars "DIFFBOT_API_KEY = " --set-env-vars "NEO4J_URI = " --set-env-vars "NEO4J_PASSWORD = " --set-env-vars "NEO4J_USERNAME = "
source location current directory > Backend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
```

## ENV
| Env Variable Name | Mandatory/Optional | Default Value | Description |
|-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
| OPENAI_API_KEY | Mandatory | | API key for OpenAI |
Expand All @@ -86,7 +122,7 @@ Alternatively, you can run the backend and frontend separately:
| KNN_MIN_SCORE | Optional | 0.94 | Minimum score for KNN algorithm |
| GEMINI_ENABLED | Optional | False | Flag to enable Gemini |
| GCP_LOG_METRICS_ENABLED | Optional | False | Flag to enable Google Cloud logs |
| NUMBER_OF_CHUNKS_TO_COMBINE | Optional | 6 | Number of chunks to combine when processing embeddings |
| NUMBER_OF_CHUNKS_TO_COMBINE | Optional | 5 | Number of chunks to combine when processing embeddings |
| UPDATE_GRAPH_CHUNKS_PROCESSED | Optional | 20 | Number of chunks processed before updating progress |
| NEO4J_URI | Optional | neo4j://database:7687 | URI for Neo4j database |
| NEO4J_USERNAME | Optional | neo4j | Username for Neo4j database |
Expand All @@ -98,86 +134,36 @@ Alternatively, you can run the backend and frontend separately:
| BACKEND_API_URL | Optional | http://localhost:8000 | URL for backend API |
| BLOOM_URL | Optional | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true | URL for Bloom visualization |
| REACT_APP_SOURCES | Optional | local,youtube,wiki,s3 | List of input sources that will be available |
| LLM_MODELS | Optional | Diffbot,OpenAI GPT 3.5,OpenAI GPT 4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot |
| LLM_MODELS | Optional | diffbot,gpt-3.5,gpt-4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot |
| ENV | Optional | DEV | Environment variable for the app |
| TIME_PER_CHUNK | Optional | 4 | Time per chunk for processing |
| CHUNK_SIZE | Optional | 5242880 | Size of each chunk for processing |
| CHUNK_SIZE | Optional | 5242880 | Size of each chunk of file for upload |
| GOOGLE_CLIENT_ID | Optional | | Client ID for Google authentication |
| GCS_FILE_CACHE | Optional | False | If set to True, will save the files to process into GCS. If set to False, will save the files locally |


###
To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:
```bash
# Frontend deploy
gcloud run deploy
source location current directory > Frontend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
```
```bash
# Backend deploy
gcloud run deploy --set-env-vars "OPENAI_API_KEY = " --set-env-vars "DIFFBOT_API_KEY = " --set-env-vars "NEO4J_URI = " --set-env-vars "NEO4J_PASSWORD = " --set-env-vars "NEO4J_USERNAME = "
source location current directory > Backend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
```
### Features
- **PDF Upload**: Users can upload PDF documents using the Drop Zone.
- **S3 Bucket Integration**: Users can also specify PDF documents stored in an S3 bucket for processing.
- **Knowledge Graph Generation**: The application employs OpenAI/Diffbot's LLM to extract relevant information from the PDFs and construct a knowledge graph.
- **Neo4j Integration**: The extracted nodes and relationships are stored in a Neo4j database for easy visualization and querying.
- **Grid View of source node files with** : Name,Type,Size,Nodes,Relations,Duration,Status,Source,Model
## Functions/Modules
#### extract_graph_from_file(uri, userName, password, file_path, model):
Extracts nodes , relationships and properties from a PDF file leveraging LLM models.
Args:
uri: URI of the graph to extract
userName: Username to use for graph creation ( if None will use username from config file )
password: Password to use for graph creation ( if None will use password from config file )
file: File object containing the PDF file path to be used
model: Type of model to use ('Gemini Pro' or 'Diffbot')
Returns:
Json response to API with fileName, nodeCount, relationshipCount, processingTime,
status and model as attributes.
<img width="692" alt="neoooo" src="https://github.com/neo4j-labs/llm-graph-builder/assets/118245454/01e731df-b565-4f4f-b577-c47e39dd1748">
#### create_source_node_graph(uri, userName, password, file):
Creates a source node in Neo4jGraph and sets properties.
Args:
uri: URI of Graph Service to connect to
userName: Username to connect to Graph Service with ( default : None )
password: Password to connect to Graph Service with ( default : None )
file: File object with information about file to be added
Returns:
Success or Failure message of node creation
<img width="958" alt="neo_workspace" src="https://github.com/neo4j-labs/llm-graph-builder/assets/118245454/f2eb11cd-718c-453e-bec9-11410ec6e45d">
#### get_source_list_from_graph():
Returns a list of file sources in the database by querying the graph and
sorting the list by the last updated date.
<img width="822" alt="get_source" src="https://github.com/neo4j-labs/llm-graph-builder/assets/118245454/1d8c7a86-6f10-4916-a4c1-8fdd9f312bcc">
#### Chunk nodes and embeddings creation in Neo4j
<img width="926" alt="chunking" src="https://github.com/neo4j-labs/llm-graph-builder/assets/118245454/4d61479c-e5e9-415e-954e-3edf6a773e72">
## Application Walkthrough
https://github.com/neo4j-labs/llm-graph-builder/assets/121786590/b725a503-6ade-46d2-9e70-61d57443c311

## Usage
1. Connect to Neo4j Aura Instance by passing URI and password or using Neo4j credentials file.
2. Choose your source from a list of Unstructured sources to create graph.
3. Change the LLM (if required) from drop down, which will be used to generate graph.
4. Optionally, define schema(nodes and relationship labels) in entity graph extraction settings.
5. Either select multiple files to 'Generate Graph' or all the files in 'New' status will be processed for graph creation.
6. Have a look at the graph for individial files using 'View' in grid or select one or more files and 'Preview Graph'
7. Ask questions related to the processed/completed sources to chat-bot, Also get detailed information about your answers generated by LLM.

## Links
The Public [ Google cloud Run URL](https://devfrontend-dcavk67s4a-uc.a.run.app).
[Workspace URL](https://workspace-preview.neo4j.io/workspace)

[LLM Knowledge Graph Builder Application](https://llm-graph-builder.neo4jlabs.com/)

[Neo4j Workspace](https://workspace-preview.neo4j.io/workspace/query)

## Reference

[Demo of application](https://www.youtube.com/watch?v=LlNy5VmV290)

## Contact
For any inquiries or support, feel free to raise [Github Issue](https://github.com/neo4j-labs/llm-graph-builder/issues)


## Happy Graph Building!
4 changes: 3 additions & 1 deletion backend/example.env
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@ LANGCHAIN_API_KEY = ""
LANGCHAIN_PROJECT = ""
LANGCHAIN_TRACING_V2 = ""
LANGCHAIN_ENDPOINT = ""
GCS_FILE_CACHE = "" #save the file into GCS or local, SHould be True or False
GCS_FILE_CACHE = "" #save the file into GCS or local, SHould be True or False
NEO4J_USER_AGENT = ""
ENABLE_USER_AGENT = ""
Loading

0 comments on commit be2e2ea

Please sign in to comment.