Showing with 341 additions and 139 deletions.
  1. +16 −7 docs/components/data-sources/csv.mdx
  2. +25 −22 docs/components/data-sources/pdf-file.mdx
  3. +43 −0 docs/components/llms.mdx
  4. +67 −0 docs/examples/slack-AI.mdx
  5. +1 −1 docs/get-started/quickstart.mdx
  6. BIN docs/images/slack-ai.png
  7. +2 −1 docs/mint.json
  8. +1 −1 embedchain-js/README.md
  9. +9 −4 embedchain/app.py
  10. +6 −5 embedchain/chunkers/base_chunker.py
  11. +1 −1 embedchain/client.py
  12. +3 −2 embedchain/config/add_config.py
  13. +6 −3 embedchain/config/cache_config.py
  14. +16 −5 embedchain/config/llm/base.py
  15. +2 −2 embedchain/config/vectordb/qdrant.py
  16. +1 −1 embedchain/config/vectordb/zilliz.py
  17. +2 −1 embedchain/data_formatter/data_formatter.py
  18. +7 −9 embedchain/embedchain.py
  19. +3 −3 embedchain/embedder/base.py
  20. +2 −2 embedchain/embedder/google.py
  21. +1 −1 embedchain/helpers/json_serializable.py
  22. +8 −7 embedchain/llm/base.py
  23. +1 −1 embedchain/llm/google.py
  24. +21 −1 embedchain/llm/huggingface.py
  25. +2 −1 embedchain/llm/ollama.py
  26. +1 −1 embedchain/loaders/base_loader.py
  27. +1 −1 embedchain/loaders/directory_loader.py
  28. +2 −1 embedchain/loaders/docs_site_loader.py
  29. +19 −15 embedchain/loaders/github.py
  30. +2 −1 embedchain/loaders/image.py
  31. +3 −2 embedchain/loaders/json.py
  32. +2 −1 embedchain/loaders/mysql.py
  33. +1 −1 embedchain/loaders/openapi.py
  34. +2 −2 embedchain/loaders/postgres.py
  35. +2 −1 embedchain/loaders/slack.py
  36. +1 −1 embedchain/loaders/unstructured_file.py
  37. +5 −4 embedchain/loaders/web_page.py
  38. +4 −2 embedchain/memory/base.py
  39. +2 −2 embedchain/memory/message.py
  40. +1 −1 embedchain/memory/utils.py
  41. +5 −3 embedchain/store/assistants.py
  42. +2 −1 embedchain/telemetry/posthog.py
  43. +3 −3 embedchain/utils/misc.py
  44. +1 −1 embedchain/vectordb/base.py
  45. +4 −2 embedchain/vectordb/chroma.py
  46. +2 −2 embedchain/vectordb/elasticsearch.py
  47. +1 −1 embedchain/vectordb/opensearch.py
  48. +2 −1 embedchain/vectordb/weaviate.py
  49. +5 −6 embedchain/vectordb/zilliz.py
  50. +1 −1 pyproject.toml
  51. +2 −1 tests/chunkers/test_text.py
  52. +19 −0 tests/llm/test_huggingface.py
23 changes: 16 additions & 7 deletions docs/components/data-sources/csv.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,27 @@
title: '📊 CSV'
---

To add any csv file, use the data_type as `csv`. `csv` allows remote urls and conventional file paths. Headers are included for each line, so if you have an `age` column, `18` will be added as `age: 18`. Eg:
You can load any csv file from your local file system or through a URL. Headers are included for each line, so if you have an `age` column, `18` will be added as `age: 18`.

## Usage

### Load from a local file

```python
from embedchain import App
app = App()
app.add('/path/to/file.csv', data_type='csv')
```

### Load from URL

```python
from embedchain import App
app = App()
app.add('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv', data_type="csv")
# Or add using the local file path
# app.add('/path/to/file.csv', data_type="csv")

app.query("Summarize the air travel data")
# Answer: The air travel data shows the number of flights for the months of July in the years 1958, 1959, and 1960. In July 1958, there were 491 flights, in July 1959 there were 548 flights, and in July 1960 there were 622 flights.
```

Note: There is a size limit allowed for csv file beyond which it can throw error. This limit is set by the LLMs. Please consider chunking large csv files into smaller csv files.
<Note>
There is a size limit allowed for csv file beyond which it can throw error. This limit is set by the LLMs. Please consider chunking large csv files into smaller csv files.
</Note>

47 changes: 25 additions & 22 deletions docs/components/data-sources/pdf-file.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
---
title: '📰 PDF file'
title: '📰 PDF'
---

To add any pdf file, use the data_type as `pdf_file`. Eg:
You can load any pdf file from your local file system or through a URL.

## Setup
Install the following packages for loading youtube videos which help in transcription.

```bash
pip install pytube youtube-transcript-api
```

## Usage

### Load from a local file

```python
from embedchain import App

app = App()
app.add('/path/to/file.pdf', data_type='pdf_file')
```

### Load from URL

```python
from embedchain import App
app = App()
app.add('https://arxiv.org/pdf/1706.03762.pdf', data_type='pdf_file')
app.query("What is the paper 'attention is all you need' about?", citations=True)
# Answer: The paper "Attention Is All You Need" proposes a new network architecture called the Transformer, which is based solely on attention mechanisms. It suggests that complex recurrent or convolutional neural networks can be replaced with a simpler architecture that connects the encoder and decoder through attention. The paper discusses how this approach can improve sequence transduction models, such as neural machine translation.
Expand All @@ -23,25 +40,11 @@ app.query("What is the paper 'attention is all you need' about?", citations=True
# ...
# }
# ),
# (
# 'Attention Visualizations Input ...',
# {
# 'page': 12,
# 'url': 'https://arxiv.org/pdf/1706.03762.pdf',
# 'score': 0.41679039679873736,
# ...
# }
# ),
# (
# 'sequence learning ...',
# {
# 'page': 10,
# 'url': 'https://arxiv.org/pdf/1706.03762.pdf',
# 'score': 0.4188303600897153,
# ...
# }
# )
# ]
```

Note that we do not support password protected pdfs.
We also store the page number under the key `page` with each chunk that helps understand where the answer is coming from. You can fetch the `page` key while during retrieval (refer to the example given above).

<Note>
Note that we do not support password protected pdf files.
</Note>
43 changes: 43 additions & 0 deletions docs/components/llms.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,49 @@ llm:
```
</CodeGroup>

### Custom Endpoints


You can also use [Hugging Face Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index#-inference-endpoints) to access custom endpoints. First, set the `HUGGINGFACE_ACCESS_TOKEN` as above.

Then, load the app using the config yaml file:

<CodeGroup>

```python main.py
import os
from embedchain import App
os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "xxx"
# load llm configuration from config.yaml file
app = App.from_config(config_path="config.yaml")
```

```yaml config.yaml
llm:
provider: huggingface
config:
endpoint: https://api-inference.huggingface.co/models/gpt2 # replace with your personal endpoint
```
</CodeGroup>

If your endpoint requires additional parameters, you can pass them in the `model_kwargs` field:

```
llm:
provider: huggingface
config:
endpoint: <YOUR_ENDPOINT_URL_HERE>
model_kwargs:
max_new_tokens: 100
temperature: 0.5
```

Currently only supports `text-generation` and `text2text-generation` for now [[ref](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html?highlight=huggingfaceendpoint#)].

See langchain's [hugging face endpoint](https://python.langchain.com/docs/integrations/chat/huggingface#huggingfaceendpoint) for more information.

## Llama2

Llama2 is integrated through [Replicate](https://replicate.com/). Set `REPLICATE_API_TOKEN` in environment variable which you can obtain from [their platform](https://replicate.com/account/api-tokens).
Expand Down
67 changes: 67 additions & 0 deletions docs/examples/slack-AI.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
[Embedchain Examples Repo](https://github.com/embedchain/examples) contains code on how to build your own Slack AI to chat with the unstructured data lying in your slack channels.

![Slack AI Demo](/images/slack-ai.png)

## Getting started

Create a Slack AI involves 3 steps

* Create slack user
* Set environment variables
* Run the app locally

### Step 1: Create Slack user token

Follow the steps given below to fetch your slack user token to get data through Slack APIs:

1. Create a workspace on Slack if you don’t have one already by clicking [here](https://slack.com/intl/en-in/).
2. Create a new App on your Slack account by going [here](https://api.slack.com/apps).
3. Select `From Scratch`, then enter the App Name and select your workspace.
4. Navigate to `OAuth & Permissions` tab from the left sidebar and go to the `scopes` section. Add the following scopes under `User Token Scopes`:

```
# Following scopes are needed for reading channel history
channels:history
channels:read
# Following scopes are needed to fetch list of channels from slack
groups:read
mpim:read
im:read
```

5. Click on the `Install to Workspace` button under `OAuth Tokens for Your Workspace` section in the same page and install the app in your slack workspace.
6. After installing the app you will see the `User OAuth Token`, save that token as you will need to configure it as `SLACK_USER_TOKEN` for this demo.

### Step 2: Set environment variables

Navigate to `api` folder and set your `HUGGINGFACE_ACCESS_TOKEN` and `SLACK_USER_TOKEN` in `.env.example` file. Then rename the `.env.example` file to `.env`.


<Note>
By default, we use `Mixtral` model from Hugging Face. However, if you prefer to use OpenAI model, then set `OPENAI_API_KEY` instead of `HUGGINGFACE_ACCESS_TOKEN` along with `SLACK_USER_TOKEN` in `.env` file, and update the code in `api/utils/app.py` file to use OpenAI model instead of Hugging Face model.
</Note>

### Step 3: Run app locally

Follow the instructions given below to run app locally based on your development setup (with docker or without docker):

#### With docker

```bash
docker-compose build
ec start --docker
```

#### Without docker

```bash
ec install-reqs
ec start
```

Finally, you will have the Slack AI frontend running on http://localhost:3000. You can also access the REST APIs on http://localhost:8000.

## Credits

This demo was built using the Embedchain's [full stack demo template](https://docs.embedchain.ai/get-started/full-stack). Follow the instructions [given here](https://docs.embedchain.ai/get-started/full-stack) to create your own full stack RAG application.
2 changes: 1 addition & 1 deletion docs/get-started/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: '💡 Create a RAG app on your own data in a minute'

## Installation

First install the python package.
First install the Python package:

```bash
pip install embedchain
Expand Down
Binary file added docs/images/slack-ai.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,8 @@
"examples/full_stack",
"examples/openai-assistant",
"examples/opensource-assistant",
"examples/nextjs-assistant"
"examples/nextjs-assistant",
"examples/slack-AI"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion embedchain-js/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ await app.addLocal("qna_pair", ["Question", "Answer"]);

## Testing

Before you consume valueable tokens, you should make sure that the embedding you have done works and that it's receiving the correct document from the database.
Before you consume valuable tokens, you should make sure that the embedding you have done works and that it's receiving the correct document from the database.

For this you can use the `dryRun` method.

Expand Down
13 changes: 9 additions & 4 deletions embedchain/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,14 @@
import requests
import yaml

from embedchain.cache import (Config, ExactMatchEvaluation,
SearchDistanceEvaluation, cache,
gptcache_data_manager, gptcache_pre_function)
from embedchain.cache import (
Config,
ExactMatchEvaluation,
SearchDistanceEvaluation,
cache,
gptcache_data_manager,
gptcache_pre_function,
)
from embedchain.client import Client
from embedchain.config import AppConfig, CacheConfig, ChunkerConfig
from embedchain.constants import SQLITE_PATH
Expand All @@ -27,7 +32,7 @@
from embedchain.vectordb.base import BaseVectorDB
from embedchain.vectordb.chroma import ChromaDB

# Setup the user directory if doesn't exist already
# Set up the user directory if it doesn't exist already
Client.setup_dir()


Expand Down
11 changes: 6 additions & 5 deletions embedchain/chunkers/base_chunker.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@ def create_chunks(self, loader, src, app_id=None, config: Optional[ChunkerConfig
"""
Loads data and chunks it.
:param loader: The loader which's `load_data` method is used to create
:param loader: The loader whose `load_data` method is used to create
the raw data.
:param src: The data to be handled by the loader. Can be a URL for
remote sources or local content for local loaders.
:param app_id: App id used to generate the doc_id.
"""
documents = []
chunk_ids = []
idMap = {}
id_map = {}
min_chunk_size = config.min_chunk_size if config is not None else 1
logging.info(f"[INFO] Skipping chunks smaller than {min_chunk_size} characters")
data_result = loader.load_data(src)
Expand All @@ -49,8 +49,8 @@ def create_chunks(self, loader, src, app_id=None, config: Optional[ChunkerConfig
for chunk in chunks:
chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
chunk_id = f"{app_id}--{chunk_id}" if app_id is not None else chunk_id
if idMap.get(chunk_id) is None and len(chunk) >= min_chunk_size:
idMap[chunk_id] = True
if id_map.get(chunk_id) is None and len(chunk) >= min_chunk_size:
id_map[chunk_id] = True
chunk_ids.append(chunk_id)
documents.append(chunk)
metadatas.append(meta_data)
Expand All @@ -77,5 +77,6 @@ def set_data_type(self, data_type: DataType):

# TODO: This should be done during initialization. This means it has to be done in the child classes.

def get_word_count(self, documents):
@staticmethod
def get_word_count(documents) -> int:
return sum([len(document.split(" ")) for document in documents])
2 changes: 1 addition & 1 deletion embedchain/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def __init__(self, api_key=None, host="https://apiv2.embedchain.ai"):
)

@classmethod
def setup_dir(self):
def setup_dir(cls):
"""
Loads the user id from the config file if it exists, otherwise generates a new
one and saves it to the config file.
Expand Down
5 changes: 3 additions & 2 deletions embedchain/config/add_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def __init__(
if self.min_chunk_size >= self.chunk_size:
raise ValueError(f"min_chunk_size {min_chunk_size} should be less than chunk_size {chunk_size}")
if self.min_chunk_size < self.chunk_overlap:
logging.warn(
logging.warning(
f"min_chunk_size {min_chunk_size} should be greater than chunk_overlap {chunk_overlap}, otherwise it is redundant." # noqa:E501
)

Expand All @@ -35,7 +35,8 @@ def __init__(
else:
self.length_function = length_function if length_function else len

def load_func(self, dotpath: str):
@staticmethod
def load_func(dotpath: str):
if "." not in dotpath:
return getattr(builtins, dotpath)
else:
Expand Down
9 changes: 6 additions & 3 deletions embedchain/config/cache_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ class CacheSimilarityEvalConfig(BaseConfig):
This is the evaluator to compare two embeddings according to their distance computed in embedding retrieval stage.
In the retrieval stage, `search_result` is the distance used for approximate nearest neighbor search and have been
put into `cache_dict`. `max_distance` is used to bound this distance to make it between [0-`max_distance`].
`positive` is used to indicate this distance is directly proportional to the similarity of two entites.
If `positive` is set `False`, `max_distance` will be used to substract this distance to get the final score.
`positive` is used to indicate this distance is directly proportional to the similarity of two entities.
If `positive` is set `False`, `max_distance` will be used to subtract this distance to get the final score.
:param max_distance: the bound of maximum distance.
:type max_distance: float
:param positive: if the larger distance indicates more similar of two entities, It is True. Otherwise it is False.
:param positive: if the larger distance indicates more similar of two entities, It is True. Otherwise, it is False.
:type positive: bool
"""

Expand All @@ -29,6 +29,7 @@ def __init__(
self.max_distance = max_distance
self.positive = positive

@staticmethod
def from_config(config: Optional[Dict[str, Any]]):
if config is None:
return CacheSimilarityEvalConfig()
Expand Down Expand Up @@ -63,6 +64,7 @@ def __init__(
self.similarity_threshold = similarity_threshold
self.auto_flush = auto_flush

@staticmethod
def from_config(config: Optional[Dict[str, Any]]):
if config is None:
return CacheInitConfig()
Expand All @@ -83,6 +85,7 @@ def __init__(
self.similarity_eval_config = similarity_eval_config
self.init_config = init_config

@staticmethod
def from_config(config: Optional[Dict[str, Any]]):
if config is None:
return CacheConfig()
Expand Down
Loading