Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/chroma #213

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
OPENAI_API_KEY=

# Update these with your Supabase details from your project settings > API and dashboard settings
PINECONE_API_KEY=
PINECONE_ENVIRONMENT=
PINECONE_INDEX_NAME=
COLLECTION_NAME=
52 changes: 18 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files
# GPT-4, LangChain & Chroma - Create a ChatGPT Chatbot for Your PDF Files

Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.

Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.

[Tutorial video](https://www.youtube.com/watch?v=ih9PBGVVOO4)

[Join the discord if you have questions](https://discord.gg/E4Mc77qwjm)
Tech stack used includes LangChain, Chroma, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Chroma is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.

The visual guide of this repo and tutorial is in the `visual guide` folder.

Expand All @@ -16,14 +12,15 @@ Prelude: Please make sure you have already downloaded node on your system and th

## Development

1. Clone the repo or download the ZIP
1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) for your platform.

2. Clone the repo or download the ZIP

```
git clone [github https url]
```


2. Install packages
3. Install packages

First run `npm install yarn -g` to install yarn globally (if you haven't already).

Expand All @@ -32,30 +29,32 @@ Then run:
```
yarn install
```

After installation, you should now see a `node_modules` folder.

3. Set up your `.env` file
4. Set up your `.env` file

- Copy `.env.example` into `.env`
Your `.env` file should look like this:

```
OPENAI_API_KEY=

PINECONE_API_KEY=
PINECONE_ENVIRONMENT=

PINECONE_INDEX_NAME=
COLLECTION_NAME=

```

- Visit [openai](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to retrieve API keys and insert into your `.env` file.
- Visit [pinecone](https://pinecone.io/) to create and retrieve your API keys, and also retrieve your environment and index name from the dashboard.

4. In the `config` folder, replace the `PINECONE_NAME_SPACE` with a `namespace` where you'd like to store your embeddings on Pinecone when you run `npm run ingest`. This namespace will later be used for queries and retrieval.
- Choose a collection name where you'd like to store your embeddings in Chroma. This collection will later be used for queries and retrieval.
- [Chroma details](https://docs.trychroma.com/getting-started)

5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAI` to `gpt-4`, if you have access to `gpt-4` api. Please verify outside this repo that you have access to `gpt-4` api, otherwise the application will not work.

6. In a new terminal window, run Chroma in the Docker container:

```
docker run -p 8000:8000 ghcr.io/chroma-core/chroma:0.3.21
```

## Convert your PDF files to embeddings

**This repo can load multiple PDF files**
Expand All @@ -64,11 +63,9 @@ PINECONE_INDEX_NAME=

2. Run the script `npm run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.

3. Check Pinecone dashboard to verify your namespace and vectors have been added.

## Run the app

Once you've verified that the embeddings and content have been successfully added to your Pinecone, you can run the app `npm run dev` to launch the local dev environment, and then type a question in the chat interface.
Once you've verified that the embeddings and content have been successfully added to Chroma db, you can run the app `npm run dev` to launch the local dev environment, and then type a question in the chat interface.

## Troubleshooting

Expand All @@ -79,21 +76,8 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
- Make sure you're running the latest Node version. Run `node -v`
- Try a different PDF or convert your PDF to text first. It's possible your PDF is corrupted, scanned, or requires OCR to convert to text.
- `Console.log` the `env` variables and make sure they are exposed.
- Make sure you're using the same versions of LangChain and Pinecone as this repo.
- Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
- If you change `modelName` in `OpenAI`, make sure you have access to the api for the appropriate model.
- Make sure you have enough OpenAI credits and a valid card on your billings account.
- Check that you don't have multiple OPENAPI keys in your global environment. If you do, the local `env` file from the project will be overwritten by systems `env` variable.
- Try to hard code your API keys into the `process.env` variables if there are still issues.

**Pinecone errors**

- Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
- Check that you've set the vector dimensions to `1536`.
- Make sure your pinecone namespace is in lowercase.
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter before 7 days.
- Retry from scratch with a new Pinecone project, index, and cloned repo.

## Credit

Frontend of this repo is inspired by [langchain-chat-nextjs](https://github.com/zahidkhawaja/langchain-chat-nextjs)
7 changes: 7 additions & 0 deletions config/chroma.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
if (!process.env.COLLECTION_NAME) {
throw new Error('Missing collection name name in .env file');
}

const COLLECTION_NAME = process.env.COLLECTION_NAME ?? '';

export { COLLECTION_NAME };
13 changes: 0 additions & 13 deletions config/pinecone.ts

This file was deleted.

5 changes: 3 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "gpt4-langchain-pdf-chatbot",
"version": "0.1.0",
"version": "0.2.0",
"private": true,
"license": "MIT",
"author": "Mayooear<twitter:@mayowaoshin>",
Expand All @@ -18,6 +18,7 @@
"@microsoft/fetch-event-source": "^2.0.1",
"@pinecone-database/pinecone": "0.0.12",
"@radix-ui/react-accordion": "^1.1.1",
"chromadb": "1.4.1",
"clsx": "^1.2.1",
"dotenv": "^16.0.3",
"langchain": "0.0.55",
Expand Down Expand Up @@ -49,7 +50,7 @@
"keywords": [
"starter",
"gpt4",
"pinecone",
"chroma",
"typescript",
"nextjs",
"langchain",
Expand Down
13 changes: 4 additions & 9 deletions pages/api/chat.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
import type { NextApiRequest, NextApiResponse } from 'next';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { makeChain } from '@/utils/makechain';
import { pinecone } from '@/utils/pinecone-client';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { COLLECTION_NAME } from '@/config/chroma';
import { Chroma } from 'langchain/vectorstores/chroma';

export default async function handler(
req: NextApiRequest,
Expand All @@ -26,15 +25,11 @@ export default async function handler(
const sanitizedQuestion = question.trim().replaceAll('\n', ' ');

try {
const index = pinecone.Index(PINECONE_INDEX_NAME);

/* create vectorstore*/
const vectorStore = await PineconeStore.fromExistingIndex(
const vectorStore = await Chroma.fromExistingCollection(
new OpenAIEmbeddings({}),
{
pineconeIndex: index,
textKey: 'text',
namespace: PINECONE_NAME_SPACE, //namespace comes from your config folder
collectionName: COLLECTION_NAME,
},
);

Expand Down
3 changes: 2 additions & 1 deletion pages/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ export default function Home() {
</div>
<footer className="m-auto p-4">
<a href="https://twitter.com/mayowaoshin">
Powered by LangChainAI. Demo built by Mayo (Twitter: @mayowaoshin).
Powered by LangChainAI and Chroma. Demo built by Mayo (Twitter:
@mayowaoshin).
</a>
</footer>
</Layout>
Expand Down
23 changes: 13 additions & 10 deletions scripts/ingest-data.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { CustomPDFLoader } from '@/utils/customPDFLoader';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { Chroma } from 'langchain/vectorstores/chroma';
import { COLLECTION_NAME } from '@/config/chroma';

/* Name of directory to retrieve your files from */
const filePath = 'docs';
Expand All @@ -31,14 +30,18 @@ export const run = async () => {
console.log('creating vector store...');
/*create and store the embeddings in the vectorStore*/
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

//embed the PDF documents
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace: PINECONE_NAME_SPACE,
textKey: 'text',
});
let chroma = new Chroma(embeddings, { collectionName: COLLECTION_NAME });
await chroma.index?.reset();

// Ingest documents in batches of 100

for (let i = 0; i < docs.length; i += 100) {
const batch = docs.slice(i, i + 100);
await Chroma.fromDocuments(batch, embeddings, {
collectionName: COLLECTION_NAME,
});
}
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
Expand Down
4 changes: 2 additions & 2 deletions utils/makechain.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { OpenAI } from 'langchain/llms/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { ConversationalRetrievalQAChain } from 'langchain/chains';
import { Chroma } from 'langchain/vectorstores/chroma';

const CONDENSE_PROMPT = `Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Expand All @@ -18,7 +18,7 @@ If the question is not related to the context, politely respond that you are tun
Question: {question}
Helpful answer in markdown:`;

export const makeChain = (vectorstore: PineconeStore) => {
export const makeChain = (vectorstore: Chroma) => {
const model = new OpenAI({
temperature: 0, // increase temepreature to get more creative answers
modelName: 'gpt-3.5-turbo', //change this to gpt-4 if you have access
Expand Down
23 changes: 0 additions & 23 deletions utils/pinecone-client.ts

This file was deleted.

7 changes: 7 additions & 0 deletions yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -881,6 +881,13 @@ chokidar@^3.5.3:
optionalDependencies:
fsevents "~2.3.2"

chromadb@1.4.1:
version "1.4.1"
resolved "https://registry.yarnpkg.com/chromadb/-/chromadb-1.4.1.tgz#a81a826956051617fdd25299fc5d3132bcb9ebd6"
integrity sha512-vRcig4CJxJXs++cKMt9tHmk9YjQprxzLK9sVYD6iXfqRJBoXeoFzk/RS95Dz1J6/7aSfBwDsyx3AE2ePP9FnYA==
dependencies:
axios "^0.26.0"

client-only@0.0.1:
version "0.0.1"
resolved "https://registry.yarnpkg.com/client-only/-/client-only-0.0.1.tgz#38bba5d403c41ab150bff64a95c85013cf73bca1"
Expand Down