
-----

# **Chromadb Vector Database**

```python
!pip install chromadb openai langchain langchain_openai langchain_community tiktoken
```

-----

### **1. Install Chromadb and other Dependencies**

In [18]:
!pip install chromadb openai langchain langchain_openai langchain_community tiktoken

Collecting langchain_openai
  Downloading langchain_openai-0.2.2-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_openai-0.2.2-py3-none-any.whl (49 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain_openai
Successfully installed langchain_openai-0.2.2


In [2]:
!pip show chromadb

Name: chromadb
Version: 0.5.13
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, httpx, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, rich, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


#### **Import Libraries**

In [45]:
import os
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

### **2. Set up OpenAI API Key**

In [30]:
os.environ['OPENAI_API_KEY'] = "YOUR_API_KEY"

### **3. Download Dataset**

In [5]:
# Use wget to download a file from a specified URL
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip

Explanation of options:
-  -q: This option enables "quiet" mode, which suppresses output messages.
-  The command will run without displaying progress or error messages.
-  The URL points to a ZIP file hosted on Dropbox.

### **4. Unzip Dataset**

In [6]:
# Use unzip to extract files from a ZIP archive
!unzip -q new_articles.zip -d new_articles

 Explanation of options:
-  -q: This option enables "quiet" mode, which suppresses output messages
during the extraction process.
-  new_articles.zip: This is the name of the ZIP file to be extracted.
-  -d new_articles: This option specifies the destination directory
where the extracted files will be placed. In this case,
the files will be extracted into a folder named 'new_articles'.

## **5. Load data**

In [7]:
# Import the DirectoryLoader class (assuming it is already imported in your context)
# This line initializes a DirectoryLoader instance to load text files from a specified directory
loader = DirectoryLoader(
    "/content/new_articles/",  # The path to the directory containing the text files
    glob="./*.txt",           # A glob pattern to match all .txt files in the directory
    loader_cls=TextLoader     # The class to use for loading the text files (TextLoader)
)

# Explanation of parameters:
# - "/content/new_articles/": This is the directory where the text files are located.
# - glob="./*.txt": This pattern indicates that only files with a .txt extension should be loaded.
# - loader_cls=TextLoader: This specifies that the TextLoader class will be used to handle the loading of the text files.

In [8]:
document = loader.load()

In [10]:
# document uncomment to watch loaded data

## **6. Chunking Of Data (Creating Chunks of Data)**

In [12]:
# Initialize a RecursiveCharacterTextSplitter instance
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # The maximum size of each chunk of text (in characters)
    chunk_overlap=200    # The number of overlapping characters between consecutive chunks
)

# Split an input document into smaller chunks using the text splitter
chunked_document = text_splitter.split_documents(document)

#### **Explanation:**

1. **RecursiveCharacterTextSplitter Initialization**:
   - `text_splitter`: This variable holds an instance of the `RecursiveCharacterTextSplitter` class.
   - **Parameters**:
     - `chunk_size=1000`: This specifies that each chunk of text should ideally contain up to 1000 characters.
     - `chunk_overlap=200`: This indicates that each consecutive chunk will overlap by 200 characters. This overlap can help maintain context between chunks when processing or analyzing the text.

2. **Splitting the Document**:
   - `chunked_document`: This variable stores the output of the `split_documents` method, which is called on the `text_splitter` instance.
   - `document`: This is the input text that you want to split into smaller chunks. The method will process this text and return a list of chunks based on the specified size and overlap.

#### **Let's Check First Chunk**

In [14]:
chunked_document[0]

Document(metadata={'source': '/content/new_articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt'}, page_content='Slack has evolved from a pure communications platform to one that enables companies to link directly to enterprise applications without having to resort to dreaded task switching. Today, at the Salesforce World Tour event in NYC, the company announced the next step in its platform’s evolution where it will be putting AI at the forefront of the user experience, making it easier to get information and build workflows.\n\nIt’s important to note that these are announcements, and many of these features are not available yet.\n\nRob Seaman says that rather than slapping on an AI cover, they are working to incorporate it in a variety of ways across the platform. That started last month with a small step, a partnership with OpenAI to bring a ChatGPT app into Slack, the first piece of a much broader vision for AI on the platform. That part is in beta at

#### **Check Length of chunked_document**

In [13]:
len(chunked_document)

233

## **7. Creating Embeddings of chunked_document**

In [32]:
embedding = OpenAIEmbeddings()
embedding

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7fac9d86ef80>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7fac957f0f10>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

## **8. Creating Local Vector Database**

In [24]:
# Define a variable to specify the directory where data will be persisted (saved)
persist_directory = "db"

# Explanation:
# - persist_directory: This variable holds the name of the directory ("db")
#   that will be used for storing persistent data, such as database files,
#   configuration files, or any other relevant data that needs to be saved
#   for future use.

In [33]:
# Create a vector database instance using Chroma
vectordb = Chroma.from_documents(
    documents=chunked_document,             # The chunked_document documents to be added to the vector database
    embedding=embedding,        # The embedding model used to convert documents into vector representations
    persist_directory=persist_directory  # The directory where the vector database will be stored persistently
)

# Explanation of parameters:
# - documents=chunked_document: This argument specifies the input documents (chunked_document)
#   that will be processed and stored in the vector database.
# - embedding=embedding: This argument refers to the embedding model that
#   will transform the documents into vectors, allowing for efficient
#   similarity searches and retrieval.
# - persist_directory=persist_directory: This indicates where the
#   vector database will be saved, making it possible to retrieve the data
#   later without needing to recreate it.

In [34]:
# Save the current state of the vector database to the specified directory
vectordb.persist()

# Release the reference to the vector database instance
vectordb = None

# Explanation:
# - vectordb.persist(): This method call ensures that all changes made to the
#   vector database are saved to the disk. This is important for data
#   integrity, ensuring that the database can be reloaded later with the
#   same state.
# - vectordb = None: This line sets the variable `vectordb` to None,
#   effectively releasing the reference to the vector database instance.
#   This can help free up memory and indicate that the vector database is no
#   longer in use in the current context.

  vectordb.persist()


## **9. Load the Vector Database**

In [35]:
# Initialize a new instance of the Chroma vector database
vectordb = Chroma(
    persist_directory=persist_directory,  # The directory where the vector database will be stored persistently
    embedding_function=embedding           # The function used to convert documents into vector representations
)

# Explanation of parameters:
# - persist_directory=persist_directory: This argument specifies the location
#   (set by the variable `persist_directory`) where the vector database will
#   save its data, allowing for retrieval in future sessions.
# - embedding_function=embedding: This argument indicates the embedding function
#   that will be applied to documents to create their vector representations.
#   This is essential for enabling operations like similarity searches within the database.

  vectordb = Chroma(


## **10. Retrive the Data from Vector Database**

In [36]:
# Create a retriever instance from the vector database
retriever = vectordb.as_retriever()

# Explanation:
# - vectordb.as_retriever(): This method converts the vector database instance
#   (`vectordb`) into a retriever object. A retriever is designed to efficiently
#   fetch relevant documents or data points from the vector database based on
#   similarity search or other retrieval mechanisms.
# - retriever: This variable will hold the retriever instance, which can be
#   used later to query the vector database for information based on input
#   queries or vectors.

In [41]:
retriever.search_type

'similarity'

In [37]:
# Retrieve relevant documents from the vector database based on a query
docs = retriever.get_relevant_documents("How much money did Microsoft raise?")

# Explanation:
# - retriever.get_relevant_documents(): This method queries the vector database
#   using the provided string as input.
# - "How much money did Microsoft raise?": This is the query for which relevant
#   documents are being sought. The retriever will search the database for
#   documents that are most similar or relevant to this question.
# - docs: This variable will store the list of documents returned by the
#   retriever that are deemed relevant to the query. These documents can be
#   used for further processing or analysis.

  docs = retriever.get_relevant_documents("How much money did Microsoft raise?")


In [40]:
print(docs)

# Explanation:
# - print(docs): This line outputs the list of relevant documents to the console
#   or standard output.

[Document(metadata={'source': '/content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}, page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”'), Document(metadata={'source': '/content/new_articles/05

In [39]:
# Print the first document retrieved from the list of relevant documents
print(docs[0])

# Explanation:
# - docs[0]: This accesses the first document in the list of relevant documents
#   stored in the variable `docs`.
#   It is assumed that `docs` contains multiple documents, and the first one
#   is selected for display.
# - print(): This function outputs the content of the first document to the
#   console or standard output, allowing the user to see the retrieved
#   information related to the query.

page_content='April 28, 2023

VC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.

April 25, 2023

Called ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”' metadata={'source': '/content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}


## **11. Define Retrival Chain**

In [55]:
llm = OpenAI()

In [56]:
# Create a question-answering chain using a retriever and a language model
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),                    # Instantiate the OpenAI language model to be used for generating answers
    chain_type="stuff",              # Specify the type of chain to be used; "stuff" indicates a method of combining documents
    retriever=retriever,             # Pass the retriever instance that will fetch relevant documents based on queries
    return_source_documents=True      # Indicate that the source documents used to generate the answer should also be returned
)

# Explanation of parameters:
# - llm=OpenAI(): This initializes the language model that will be used to process
#   the retrieved documents and generate answers to the questions.
# - chain_type="stuff": This specifies the method by which the retrieved documents
#   will be combined for answering. The "stuff" type usually implies a straightforward
#   concatenation of documents.
# - retriever=retriever: This argument connects the QA chain to the retriever
#   that will supply the relevant documents needed to answer the user's queries.
# - return_source_documents=True: This option ensures that the documents used
#   to construct the answer are also returned, which can be useful for reference
#   or verification.

In [57]:
# Define a function to process the response from the language model
def process_llm_response(llm_response):
    # Print the main result or answer from the language model's response
    print(llm_response['result'])

    # Print a header for the sources section
    print('\n\nSources:')

    # Iterate over each source document in the response
    for source in llm_response["source_documents"]:
        # Print the source information from the metadata of each document
        print(source.metadata['source'])

# Explanation:
# - llm_response: This parameter is expected to be a dictionary containing
#   the response from the language model, including the result and source documents.
# - print(llm_response['result']): This line retrieves and prints the main
#   result or answer generated by the language model.
# - print('\n\nSources:'): This line outputs a header to indicate the beginning
#   of the sources section, providing clarity to the user.
# - for source in llm_response["source_documents"]: This loop iterates over
#   the list of source documents contained in the response.
# - print(source.metadata['source']): This line retrieves and prints the
#   source information from the metadata of each document, which can help
#   users trace back the information or verify its origin.

In [58]:
# Define the query to be asked to the question-answering chain
query = "How much money did Microsoft raise?"

# Execute the question-answering chain with the specified query
llm_response = qa_chain(query)

# Process and display the response from the language model
process_llm_response(llm_response)

# Explanation:
# - query: This variable holds the string representing the question we want
#   to ask the language model through the QA chain.
# - llm_response = qa_chain(query): This line calls the `qa_chain` with the
#   specified query, which processes the question, retrieves relevant documents,
#   and generates an answer. The result is stored in the variable `llm_response`.
# - process_llm_response(llm_response): This function call takes the response
#   from the language model and processes it, displaying the answer and the
#   sources of information used to generate that answer.


Microsoft raised $10 billion.


Sources:
/content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
/content/new_articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt
/content/new_articles/05-07-3one4-capital-driven-by-contrarian-bets-raises-200-million-new-fund.txt
/content/new_articles/05-07-fintech-space-continues-to-be-competitive-and-drama-filled.txt


In [59]:
query = "What is the news about Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Pando, a startup developing fulfillment management technologies, has raised $30 million in funding and plans to use the capital to expand its global sales, marketing, and delivery capabilities. The company also plans to explore strategic partnerships and acquisitions. Pando's platform offers customizable tools and apps, as well as no-code capabilities, and uses algorithms and machine learning for supply chain predictions. The company faces competition from other vendors such as Altana and Everstream.


Sources:
/content/new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
/content/new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
/content/new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
/content/new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


## **12. Delete the Database**

In [60]:
# Create a ZIP archive of the specified directory
!zip -r db.zip ./db

# Explanation of options:
# - zip: This command is used to create a ZIP archive.
# - -r: This option stands for "recursive," meaning that all files and
#       subdirectories within the specified directory will be included in
#       the ZIP archive.
# - db.zip: This is the name of the resulting ZIP file that will be created.
#           It will contain the contents of the specified directory.
# - ./db: This specifies the directory to be compressed into the ZIP file.
#         In this case, it is the `db` directory located in the current working directory.

  adding: db/ (stored 0%)
  adding: db/chroma.sqlite3 (deflated 45%)
  adding: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/ (stored 0%)
  adding: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/data_level0.bin (deflated 100%)
  adding: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/header.bin (deflated 61%)
  adding: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/length.bin (deflated 88%)
  adding: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/link_lists.bin (stored 0%)


In [61]:
# Clean up resources by deleting the vector database collection
vectordb.delete_collection()  # Remove the entire collection from the vector database

# Persist any changes made before deletion
vectordb.persist()  # Save the current state to ensure data integrity

# Delete the associated directory from the file system
!rm -rf db/  # Remove the 'db' directory and all its contents forcefully and recursively

# Explanation:
# - vectordb.delete_collection(): This method call deletes the collection
#   within the vector database, freeing up resources and clearing stored data.
# - vectordb.persist(): This ensures that any changes made to the database
#   are saved before deletion, which is important for ensuring that
#   modifications are not lost.
# - !rm -rf db/: This command uses the `rm` utility to remove the `db`
#   directory. The options `-r` and `-f` stand for recursive and force,
#   respectively, meaning that all files and subdirectories in `db` will be
#   deleted without prompting for confirmation.

In [62]:
# Extract the contents of the ZIP archive
!unzip db.zip

# Explanation:
# - !unzip: This command is used to extract files from a ZIP archive.
# - db.zip: This specifies the name of the ZIP file to be extracted.
#   The contents of this file will be unpacked into the current working directory.

Archive:  db.zip
   creating: db/
  inflating: db/chroma.sqlite3       
   creating: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/
  inflating: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/data_level0.bin  
  inflating: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/header.bin  
  inflating: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/length.bin  
 extracting: db/3f24a27f-909d-4e42-8fe3-a43da163cc17/link_lists.bin  


-------------------
