## Introduction

Large Linguistic Models (LLMs) achieve a superior level of language understanding during their training. It allows them to create human-like text and create powerful representations from text data. We covered using LangChain to use LLM to write content with hands-on projects.

This post will focus on using language models to create seamless vector representations from a corpus. The representation in question will power a chat application that can answer questions from any text by finding the closest data point to the request. This project focuses on finding answers from text files in GitHub repositories, such as .md and .txt. So we'll start by collecting data from a GitHub repository and converting that data into embeds. These integrations will be saved to Activeloop's Deep Lake vector database for quick and easy access. The Deep Lake recovery object will find related files based on user queries and provide them as context to the template. Finally, the model uses the information provided to best answer the question. 

## What is Deep Lake?

It is a vector database that provides multimodal storage for all data types (including but not limited to PDF, audio and video files) as well as vectorized representations of data. they. This service eliminates the need to build data infrastructure while handling high-dimensional tensors. Furthermore, it offers a wide range of features such as visualization, parallel computing, data versioning, integration with major AI frameworks, and most importantly, search integration. Supported vector operations like cosine_similarity allow us to find related points in an integration space.


The rest of the article is based on code from ["Chat with Github Repo"](https://github.com/peterw/Chat-with-Github-Repo/) and is organized as follows:

1) Processing the Files 

2) Saving the Embedding 

3) Retrieving from Database 

4) Creating an Interface.

## Processing the Repository Files

To access files from the target repository, the script will clone the desired repository to your computer, placing the files in a folder named "repos". Once we've uploaded the file, just browse through the directory to create a list of files. It is possible to filter out specific extensions or environmental factors. 

In [None]:
root_dir = "./path/to/cloned/repository"
docs = []
file_extensions = []

for dirpath, dirnames, filenames in os.walk(root_dir):
	
	for file in filenames:
	  file_path = os.path.join(dirpath, file)
	
	  if file_extensions and os.path.splitext(file)[1] not in file_extensions:
      continue
	
    loader = TextLoader(file_path, encoding="utf-8")
    docs.extend(loader.load_and_split())

The sample code above generates a list of all the files in the archive. It is possible to filter each item by extension types like file_extensions=['.md', '.txt'] with focus only on markdown and text files. The original implementation had more filters and an unsafe approach; Please refer to the [complete code](https://github.com/peterw/Chat-with-Github-Repo/blob/main/src/utils/process.py#L20).

Now that the list of files has been generated, the split_documents method of the CharacterTextSplitter class of the LangChain library will read the files and split their contents into blocks of 1000 characters. 

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
splitted_text = text_splitter.split_documents(docs)

The splitted_text variable holds the textual content which is ready to be converted to embedding representations.

## Saving the Embeddings

Let's create the database before doing the text-to-embed conversion. This is where the integration between LangChain and Deep Lake comes in handy! We initialize the database in the cloud using the hub:
//... format and LangChain's OpenAIEmbeddings() as an embedded function. The Deep Lake Library will collect the content and automatically generate the embedded content. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_chat_with_gh"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db.add_documents(splitted_text)

## Retrieving from Database

The final step is to code the process to answer user questions based on information from the database. Again, the integration of LangChain and Deep Lake greatly simplifies the process, making it extremely easy. We need 1) an object that accesses the Deep Lake database using the .as_retriever() method and 2) a conversational model like ChatGPT using the ChatOpenAI() class.

Finally, LangChain's RetrievalQA layer ties everything together! It uses user input as prompt while including database results as context. So the ChatGPT model can find the correct model from the provided context. It should be noted that the database fetcher is configured to collect instances that are closely related to the user's query using cosine similarities. 

In [None]:
# Create a retriever from the DeepLake instance
retriever = db.as_retriever()

# Set the search parameters for the retriever
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 100
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 10

# Create a ChatOpenAI model instance
model = ChatOpenAI()

# Create a RetrievalQA instance from the model and retriever
qa = RetrievalQA.from_llm(model, retriever=retriever)

# Return the result of the query
qa.run("What is the repository's name?")

## Create an Interface

Creating a user interface (UI) that is accessible to the bot through a web browser is an optional but very important step. This addition will take your ideas to the next level, allowing users to interact with the application easily even without any programming expertise. This repository uses the Streamlit platform, a quick and easy way to build and deploy an application instantly and for free - and i have used before on [earlier projects](https://livingdatalab.com/projects.html). It provides a wide range of utilities to eliminate the need to use backend frameworks or frontends to build a web application.

We need to install the library and its chat component using the pip command. We strongly recommend that you install the latest version of each library. In addition, the provided codes have been tested using streamlit and streamlit-chat versions of 2023.6.21 and 20230314 respectively. 

In [None]:
!pip install streamlit streamlit_chat

The [API documentation page](https://seldonia.notion.site/Chat-with-a-GitHub-Repository-26669f0e8d634ecf9ece0fb7d2cf2b78) provides a complete list of available extensions that can be used in your application. We need a simple UI that accepts user input and displays the chat in a chat-like interface. Fortunately, Streamlit offers both.  

In [None]:
import streamlit as st
from streamlit_chat import message

# Set the title for the Streamlit app
st.title(f"Chat with GitHub Repository")

# Initialize the session state for placeholder messages.
if "generated" not in st.session_state:
	st.session_state["generated"] = ["i am ready to help you ser"]

if "past" not in st.session_state:
	st.session_state["past"] = ["hello"]

# A field input to receive user queries
input_text = st.text_input("", key="input")

# Search the databse and add the responses to state
if user_input:
	output = qa.run(user_input)
	st.session_state.past.append(user_input)
	st.session_state.generated.append(output)

# Create the conversational UI using the previous states
if st.session_state["generated"]:
	for i in range(len(st.session_state["generated"])):
		message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
		message(st.session_state["generated"][i], key=str(i))

The above code is very simple. We call st.text_input() to generate text input for user queries. The query will be passed to the previously declared RetrievalQA object and the results will be displayed using the message component. You should store the code in question in a Python file ( for example chat.py) and run the following command to see the local interface. 

**streamlit run ./chat.py**

## Putting Everything Together

As we mentioned before, the codes for this lesson are available in ["Chat with GitHub Repo"](https://github.com/peterw/Chat-with-Github-Repo), you can easily fork and get it up and running in 3 easy steps. First, fork the repository and install the required libraries using pip.  

In [None]:
!git clone https://github.com/peterw/Chat-with-Git-Repo.git
!cd Chat-with-Git-Repo

!pip install -r requirements.txt

Second, rename the environment file from .env.example to .env and fill in the API keys. You must have accounts in both OpenAI and Activeloop.

In [None]:
!cp .env.example .env

# OPENAI_API_KEY=your_openai_api_key
# ACTIVELOOP_TOKEN=your_activeloop_api_token
# ACTIVELOOP_USERNAME=your_activeloop_username

Lastly, use the process command to read and store the contents of any repository on the Deep Lake by passing the repository URL to the --repo-url argument.

In [None]:
!python src/main.py process --repo-url https://github.com/username/repo_name

**Be aware of the costs associated with generating embeddings using the OpenAI API. Using a smaller repository that needs fewer resources and faster processing is better.**

And run the chat interface by using the chat command followed by the database name. It is the same as repo_name from the above sample. You can also see the database name by logging in to the Deep Lake dashboard.

In [None]:
!python src/main.py chat --activeloop-dataset-name <dataset_name>

The application will be accessible using a browser on the http://localhost:8501 URL or the next available port. (as demonstrated in the image below) Please read the [complete instruction](https://github.com/peterw/Chat-with-Github-Repo/tree/main#setup) for more information, like filtering a repository content by file extension.

## Conclusion

We have broken down the important sections of the "Chat with GitHub Repo" repository to learn how to build chatbots using the user interface. You learned how to use the Deep Lake database to store high-dimensional embedded assets and query them using functions similar to cosine. Their integration with the LangChain library has provided easy-to-use APIs for storing and retrieving data. Finally, we created the user interface using the Streamlit library to make the bot accessible to everyone.  

## Acknowledgements

I'd like to express my thanks to the wonderful [LangChain & Vector Databases in Production Course](https://learn.activeloop.ai/courses/langchain) by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.