<a href="https://colab.research.google.com/github/mclausaudio/ai-ml-experiments-notebooks/blob/main/Talk_To_Repo_by_Michael_Claus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will walk you through how you can use LangChain and Pinecone DB to "talk" to a GitHub repo.  

Read more about this project on my blog:

https://medium.com/@michaelclaus/my-attempt-to-talk-to-a-github-repo-using-python-langchain-pinecone-db-and-openai-73b6df90d0e9

## Install dependencies
pip install time

In [None]:
!pip install langchain
!pip install pinecone-client
!pip install openai
!pip install tiktoken

## Add your OpenAI keys
We load in our OpenAI key below

In [None]:
import os

# Add your OpenAI key here
os.environ["OPENAI_API_KEY"] = ""

## Clone to repo you want to talk to
Clone your repo into this project directory.  Update the URL to the URL you would like to talk to.  It will download the repo in a directory called `codebase`.

In [None]:
!git clone https://github.com/mclausaudio/notion-personal-assistant-ai codebase

## Convert repo to text, prepare it to be vectorized
Below loops over the repo you just cloned and converts it all to `.txt` files and placed into a new directory called `converted_codebase`.  `converted_codebase` will essentially be a replica of `codebase` except it will be all `.txt` files.  It preserves the original extension and basically appends `.txt` to it.  So for example, `index.js` will become `index.js.txt`.

In [None]:
def convert_files_to_txt(src_dir, dst_dir):
    # If the destination directory does not exist, create it.
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)

    for root, dirs, files in os.walk(src_dir):
        for file in files:
            file_path = os.path.join(root, file)
            rel_path = os.path.relpath(file_path, src_dir)  # get the relative path to preserve directory structure

            # Create the same directory structure in the new directory
            new_root = os.path.join(dst_dir, os.path.dirname(rel_path))
            os.makedirs(new_root, exist_ok=True)

            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    data = f.read()
            except UnicodeDecodeError:
                try:
                    with open(file_path, 'r', encoding='latin-1') as f:
                        data = f.read()
                except UnicodeDecodeError:
                    print(f"Failed to decode the file: {file_path}")
                    continue

            # Create a new file path with .txt extension
            new_file_path = os.path.join(new_root, file + '.txt')
            with open(new_file_path, 'w', encoding='utf-8') as f:
                f.write(data)

# Call the function with the source and destination directory paths
convert_files_to_txt('/content/codebase', '/content/converted_codebase')


## Pinecone DB time!
Next we loop over `converted_codebase` and load, split the contents of each file into chunks, create a vector representation of the pieces of text (embeddings) and write into Pinecone DB.

Check your Pinecone DB index.  If it already has vectors in it (as in you've already ran this cell) you don't need to run it again.  If you want to try a new `chunk_size` or `chunk_overlap` value, you can delete and recreate your index.  Or you could keep adding more vectors into the index, although I think that might make it perform worse (could be wrong).

### Important Notes
- When you configure your PineconeDB index, be sure you set `Dimensions` to `1536`.  As you can see in the [OpenAI docs](https://platform.openai.com/docs/guides/embeddings/second-generation-models)
- Experiment with the `chunk_size` and `chunk_overlap` to yield more granular results at the cost of longer write to DB time.
- Useful docs: https://python.langchain.com/docs/modules/data_connection/

In [None]:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
import pinecone

PINECONE_API_KEY = ""
PINECONE_ENVIRONMENT = ""
PINECONE_INDEX_NAME = ""

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

def ingest_files(src_dir):
    loader = DirectoryLoader(src_dir, show_progress=True, loader_cls=TextLoader)
    repo_files = loader.load()
    print(f"Number of files loaded: {len(repo_files)}")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=50)
    documents = text_splitter.split_documents(documents=repo_files)
    print(f"Number of documents : {len(documents)}")
    for doc in documents:
      old_path_with_txt_extension = doc.metadata["source"]
      new_path_without_txt_extension = old_path_with_txt_extension.replace(".txt", "")
      doc.metadata.update({"source": new_path_without_txt_extension})

    print(f"Going in insert {len(documents)} to pinecone")
    embeddings = OpenAIEmbeddings()
    Pinecone.from_documents(documents, embeddings, index_name=PINECONE_INDEX_NAME)
    print(f"Done inserting to pinecone")




ingest_files('/content/converted_codebase')



## Embed Pinecone DB Index with LangChain's RetrievalQA Chain
We set up the LLM so we can talk to it.  I tried using the `RetrievalQA` chain, as well as teh `RetrievalQAWithSourcesChain`.  I liked the latter, because it cites it's sources.

In [None]:
# from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI

embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
docsearch = Pinecone.from_existing_index(PINECONE_INDEX_NAME, embeddings)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
# chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), input_key="question")
chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())
template = None

## Set up a prompt template (optional)
We will set up a prompt template to wrap the users input.

### Important note:
I found that this actually made the LLM's responses WORSE!  I would recommend that you don't run the cell below

In [None]:
from langchain import PromptTemplate


template = """
You are a senior engineer who knows everything about the embedded repo. Your job is to act as a git repository assistant.
You will answer the users questions in as much detail as possible using only information from the embedded GitHub repo.
You are a programmer, so you are able to provide insight into how the repo works and could be updated, if the user asks those types of questions.
You must provide as much detail as possible when answering the question and you must only reference files and information contained within the repo.

Please answer the users question: {question}
"""
prompt = PromptTemplate.from_template(template)

## Talk to the repo!
Ask it some questions by updating `query`
Note: If you switched the cell above to use `RetrievalQA` you will need to update the cell below accordingly, as the two chains use different `key`s in the query.

In [None]:
# Update the value of `query` with your question!
query = "Which frontend file handles users text input?  And how can I make the text areas background blue?"

if template is not None and prompt:
  optimized_prompt = prompt.format(question=query)
  result = chain({"question": optimized_prompt}, return_only_outputs=True)
else:
  result = chain({"question": query}, return_only_outputs=True)

print(result)


## Helper function (you can ignore the cell below)
Below is just a little helper code cell.  In case you want to delete some directorys, comment in / out as needed and run

In [None]:
# !rm -r /content/codebase
!rm -r /content/converted_codebase/
