## Using GitLoader (Langchain) to load git repos.

In [1]:
from langchain.document_loaders import GitLoader
from langchain.schema import Document
import json
from typing import Iterable

## TODO
- how to preprocess code files?


## Tips from [here](https://towardsdatascience.com/10-ways-to-improve-the-performance-of-retrieval-augmented-generation-systems-5fa2cee7cd5c)
- Clean your data. Make sure it is content only and no extraneous tags.
- Explore different index types. Perhaps can use keyword based search for certain items, and llamaindex based for others.
- Experiment with chunking approach. Suggest to write up individual functions and then combine them into a main file for experiments. (for loop)
- Play around with your base prompt. Once again, can use an experiment. Refer to the prompt guide [here](https://arxiv.org/abs/2302.11382)
- Do meta-data filtering. Gitloader currently doesn't consider date. But there should be a way to obtain last modified.
-  Use query routing. It's useful to have more than one index. For e.g. one that handles the summaries, one that answers flow questions etc. OR just come up with two separate chatbots.
- Look into reranking. It is one solution to the issue of discrepancy between similarity and relevance.
- Look into transforming the user's query further. [HyDE](https://github.com/texttron/hyde) takes a query, generates a hypothetical response, and then uses both for embedding look up. This can improve performance. [*note that they use GPT3, but we can simply prompt the LLM one more time.*]
- Fine-tune your **embedding model**. The embedding model might not be well-suited to our target domain. So we can improve retrieval by fine-tuning first.
- Use LLM evaluation frameworks: Langchain-Evaluate, RAGAS

In [2]:
# The original approach loads too many files. 
# Instead, focus on Tcl, Cpp files

# cpp_exts = ['.cpp', 'cc', 'c++',
#             'hpp', 'hh', 'h++', 'h']
# other_exts = ['.tcl', '.i', '.py', '.md']
# exts = cpp_exts + other_exts
exts = ['.md']

loader = GitLoader(
    clone_url="https://github.com/The-OpenROAD-Project/OpenROAD",
    repo_path="./data",
    branch="master",
    file_filter = lambda file_path: any(file_path.endswith(ext) for ext in exts)
)

data = loader.load()
data[0]



In [46]:
listA = ["apple", "banana", "orange", "grape"]
listB = ["app", "ran", "ra", "ora"]

# Remove elements from listA that contain strings from any element in listB
filtered_listA = [x for x in listA if not any(y in x for y in listB)]

print(filtered_listA)


['banana']


In [70]:
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter, MarkdownTextSplitter
import re

def remove_text(text):
    # remove code blocks ```
    text = re.sub(r"```.*?```", '', text, flags=re.DOTALL)

    # remove urls
    text = re.sub(r"https?://\S+|www\.\S+", '', text)
    return text
    
# MD splits
MD_header_split = True
if MD_header_split: 
    headers_to_split_on = [("#"*i, f"Header {i}") for i in range(1, 8)]
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
else:
    markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=15)

# Recursive splitting
chunk_size = 250
chunk_overlap = 20
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

# Use MD Headers metadata to remove useless info. 
remove_headers_1 = ["Contributor Covenant Code of Conduct"]
remove_headers_2 = ["Authors", "Commands"]

final = []
for d in data:
    content, metadata = d.page_content, d.metadata
    content = remove_text(content)
    if MD_header_split: 
        docs = markdown_splitter.split_text(content)
        for d in docs:
            if 'Header 1' in d.metadata: 
                if any(substring in d.metadata['Header 1'] for substring in remove_headers_1):
                    continue
            if 'Header 2' in d.metadata: 
                if any(substring in d.metadata['Header 2'] for substring in remove_headers_2):
                    continue
            d.metadata.update(metadata)
            final.append(d)
    else: 
        docs = markdown_splitter.create_documents([content])
        final.extend(docs)
        
if MD_header_split:
    final = text_splitter.split_documents(final)

len(final)

736

In [71]:
for i in final:
    print(i) 
    print()


page_content='[![Build Status](\n[![Coverity Scan Status](\n[![Documentation Status](\n[![CII Best Practices](' metadata={'Header 1': 'OpenROAD', 'source': 'README.md', 'file_path': 'README.md', 'file_name': 'README.md', 'file_type': '.md'}

page_content='OpenROAD is the leading open-source, foundational application for\nsemiconductor digital design. The OpenROAD flow delivers an\nAutonomous, No-Human-In-Loop (NHIL) flow, 24 hour turnaround from' metadata={'Header 1': 'OpenROAD', 'Header 2': 'About OpenROAD', 'source': 'README.md', 'file_path': 'README.md', 'file_name': 'README.md', 'file_type': '.md'}

page_content='RTL-GDSII for rapid design exploration and physical design implementation.' metadata={'Header 1': 'OpenROAD', 'Header 2': 'About OpenROAD', 'source': 'README.md', 'file_path': 'README.md', 'file_name': 'README.md', 'file_type': '.md'}

page_content='[OpenROAD]( eliminates the barriers\nof cost, schedule risk and uncertainty in hardware design to promote\nopen access to rap

In [72]:
# Code is adapted from https://github.com/langchain-ai/langchain/issues/3016
def save_docs_to_jsonl(array:Iterable[Document], file_path:str)->None:
    with open(file_path, 'w') as jsonl_file:
        for doc in array:
            jsonl_file.write(doc.json() + '\n')

def load_docs_from_jsonl(file_path)->Iterable[Document]:
    array = []
    with open(file_path, 'r') as jsonl_file:
        for line in jsonl_file:
            data = json.loads(line)
            obj = Document(**data)
            array.append(obj)
    return array
    
save_docs_to_jsonl(final,'tempdata/data.jsonl')
final2=load_docs_from_jsonl('tempdata/data.jsonl')
assert len(final) == len(final2)


## Chunking Strategies (from Pinecone) [link](https://www.pinecone.io/learn/chunking-strategies)
- Fixed-size chunking. The most straightforward is to use Langchain, CharacterTextSplitter
- Content-aware chunking. Take advantage of the nature of the content and apply more sophisticated chunking
1. Naive splitting: split by `.`, `\n`
```python
text = "..." # your text<br>
docs = text.split(".")
```

2. NLTK
```python
text = "..." # your text
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)
```

3. spaCy
```python
text = "..." # your text
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpaCyTextSplitter()
docs = text_splitter.split_text(te)


- Recursive Chunking: divide the input text into smaller chunks hierarchically and iteratively.
```python
text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 256,
    chunk_overlap  =20
)

docs = text_splitter.create_documentstext]
- Specialized Chunking: Markdown and LateX
```python
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_te)
`

```


```)


```

```


## Best chunk size
- Preprocessing your data. Remove tags, noise
- Select a range of chunk sizes, setup evaluation harness
- A good size of question bank is 75. [link](https://www.mattambrogi.com/posts/chunk-size-matters/)