# Using GithubReader for Accessing Files from GitHub
In this notebook, we will demonstrate how to use the `GithubReader` class for accessing files from GitHub repositories. The `GithubReader` class interacts with the GitHub API to retrieve file content and metadata, which can be useful for various applications including research and question-answering systems.

To begin, ensure you have your GitHub personal access token ready for authentication. This token is crucial for accessing private repositories and ensuring secure interactions with the GitHub API.

Also, ensure you have set up your environment variables and API keys in Python using the dotenv library. This is crucial for securely managing sensitive information, such as API keys, especially when using services like HuggingFace. Ensure your `HUGGINGFACE_API_KEY` is defined in the `.env` file to avoid hardcoding sensitive data into your codebase, thus enhancing security and maintainability.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/github_chroma_recursive.ipynb)

In [None]:
!pip install indox
!pip install chromadb
!pip install pygithub
!pip install sentence_transformers
!pip install chromadb

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indox`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indox
```
2. **Activate the virtual environment:**
```bash
indox_judge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indox
    ```

2. **Activate the virtual environment:**
    ```bash
   source indox/bin/activate
    ```
   
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


## Import Essential Libraries

Next, we import the essential libraries for our Indox question-answering system:

- `IndoxRetrievalAugmentation`: Enhances the retrieval process by improving the relevance and quality of the documents retrieved, leading to better QA performance.
- `MistralQA`: A powerful QA model provided by Indox, built on top of the Hugging Face model architecture. It leverages state-of-the-art language understanding to deliver precise answers.
- `HuggingFaceEmbedding`: This library uses Hugging Face embeddings to enrich semantic understanding, making it easier to capture the contextual meaning of the text.
- `RecursiveCharacterTextSplitter`: Utilizes a recursive approach to divide large text documents into smaller chunks based on character length and semantic boundaries, ensuring that each segment maintains contextual integrity

In [1]:
import os
from dotenv import load_dotenv

load_dotenv('api.env')

HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']
github_token = os.environ['github_token']

In [2]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            


## Building the GithubReader System and Initializing Models

Next, we will build our GithubReader system and initialize the necessary models for processing GitHub repository content. This setup will enable us to effectively retrieve and handle files from GitHub, leveraging these models to support various research and question-answering tasks.


In [3]:
from indox.llms import HuggingFaceModel
from indox.embeddings import HuggingFaceEmbedding
mistral_qa = HuggingFaceModel(api_key=HUGGINGFACE_API_KEY,model="mistralai/Mistral-7B-Instruct-v0.2")
embed = HuggingFaceEmbedding(api_key=HUGGINGFACE_API_KEY,model="multi-qa-mpnet-base-cos-v1")

[32mINFO[0m: [1mInitializing HuggingFaceModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceModel initialized successfully[0m
[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


## Setting Up the GithubReader for Retrieving Repository Content
To demonstrate the capabilities of our `GithubReader` system and its integration with `Indox`, we will use a sample GitHub repository. This repository will contain reference data, such as various files and documents, which we will use for testing and evaluation.

In [4]:
from indox.data_connectors import GithubClient, GithubRepositoryReader

github_client = GithubClient(github_token=github_token)

repo_reader = GithubRepositoryReader(
    github_client=github_client,
    owner="osllmai",
    repo="indoxjudge",
    filter_directories=(["docs"], GithubRepositoryReader.FilterType.INCLUDE),
    filter_file_extensions=([".md"], GithubRepositoryReader.FilterType.INCLUDE)
)

documents = repo_reader.load_content(branch="main")

In [5]:
content = documents


In [6]:
content

['# Branch Naming and Pull Request Guidelines for the Team\n\n### Note 1: Branch Naming\n\nPay attention to the type of task assigned to you. Is it a feature, a bug, or a refactor?\n\n- If it\'s a bug: The branch name should start with the word "issue".\n- If it\'s a feature: The branch name should start with the word "feature".\n- If it\'s a refactor: The branch name should start with the word "refactor".\n- If it\'s for documentation : The branch name should start with the word "docs".\n### Note 2: Creating a Pull Request\n\nFor every branch you create, you need to make a pull request at the end of development. However, there are some rules:\n\n1. Ensure your code adheres to a set of technical guidelines before creating the pull request. This includes following coding standards and running all necessary tests.\n2. Write detailed descriptions for the pull request. This should include  an explanation of the issue solved and what you did.\n3. Limit your changes to no more than 10 files 

## Splitting Content into Manageable Chunks
We use the `RecursiveCharacterTextSplitter` to divide the retrieved content into smaller, coherent chunks. This approach ensures that each segment maintains contextual integrity while managing large volumes of text.

In [7]:
from indox.splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(400,20)
content_chunks = splitter.split_text(content)

## Storing and Indexing Content with Chroma
We use the `Chroma` vector store from the `indox.vector_stores` module to store and index the content chunks. By creating a collection named "sample" and applying an embedding function (`embed`), we convert each chunk into a vector for efficient retrieval. The `add` method then adds these vectors to the database, enabling scalable and effective search for question-answering tasks.

In [8]:
from indox.vector_stores import Chroma
db = Chroma(collection_name="sample",embedding_function=embed)
db.add(docs=content_chunks)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


## Querying GitHub Repository Data with Indox

With our `GithubReader` system and `Indox` setup complete, we are ready to test it using a sample query. This test will demonstrate how effectively our system can retrieve and process information from files in the GitHub repository.

We’ll use a sample query to evaluate our system:

- **Query**: "What are the guidelines for creating a pull request?"

This question will be processed by the `GithubReader` and `Indox` system to retrieve relevant files and generate a precise response based on the repository content.

Let’s test our setup with this query.

In [9]:
query = "What are the guidelines for creating a pull request?"
retriever = indox.QuestionAnswer(vector_database=db, llm=mistral_qa, top_k=1)

Now that our `GithubReader` system with `Indox` is fully set up, we can test it with a sample query. We’ll use the invoke method to get a response from the system.

The `invoke` method processes the query using the connected QA model and retrieves relevant information from the repository content. 

We’ll pass the query to the invoke method and print the response to evaluate how effectively the system retrieves and generates answers based on the GitHub repository content.

In [10]:
answer = retriever.invoke(query)
context = retriever.context

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mAnswering question[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mQuery answered successfully[0m


In [11]:
answer

'To create a pull request, ensure that your code follows a set of technical guidelines beforehand. This includes adhering to coding standards and passing all necessary tests. Additionally, write detailed descriptions for the pull request, explaining the issue solved and what actions you took.'