# QA CHatbot over GitHub Repo

This project focuses on finding answers from a GitHub repository’s text files like `.md` and `.txt`. 

STEPS:
* [1. Processing Files](#processing)
* [2. Saving Embeddings](#saving)
* [3. Retrieving from Database](#retrieving)
* [4. Creating an Interface](#interface)

RESOURCES:
- [GitHub repo with the full code](https://github.com/iryna-savchuk/Chat-with-Github-Repo/tree/main)
- [Streamlit website](https://streamlit.io/)
- [Streamlit API documentation](https://docs.streamlit.io/library/api-reference)
- [How to deploy Streamlit-based app](https://docs.streamlit.io/library/get-started/create-an-app#share-your-app)

In [1]:
import sys, os
sys.path.append('..')
from keys import OPENAI_API_KEY, ACTIVELOOP_TOKEN

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

ACTIVELOOP_USERNAME='iryna'

<hr>
<a class="anchor" id="processing">
    
## 1. Processing Files processing
    
</a>

To begin with creating a ChatBot, the following repo was cloned to the local machine: https://github.com/huggingface/transformers

In [2]:
from langchain.document_loaders import TextLoader

# Path to the cloned repo 
root_dir = "../../transformers"
docs = []
file_extensions = ['.md', '.txt']

for dirpath, dirnames, filenames in os.walk(root_dir):
    
    for file in filenames:
        file_path = os.path.join(dirpath, file)
        if file_extensions and os.path.splitext(file)[1] not in file_extensions:
            continue   
        loader = TextLoader(file_path, encoding="utf-8")
        docs.extend(loader.load_and_split())

In [3]:
len(docs)

2191

In [4]:
docs[0]



In [8]:
# Cutting the number of docs 
docs = docs[:100]

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, length_function=len)
splitted_text = text_splitter.split_documents(docs)

In [10]:
len(splitted_text)

492

In [11]:
# Print the length of the first chunk and the chunk
print(splitted_text[0].page_content)

# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:


In [12]:
print(splitted_text[1].page_content)

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
  and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
  community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or advances of
  any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
  without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Enforcement Responsibilities


<hr>
<a class="anchor" id="saving">
    
## 2. Saving Embeddings
    
</a>

In [13]:
# Create database on Activeloop DataLake cloud
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_org_id = ACTIVELOOP_USERNAME
my_activeloop_dataset_name = "chat_over_github"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!




In [14]:
# Adding data to the database (embedding is done by Activeloop automatically
db.add_documents(splitted_text)

|

Dataset(path='hub://iryna/chat_over_github', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (492, 1536)  float32   None   
    id        text      (492, 1)      str     None   
 metadata     json      (492, 1)      str     None   
   text       text      (492, 1)      str     None   


 

['850cff80-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d0e8a-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d0ee4-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d0f20-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d0f52-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d0f84-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d10d8-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d110a-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d11aa-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d11dc-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d120e-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1236-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1268-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1290-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d12c2-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d12ea-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d131c-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1344-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1376-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d139e-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d13d0-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d13f8-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d142a-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1452-3fc3-11ee-9301-12ee7aa5dbdc',
 '850d1484-3fc3-

<hr>
<a class="anchor" id="retrieving">
    
## 3. Retrieving from Database
    
</a>

In [15]:
# Create a retriever from the DeepLake instance
retriever = db.as_retriever()

In [16]:
# Set the search parameters for the retriever
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 100
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 10

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Create a ChatOpenAI model instance
model = ChatOpenAI()

# Create a RetrievalQA instance from the model and retriever
qa = RetrievalQA.from_llm(model, retriever=retriever)

In [18]:
# Return the result of the query
qa.run("What are examples of unacceptable actions?")

"Examples of unacceptable actions include:\n\n- The use of sexualized language or imagery, and sexual attention or advances of any kind\n- Trolling, insulting or derogatory comments, and personal or political attacks\n- Public or private harassment\n- Publishing others' private information, such as a physical or email address, without their explicit permission\n- Other conduct which could reasonably be considered inappropriate in a professional setting"

<hr>
<a class="anchor" id="interface">
    
## 4. Creating an Interface
    
</a>

Creating a UI for the bot to be accessed through a web browser is an optional yet crucial step. It will allow users to engage with the application effortlessly, even without any programming expertise. 
 
A fast and easy way to build and deploy an application for free is to use the [Streamlit](https://streamlit.io/) platform. It provides a wide range of widgets to be used in application and has a comprehensive [API documentation](https://docs.streamlit.io/library/api-reference) page. We need a simple UI that accepts the input from the user and shows the conversation in a chat-like interface, and Streamlit provides both.

In [19]:
# pip install streamlit streamlit_chat

In [20]:
"""
import streamlit as st
from streamlit_chat import message

# Set the title for the Streamlit app
st.title(f"Chat with GitHub Repository")

# Initialize the session state for placeholder messages.
if "generated" not in st.session_state:
    st.session_state["generated"] = ["i am ready to help you ser"]

if "past" not in st.session_state:
    st.session_state["past"] = ["hello"]

# A field input to receive user queries
input_text = st.text_input("", key="input")

# Search the databse and add the responses to state
if user_input:
    output = qa.run(user_input)
    st.session_state.past.append(user_input)
    st.session_state.generated.append(output)

# Create the conversational UI using the previous states
if st.session_state["generated"]:
    for i in range(len(st.session_state["generated"])):
        message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
        message(st.session_state["generated"][i], key=str(i))
"""
print("")




The code above is straightforward. We call st.text_input() to create text input for users queries. The query will be passed to the previously declared RetrievalQA object, and the results will be shown using the message component. 

The provided Python code should be stored in .py file and (for example, I created, "qa-chat-github.py" in the current directory)and run the following command to see the interface locally:
> streamlit run ./qa-chat-github.py

The application will be accessible using a web-browser on http://localhost:8501 or http://localhost:8502 (or the next available port)

The application UI is demonstrated in the image below:
<img src="../img/qa-chat-github-ui.png" alt="qa-chat-github-ui" width="50%">

To make the application accessible over web, refer to the documentation on how to [deploy](https://docs.streamlit.io/library/get-started/create-an-app#share-your-app) the application.