# Creating a Voice Assistant for a Knowledge Base

* [1. Prompt use case](#prompt)
* [2. Summarization chain example](#summarization)
* [3. QA chain example](#qa)

The main purpose here is to create a voice assistant that can efficiently navigate a knowledge base, providing precise and timely responses to a user's queries.

Inspiration taken from: [github repo](https://github.com/peterw/JarvisBase)

In [1]:
import sys, os
sys.path.append('..')

from keys import OPENAI_API_KEY, ACTIVELOOP_TOKEN, ELEVEN_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN
os.environ["ELEVEN_API_KEY"] = ELEVEN_API_KEY

Main stages:
1. Transcribe voice inputs into text - automatic speech recognition (ASR) using OpenAI's Whisper.
2. Generating response to the question:
    - loading the vector database, a repository housing relevant documents
    - retrieve the documents and feed them along with question to LLM
    - LLM then generates the response based on retrieved documents
3. Generating voice outputs - employ Eleven Labs.

<hr>
<a class="anchor" id="sourcing">
    
## 1. Sourcing Content from Hugging Face Hub
    
</a>

The knowledge base for our voice assistant will be created by the articles from the Hugging Face Hub. We'll do some web scraping in order to collect the documents documents.

In [2]:
# Import necessary modules
import os
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import re

# Set up the path for DeepLake (a vector database)
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "langchain_course_jarvis_assistant"
dataset_path = 'hub://{active_loop_username}/{dataset_name}'

# Set up an OpenAIEmbeddings instance
model_params = {
    "model_name": "text-embedding-ada-002",
}
embeddings =  OpenAIEmbeddings(model_kwargs=model_params)

<hr>
<a class="anchor" id="scrape">
    
### 1.1. Scrape content
    
</a>

In [4]:
def get_documentation_urls():
    # List of relative URLs for the desired Hugging Face documentation pages
    return [
        '/docs/huggingface_hub/guides/overview',
        '/docs/huggingface_hub/guides/download',
        '/docs/huggingface_hub/guides/upload',
        '/docs/huggingface_hub/guides/hf_file_system',
        '/docs/huggingface_hub/guides/repository',
        '/docs/huggingface_hub/guides/search',
    ]


def construct_full_url(base_url, relative_url):
    # Construct the full URL by appending the relative URL to the base URL
    return base_url + relative_url


def scrape_page_content(url):
    # Send a GET request to the URL and parse the HTML response using BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract the desired content from the page (in this case, the body text)
    text=soup.body.text.strip()
    # Remove non-ASCII characters
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', '', text)
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    return text.strip()


def scrape_all_content(base_url, relative_urls, filename):
    # Loop through the list of URLs, scrape content and add it to the content list
    content = []
    for relative_url in relative_urls:
        full_url = construct_full_url(base_url, relative_url)
        scraped_content = scrape_page_content(full_url)
        content.append(scraped_content.rstrip('\n'))

    # Write the scraped content to a file
    with open(filename, 'w', encoding='utf-8') as file:
        for item in content:
            file.write("%s\n" % item)
    
    return content

<hr>
<a class="anchor" id="load_split">
    
### 1.2. Loading and splitting texts
    
</a>

In [6]:
# Define a function to load documents from a file
def load_docs(root_dir,filename):
    # Create an empty list to hold the documents
    docs = []
    try:
        # Load the file using the TextLoader class and UTF-8 encoding
        loader = TextLoader(os.path.join(
            root_dir, filename), encoding='utf-8')
        # Split the loaded file into separate documents and add them to the list of documents
        docs.extend(loader.load_and_split())
    except Exception as e:
        # If an error occurs during loading, ignore it and return an empty list of documents
        pass
    # Return the list of documents
    return docs
  
    
def split_docs(docs):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(docs)

<hr>
<a class="anchor" id="storing">
    
## 2. Embedding and storing in Deep Lake
    
</a>

- [Eleven Labs Website](https://elevenlabs.io/)
- [Eleven Labs API documentation](https://api.elevenlabs.io/docs)