# Use LangChain, GPT and Deep Lake to work with code base
In this tutorial, we are going to use Langchain + Deep Lake with GPT to analyze the code base of the LangChain itself. 

## Design

1. Prepare data:
   1. Upload all python project files using the `langchain.document_loaders.TextLoader`. We will call these files the **documents**.
   2. Split all documents to chunks using the `langchain.text_splitter.CharacterTextSplitter`.
   3. Embed chunks and upload them into the DeepLake using `langchain.embeddings.openai.OpenAIEmbeddings` and `langchain.vectorstores.DeepLake`
2. Question-Answering:
   1. Build a chain from `langchain.chat_models.ChatOpenAI` and `langchain.chains.ConversationalRetrievalChain`
   2. Prepare questions.
   3. Get answers running the chain.


## Implementation

### Integration preparations

We need to set up keys for external services and install necessary python libraries.

In [3]:
#!python3 -m pip install --upgrade langchain deeplake openai

Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. 

For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['OPENAI_API_KEY'] # = getpass()
# Please manually enter OpenAI Key

True

'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'

Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at [app.activeloop.ai](https://app.activeloop.ai)

In [6]:
# os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')

 ········


### Prepare data 

Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the `langchain` repo.

If you want to use files from different repo, change `root_dir` to the root dir of your repo.

In [2]:
from langchain.document_loaders import TextLoader

# root_dir = '../../../..'
root_dir = '../data'

docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        if file.endswith('.py') and 'venv/' not in dirpath:
            try: 
                loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
                docs.extend(loader.load_and_split())
            except Exception as e: 
                pass
print(f'{len(docs)}')

2470


Then, chunk the files

In [3]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)}")

Created a chunk of size 1109, which is longer than the specified 1000
Created a chunk of size 1194, which is longer than the specified 1000
Created a chunk of size 3122, which is longer than the specified 1000
Created a chunk of size 1111, which is longer than the specified 1000
Created a chunk of size 3569, which is longer than the specified 1000
Created a chunk of size 1121, which is longer than the specified 1000
Created a chunk of size 1155, which is longer than the specified 1000
Created a chunk of size 1062, which is longer than the specified 1000
Created a chunk of size 1033, which is longer than the specified 1000
Created a chunk of size 2432, which is longer than the specified 1000
Created a chunk of size 2357, which is longer than the specified 1000
Created a chunk of size 3133, which is longer than the specified 1000
Created a chunk of size 2082, which is longer than the specified 1000
Created a chunk of size 1003, which is longer than the specified 1000
Created a chunk of s

7114


Then embed chunks and upload them to the DeepLake.

This can take several minutes. 

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(disallowed_special=())
embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special=set(), chunk_size=1000, max_retries=6, request_timeout=None, headers=None)

In [7]:
# from langchain.vectorstores import DeepLake

# db = DeepLake.from_documents(texts, embeddings, dataset_path=f"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code")
# db

In [8]:
from langchain.vectorstores.faiss import FAISS

# embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(texts, embeddings)
vectorstore.save_local("vectorstore.faiss", index_name="code")

### Question Answering
First load the dataset, construct the retriever, then construct the Conversational Chain

In [9]:
# db = DeepLake(dataset_path=f"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code", read_only=True, embedding_function=embeddings)

In [10]:
retriever = vectorstore.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 20
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 20

You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)

In [18]:
def filter(x):
    # filter based on source code
    if 'something' in x['text'].data()['value']:
        return False
    
    # filter based on path e.g. extension
    metadata =  x['metadata'].data()['value']
    return 'only_this' in metadata['source'] or 'also_that' in metadata['source']

### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter

In [25]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-4') # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model,
                                           retriever=retriever,
                                           return_source_documents=True,
                                           verbose=True)

In [32]:
qa.input_keys
qa.output_keys
qa.output_key

['question', 'chat_history']

['answer', 'source_documents']

'answer'

In [18]:
from langchain.chains import RetrievalQA

qa2 = RetrievalQA.from_chain_type(llm=model, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True, 
                                  verbose=True)


In [34]:
qa2.input_keys
qa2.input_key
qa2.output_keys
qa2.output_key

['query']

'query'

['result', 'source_documents']

'result'

In [27]:
def ask_question(qa, question):
    result = qa({"query": question})
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['result']} \n")
    print(f"**Sources**: {result['source_documents']} \n")
    for i, s in enumerate(result['source_documents']):
        print(f"{i}: {s.metadata['source']} \n")
        print(f"{s.page_content} \n")
        print()

ask_question(qa2, "What is the langchain class hierarchy?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
-> **Question**: What is the langchain class hierarchy? 

**Answer**: The LangChain class hierarchy includes various classes and components used to create, manage, and interact with language models, chains, prompts, tools, and other utilities. Here is an overview of the class hierarchy:

- Chain (Base class for all chains)
  - LLMChain (Language model chain)
    - TaskCreationChain
    - TaskPrioritizationChain
  - LLMBashChain (Language model bash chain)
  - LLMCheckerChain (Language model checker chain)
  - LLMMathChain (Language model math chain)
  - PALChain (PAL chain)
  - QAWithSourcesChain (Question answering with sources chain)
    - VectorDBQAWithSourcesChain
  - VectorDBQA (Retrieval QA chain)
  - SQLDatabaseChain (SQL database chain)
  - APIChain (API chain)
    - OpenAPIEndpointChain
  - AnalyzeDocumentChain (Analyze document chain)
  - ConstitutionalChain (Constitutional chain)
  - ConversationChain (

-> **Question**: What is the langchain class hierarchy? 

**Answer**: The LangChain class hierarchy includes various classes and components used to create, manage, and interact with language models, chains, prompts, tools, and other utilities. Here is an overview of the class hierarchy:

- Chain (Base class for all chains)
  - LLMChain (Language model chain)
    - TaskCreationChain
    - TaskPrioritizationChain
  - LLMBashChain (Language model bash chain)
  - LLMCheckerChain (Language model checker chain)
  - LLMMathChain (Language model math chain)
  - PALChain (PAL chain)
  - QAWithSourcesChain (Question answering with sources chain)
    - VectorDBQAWithSourcesChain
  - VectorDBQA (Retrieval QA chain)
  - SQLDatabaseChain (SQL database chain)
  - APIChain (API chain)
    - OpenAPIEndpointChain
  - AnalyzeDocumentChain (Analyze document chain)
  - ConstitutionalChain (Constitutional chain)
  - ConversationChain (Conversation chain)
  - RouterChain (Router chain)
    - LLMRouterChain
    - MultiRouteChain
    - MultiPromptChain
    - MultiRetrievalQAChain
  - GraphQAChain (Graph QA chain)
    - GraphCypherQAChain
  - HypotheticalDocumentEmbedder (Hyde chain)
  - SequentialChain (Sequential chain)

- BaseLanguageModel (Base class for language models)
  - AI21
  - AlephAlpha
  - Anthropic
  - Anyscale
  - Banana
  - Beam
  - Bedrock
  - CerebriumAI
  - Cohere
  - CTransformers
  - Databricks
  - DeepInfra
  - FakeListLLM
  - ForefrontAI
  - GooglePalm
  - GooseAI
  - GPT4All
  - HuggingFaceEndpoint
  - HuggingFaceHub
  - HuggingFacePipeline
  - HuggingFaceTextGenInference
  - HumanInputLLM
  - LlamaCpp
  - Modal
  - MosaicML
  - NLPCloud
  - OpenAI
  - OpenAIChat
  - OpenLM
  - Petals
  - PipelineAI
  - PredictionGuard
  - PromptLayerOpenAI
  - PromptLayerOpenAIChat
  - Replicate
  - RWKV
  - SagemakerEndpoint
  - SelfHostedPipeline
  - SelfHostedHuggingFaceLLM
  - StochasticAI
  - VertexAI
  - Writer

- BasePromptTemplate (Base class for prompt templates)
  - PromptTemplate
  - FewShotPromptTemplate

This hierarchy outlines the main classes used to build and manage various chains, language models, and prompts within the LangChain library. Note that there are many other classes and utility functions that are not included in this hierarchy, as they support these main classes in specific tasks and functionalities. 


In [37]:

def chat(qa, question):
    result = qa({"query": question})
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['result']} \n")
    print(f"**Sources**: {result['source_documents']} \n")
    for i, s in enumerate(result['source_documents']):
        print(f"{i}: {s.metadata['source']} \n")
        print(f"{s.page_content} \n")
        print()

ask_question(qa2, "What is the langchain class hierarchy?")

In [42]:
qa.get_chat_history
qa.input_keys
qa.output_keys

TypeError: 'NoneType' object is not callable

In [44]:
result["chat_history"]
chat_history

[('What is the langchain class hierarchy?',
  "The LangChain class hierarchy consists of various classes and modules, organized mainly into Agents, Chains, LLMs (Language Models), Prompts, Tools, and Utilities. Here's an overview of the main classes in each category:\n\n1. Agents:\n   - MRKLChain\n   - ReActChain\n   - SelfAskWithSearchChain\n\n2. Chains:\n   - Base Chains:\n     - Chain\n     - APIChain\n     - LLMChain\n     - LLMBashChain\n     - LLMCheckerChain\n     - LLMMathChain\n     - PALChain\n     - QAWithSourcesChain\n     - SQLDatabaseChain\n     - VectorDBQA\n     - VectorDBQAWithSourcesChain\n   - Conversational Chains:\n     - ConversationChain\n     - TaskCreationChain\n     - TaskPrioritizationChain\n   - Specialized Chains:\n     - ConstitutionalChain\n     - ChatVectorDBChain\n     - ConversationalRetrievalChain\n     - RouterChain\n     - MultiRouteChain\n     - MultiPromptChain\n     - MultiRetrievalQAChain\n     - LLMRouterChain\n\n3. LLMs (Language Models):\n   

[('What is the langchain class hierarchy?',
  "The LangChain class hierarchy consists of various classes and modules, organized mainly into Agents, Chains, LLMs (Language Models), Prompts, Tools, and Utilities. Here's an overview of the main classes in each category:\n\n1. Agents:\n   - MRKLChain\n   - ReActChain\n   - SelfAskWithSearchChain\n\n2. Chains:\n   - Base Chains:\n     - Chain\n     - APIChain\n     - LLMChain\n     - LLMBashChain\n     - LLMCheckerChain\n     - LLMMathChain\n     - PALChain\n     - QAWithSourcesChain\n     - SQLDatabaseChain\n     - VectorDBQA\n     - VectorDBQAWithSourcesChain\n   - Conversational Chains:\n     - ConversationChain\n     - TaskCreationChain\n     - TaskPrioritizationChain\n   - Specialized Chains:\n     - ConstitutionalChain\n     - ChatVectorDBChain\n     - ConversationalRetrievalChain\n     - RouterChain\n     - MultiRouteChain\n     - MultiPromptChain\n     - MultiRetrievalQAChain\n     - LLMRouterChain\n\n3. LLMs (Language Models):\n   

In [39]:
questions = [
    "What is the langchain class hierarchy?",
    # "What classes are derived from the Chain class?",
    # "What classes and functions in the ./langchain/utilities/ folder are not covered by unit tests?",
    # "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
from langchain.cache import BaseCache
from langchain.chains import (
    ConversationChain,
    LLMBashChain,
    LLMChain,
    LLMCheckerChain,
    LLMMathChain,
    PALChain,
    QAWithSourcesChain,
    SQLDatabaseChain,
    VectorDBQA,
    VectorDBQAWithSourcesChain,
)
from langchain.docstore import InMemoryDocstore, Wikipedia
from langchain.llms import (
    Anthropic,
    Banana,
    CerebriumAI,
    Cohere,
    ForefrontAI,
    GooseAI,
    HuggingFaceHub,
    HuggingFaceTextGenInference,
    LlamaCpp,
    Modal,
    OpenAI,
    Petals,
    PipelineAI,
    SagemakerEndpoint,
    StochasticAI,


-> **Question**: What is the langchain class hierarchy? 

**Answer**: The LangChain class hierarchy consists of various classes and modules, organized mainly into Agents, Chains, LLMs (Language Models), Prompts, Tools, and Utilities. Here's an overview of the main classes in each category:

1. Agents:
   - `MRKLChain`
   - `ReActChain`
   - `SelfAskWithSearchChain`

2. Chains:
   - Base Chains:
     - `Chain`
     - `APIChain`
     - `LLMChain`
     - `LLMBashChain`
     - `LLMCheckerChain`
     - `LLMMathChain`
     - `PALChain`
     - `QAWithSourcesChain`
     - `SQLDatabaseChain`
     - `VectorDBQA`
     - `VectorDBQAWithSourcesChain`
   - Conversational Chains:
     - `ConversationChain`
     - `TaskCreationChain`
     - `TaskPrioritizationChain`
   - Specialized Chains:
     - `ConstitutionalChain`
     - `ChatVectorDBChain`
     - `ConversationalRetrievalChain`
     - `RouterChain`
     - `MultiRouteChain`
     - `MultiPromptChain`
     - `MultiRetrievalQAChain`
     - `LLMRouterChain`

3. LLMs (Language Models):
   - AI21
   - AlephAlpha
   - Anthropic
   - Anyscale
   - Banana
   - Beam
   - Bedrock
   - CerebriumAI
   - Cohere
   - CTransformers
   - Databricks
   - DeepInfra
   - FakeListLLM
   - ForefrontAI
   - GooglePalm
   - GooseAI
   - GPT4All
   - HuggingFaceEndpoint
   - HuggingFaceHub
   - HuggingFacePipeline
   - HuggingFaceTextGenInference
   - HumanInputLLM
   - LlamaCpp
   - Modal
   - MosaicML
   - NLPCloud
   - OpenAI
   - OpenAIChat
   - OpenLM
   - Petals
   - PipelineAI
   - Replicate
   - RWKV
   - SagemakerEndpoint
   - StochasticAI
   - VertexAI
   - Writer

4. Prompts:
   - `BasePromptTemplate`
   - `FewShotPromptTemplate`
   - `Prompt`
   - `PromptTemplate`

5. Tools:
   - `ArxivQueryRun`
   - `BingSearchRun`
   - `DuckDuckGoSearchRun`
   - `GoogleSearchResults`
   - `GoogleSearchRun`
   - `GoogleSerperResults`
   - `GoogleSerperRun`
   - `HumanInputRun`
   - `MetaphorSearchResults`
   - `SearxSearchResults`
   - `SearxSearchRun`
   - `WikipediaQueryRun`
   - `WolframAlphaQueryRun`
   - `OpenWeatherMapQueryRun`

6. Utilities:
   - `ArxivAPIWrapper`
   - `BingSearchAPIWrapper`
   - `DuckDuckGoSearchAPIWrapper`
   - `GoogleSearchAPIWrapper`
   - `GoogleSerperAPIWrapper`
   - `MetaphorSearchAPIWrapper`
   - `LambdaWrapper`
   - `GraphQLAPIWrapper`
   - `SearxSearchWrapper`
   - `SerpAPIWrapper`
   - `TwilioAPIWrapper`
   - `WikipediaAPIWrapper`
   - `WolframAlphaAPIWrapper`
   - `OpenWeatherMapAPIWrapper`

Please note that this list is not exhaustive and only includes the main classes for each category. There might be other classes and subclasses within the LangChain library. 


In [40]:
result

{'question': 'What is the langchain class hierarchy?',
 'chat_history': [('What is the langchain class hierarchy?',
   "The LangChain class hierarchy consists of various classes and modules, organized mainly into Agents, Chains, LLMs (Language Models), Prompts, Tools, and Utilities. Here's an overview of the main classes in each category:\n\n1. Agents:\n   - MRKLChain\n   - ReActChain\n   - SelfAskWithSearchChain\n\n2. Chains:\n   - Base Chains:\n     - Chain\n     - APIChain\n     - LLMChain\n     - LLMBashChain\n     - LLMCheckerChain\n     - LLMMathChain\n     - PALChain\n     - QAWithSourcesChain\n     - SQLDatabaseChain\n     - VectorDBQA\n     - VectorDBQAWithSourcesChain\n   - Conversational Chains:\n     - ConversationChain\n     - TaskCreationChain\n     - TaskPrioritizationChain\n   - Specialized Chains:\n     - ConstitutionalChain\n     - ChatVectorDBChain\n     - ConversationalRetrievalChain\n     - RouterChain\n     - MultiRouteChain\n     - MultiPromptChain\n     - MultiR

-> **Question**: What is the class hierarchy? 

**Answer**: There are several class hierarchies in the provided code, so I'll list a few:

1. `BaseModel` -> `ConstitutionalPrinciple`: `ConstitutionalPrinciple` is a subclass of `BaseModel`.
2. `BasePromptTemplate` -> `StringPromptTemplate`, `AIMessagePromptTemplate`, `BaseChatPromptTemplate`, `ChatMessagePromptTemplate`, `ChatPromptTemplate`, `HumanMessagePromptTemplate`, `MessagesPlaceholder`, `SystemMessagePromptTemplate`, `FewShotPromptTemplate`, `FewShotPromptWithTemplates`, `Prompt`, `PromptTemplate`: All of these classes are subclasses of `BasePromptTemplate`.
3. `APIChain`, `Chain`, `MapReduceDocumentsChain`, `MapRerankDocumentsChain`, `RefineDocumentsChain`, `StuffDocumentsChain`, `HypotheticalDocumentEmbedder`, `LLMChain`, `LLMBashChain`, `LLMCheckerChain`, `LLMMathChain`, `LLMRequestsChain`, `PALChain`, `QAWithSourcesChain`, `VectorDBQAWithSourcesChain`, `VectorDBQA`, `SQLDatabaseChain`: All of these classes are subclasses of `Chain`.
4. `BaseLoader`: `BaseLoader` is a subclass of `ABC`.
5. `BaseTracer` -> `ChainRun`, `LLMRun`, `SharedTracer`, `ToolRun`, `Tracer`, `TracerException`, `TracerSession`: All of these classes are subclasses of `BaseTracer`.
6. `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, `CohereEmbeddings`, `JinaEmbeddings`, `LlamaCppEmbeddings`, `HuggingFaceHubEmbeddings`, `TensorflowHubEmbeddings`, `SagemakerEndpointEmbeddings`, `HuggingFaceInstructEmbeddings`, `SelfHostedEmbeddings`, `SelfHostedHuggingFaceEmbeddings`, `SelfHostedHuggingFaceInstructEmbeddings`, `FakeEmbeddings`, `AlephAlphaAsymmetricSemanticEmbedding`, `AlephAlphaSymmetricSemanticEmbedding`: All of these classes are subclasses of `BaseLLM`. 


-> **Question**: What classes are derived from the Chain class? 

**Answer**: There are multiple classes that are derived from the Chain class. Some of them are:
- APIChain
- AnalyzeDocumentChain
- ChatVectorDBChain
- CombineDocumentsChain
- ConstitutionalChain
- ConversationChain
- GraphQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMCheckerChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAPIEndpointChain
- PALChain
- QAWithSourcesChain
- RetrievalQA
- RetrievalQAWithSourcesChain
- SequentialChain
- SQLDatabaseChain
- TransformChain
- VectorDBQA
- VectorDBQAWithSourcesChain

There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.


-> **Question**: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests? 

**Answer**: All classes and functions in the `./langchain/utilities/` folder seem to have unit tests written for them. 
