# 使用LangChain、GPT和Deep Lake处理代码库
在本教程中，我们将使用LangChain+Deep Lake和GPT来分析LangChain自身的代码库。

## Design

1. 数据准备：
   1. 使用`langchain.document_loaders.TextLoader`上传所有Python项目文件。我们称这些文件为“文档”。
   2. 使用`langchain.text_splitter.CharacterTextSplitter`将所有文档分成块。
   3. 使用`langchain.embeddings.openai.OpenAIEmbeddings`和`langchain.vectorstores.DeepLake`将块嵌入并上传到DeepLake中。
2. 问答：
   1. 从`langchain.chat_models.ChatOpenAI`和`langchain.chains.ConversationalRetrievalChain`构建链。
   2. 准备问题。
   3. 运行链以获取答案。

## Implementation

### Integration preparations

We need to set up keys for external services and install necessary python libraries.

In [20]:
#!python3 -m pip install --upgrade langchain deeplake openai

Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. 

For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/

Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at [app.activeloop.ai](https://app.activeloop.ai)

In [24]:
import os
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings

load_dotenv()
org = os.getenv('ACTIVELOOP_ORG')
os.environ['ACTIVELOOP_TOKEN'] = os.getenv('ACTIVELOOP_TOKEN')
os.environ['ACTIVELOOP_ORG'] = org

embeddings = OpenAIEmbeddings(disallowed_special=())
dataset_path = 'hub://' + org + '/langchain-code'
print(f"Dataset path: {dataset_path}")
embeddings

Dataset path: hub://gnehcgnaw/langchain-code


OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='2022-12-01', openai_api_base=None, openai_api_type=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special=set(), chunk_size=1000, max_retries=6)

### Prepare data 

Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the `langchain` repo.

If you want to use files from different repo, change `root_dir` to the root dir of your repo.

In [22]:
from langchain.document_loaders import TextLoader

root_dir = '../../../..'

docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        if file.endswith('.py') and '/.venv/' not in dirpath:
            try: 
                loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
                docs.extend(loader.load_and_split())
            except Exception as e: 
                pass
print(f'{len(docs)}')

KeyboardInterrupt: 

Then, chunk the files

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)}")

Then embed chunks and upload them to the DeepLake.

This can take several minutes.

In [25]:
# 可能出现的问题：https://github.com/hwchase17/langchain/issues/923
# 解决办法：https://github.com/shashnkvats/PdfPal/issues/1
from langchain.vectorstores import DeepLake
#  deeplake token 默认是1天过期，需要重新获取
DeepLake.force_delete_by_path(dataset_path)
db = DeepLake.from_documents(texts, embeddings, dataset_path=dataset_path)
db

 

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


/

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/gnehcgnaw/langchain-code


 

hub://gnehcgnaw/langchain-code loaded successfully.


Evaluating ingest: 61%|██████▏   | 49/80 [08:56<05:56Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: The server is currently overloaded with other requests. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists..
Evaluating ingest: 100%|██████████| 80/80 [17:16<00:00
 

Dataset(path='hub://gnehcgnaw/langchain-code', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (81406, 1536)  float32   None   
    ids      text     (81406, 1)      str     None   
 metadata    json     (81406, 1)      str     None   
   text      text     (81406, 1)      str     None   


<langchain.vectorstores.deeplake.DeepLake at 0x7feb53bdff10>

### Question Answering
首先加载数据集，构建检索器，然后构建会话链。

In [26]:
db = DeepLake(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)

 
KeyboardInterrupt



In [27]:
# as_retriever() 是一个检索器，可以用来检索，也可以用来过滤
retriever = db.as_retriever()
# distance_metric: 'cos' or 'euclidean' ,cos是余弦距离，euclidean是欧式距离
retriever.search_kwargs['distance_metric'] = 'cos'
# fetch_k: 检索时，每个检索结果返回的最大数量
retriever.search_kwargs['fetch_k'] = 20
# maximal_marginal_relevance: 是否使用MMR算法，用于检索结果的排序 ,mmr是一种排序算法，用于对检索结果进行排序
retriever.search_kwargs['maximal_marginal_relevance'] = True
# k是MMR算法中，每个检索结果返回的最大数量
retriever.search_kwargs['k'] = 20

You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)

In [28]:
# 以下代码的意思是：过滤掉包含'something'的文档，过滤掉路径中包含'only_this'或'also_that'的文档
def filter(x):
    # filter based on source code
    if 'something' in x['text'].data()['value']:
        return False
    
    # filter based on path e.g. extension
    metadata =  x['metadata'].data()['value']
    return 'only_this' in metadata['source'] or 'also_that' in metadata['source']

### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter

In [29]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model='gpt-3.5-turbo') # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

In [31]:
questions = [
    "AIMessagePromptTemplate 如何使用？",
    # "What classes are derived from the Chain class?",
    # "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
    # "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")


-> **Question**: AIMessagePromptTemplate 如何使用？ 

**Answer**: AIMessagePromptTemplate是一个用于生成机器人消息的模板类。您可以使用它来创建一个包含机器人消息的ChatPromptTemplate。这个模板类的使用方式类似于其他的PromptTemplate。您需要传入一个包含机器人消息的字符串模板，该模板中可以包含变量。在调用format方法时，您可以提供这些变量的值，以生成最终的机器人消息。 

例如，假设您有一个包含机器人消息的ChatPromptTemplate，您可以使用AIMessagePromptTemplate来为这个模板生成机器人消息：

```
from langchain.prompts.chat import AIMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

prompt_template = ChatPromptTemplate(
    [
        HumanMessagePromptTemplate("What's your favorite color?"),
        AIMessagePromptTemplate("My favorite color is {color}.")
    ]
)

prompt = prompt_template.format(color="blue")
```

在这个例子中，AIMessagePromptTemplate用于生成机器人消息，其中包含了变量{color}，在调用format方法时，您必须传入color参数的值，以便将变量替换为实际的值。 



-> **Question**: What is the class hierarchy? 

**Answer**: There are several class hierarchies in the provided code, so I'll list a few:

1. `BaseModel` -> `ConstitutionalPrinciple`: `ConstitutionalPrinciple` is a subclass of `BaseModel`.
2. `BasePromptTemplate` -> `StringPromptTemplate`, `AIMessagePromptTemplate`, `BaseChatPromptTemplate`, `ChatMessagePromptTemplate`, `ChatPromptTemplate`, `HumanMessagePromptTemplate`, `MessagesPlaceholder`, `SystemMessagePromptTemplate`, `FewShotPromptTemplate`, `FewShotPromptWithTemplates`, `Prompt`, `PromptTemplate`: All of these classes are subclasses of `BasePromptTemplate`.
3. `APIChain`, `Chain`, `MapReduceDocumentsChain`, `MapRerankDocumentsChain`, `RefineDocumentsChain`, `StuffDocumentsChain`, `HypotheticalDocumentEmbedder`, `LLMChain`, `LLMBashChain`, `LLMCheckerChain`, `LLMMathChain`, `LLMRequestsChain`, `PALChain`, `QAWithSourcesChain`, `VectorDBQAWithSourcesChain`, `VectorDBQA`, `SQLDatabaseChain`: All of these classes are subclasses of `Chain`.
4. `BaseLoader`: `BaseLoader` is a subclass of `ABC`.
5. `BaseTracer` -> `ChainRun`, `LLMRun`, `SharedTracer`, `ToolRun`, `Tracer`, `TracerException`, `TracerSession`: All of these classes are subclasses of `BaseTracer`.
6. `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, `CohereEmbeddings`, `JinaEmbeddings`, `LlamaCppEmbeddings`, `HuggingFaceHubEmbeddings`, `TensorflowHubEmbeddings`, `SagemakerEndpointEmbeddings`, `HuggingFaceInstructEmbeddings`, `SelfHostedEmbeddings`, `SelfHostedHuggingFaceEmbeddings`, `SelfHostedHuggingFaceInstructEmbeddings`, `FakeEmbeddings`, `AlephAlphaAsymmetricSemanticEmbedding`, `AlephAlphaSymmetricSemanticEmbedding`: All of these classes are subclasses of `BaseLLM`. 


-> **Question**: What classes are derived from the Chain class? 

**Answer**: There are multiple classes that are derived from the Chain class. Some of them are:
- APIChain
- AnalyzeDocumentChain
- ChatVectorDBChain
- CombineDocumentsChain
- ConstitutionalChain
- ConversationChain
- GraphQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMCheckerChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAPIEndpointChain
- PALChain
- QAWithSourcesChain
- RetrievalQA
- RetrievalQAWithSourcesChain
- SequentialChain
- SQLDatabaseChain
- TransformChain
- VectorDBQA
- VectorDBQAWithSourcesChain

There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.


-> **Question**: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests? 

**Answer**: All classes and functions in the `./langchain/utilities/` folder seem to have unit tests written for them. 
