## Get LCEL-related questions from `chat-langchain`

We use this to extract chat history from `chat-langchain`.

We will use this to get LCEL related questions.

See [here](https://raw.githubusercontent.com/hinthornw/lspopscripts/main/download_runs.py) if you want code the get full traces.

Set the correct `LANGCHAIN_API_KEY` for `chat-langchain`.

## Get Questions


In [24]:
import datetime
import csv
from concurrent.futures import ThreadPoolExecutor, as_completed
from itertools import islice

import langsmith
from tqdm import tqdm

client = langsmith.Client()

def download_data(
    project_name: str,
    nested: bool = False,
    since: datetime.datetime = yesterday,
    exclude_followups: bool = True,
    filename: str = "fetched_data.csv",
):
    """
    Downloads and saves data from Langsmith runs to a CSV file.

    This function retrieves run data from the Langsmith project specified by 'project_name'.
    It extracts 'question' and 'output' from each run's inputs and outputs, respectively,
    and saves them into a CSV file. The function can handle both nested and non-nested runs.
    Follow-up runs can be excluded if desired.

    Parameters:
    project_name (str): The name of the Langsmith project to retrieve data from.
    nested (bool): Set to True to handle nested runs; False by default.
    since (datetime): The start time from which to retrieve runs; defaults to yesterday.
    exclude_followups (bool): Set to True to exclude follow-up runs; True by default.
    filename (str): The name of the file to save the data to; defaults to 'fetched_data.csv'.
    """
    traces = client.list_runs(
        project_name=project_name, start_time=since, execution_order=1
    )
    batch_size = 10
    executor = ThreadPoolExecutor(max_workers=batch_size) if nested else None

    with open(filename, 'w', newline='', encoding='utf-8') as file_handle:
        csv_writer = csv.writer(file_handle)
        # Write the header
        csv_writer.writerow(['question', 'output'])

        try:
            if nested:
                pbar = tqdm()
                while True:
                    batch = list(islice(traces, batch_size))
                    if not batch:
                        break
                    futures = [
                        executor.submit(client.read_run, run.id, load_child_runs=True)
                        for run in batch
                    ]
                    for future in as_completed(futures):
                        loaded_run = future.result()
                        loaded_run_json=loaded_run.json()
                        loaded_run_json = json.loads(loaded_run_json)
                        question = loaded_run_json['inputs'].get('question', '')
                        output = loaded_run_json['outputs'].get('output', '')
                        csv_writer.writerow([question, output])
                    pbar.update(len(batch))
            else:
                for run in tqdm(traces):
                    if exclude_followups and run.inputs.get("chat_history"):
                        continue
                    run_json = run.json()
                    run_json = json.loads(run_json)
                    question = run_json['inputs'].get('question', '')
                    output = run_json['outputs'].get('output', '')
                    csv_writer.writerow([question, output])

        finally:
            if executor:
                executor.shutdown()
    
    print(f"Saved to {filename}")

# Call the function
yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
window_30_day = datetime.datetime.now() - datetime.timedelta(days=30)
download_data(project_name="chat-langchain",
              since=window_30_day)

152090it [42:48, 59.22it/s]

Saved to fetched_data.csv





## Read Extracted QA Pairs

In [1]:
import pandas as pd
filename = 'fetched_data.csv'
df = pd.read_csv(filename)

## Filter for LCEL in the question

In [3]:
from langchain.schema import Document

search_term = 'LCEL'
filtered_df = df[df['question'].str.contains(search_term, case=False, na=False)]

# Group by unique instances of 'question' and then reset index
unique_questions_df = filtered_df.drop_duplicates(subset='question')

# Extract the 'question' column and convert it to a list
unique_questions = unique_questions_df['question'].tolist()

## Cluster

Some of the questions are highly verbose and contain large code blocks.

Let's try to cluster so these types are questions are grouped (and can be most easily ignored).

In [None]:
# Embed and cluster 

from langchain.embeddings.openai import OpenAIEmbeddings
embd = OpenAIEmbeddings()
question_embeddings = embd.embed_documents(unique_questions)

from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=5, random_state=0)
clusters = clustering_model.fit_predict(question_embeddings)
unique_questions_df['cluster'] = clusters

def fmt_qus(df):

    unique_questions = df['question'].tolist()
    formatted_unique_questions = '--- --- \n --- --- '.join(unique_questions)
    return formatted_unique_questions

# Get unique values in the 'cluster' column
all_clusters = unique_questions_df['cluster'].unique()

# Process each cluster
cluster_context=[]
for i in all_clusters:
    df_cluster = unique_questions_df[unique_questions_df['cluster'] == i]
    formatted_questions = fmt_qus(df_cluster)
    cluster_context.append(formatted_questions)

## Summarize

Summarize major question themes in each cluster.

This isolates lower quality / more vebose questions into its own cluster, limiting pollution of the overall themes.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Prompt template
template = """Here is a set of questions input to LangChain QA system. \n

They are related to LCEL, LangChain Expression Language. \n

Reason about the questions, first. \n

Then, give me a list of the top 10 question themes.

Give me one reprentitive question per theme.

Questions:
{context}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")
chain = prompt | model | StrOutputParser()

answers = []
for c in cluster_context:
    answers.append(chain.invoke({"context":c}))

## Themes

We can look at the group summaries in LangSmith.

* https://smith.langchain.com/public/69d0b729-cd8c-4d4b-859d-6e5ee683fc7a/r

```
1. **Basic Understanding of LCEL**
   - What is LCEL?

2. **LCEL Integration with Agents**
   - Can I use agents with LCEL?

3. **LCEL Coding and Implementation Examples**
   - Code me a question answering example with LCEL.

4. **LCEL with Memory and Storage**
   - How to use VectorStoreRetrieverMemory in LCEL?

5. **LCEL Configuration and Settings**
   - How to set verbose true for LCEL?

6. **LCEL with Retrieval-Augmented Generation (RAG)**
   - Can you give me an example to run a simple RAG using LCEL in Python?

7. **LCEL Asynchronous Operations**
   - LCEL 异步invoke (LCEL asynchronous invoke)

8. **LCEL Error Handling and Debugging**
   - How can I get the finish reason using LCEL?

9. **LCEL with Multiple Inputs and Variables**
   - How to use multiple partial variables in LCEL?

10. **LCEL Advanced Features and Customization**
   - Can LCEL execute custom python functions?
```

* https://smith.langchain.com/public/b3aba7b6-e877-4d99-bede-e607138fe171/r

```
1. **Parallel and Asynchronous Execution**: Questions about running multiple chains in parallel or asynchronously.
   - Representative question: "I want to run three chains in parallel. They share the same input variables, but produce different output objects. How do I do this with LCEL?"

2. **Custom Functions and Configurations**: How to include custom functions or add configurable fields to a chain.
   - Representative question: "How to include a custom function as part of an LCEL chain?"

3. **Memory Management**: Questions about how memory is handled within chains, including buffer memory and conversation memory.
   - Representative question: "I have a LCEL chain with e.g. buffer memory, and I serve it via Langserve. When is the memory reset? Do all API calls use the same memory under the hood?"

4. **Chain Composition and Modularity**: How to compose chains from multiple components or steps, and how to pass data between them.
   - Representative question: "How can I connect several chains, i.e. the output of the former chain is the input of the latter chain? Can I achieve this through LCEL?"

5. **Error Handling and Retries**: How to handle errors and implement retries within a chain.
   - Representative question: "How can I use a RetryWithErrorOutputParser in a LCEL chain?"

6. **Retrieval and Querying**: Questions about setting up retrieval chains, including those with specific querying capabilities.
   - Representative question: "How to create a Retrieval QA chain with streaming, using LCEL?"

7. **Verbose and Debugging**: How to enable verbose output or debugging within a chain.
   - Representative question: "How to set verbose True in LangChain Expression Language (LCEL)?"

8. **Integration with External Services**: Questions about integrating LCEL chains with external services or databases.
   - Representative question: "I need a LCEL chain that takes a YouTube link and transcribes it with Whisper."

9. **Chain Customization and Enhancement**: How to enhance chains with additional features like callbacks, custom parsers, or specific output formatting.
   - Representative question: "How do I pass a pre-written history variable into my LCEL chain?"

10. **Understanding LCEL Fundamentals**: Basic questions about what LCEL is and how to use it effectively.
    - Representative question: "What is LangChain Expression Language (LCEL)?"
```

* https://smith.langchain.com/public/3346bc94-146e-451a-9af4-7b7fe07d9a84/r

```
1. **LCEL Chain Construction**: How to build and structure chains using LCEL components.
   - Representative question: "Give an example of an LCEL chain with LLMSingleActionAgent and AgentExecutor."

2. **Output Parsing and Formatting**: How to parse and format the output from LCEL chains.
   - Representative question: "What LangChain tool can I use to parse this output into a single message?"

3. **Component Ordering and Interaction**: Understanding the order and interaction between components in an LCEL chain.
   - Representative question: "When using LCEL, is the order of the chained components arbitrary?"

4. **Custom Agents and Tools Integration**: How to integrate custom agents and tools within an LCEL chain.
   - Representative question: "I would like to use my own custom agent in an LCEL chain. How do I build this chain?"

5. **Conditional Logic and Prompts**: Implementing conditional logic and handling prompts in LCEL.
   - Representative question: "How to conditionally choose between prompts in LCEL."

6. **Memory and Conversation History**: Utilizing memory and conversation history within LCEL chains.
   - Representative question: "Conversation chain with memory using LCEL."

7. **Runnable and Agent Configuration**: Configuring and using Runnables and agents in LCEL.
   - Representative question: "How do I configure ReAct agent 'Thought' with custom OutputParser and Custom Agent, using LCEL?"

8. **LCEL Syntax and Expressions**: Understanding and using the syntax and expressions specific to LCEL.
   - Representative question: "Can I create an LCEL chain with prompt templates having no input variables?"

9. **LCEL with Specific Models and Tools**: Using LCEL with specific models like GPT-4 and tools like vectorstore retrievers.
   - Representative question: "Can you show me an example of an agent using gpt4 with a web search tool and memory? All using LCEL."

10. **LCEL in Different Environments and Applications**: Applying LCEL in various environments and for different types of applications.
    - Representative question: "Provide LCEL code for a simple chat app using Azure OpenAI."
```

* https://smith.langchain.com/public/47648f18-b543-477c-91cf-84d612eb6810/r

```
1. **Error Handling in LCEL Chains**
   - Representative Question: "This code gives me this error TypeError: Expected a Runnable, callable or dict. Instead got an unsupported type: <class 'str'>"

2. **Integration of LangChain with AI Models**
   - Representative Question: "Create the LCEL chain using ChatOpenAI with a specific model and temperature settings."

3. **PDF Processing with PyMuPDF**
   - Representative Question: "Convert a PDF page to a pixmap using the PyMuPDF library."

4. **Base64 Encoding of Images**
   - Representative Question: "Encode a pixmap to a base64 string for image processing."

5. **Template Formatting and Data Injection**
   - Representative Question: "Define the prompt templates and format them with dynamic data for the AI model."

6. **AI-Assisted Data Interpretation**
   - Representative Question: "Use the AI model to assist in marking images using a provided mark scheme."

7. **File I/O Operations**
   - Representative Question: "Write the results of the LCEL chain to a file."

8. **Debugging Lambda Functions in LCEL**
   - Representative Question: "Change the RunnableLambda to RunnablePassthrough from the start of the template."

9. **Understanding LCEL Chain Outputs**
   - Representative Question: "What will be the type of marking_output and how to make it a string?"

10. **Correct Usage of LCEL Components**
    - Representative Question: "The output of the LCEL chain isn't a string; find a way to fix it."
```

We do some manual curation, and put our final question set into `eval/eval.csv`.