jxnl · shreya-51 · Jul 5, 2024 · Jun 30, 2024 · Jun 30, 2024 · Jun 30, 2024
diff --git a/docs/prompting/thought_generation/chain_of_thought_few_shot/active_prompt.md b/docs/prompting/thought_generation/chain_of_thought_few_shot/active_prompt.md
@@ -1,7 +1,101 @@
 ---
-title: ""
-description: ""
-keywords: ""
+description: "Active prompting is a method used to identify the most effective examples for human annotation. "
 ---
 
-[wip]
+When we have a large pool of unlabeled examples that could be used in a prompt, how should we decide which examples to manually label?
+
+Active prompting is a method used to identify the most effective examples for human annotation. The process involves four key steps:
+
+1. **Uncertainty Estimation**: Assess the uncertainty of the LLM's predictions on each possible example
+2. **Selection**: Choose the most uncertain examples for human annotation
+3. **Annotation**: Have humans label the selected examples
+4. **Inference**: Use the newly labeled data to improve the LLM's performance
+
+## Uncertainty Estimation
+
+In this step, we define an unsupervised method to measure the uncertainty of an LLM in answering a given example.
+
+!!! example "Uncertainty Estimation Example"
+
+    Let's say we ask an LLM the following query:
+    >query = "Classify the sentiment of this sentence as positive or negative: I am very excited today."
+
+    and the LLM returns:
+    >response = "positive"
+
+    The goal of uncertainty estimation is to answer: **How sure is the LLM in this response?**
+
+In order to do this, we query the LLM with the same example _k_ times. Then, we use the _k_ responses to determine how dissimmilar these responses are. Three possible metrics<sup><a href="https://arxiv.org/abs/2302.12246">1</a></sup> are:
+
+1. Disagreement
+2. Entropy
+3. Variance
+
+Below is an example of uncertainty estimation for a single input example using the disagreement uncertainty metric.
+
+```python
+import instructor
+from pydantic import BaseModel
+from openai import OpenAI
+
+
+class Response(BaseModel):
+    height: int
+
+
+client = instructor.from_openai(OpenAI())
+
+
+def query_llm():
+    return client.chat.completions.create(
+        model="gpt-4o",
+        response_model=Response,
+        messages=[
+            {
+                "role": "user",
+                "content": "How tall is the Empire State Building in meters?",
+            }
+        ],
+    )
+
+
+def calculate_disagreement(responses):
+    unique_responses = set(responses)
+    h = len(unique_responses)
+    return h / k
+
+
+if __name__ == "__main__":
+    k = 5  # (1)!
+    responses = [query_llm() for _ in range(k)]  # Query the LLM k times
+    for response in responses:
+        print(response)
+        #> height=443
+        #> height=443
+        #> height=443
+        #> height=443
+        #> height=381
+
+    print(
+        calculate_disagreement([response.height for response in responses])
+    )  # Calculate the uncertainty metric
+    #> 0.4
+```
+
+1. _k_ is the number of times to query the LLM with a single unlabeled example
+
+This process will then be repeated for all unlabeled examples.
+
+## Selection & Annotation
+
+Once we have a set of examples and their uncertainties, we can select _n_ of them to be annotated by humans. Here, we choose the examples with the highest uncertainties.
+
+## Inference
+
+Now, each time the LLM is prompted, we can include the newly-annotated examples.
+
+## References
+
+<sup id="ref-1">1</sup>: [Active Prompting with Chain-of-Thought for Large Language Models](https://arxiv.org/abs/2302.12246)
+
+<sup id="ref-asterisk">\*</sup>: [The Prompt Report: A Systematic Survey of Prompting Techniques](https://arxiv.org/abs/2406.06608)
diff --git a/docs/prompting/thought_generation/chain_of_thought_few_shot/auto_cot.md b/docs/prompting/thought_generation/chain_of_thought_few_shot/auto_cot.md
@@ -1,7 +1,150 @@
 ---
-title: ""
-description: ""
-keywords: ""
+description: "Automate few-shot chain of thought to choose diverse examples"
 ---
 
-[wip]
+How can we improve the performance of few-shot CoT?
+
+While few-shot CoT reasoning is effective, its effectiveness relies on manually crafted examples. Further, choosing diverse examples has shown effective in reducing reasoning errors from CoT.
+
+Here, we automate CoT to choose diverse examples. Given a list of potential examples:
+
+1. **Cluster**: Cluster potential examples
+2. **Sample**: For each cluster,
+   1. Sort examples by distance from cluster center
+   2. Select the first example that meets a predefined selection criteria
+3. **Prompt**: Incorporate the chosen questions from each cluster as examples in the LLM prompt
+
+!!! info
+
+    A sample selection criteria could be limiting the number of reasoning steps to a maximum of 5 steps to encourage sampling examples with simpler rationales.
+
+```python hl_lines="72 75 106"
+import instructor
+import numpy as np
+from openai import OpenAI
+from pydantic import BaseModel
+from sklearn.cluster import KMeans
+from sentence_transformers import SentenceTransformer
+
+client = instructor.patch(OpenAI())
+NUM_CLUSTERS = 2
+
+
+class Example(BaseModel):
+    question: str
+    reasoning_steps: list[str]
+
+
+class FinalAnswer(BaseModel):
+    reasoning_steps: list[str]
+    answer: int
+
+
+def cluster_and_sort(questions, n_clusters=NUM_CLUSTERS):
+    # Cluster
+    embeddings = SentenceTransformer('all-MiniLM-L6-v2').encode(questions)
+    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10).fit(embeddings)
+
+    # Sort
+    sorted_clusters = [[] for _ in range(kmeans.n_clusters)]
+    for question, embedding, label in zip(questions, embeddings, kmeans.labels_):
+        center = kmeans.cluster_centers_[label]
+        distance = np.linalg.norm(embedding - center)
+        sorted_clusters[label].append((distance, question))
+    for cluster in sorted_clusters:
+        cluster.sort()  # Sort by distance
+
+    return sorted_clusters
+
+
+def sample(cluster):
+    for question in cluster:
+        response = client.chat.completions.create(
+            model="gpt-4o",
+            response_model=Example,
+            messages=[
+                {
+                    "role": "system",
+                    "content": "You are an AI assistant that generates step-by-step reasoning for mathematical questions.",
+                },
+                {
+                    "role": "user",
+                    "content": f"Q: {question}\nA: Let's think step by step.",
+                },
+            ],
+        )
+        if (
+            len(response.reasoning_steps) <= 5
+        ):  # If we satisfy the selection criteria, we've found our question for this cluster
+            return response
+
+
+if __name__ == "__main__":
+    questions = [
+        "How many apples are left if you have 10 apples and eat 3?",
+        "What's the sum of 5 and 7?",
+        "If you have 15 candies and give 6 to your friend, how many do you have left?",
+        "What's 8 plus 4?",
+        "You start with 20 stickers and use 8. How many stickers remain?",
+        "Calculate 6 added to 9.",
+    ]
+
+    # Cluster and sort the questions
+    sorted_clusters = cluster_and_sort(questions)
+
+    # Sample questions that match selection criteria for each cluster
+    selected_examples = [sample(cluster) for cluster in sorted_clusters]
+    print(selected_examples)
+    """
+    [
+        Example(
+            question='If you have 15 candies and give 6 to your friend, how many do you have left?',
+            reasoning_steps=[
+                'Start with the total number of candies you have, which is 15.',
+                'Subtract the number of candies you give to your friend, which is 6, from the total candies.',
+                '15 - 6 = 9, so you are left with 9 candies.',
+            ],
+        ),
+        Example(
+            question="What's the sum of 5 and 7?",
+            reasoning_steps=[
+                'Identify the numbers to be added: 5 and 7.',
+                'Perform the addition: 5 + 7.',
+                'The sum is 12.',
+            ],
+        ),
+    ]
+    """
+
+    # Use selected questions as examples for the LLM
+    response = client.chat.completions.create(
+        model="gpt-4o",
+        response_model=FinalAnswer,
+        messages=[
+            {
+                "role": "user",
+                "content": f"""
+                {selected_examples}
+                If there are 10 books in my bad and I read 8 of them, how many books do I have left? Let's think step by step.
+                """,
+            }
+        ],
+    )
+
+    print(response.reasoning_steps)
+    """
+    [
+        'Start with the total number of books in the bag, which is 10.',
+        "Subtract the number of books you've read, which is 8, from the total books.",
+        '10 - 8 = 2, so you have 2 books left.',
+    ]
+    """
+    print(response.answer)
+    #> 2
+```
+
+### References
+
+<sup id="ref-1">1</sup>: [Automatic Chain of Thought Prompting in Large Language Models](https://arxiv.org/abs/2210.03493)
+
+<sup id="ref-asterisk">\*</sup>: [The Prompt Report: A Systematic Survey of Prompting Techniques](https://arxiv.org/abs/2406.06608)