Notebook to experiment with the following end to end process: from dataset+task in NL, to typology based diagram, to design recommendation.

In [1]:
from langchain_community.llms import Ollama
from langchain.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.llms import Ollama
from langchain.text_splitter import TokenTextSplitter
from pprint import pprint

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_core.documents import Document

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

from langchain.schema.runnable import RunnableMap
from langchain_core.prompts import PromptTemplate

In [2]:
model = Ollama(model="llama3:8b", temperature=0)

## Determine the end goal/decision
Input: dataset description

Output: The decision task

In [3]:
documents = []
documents.extend(PyPDFLoader("docs/dm.pdf").load())

In [4]:
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=25)
docs = text_splitter.split_documents(documents)
len(docs)

260

In [5]:
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

persist_directory = 'docs/chroma/'
!rm -rf docs/chroma  # remove old database files if any

vectordb = Chroma.from_documents( # had an error previuously, downgraded to chromadb version 0.4.3 using command: pip install chromadb==0.4.3. See https://github.com/zylon-ai/private-gpt/issues/1012
    documents=docs,
    embedding=hf,
    persist_directory=persist_directory,
)
retriever = vectordb.as_retriever()

print(vectordb._collection.count())

  from tqdm.autonotebook import tqdm, trange


260


In [6]:
nl_to_typology_goal_template = """Imagine you are a visualization designer who wants to understand what are the decisions an expert in embryology and in vitro fertilization would make when designing a visualization.
You are tasked with taking the dataset and the task the expert is trying to accomplish and translating the task into one of the decision making tasks that appear in Typology of
Decision-Making Tasks for Visualization paper.

The three possible decision tasks are: CHOOSE, ACTIVATE and CREATE. Give a brief explanation of the decision making task you chose and why you think it is the most appropriate for the task at hand.
When providing reasons, give explanations that relate to the definitions of the three tasks as described in the Typology of Decision-Making Tasks for Visualization paper.
The dataset description, task description, and typology of decision making tasks paper are given below. 

Relevant context from the Typology of Decision-Making Tasks for Visualization paper: {context}
Data Description: {data_description}
Task Description: {task_description}
"""
# prompt = ChatPromptTemplate.from_template(template)
nl_to_typology_goal_prompt_template = PromptTemplate.from_template(nl_to_typology_goal_template)

In [7]:
dm_task_definitions_question = "What are the three decision making tasks in my typology?"
retriever.get_relevant_documents(dm_task_definitions_question)

  warn_deprecated(


[Document(page_content=' a typology for decision-making tasks in visualiza-\ntion, addressing the limitations of existing taxonomies. Built upon prior\nresearch and informed by design goals derived from a thorough liter-\nature review, the typology comprises three decision tasks: CHOOSE,\nACTIV ATE, and CREATE. These tasks allow for the representation\nof complex decision-making structures, as they can be composed or\ndecomposed into other tasks. The typology demonstrates completeness,', metadata={'page': 8, 'source': 'docs/dm.pdf'}),
 Document(page_content=' real-world visualization\nsystems.\n4.1 Decision-Making Tasks\nOur typology consists of three tasks derived from the scientific\nliterature [27, 28] : CHOOSE, ACTIV ATE, and CREATE. Each task is\na function that represents a specific and distinct decision problem. The\ntype of the inputs to these functions does not change the core process\nof the decision task. Some of the key differences between the tasks are\nthe unique transfor

In [8]:
nl_to_typology_goal_chain = RunnableMap({
    "context": lambda x: retriever.get_relevant_documents(dm_task_definitions_question),
    "data_description": lambda x: x["data"],
    "task_description": lambda x: x["task"]
}) | nl_to_typology_goal_prompt_template | model

In [9]:
nl_to_typology_goal_chain_output =  nl_to_typology_goal_chain.invoke({"data": "tabular data where each row is a patient, and the associated levels of age, bmi amh and afc at the time of the Egg Retrieval Procedure.",
              "task": "understand how the medication dose varies with the following patient parameters: age, bmi amh and afc then recommend a dosage for the current patient."
              })
print(nl_to_typology_goal_chain_output)

After analyzing the dataset and task description, I believe that the most appropriate decision-making task is ACTIVATE.

The ACTIVATE task represents a decision where options are evaluated, and only those that meet or exceed a threshold are returned. In this case, the expert in embryology and in vitro fertilization needs to evaluate the medication dose based on various patient parameters (age, BMI, AMH, and AFC) and recommend a dosage for the current patient.

The task requires evaluating options (different medication doses) against specific criteria (patient parameters), and only those that meet or exceed a certain threshold (optimal dosage) are returned. This process involves filtering out suboptimal options based on the evaluation of the patient's characteristics, which aligns with the ACTIVATE task definition.

In contrast, the CHOOSE task would require selecting one option from a set of available options, which might not accurately capture the complexity of evaluating multiple pat

## Expand the decision tasks iteratively
Input: The  decision task.

Ouput: a typology based diagram.

In [10]:
decision_task = model.invoke("What is the decision task the following text is describing? Answer with one of these three words and nothing else: CHOOSE, ACTIVATE or CREATE. \n" 
                             + nl_to_typology_goal_chain_output)

decision_task

'ACTIVATE'

In [11]:
decomposition_context_question = "How do you decompose decision making tasks?"
retriever.get_relevant_documents(decomposition_context_question)

[Document(page_content=', during our interview study, we observed a preference among\nparticipants for a top-down approach (see Section 7). The participantsstarted with high-level decision tasks and recursively decomposed them\ninto sub-tasks until they achieved the desired level of granularity. In the\ncase studies below, we demonstrate how these decision-task hierarchies\ncan be created through a series of decompositions.\n6 C ASESTUDIES\nTo illustrate the composability andh', metadata={'page': 4, 'source': 'docs/dm.pdf'}),
 Document(page_content=' of real-world decision-making\nproblems, as our tasks can be composed or decomposed into other\ntasks.arXiv:2404.08812v2  [cs.HC]  22 Apr 2024', metadata={'page': 0, 'source': 'docs/dm.pdf'}),
 Document(page_content=' This involved dissecting the decision processes, identifying key\ncomponents, and highlighting other properties relevant to decision-\nmaking, as outlined in subsection 3.3.\nFollowing this, we conducted collaborative working

In [27]:
from langchain.output_parsers import CommaSeparatedListOutputParser
csv_output_parser = CommaSeparatedListOutputParser()
format_instructions = csv_output_parser.get_format_instructions()

From previous prompt: 
V2
To describe the decomposition, you should enumerate each decision task, describe it, and explain why it is part of the decomposition.
Also, you should explain how the information flows from one decision to the next by constructing a node-link diagram.
There might be back loops, where one decision task influences a previous decision task.
Note that the decision goal is also part of this description, and you should include how the information flows in the goal decision task.
The context, decomposition instructions, and decision goal are given below.


V1
For that, describe the node-link diagram of the decision making process in a couple of tables in CSV format, where each node is a decision task and each edge is a connection between two decision tasks.
The first table has the following columns: node id, node type (CHOOSE, ACTIVATE, or CREATE), that is, the name of the decision task, and the description of the decision task. The node id starts at 1, and increments by 1 for each decision task.
The second table has the following columns: Source, Target, and the information being passed from one decition to the other. 
The Source and Target columns should contain the ids of the decision tasks, names of the decision tasks, and the Description column should contain a brief explanation of the connection between the two decision tasks.
Add to the nodes and edges tables the decision goal, that is, how do the subtasks relate to the decision goal.
Answer with the node and edge tables and nothing else.

In [113]:
typology_goal_to_nl_diagram_template = """
Imagine you are a visualization designer who wants to understand what are the decisions an expert in embryology and in vitro fertilization would make when recommending medication for ovarian stimulation.
You are tasked with taking the decision goal explained as context below, and expanding it according to the decomposition instructions, also given below. 
The decision goal should be decomposed
in {number_of_subtasks} decision subtasks. 
Each subtask should be also one of the three decision making tasks that appear in Typology of Decision-Making Tasks for Visualization paper (CHOOSE, ACTIVATE, CREATE).

For that, describe the node-link diagram of the decision making process in a couple of tables, where each node is a decision task and each edge is a connection between two decision tasks.
The first table has the following columns: node id, node type (CHOOSE, ACTIVATE, or CREATE), that is, the name of the decision task, and the description of the decision task. The node id starts at 1, and increments by 1 for each decision task.
The second table has the following columns: Source, Target, and the information being passed from one decition to the other. 
The Source and Target columns should contain the ids of the decision tasks, names of the decision tasks, and the Description column should contain a brief explanation of the connection between the two decision tasks.
Add to the nodes and edges tables the decision goal, that is, how do the subtasks relate to the decision goal.
Answer with the node and edge tables and nothing else.


Relevant context from the Typology of Decision-Making Tasks for Visualization paper: {decision_task_definitions}
Relevant decomposition instructions from the Typology of Decision-Making Tasks for Visualization paper: {decomposition_context}
Decision Goal Description: {decision_goal_description}
"""
# prompt = ChatPromptTemplate.from_template(template)
typology_goal_to_nl_diagram_prompt_template = PromptTemplate.from_template(template=typology_goal_to_nl_diagram_template,
                                                                        # input_variables=["number_of_subtasks", "decision_goal_description"],
                                                                        # partial_variables={"format_instructions": format_instructions}
                                                                        )

In [114]:
typology_goal_to_nl_diagram_chain = RunnableMap({
    "number_of_subtasks": lambda x: x["number_of_subtasks"],
    "decision_task_definitions": lambda x: retriever.get_relevant_documents(dm_task_definitions_question),
    "decomposition_context": lambda x: retriever.get_relevant_documents(decomposition_context_question),
    "decision_goal_description": lambda x: x["decision_goal_description"]
}) | typology_goal_to_nl_diagram_prompt_template | model

In [115]:
typology_goal_to_nl_diagram_chain_output =  typology_goal_to_nl_diagram_chain.invoke({"number_of_subtasks": 3,
              "decision_goal_description": nl_to_typology_goal_chain_output
              })
print(typology_goal_to_nl_diagram_chain_output)

Here are the node-link diagram tables:

**Nodes Table**

| Node ID | Node Type | Description |
| --- | --- | --- |
| 1 | ACTIVATE | Evaluate options (different medication doses) against specific criteria (patient parameters) and recommend a dosage for the current patient. |
| 2 | CHOOSE | Select one option from a set of available options based on patient characteristics. |
| 3 | CREATE | Generate new information or assemble existing information to make a decision about medication dose recommendation. |

**Edges Table**

| Source | Target | Description |
| --- | --- | --- |
| 1 | 2 | Filter out suboptimal options based on evaluation of patient's characteristics, and only recommend optimal dosage. |
| 1 | 3 | Not applicable - expert already has the dataset and needs to evaluate it based on specific criteria. |

**Decision Goal**

The decision goal is to ACTIVATE a recommendation for medication dose based on patient parameters (age, BMI, AMH, and AFC). The subtasks are:

* Node 1: ACTIVAT

Note to self: this can be improved by adding memory to the llm, that is, I shouldn't have to pass it the previous output, to every new prompt template, but I should just use a template.
See the memory module in this course: https://learn.deeplearning.ai/courses/langchain/lesson/3/memory

From natural language to pandas diagram. See example of how to format the graph in a table here: https://towardsdatascience.com/visualizing-networks-in-python-d70f4cbeb259

In [116]:
# from langchain.output_parsers import CommaSeparatedListOutputParser

# csv_output_parser = CommaSeparatedListOutputParser()

# nodes_template = """
# Convert the decision process described below to a table in CSV format with 3 columns: 'decision_id', 'decision_type' and 'decision_description'. 
# 'decision_id' is the id of the decision, which starts at 1 and increments by 1 for each decision. 
# 'decision_type' is one of the three decision types: CHOOSE, ACTIVATE, CREATE. 
# 'decision_description' is the description of the decision. 
# Don't forget to add the names of the columns.
# Please don't add an introductory sentence introducing the table in CSV format.

# Here is the context for the decision process: {decision_process}

# """
# format_instructions = csv_output_parser.get_format_instructions()
# nodes_prompt = PromptTemplate(
#     template=nodes_template,
#     input_variables=["decision_process"],
#     # partial_variables={"format_instructions": format_instructions},
# )

# nodes_chain = nodes_prompt | model | csv_output_parser

In [117]:
# nodes_chain.invoke({"decision_process": typology_goal_to_nl_diagram_chain_output})

In [118]:
# edges_prompt = PromptTemplate(
#     template="Convert the information flow in the decision process described below to a table in CSV format with 3 columns: 'source', 'target', and 'description'. The 'source' is the decision task that generates the information, 'target' is the decision task that receives that information, and 'description' is the description on how the information flows from one decision task to the next. Don't forget to add the names of the columns. \n{decision_process}\n{format_instructions}\n Please don't add an introductory sentence introducing the table in CSV format.",
#     input_variables=["decision_process"],
#     partial_variables={"format_instructions": format_instructions},
# )

# edges_chain = edges_prompt | model | csv_output_parser

In [119]:
# edges_csv = edges_chain.invoke({"decision_process": typology_goal_to_nl_diagram_chain_output})
# edges_csv

In [123]:
# testing the json parser
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI
from typing import List

# Define your desired data structure.
# took the structure from here: https://gist.github.com/mbostock/4062045
class Node(BaseModel):
    node_id: int = Field(description="this field is called 'id', and it is the id of the node in the graph, starting at 1.")
    node_type: dict = Field(description="this field is called 'node_type'. The type of the node is one of the decision making tasks: CHOOSE, ACTIVATE or CREATE")
    node_description: dict = Field(description="this field is called 'description'. This is the description of the node, that is, the description of the decision making task")

class Edge(BaseModel):
    source: dict = Field(description="this field is called 'source', and it is the id of the source node in the graph")
    target: dict = Field(description="this field is called 'target', and it is the id of the target node in the graph")
    description: dict = Field(description="This is the description of the edge, that is, the description of the information flow from one node to the next")

class Graph(BaseModel):
    nodes: List[Node] = Field(description="list of nodes in the graph")
    edges: List[Edge] = Field(description="list of edges in the graph")



In [124]:
# And a query intented to prompt a language model to populate the data structure.
query = '''From the decision process described below, extract a node-link diagram where the nodes are decision nodes (CHOOSE, ACTIVATE or CREATE), 
and the edges represent the information flow between the decision nodes.
Make sure that the nodes have an id and that the edges use the node ids. 
The decision process is described below in two parts: as a table in CSV format of nodes and as a table in CSV format of edges: \n ''' + typology_goal_to_diagram_chain_output

# Set up a parser + inject instructions into the prompt template.
parser = JsonOutputParser(pydantic_object=Graph)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

output_json = chain.invoke({"query": query})

In [125]:
output_json

{'nodes': [{'node_id': 1,
   'node_type': {'type': 'CHOOSE'},
   'node_description': {'description': 'Identify relevant patient parameters (age, BMI, AMH, and AFC) that affect medication dose.'}},
  {'node_id': 2,
   'node_type': {'type': 'ACTIVATE'},
   'node_description': {'description': 'Evaluate different medication doses based on the chosen patient parameters and recommend a dosage for the current patient.'}}],
 'edges': [{'source': 1,
   'target': 2,
   'description': {'description': 'Information flows from this decision to the next step, as the chosen patient parameters will be used to evaluate different medication doses.'}}]}

In [103]:
output_json['edges']

[{'source': 1,
  'target': 2,
  'description': 'Information flows from this decision to the next step, as the chosen patient parameters will be used to evaluate different medication doses.'}]

In [121]:
import pandas as pd

nodes_df = pd.read_json(json.dumps(output_json['properties']['nodes']['items']))
nodes_df

  nodes_df = pd.read_json(json.dumps(output_json['properties']['nodes']['items']))


ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

In [98]:
edges_df = pd.read_json(json.dumps(output_json['properties']['links']['items']))
edges_df

  edges_df = pd.read_json(json.dumps(output_json['properties']['links']['items']))


Unnamed: 0,source,target,description
0,1,2,The CHOOSE task provides the relevant patient ...


In [105]:
G = nx.from_pandas_edgelist(edges_df, 'source', 'target', 'description')

In [106]:
node_attributes = nodes_df.set_index('id')['type'].to_dict()
nx.set_node_attributes(G, node_attributes, 'node_type')

In [108]:
G[1]

AtlasView({2: {'description': 'The CHOOSE task provides the relevant patient parameters, which are used to evaluate medication options in the ACTIVATE task.'}})