# Web endpoints

Now that we've designed a simple retrieval chain, let's look at what it would take to productionize it as a streaming chat endpoint!

We'll go over the interaction with native web primitives like `fetch` and `Response`, as well as show how to utilize different chat sessions.

We'll pick up where we left off in the last lesson with loading and splitting our CS229 transcript into a vectorstore:

In [1]:
import "dotenv/config";

[Module: null prototype] { default: {} }

In [2]:
import { 
  loadAndSplitChunks, 
  initializeVectorstoreWithDocuments 
} from "./lib/helpers.ts";

const splitDocs = await loadAndSplitChunks({
  chunkSize: 1536,
  chunkOverlap: 128,
});

const vectorstore = await initializeVectorstoreWithDocuments({
  documents: splitDocs,
});

const retriever = vectorstore.asRetriever();

Let's load the pieces of our conversational retrieval chain together.

In [3]:
import { 
  createDocumentRetrievalChain, 
  createRephraseQuestionChain 
} from "./lib/helpers.ts";

const documentRetrievalChain = createDocumentRetrievalChain();
const rephraseQuestionChain = createRephraseQuestionChain();

In [4]:
import { ChatPromptTemplate, MessagesPlaceholder } from "langchain/prompts";

const ANSWER_CHAIN_SYSTEM_TEMPLATE = `You are an experienced researcher,
expert at interpreting and answering questions based on provided sources.
Using the below provided context and chat history, 
answer the user's question to the best of your ability
using only the resources provided. Be verbose!

<context>
{context}
</context>`;

const HUMAN_MESSAGE_TEMPLATE = 
  `Now, answer this question using the previous context and chat history:
  
  {standalone_question}`;

const answerGenerationChainPrompt = ChatPromptTemplate.fromMessages([
  ["system", ANSWER_CHAIN_SYSTEM_TEMPLATE],
  new MessagesPlaceholder("history"),
  [
    "human", 
    HUMAN_MESSAGE_TEMPLATE
  ]
]);


Before we assemble all the pieces together, let's note that the native web `Response` objects used to return data in popular frameworks like Next.js accept a `ReadableStream` the emits bytes directly. Previously, our chain outputted string chunks directly using `StringOutputParser`, but it would be convenient to be able to directly stream so that we could pass our LangChain stream directly to the response.

Fortunately, LangChain provides an `HttpResponseOutputParser` that parses chat output into chunks of bytes that match either `text/plain` or `text/event-stream` content types! To use it, let's construct our conversational retrieval chain as before, but skip the final `StringOutputParser`:

In [5]:
import { 
  RunnablePassthrough, 
  RunnableSequence 
} from "langchain/runnables";
import { ChatOpenAI } from "langchain/chat_models/openai";

const conversationalRetrievalChain = RunnableSequence.from([
  RunnablePassthrough.assign({
    standalone_question: rephraseQuestionChain,
  }),
  RunnablePassthrough.assign({
    context: documentRetrievalChain,
  }),
  answerGenerationChainPrompt,
  new ChatOpenAI({ modelName: "gpt-3.5-turbo-1106" }),
]);

Then, we'll create an `HttpResponseOutputParser` and pipe the `RunnableWithMessageHistory` into it:

In [6]:
import { HttpResponseOutputParser } from "langchain/output_parsers";
import { RunnableWithMessageHistory } from "langchain/runnables"; 
import { ChatMessageHistory } from "langchain/stores/message/in_memory";

// "text/event-stream" is also supported
const httpResponseOutputParser = new HttpResponseOutputParser({
  contentType: "text/plain"
});

const messageHistories = {};

const finalRetrievalChain = new RunnableWithMessageHistory({
  runnable: conversationalRetrievalChain,
  // Mention where sessionId gets passed from (parameter to our endpoint)
  getMessageHistory: (sessionId) => {
    if (sessionId in messageHistories) {
      return messageHistories[sessionId];
    }
    const newChatSessionHistory = new ChatMessageHistory();
    messageHistories[sessionId] = newChatSessionHistory;
    return newChatSessionHistory;
  },
  inputMessagesKey: "question",
  historyMessagesKey: "history",
}).pipe(httpResponseOutputParser);

The reason we don't put the `HttpResponseOutputParser` directly in the `conversationalRetrievalChain` is because `RunnableWithMessageHistory` will store the aggregated output of its runnable in the `ChatMessageHistory`, and requires either a string or a `ChatMessage` to be the final output rather than bytes.

You might also notice that our `getMessageHistory` function creates a new `ChatMessageHistory` object based on the passed `sessionId` instead of reusing the same one as before. This allows us to assign `sessionId`s properly to individual conversations and load them as requests come in later. For more advanced persistence, you'll want to use a integration to store these histories.

Great! Let's set up a simple server with a handler that calls our chain and see if we can get a streaming response. We'll populate the input question and the session id from the body parameters. Since this notebook is written in Deno, we use a Deno built-in HTTP method, but this general concept is shared by many JS frameworks.

Also, in a true production deployment, you'd likely want to set up authentication/input validation via some middleware, but we'll skip that for simplicity for now:

In [7]:
const port = 8080;

const handler = async (request: Request): Response => {
  const body = await request.json();
  const stream = await finalRetrievalChain.stream({
    question: body.question
  }, { configurable: { sessionId: body.session_id } });

  return new Response(stream, { 
    status: 200,
    headers: {
      "Content-Type": "text/plain"
    },
  });
};

console.log(`HTTP server is running! Access it at: http://localhost:${port}/`);
Deno.serve({ port }, handler);

HTTP server is running! Access it at: http://localhost:8080/
Listening on http://localhost:8080/


{
  finished: Promise { [36m<pending>[39m },
  shutdown: [36m[AsyncFunction: shutdown][39m,
  ref: [36m[Function: ref][39m,
  unref: [36m[Function: unref][39m
}

Let's make a quick helper function to make handling the response stream in the client a bit nicer:

In [8]:
const decoder = new TextDecoder();

// readChunks() reads from the provided reader and yields the results into an async iterable
function readChunks(reader) {
  return {
    async* [Symbol.asyncIterator]() {
      let readResult = await reader.read();
      while (!readResult.done) {
        yield decoder.decode(readResult.value);
        readResult = await reader.read();
      }
    },
  };
}

And now let's try calling our endpoint!

We use a sleep function at the end due to the limitations of running a server within a notebook - we want to make sure the request finishes before the cell stop execution.

In [9]:
const sleep = async () => {
  return new Promise((resolve) => setTimeout(resolve, 500));
}

const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "What are the prerequisites for this course?",
    session_id: "1", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

TypeError: request or response body error: error reading a body from connection: unexpected EOF during chunk size line

We can see that we get a streamed string response.

Now, let's test the memory by asking a followup:

In [None]:
const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "Can you list them in bullet point format?",
    session_id: "1", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

CHUNK: Based on the chat hist
CHUNK: ory provided, it seems like
CHUNK:  the user is asking 
CHUNK: for a specific list. Fr
CHUNK: om the conve
CHUNK: rsation and documen
CHUNK: ts provided, it is not cle
CHUNK: ar what exact list
CHUNK:  the user is referring to. Howe
CHUNK: ver, I can make
CHUNK:  an assump
CHUNK: tion based on the context.

Assuming t
CHUNK: he user is referring to a list of topics discussed i
CHUNK: n the conversation and documents provided, the bullet point
CHUNK:  format could look
CHUNK:  like this:

- Online Resources:
CHUNK:  This section seems t
CHUNK: o be mentioned on the thi
CHUNK: rd page of a
CHUNK:  course handou
CHUNK: t and may contain valu
CHUNK: able info
CHUNK: rmation related
CHUNK:  to the co
CHUNK: urse.
- Learning Algo
CHUNK: rithms: This was a topic mentioned 
CHUNK: in the chat history where the instructor talks about teaching machine lear
CHUNK: ning algorithms to enable a car
CHUNK:  to drive off ro
CHUNK: ads at high speeds or a 
CHUNK: ro

Sweet! Let's try again with a different `sessionId`. We expect to see a wholly new loaded conversation.

In [None]:
const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "What did I just ask you?",
    session_id: "2", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

CHUNK: Based on the p
CHUNK: rovided chat histor
CHUNK: y and context, I can't
CHUNK:  determine
CHUNK:  the exact question
CHUNK:  you asked 
CHUNK: that prompted th
CHUNK: e response by the in
CHUNK: structor, A
CHUNK: ndrew Ng. The conver
CHUNK: sation primar
CHUNK: ily revolve
CHUNK: s around various 
CHUNK: topics, including
CHUNK:  forming study groups,
CHUNK:  course resources,
CHUNK:  and the diverse ba
CHUNK: ckgrounds of the 
CHUNK: students in the class. It seems that your question may not be captured in the provided context.

If you have a specific question in mind that you'd like assistance with, please feel free to provide additional details or directly ask the question, and I'll be happy to help!


An entirely new session! While the current version of the Deno kernel can't currently render a frontend in the notebook, you could update a frontend component with the content of the stream to create a responsive chat experience.