# Web endpoints

Now that we've designed a simple retrieval chain, let's look at what it would take to productionize it as a streaming chat endpoint!

We'll go over the interaction with native web primitives like `fetch` and `Response`, as well as show how to utilize different chat sessions.

We'll pick up where we left off in the last lesson with loading and splitting our CS229 transcript into a vectorstore:

In [1]:
import "dotenv/config";

[Module: null prototype] { default: {} }

In [2]:
import { 
  loadAndSplitChunks, 
  initializeVectorstoreWithDocuments 
} from "./lib/helpers.ts";

const splitDocs = await loadAndSplitChunks({
  chunkSize: 1536,
  chunkOverlap: 128,
});

const vectorstore = await initializeVectorstoreWithDocuments({
  documents: splitDocs,
});

const retriever = vectorstore.asRetriever();

Let's load the pieces of our conversational retrieval chain together.

In [3]:
import { 
  createDocumentRetrievalChain, 
  createRephraseQuestionChain 
} from "./lib/helpers.ts";

const documentRetrievalChain = createDocumentRetrievalChain();
const rephraseQuestionChain = createRephraseQuestionChain();

In [4]:
import { ChatPromptTemplate, MessagesPlaceholder } from "langchain/prompts";

const ANSWER_CHAIN_SYSTEM_TEMPLATE = `You are an experienced researcher,
expert at interpreting and answering questions based on provided sources.
Using the below provided context and chat history, 
answer the user's question to the best of your ability
using only the resources provided. Be verbose!

<context>
{context}
</context>`;

const HUMAN_MESSAGE_TEMPLATE = 
  `Now, answer this question using the previous context and chat history:
  
  {standalone_question}`;

const answerGenerationChainPrompt = ChatPromptTemplate.fromMessages([
  ["system", ANSWER_CHAIN_SYSTEM_TEMPLATE],
  new MessagesPlaceholder("history"),
  [
    "human", 
    HUMAN_MESSAGE_TEMPLATE
  ]
]);


Before we assemble all the pieces together, let's note that the native web `Response` objects used to return data in popular frameworks like Next.js accept a `ReadableStream` the emits bytes directly. Previously, our chain outputted string chunks directly using `StringOutputParser`, but it would be convenient to be able to directly stream so that we could pass our LangChain stream directly to the response.

Fortunately, LangChain provides an `HttpResponseOutputParser` that parses chat output into chunks of bytes that match either `text/plain` or `text/event-stream` content types! To use it, let's construct our conversational retrieval chain as before, but skip the final `StringOutputParser`:

In [5]:
import { 
  RunnablePassthrough, 
  RunnableSequence 
} from "langchain/runnables";
import { ChatOpenAI } from "langchain/chat_models/openai";

const conversationalRetrievalChain = RunnableSequence.from([
  RunnablePassthrough.assign({
    standalone_question: rephraseQuestionChain,
  }),
  RunnablePassthrough.assign({
    context: documentRetrievalChain,
  }),
  answerGenerationChainPrompt,
  new ChatOpenAI({ modelName: "gpt-3.5-turbo-1106" }),
]);

Then, we'll create an `HttpResponseOutputParser` and pipe the `RunnableWithMessageHistory` into it:

In [6]:
import { HttpResponseOutputParser } from "langchain/output_parsers";
import { RunnableWithMessageHistory } from "langchain/runnables"; 
import { ChatMessageHistory } from "langchain/stores/message/in_memory";

// "text/event-stream" is also supported
const httpResponseOutputParser = new HttpResponseOutputParser({
  contentType: "text/plain"
});

const messageHistoryMap = new Map();

const finalRetrievalChain = new RunnableWithMessageHistory({
  runnable: conversationalRetrievalChain,
  // Mention where sessionId gets passed from (parameter to our endpoint)
  getMessageHistory: (sessionId) => {
    if (sessionId in messageHistoryMap) {
      return messageHistoryMap.get(sessionId);
    }
    const newChatSessionHistory = new ChatMessageHistory();
    messageHistoryMap.set(sessionId, newChatSessionHistory);
    return newChatSessionHistory;
  },
  inputMessagesKey: "question",
  historyMessagesKey: "history",
}).pipe(httpResponseOutputParser);

The reason we don't put the `HttpResponseOutputParser` directly in the `conversationalRetrievalChain` is because `RunnableWithMessageHistory` will store the aggregated output of its runnable in the `ChatMessageHistory`, and requires either a string or a `ChatMessage` to be the final output rather than bytes.

You might also notice that our `getMessageHistory` function creates a new `ChatMessageHistory` object based on the passed `sessionId` instead of reusing the same one as before. This allows us to assign `sessionId`s properly to individual conversations and load them as requests come in later. For more advanced persistence, you'll want to use a integration to store these histories.

Great! Let's set up a simple server with a handler that calls our chain and see if we can get a streaming response. We'll populate the input question and the session id from the body parameters. Since this notebook is written in Deno, we use a Deno built-in HTTP method, but this general concept is shared by many JS frameworks.

Also, in a true production deployment, you'd likely want to set up authentication/input validation via some middleware, but we'll skip that for simplicity for now:

In [7]:
const port = 8080;

const handler = async (request: Request): Response => {
  const body = await request.json();
  const stream = await finalRetrievalChain.stream({
    question: body.question
  }, { configurable: { sessionId: body.session_id } });

  return new Response(stream, { 
    status: 200,
    headers: {
      "Content-Type": "text/plain"
    },
  });
};

console.log(`HTTP server is running! Access it at: http://localhost:${port}/`);
Deno.serve({ port }, handler);

HTTP server is running! Access it at: http://localhost:8080/
Listening on http://localhost:8080/


{
  finished: Promise { [36m<pending>[39m },
  shutdown: [36m[AsyncFunction: shutdown][39m,
  ref: [36m[Function: ref][39m,
  unref: [36m[Function: unref][39m
}

Let's make a quick helper function to make handling the response stream in the client a bit nicer:

In [8]:
const decoder = new TextDecoder();

// readChunks() reads from the provided reader and yields the results into an async iterable
function readChunks(reader) {
  return {
    async* [Symbol.asyncIterator]() {
      let readResult = await reader.read();
      while (!readResult.done) {
        yield decoder.decode(readResult.value);
        readResult = await reader.read();
      }
    },
  };
}

And now let's try calling our endpoint!

We use a sleep function at the end due to the limitations of running a server within a notebook - we want to make sure the request finishes before the cell stop execution.

In [9]:
const sleep = async () => {
  return new Promise((resolve) => setTimeout(resolve, 500));
}

const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "What are the prerequisites for this course?",
    session_id: "1", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

CHUNK: The course has speci
CHUNK: fic requirements 
CHUNK: for the students. The instruct
CHUNK: or, Andrew Ng, 
CHUNK: mentioned that the course w
CHUNK: ill not be very p
CHUNK: rogramming in
CHUNK: tensive, but it will invol
CHUNK: ve some
CHUNK:  programming, mostly us
CHUNK: ing MATLAB or Octave.
CHUNK:  He also assumes fami
CHUNK: liarity wit
CHUNK: h basic probability 
CHUNK: and statistics, su
CHUNK: ggesting 
CHUNK: that most undergr
CHUNK: aduate st
CHUNK: atistics classes like 
CHUNK: Stat 116 at
CHUNK:  Stanford will be more
CHUNK:  than enough to mee
CHUNK: t this requirement. Additionally,
CHUNK:  students are expect
CHUNK: ed to have an underst
CHUNK: anding of
CHUNK:  basic linear algebra, 
CHUNK: which can be ac
CHUNK: quired through 
CHUNK: courses like Math 51, 
CHUNK: 103, Math 11
CHUNK: 3, or CS205 at Stanford
CHUNK: .

The instructor also makes
CHUNK:  it clear that students sh
CHUNK: ould be 
CHUNK: familiar with concepts such as random variables,
CHUNK:  expect

We can see that we get a streamed string response.

Now, let's test the memory by asking a followup:

In [10]:
const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "Can you list them in bullet point format?",
    session_id: "1", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

CHUNK: Based on the provided con
CHUNK: text, the reques
CHUNK: t for listing 
CHUNK: in bullet poin
CHUNK: t format is 
CHUNK: unclear. Howe
CHUNK: ver, if we consider
CHUNK:  the prior conversatio
CHUNK: ns and the conten
CHUNK: t of the documen
CHUNK: ts, it seems likely t
CHUNK: hat the user is a
CHUNK: sking for a list
CHUNK:  related to a spe
CHUNK: cific top
CHUNK: ic or info
CHUNK: rmation. Without more
CHUNK:  clarity
CHUNK:  on the specific list the user is 
CHUNK: requesting, I'm
CHUNK:  unable to provide
CHUNK:  a precise bullet-point list without m
CHUNK: aking assu
CHUNK: mptions.

If the user 
CHUNK: could provide
CHUNK:  more details about the su
CHUNK: bject or conte
CHUNK: xt they are refer
CHUNK: ring to, I would be h
CHUNK: appy to assist in creating
CHUNK:  a bullet-point list. This 
CHUNK: could include a list of
CHUNK:  course topics, examp
CHUNK: les of machine learning projects, a list of students' academic backgrounds in a classroom, or a list of reminders giv

Sweet! Let's try again with a different `sessionId`. We expect to see a wholly new loaded conversation.

In [11]:
const response = await fetch("http://localhost:8080/", {
  method: "POST",
  headers: {
    "content-type": "application/json",
  },
  body: JSON.stringify({
    question: "What did I just ask you?",
    session_id: "2", // Should randomly generate/assign
  })
});

// response.body is a ReadableStream
const reader = response.body?.getReader();

for await (const chunk of readChunks(reader)) {
  console.log("CHUNK:", chunk);
}

await sleep();

CHUNK: Based on the context an
CHUNK: d chat history pr
CHUNK: ovided, the previ
CHUNK: ous question you asked r
CHUNK: evolved around the diversity of the students in the class. Specifically, you were inquiring about the backgrounds of the students in the class and where they were from. This demonstrates a curiosity about the diversity of the audience and the various disciplines they represent.


An entirely new session! While the current version of the Deno kernel can't currently render a frontend in the notebook, you could update a frontend component with the content of the stream to create a responsive chat experience.