## Lesson 4: Form Parsing

**Lesson objective**: Incorporate form parsing to the workflow

In your previous lesson, you used LlamaParse to parse a resume, and included parsing instructions. You'll do that again this time, but the instructions are going to be more advanced -- you're going to get it to read an application form and convert it into a list of fields that need to be filled in, and return that as a JSON object. You will then incorporate these steps in the workflow you started building in the previous lesson.


In [11]:
import json
import os

from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core.base.base_query_engine import BaseQueryEngine
from llama_index.core.workflow import (
    StartEvent,
    StopEvent,
    Workflow,
    step,
    Event,
    Context,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.utils.workflow import draw_all_possible_flows
from llama_parse import LlamaParse, ResultType

In [2]:
from dotenv import load_dotenv

load_dotenv()

openai_api_key = os.environ["OPENAI_API_KEY"]
llama_cloud_api_key = os.environ["LLAMA_CLOUD_API_KEY"]

In [3]:
import nest_asyncio

nest_asyncio.apply()

### Parsing an Application Form with LlamaParse


In [None]:
parser = LlamaParse(
    api_key=llama_cloud_api_key,
    # base_url=os.getenv("LLAMA_CLOUD_BASE_URL"),
    result_type=ResultType.MD,
    user_prompt="This is a job application form."
    " Create a list of all the fields that need to be filled in."
    " Return a bulleted list of the fields ONLY.",
)

In [5]:
result = parser.load_data(file_path="./data/fake_application_form.pdf")[0]

print(result.text)

Started parsing the file under job_id 5c6e4cc6-2c32-465a-99fa-39a653e359bc
- First Name
- Last Name
- Email
- Phone
- LinkedIn
- Project Portfolio
- Degree
- Graduation Date
- Current Job Title
- Current Employer
- Technical Skills
- Describe why you’re a good fit for this position
- Do you have 5 years of experience in React?


In [6]:
llm = OpenAI(
    model="gpt-4.1-nano",
    api_key=openai_api_key,
    api_base=os.environ["OPENAI_API_BASE"],
    temperature=0.5,
)

In [7]:
raw_json = llm.complete(
    prompt=f"""
    This is a parsed form.
    Convert it into a JSON object containing only the list 
    of fields to be filled in, in the form {{ fields: [...] }}. 
    <form>{result.text}</form>. 
    Return JSON ONLY, no markdown."""
)

print(raw_json.text)

{"fields": ["First Name", "Last Name", "Email", "Phone", "LinkedIn", "Project Portfolio", "Degree", "Graduation Date", "Current Job Title", "Current Employer", "Technical Skills", "Describe why you’re a good fit for this position", "Do you have 5 years of experience in React?"]}


In [8]:
fields = json.loads(raw_json.text)["fields"]

for field in fields:
    print(field)

First Name
Last Name
Email
Phone
LinkedIn
Project Portfolio
Degree
Graduation Date
Current Job Title
Current Employer
Technical Skills
Describe why you’re a good fit for this position
Do you have 5 years of experience in React?


### Adding a Form Parser to the Workflow (first update)

In [9]:
class ParseFormEvent(Event):
    application_form: str


class QueryEvent(Event):
    query: str

In [13]:
class RAGWorkflow(Workflow):

    storage_dir = "./storage"
    llm: OpenAI
    query_engine: BaseQueryEngine

    @step
    async def set_up(self, ctx: Context, ev: StartEvent) -> ParseFormEvent:

        if not ev.resume_file:
            raise ValueError("No resume file provided")

        if not ev.application_form:
            raise ValueError("No application form provided")

        # define the LLM to work with
        self.llm = OpenAI(
            model="gpt-4.1-nano",
            api_key=openai_api_key,
            api_base=os.environ["OPENAI_API_BASE"],
            temperature=0.5,
        )

        # ingest the data and set up the query engine
        if os.path.exists(self.storage_dir):
            # you've already ingested the resume document
            storage_context = StorageContext.from_defaults(persist_dir=self.storage_dir)
            index = load_index_from_storage(
                storage_context=storage_context,
                embed_model=OpenAIEmbedding(
                    model_name="Cohere-embed-v3-english",
                    api_key=openai_api_key,
                    api_base=os.environ["OPENAI_API_BASE"],
                ),
            )
        else:
            # parse and load the resume document
            documents = LlamaParse(
                api_key=llama_cloud_api_key,
                # base_url=os.getenv("LLAMA_CLOUD_BASE_URL"),
                result_type=ResultType.MD,
                content_guideline_instruction=(
                    "This is a resume, gather related facts together and format it as bullet points with headers"
                ),
            ).load_data(ev.resume_file)

            # embed and index the documents
            index = VectorStoreIndex.from_documents(
                documents,
                embed_model=OpenAIEmbedding(
                    model_name="Cohere-embed-v3-english",
                    api_key=openai_api_key,
                    api_base=os.environ["OPENAI_API_BASE"],
                ),
            )
            index.storage_context.persist(persist_dir=self.storage_dir)

        # create a query engine
        self.query_engine = index.as_query_engine(llm=self.llm, similarity_top_k=5)

        # you no longer need a query to be passed in,
        # you'll be generating the queries instead
        # let's pass the application form to a new step to parse it
        return ParseFormEvent(application_form=ev.application_form)

    @step
    async def parse_form(self, ctx: Context, ev: ParseFormEvent) -> QueryEvent:
        parser = LlamaParse(
            api_key=llama_cloud_api_key,
            # base_url=os.getenv("LLAMA_CLOUD_BASE_URL"),
            result_type=ResultType.MD,
            user_prompt="This is a job application form."
            " Create a list of all the fields that need to be filled in."
            " Return a bulleted list of the fields ONLY.",
        )

        # get the LLM to convert the parsed form into JSON
        result = parser.load_data(ev.application_form)[0]
        raw_json = self.llm.complete(
            prompt=f"""
            This is a parsed form. 
            Convert it into a JSON object containing only the list 
            of fields to be filled in, in the form {{ fields: [...] }}. 
            <form>{result.text}</form>. 
            Return JSON ONLY, no markdown.
            """
        )

        fields = json.loads(raw_json.text)["fields"]

        for field in fields:
            print(field)

        return StopEvent(
            result="Dummy event"
        )  ## Intended to return a StopEvent, but will be updated later

    # will be edited in the next section
    @step
    async def ask_question(self, ctx: Context, ev: QueryEvent) -> StopEvent:
        response = self.query_engine.query(
            f"This is a question about the specific resume we have in our database: {ev.query}"
        )
        return StopEvent(result=response.response)

In [14]:
workflow = RAGWorkflow(timeout=60, verbose=False)
result = await workflow.run(
    resume_file="./data/fake_resume.pdf",
    application_form="./data/fake_application_form.pdf",
)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage/index_store.json.
Started parsing the file under job_id 507e6a5a-137f-43f5-81a2-adf89251d9c3
Position
First Name
Last Name
Email
Phone
Linkedin
Project Portfolio
Degree
Graduation Date
Current Job Title
Current Employer
Technical Skills
Describe why you’re a good fit for this position
Do you have 5 years of experience in React?


### Generating Questions (second update)

Your workflow knows what fields it needs answers for.
In this next iteration, you can fire off one `QueryEvent` for each of the fields, so they'll be executed concurrently (we talked about doing concurrent steps in Lesson 2).

The changes you're going to make are:
* Generate a `QueryEvent` for each of the questions you pulled out of the form
* Create a `fill_in_application` step which will take all the responses to the questions and aggregate them into a coherent response
* Add a `ResponseEvent` to pass the results of queries to `fill_in_application`

In [15]:
class ParseFormEvent(Event):
    application_form: str


class QueryEvent(Event):
    query: str
    field: str


# new!
class ResponseEvent(Event):
    field: str
    response: str

In [None]:
class RAGWorkflow(Workflow):

    storage_dir: str = "./storage"
    llm: OpenAI
    query_engine: BaseQueryEngine

    @step
    async def set_up(self, ctx: Context, ev: StartEvent) -> ParseFormEvent:
        if not ev.resume_file:
            raise ValueError("No resume file provided")

        if not ev.application_form:
            raise ValueError("No application form provided")

        # define the LLM to work with
        self.llm = OpenAI(
            model="gpt-4.1-nano",
            api_key=openai_api_key,
            api_base=os.environ["OPENAI_API_BASE"],
            temperature=0.5,
        )

        # ingest the data and set up the query engine
        if os.path.exists(self.storage_dir):
            # you've already ingested the resume document
            storage_context = StorageContext.from_defaults(persist_dir=self.storage_dir)
            index = load_index_from_storage(
                storage_context=storage_context,
                embed_model=OpenAIEmbedding(
                    model_name="Cohere-embed-v3-english",
                    api_key=openai_api_key,
                    api_base=os.environ["OPENAI_API_BASE"],
                ),
            )
        else:
            # parse and load the resume document
            documents = LlamaParse(
                api_key=llama_cloud_api_key,
                # base_url=os.getenv("LLAMA_CLOUD_BASE_URL"),
                result_type=ResultType.MD,
                user_prompt="This is a resume, gather related facts together and format it as bullet points with headers",
            ).load_data(ev.resume_file)

            # embed and index the documents
            index = VectorStoreIndex.from_documents(
                documents=documents,
                embed_model=OpenAIEmbedding(
                    model_name="Cohere-embed-v3-english",
                    api_key=openai_api_key,
                    api_base=os.environ["OPENAI_API_BASE"],
                ),
            )
            index.storage_context.persist(persist_dir=self.storage_dir)

        # create a query engine
        self.query_engine = index.as_query_engine(llm=self.llm, similarity_top_k=5)

        # you no longer need a query to be passed in,
        # you'll be generating the queries instead
        # let's pass the application form to a new step to parse it
        return ParseFormEvent(application_form=ev.application_form)

    @step
    async def parse_form(self, ctx: Context, ev: ParseFormEvent) -> QueryEvent:
        parser = LlamaParse(
            api_key=llama_cloud_api_key,
            # base_url=os.getenv("LLAMA_CLOUD_BASE_URL"),
            result_type=ResultType.MD,
            user_prompt="This is a job application form."
            " Create a list of all the fields that need to be filled in."
            " Return a bulleted list of the fields ONLY.",
        )

        # get the LLM to convert the parsed form into JSON
        result = parser.load_data(ev.application_form)[0]
        raw_json = self.llm.complete(
            f"""
            This is a parsed form. 
            Convert it into a JSON object containing only the list 
            of fields to be filled in, in the form {{ fields: [...] }}. 
            <form>{result.text}</form>. 
            Return JSON ONLY, no markdown.
            """
        )
        fields = json.loads(raw_json.text)["fields"]

        # new!
        # generate one query for each of the fields, and fire them off
        for field in fields:
            ctx.send_event(
                QueryEvent(
                    field=field,
                    query=f"How would you answer this question about the candidate? {field}",
                )
            )

        # store the number of fields so we know how many to wait for later
        await ctx.store.set("total_fields", len(fields))

    @step
    async def ask_question(self, ctx: Context, ev: QueryEvent) -> ResponseEvent:
        response = self.query_engine.query(
            f"This is a question about the specific resume we have in our database: {ev.query}"
        )
        return ResponseEvent(field=ev.field, response=response.response)

    # new!
    @step
    async def fill_in_application(
        self, ctx: Context, ev: ResponseEvent
    ) -> StopEvent | None:
        # get the total number of fields to wait for
        total_fields = await ctx.store.get("total_fields")

        responses = ctx.collect_events(ev, [ResponseEvent] * total_fields)
        if responses is None:
            return None  # do nothing if there's nothing to do yet

        # we've got all the responses!
        responseList = "\n".join(
            "Field: " + r.field + "\n" + "Response: " + r.response for r in responses
        )

        result = self.llm.complete(
            f"""
            You are given a list of fields in an application form and responses to
            questions about those fields from a resume. Combine the two into a list of
            fields and succinct, factual answers to fill in those fields.

            <responses>
            {responseList}
            </responses>
        """
        )
        return StopEvent(result=result)

In [17]:
workflow = RAGWorkflow(timeout=120, verbose=False)

result = await workflow.run(
    resume_file="./data/fake_resume.pdf",
    application_form="./data/fake_application_form.pdf"
)

print("-=-=-=-=- Result -=-=-=-=-")
print(result)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage/index_store.json.
Started parsing the file under job_id ade3e40c-083b-49e6-ab48-4002476ee3f6


  await ctx.set("total_fields", len(fields))
  total_fields = await ctx.get("total_fields")


-=-=-=-=- Result -=-=-=-=-
- **Position:** Full Stack Developer with extensive experience in developing and maintaining web applications, leading technical teams, and implementing scalable solutions. Expertise in frontend frameworks like React.js, Vue.js, and Next.js; backend technologies including Node.js, Django, and cloud services. Skilled in CI/CD, performance optimization, and accessibility. Proven track record of mentoring junior developers and improving code quality.

- **First Name:** Sarah

- **Last Name:** Chen

- **Email:** sarah.chen@email.com

- **Phone:** No phone number provided; contact includes email, LinkedIn, GitHub, and portfolio website.

- **LinkedIn:** linkedin.com/in/sarahchen

- **Project Portfolio:** Includes EcoTrack, a full-stack carbon footprint tracking app with machine learning; ChatFlow, a real-time, end-to-end encrypted chat app serving over 5,000 monthly users; demonstrates building scalable, user-focused applications with modern technologies; recogniz

### Workflow Visualization

In [18]:
WORKFLOW_FILE = "./workflows/form_parsing_workflow.html"
draw_all_possible_flows(workflow, filename=WORKFLOW_FILE)

./workflows/form_parsing_workflow.html


Your workflow takes all the fields in the form and generates plausible answers for all of them. There are a couple of fields where I think it can do better, and in the next lesson you'll add the ability to give that feedback to the agent and get it to try again.