# Essay grading system

Essay grading is a time-intensive task that requires evaluating multiple dimensions of writing quality - from basic grammar and structure to more nuanced aspects like relevance and analytical depth. While human graders bring invaluable expertise to this process, they can also introduce inconsistency and are limited by time constraints when dealing with large volumes of essays.

This notebook demonstrates how to build an intelligent essay grading agent that leverages LLMs and graph-based workflows to automate the evaluation process. Rather than treating essay grading as a single monolithic task, we break it down into distinct evaluation components - relevance, grammar, structure, and depth of analysis - each handled by specialized functions within a conditional workflow.

Our goal is to simulate a nuanced and modular grading process that mirrors how a human evaluator might approach scoring essays—but built using a graph-based logic where the grading path adapts based on intermediate results.

In [1]:
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import os
from dotenv import load_dotenv
import re
from IPython.display import display, Image
from langchain_core.runnables.graph import MermaidDrawMethod

# Load environment variables from .env file
load_dotenv()

# Configure OpenAI API key for AI model access
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### State definition
The core of our grading system relies on maintaining state throughout the evaluation process. We define a structured state object that tracks both the essay content and all scoring components as the essay moves through different evaluation stages.

In [2]:
# Define the shared state that flows through the graph
class State(TypedDict):
    """Represents the state of the essay grading process."""
    essay: str  # The original essay text to be graded
    relevance_score: float  # Score for topic relevance (0-1)
    grammar_score: float  # Score for grammar and language usage (0-1)
    structure_score: float  # Score for essay organization and flow (0-1)
    depth_score: float  # Score for analytical depth and insight (0-1)
    final_score: float  # Weighted final score combining all components

The `State` class serves as our data container throughout the grading workflow. By using `TypedDict`, we get the benefits of type hints while maintaining the dictionary-like access patterns that work well with LangGraph. Each score field is initialized and updated as the essay progresses through different evaluation stages, creating a comprehensive assessment profile.

### Language model initialization
We initialize the OpenAI model that will serve as the "expert grader" for each component of our evaluation process.

In [3]:
# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18")

This model will be called multiple times during the grading process - once for each evaluation component

### Define node functions for functions for grading
We will create specialized functions for each stage of the essay evaluation. In LangGraph, each step of the process is encapsulated in a node, which performs a specific task like relevance, grammar, structure, and depth. Each function returns the updated state.

- Each function takes the current State dictionary, performs a specific analysis on the essay text (e.g. grammar, structure), and updates the state with a corresponding score between 0 and 1.
- Before each grading function can assign a score, we need to reliably extract the score from the language model's response. For this, we introduce a helper function, `extract_score`, which uses a regular expression to parse numeric scores prefixed by `Score:`.
- Finally, we implement a function to compute the final weighted score, combining the individual scores using a custom weighting scheme: relevance and depth are weighted more heavily, as they often reflect higher-order thinking and topic engagement.

In [4]:
# Helper function to extract a numeric score from the LLM's response
def extract_score(content: str) -> float:
    """Extract the numeric score from the LLM's response."""
    # Regex pattern looks for a float or integer following "Score:"
    match = re.search(r'Score:\s*(\d+(\.\d+)?)', content)
    if match:
        return float(match.group(1))
    # If no valid score is found, raise an error for graceful handling in caller functions.
    raise ValueError(f"Could not extract score from: {content}")

# Relevance scoring function
def check_relevance(state: State) -> State:
    """Check the relevance of the essay."""
    prompt = ChatPromptTemplate.from_template(
        "Analyze the relevance of the following essay to the given topic. "
        "Provide a relevance score between 0 and 1. "
        "Your response should start with 'Score: ' followed by the numeric score, "
        "then provide your explanation.\n\nEssay: {essay}"
    )
    # Pass the formatted essay into the LLM and get its response
    result = llm.invoke(prompt.format(essay=state["essay"]))
    try:
        # Try to extract and save the relevance score
        state["relevance_score"] = extract_score(result.content)
    except ValueError as e:
        # Fallback in case of malformed or missing score
        print(f"Error in check_relevance: {e}")
        state["relevance_score"] = 0.0
    return state

# Grammar scoring function
def check_grammar(state: State) -> State:
    """Check the grammar of the essay."""
    prompt = ChatPromptTemplate.from_template(
        "Analyze the grammar and language usage in the following essay. "
        "Provide a grammar score between 0 and 1. "
        "Your response should start with 'Score: ' followed by the numeric score, "
        "then provide your explanation.\n\nEssay: {essay}"
    )
    result = llm.invoke(prompt.format(essay=state["essay"]))
    try:
        state["grammar_score"] = extract_score(result.content)
    except ValueError as e:
        print(f"Error in check_grammar: {e}")
        state["grammar_score"] = 0.0
    return state

# Structure analysis function
def analyze_structure(state: State) -> State:
    """Analyze the structure of the essay."""
    prompt = ChatPromptTemplate.from_template(
        "Analyze the structure of the following essay. "
        "Provide a structure score between 0 and 1. "
        "Your response should start with 'Score: ' followed by the numeric score, "
        "then provide your explanation.\n\nEssay: {essay}"
    )
    result = llm.invoke(prompt.format(essay=state["essay"]))
    try:
        state["structure_score"] = extract_score(result.content)
    except ValueError as e:
        print(f"Error in analyze_structure: {e}")
        state["structure_score"] = 0.0
    return state

# Depth of analysis evaluation
def evaluate_depth(state: State) -> State:
    """Evaluate the depth of analysis in the essay."""
    prompt = ChatPromptTemplate.from_template(
        "Evaluate the depth of analysis in the following essay. "
        "Provide a depth score between 0 and 1. "
        "Your response should start with 'Score: ' followed by the numeric score, "
        "then provide your explanation.\n\nEssay: {essay}"
    )
    result = llm.invoke(prompt.format(essay=state["essay"]))
    try:
        state["depth_score"] = extract_score(result.content)
    except ValueError as e:
        print(f"Error in evaluate_depth: {e}")
        state["depth_score"] = 0.0
    return state

# Aggregate scores into a single weighted final score
def calculate_final_score(state: State) -> State:
    """Calculate the final score based on individual component scores."""
    # Weight distribution: Relevance (30%), Grammar (20%), Structure (20%), Depth (30%)
    state["final_score"] = (
        state["relevance_score"] * 0.3 +
        state["grammar_score"] * 0.2 +
        state["structure_score"] * 0.2 +
        state["depth_score"] * 0.3
    )
    return state

Here, we define all the node functions that form the building blocks of our grading pipeline. Each function takes in a shared `state` dictionary, evaluates one aspect of the essay using a prompt fed into an LLM, and updates the respective score field in the state.
- A template prompt is used to instruct the LLM to generate a score and a short justification.
- The score is extracted using a regular expression that parses the model’s response text for the format `"Score: X"`.
- If parsing fails (e.g., malformed output), the system defaults that score to `0.0` but does not stop the overall grading process.
- The `calculate_final_score` function combines all valid scores using weighted averages, where relevance and depth are emphasized more heavily than grammar and structure. These weights can be adjusted based on specific grading requirements or educational contexts.

### Workflow definition and graph construction
Now that we have all our grading functions, we need to orchestrate them into an intelligent workflow. This section creates the graph structure that defines how essays flow through different evaluation stages based on conditional logic. Instead of evaluating every essay in the exact same way, we want to dynamically adapt the evaluation path based on intermediate results.

In [5]:
# Create the graph by specifying the state type it operates on
workflow = StateGraph(State)

# Add each grading function as a node in the workflow graph
workflow.add_node("check_relevance", check_relevance)
workflow.add_node("check_grammar", check_grammar)
workflow.add_node("analyze_structure", analyze_structure)
workflow.add_node("evaluate_depth", evaluate_depth)
workflow.add_node("calculate_final_score", calculate_final_score)

# Define conditional edges that determine workflow path based on scores
# If relevance is too low (≤0.5), skip detailed analysis and go straight to final scoring
workflow.add_conditional_edges(
    "check_relevance",
    lambda x: "check_grammar" if x["relevance_score"] > 0.5 else "calculate_final_score"
)
# If grammar is poor (≤0.6), skip structure analysis
workflow.add_conditional_edges(
    "check_grammar",
    lambda x: "analyze_structure" if x["grammar_score"] > 0.6 else "calculate_final_score"
)
# If structure is poor (≤0.7), skip depth analysis
workflow.add_conditional_edges(
    "analyze_structure",
    lambda x: "evaluate_depth" if x["structure_score"] > 0.7 else "calculate_final_score"
)
# After depth evaluation, always proceed to final scoring
workflow.add_conditional_edges(
    "evaluate_depth",
    lambda x: "calculate_final_score"
)

# Set the entry point for the workflow
workflow.set_entry_point("check_relevance")

# Connect final scoring to the end of the workflow
workflow.add_edge("calculate_final_score", END)

# Compile the graph into an executable application
app = workflow.compile()

This workflow definition creates an intelligent grading system that can adapt its evaluation depth based on interim results. The conditional logic implements a practical approach: if an essay fails at a fundamental level (poor relevance), there is less value in detailed structural or depth analysis. The threshold values (0.5 for relevance, 0.6 for grammar, 0.7 for structure) can be adjusted based on grading standards and requirements.

This logic is expressed using `add_conditional_edges`, which accepts a lambda function that inspects the current state and decides what node to go to next. Each evaluation step updates the shared `state`, which makes the logic clean and easy to trace.

The graph structure allows for efficient processing - high-quality essays go through all evaluation stages, while essays with fundamental issues receive faster, but still fair, assessment. This mirrors how experienced human graders might triage essays during the grading process.

### Essay grading function
Now that we have defined the full grading pipeline using LangGraph—including the evaluation nodes and the adaptive workflow—it is time to wrap the whole system into a single function: `grade_essay`.

The purpose of this function is to provide a simple, clean interface to pass in an essay and receive a fully evaluated score breakdown. Internally, it constructs the initial state (with default scores), triggers the LangGraph `app` we previously compiled, and returns the final evaluated state.

In [7]:
def grade_essay(essay: str) -> dict:
    """Grade the given essay using the defined workflow."""
    # Initialize the state with the essay and zero scores
    # All score fields start at 0.0 and will be updated as the workflow progresses.
    initial_state = State(
        essay=essay,
        relevance_score=0.0,
        grammar_score=0.0,
        structure_score=0.0,
        depth_score=0.0,
        final_score=0.0
    )

    # Execute the grading workflow - invoke the LangGraph application with the initial state
    result = app.invoke(initial_state)
    return result

This function essentially acts as the entry point for the grading engine. It packages the input essay into a `State` object, initializes all score fields to 0, and runs it through the previously compiled LangGraph workflow.

### Sample essay
To demonstrate our grading system, we will use a sample essay that covers a relevant topic with reasonable structure and analysis. This allows us to see how the system evaluates a typical academic essay.

In [8]:
sample_essay = """
    The Impact of Artificial Intelligence on Modern Society

    Artificial Intelligence (AI) has become an integral part of our daily lives,
    revolutionizing various sectors including healthcare, finance, and transportation.
    This essay explores the profound effects of AI on modern society, discussing both
    its benefits and potential challenges.

    One of the most significant impacts of AI is in the healthcare industry.
    AI-powered diagnostic tools can analyze medical images with high accuracy,
    often surpassing human capabilities. This leads to earlier detection of diseases
    and more effective treatment plans. Moreover, AI algorithms can process vast
    amounts of medical data to identify patterns and insights that might escape
    human observation, potentially leading to breakthroughs in drug discovery and
    personalized medicine.

    In the financial sector, AI has transformed the way transactions are processed
    and monitored. Machine learning algorithms can detect fraudulent activities in
    real-time, enhancing security for consumers and institutions alike. Robo-advisors
    use AI to provide personalized investment advice, democratizing access to
    financial planning services.

    The transportation industry is another area where AI is making significant strides.
    Self-driving cars, powered by complex AI systems, promise to reduce accidents
    caused by human error and provide mobility solutions for those unable to drive.
    In logistics, AI optimizes routing and inventory management, leading to more
    efficient supply chains and reduced environmental impact.

    However, the rapid advancement of AI also presents challenges. There are concerns
    about job displacement as AI systems become capable of performing tasks
    traditionally done by humans. This raises questions about the need for retraining
    and reskilling the workforce to adapt to an AI-driven economy.

    Privacy and ethical concerns also arise with the increasing use of AI. The vast
    amount of data required to train AI systems raises questions about data privacy
    and consent. Additionally, there are ongoing debates about the potential biases
    in AI algorithms and the need for transparent and accountable AI systems.

    In conclusion, while AI offers tremendous benefits and has the potential to solve
    some of humanity's most pressing challenges, it also requires careful consideration
    of its societal implications. As we continue to integrate AI into various aspects
    of our lives, it is crucial to strike a balance between technological advancement
    and ethical considerations, ensuring that the benefits of AI are distributed
    equitably across society.
    """

### Grading the sample essay
Finally, we will run our grading system on the sample essay and display the comprehensive results, showing both individual component scores and the final weighted grade.


In [9]:
# Grade the sample essay
result = grade_essay(sample_essay)

# Display the results
print(f"Final Essay Score: {result['final_score']:.2f}\n")
print(f"Relevance Score: {result['relevance_score']:.2f}")
print(f"Grammar Score: {result['grammar_score']:.2f}")
print(f"Structure Score: {result['structure_score']:.2f}")
print(f"Depth Score: {result['depth_score']:.2f}")

Final Essay Score: 0.87

Relevance Score: 1.00
Grammar Score: 0.90
Structure Score: 0.90
Depth Score: 0.70


This demonstration shows the grading output, providing both the final composite score and the breakdown of individual component scores. This transparency is crucial for educational applications, as it allows students and instructors to understand which aspects of the essay were strong and which areas might need improvement.