Skip to content

Multi-strategy parser for messy, real-world data. Built to handle LLM responses with broken JSON, markdown wrappers, type mismatches, and inconsistent formatting.

License

Notifications You must be signed in to change notification settings

microagents/tryparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tryparse

Multi-strategy parser for messy, real-world data. Built to handle LLM responses with broken JSON, markdown wrappers, type mismatches, and inconsistent formatting.

Quick Start

[dependencies]
tryparse = "0.3"
serde = { version = "1.0", features = ["derive"] }
use tryparse::parse;
use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct User {
    name: String,
    age: u32,
}

fn main() {
    // Handles markdown wrappers, trailing commas, unquoted keys, type coercion
    let messy_input = r#"
    Here's your data:
    ```json
    {
      name: "Alice",
      age: "30",
    }
    ```
    "#;

    let user: User = parse(messy_input).unwrap();
    println!("{:?}", user); // User { name: "Alice", age: 30 }
}

Core Features

With serde::Deserialize

Basic type coercion works out of the box:

use tryparse::parse;
use serde::Deserialize;

#[derive(Deserialize)]
struct Data {
    count: i64,      // "42" → 42
    price: f64,      // "3.14" → 3.14
    active: bool,    // "true" → true
    tags: Vec<String>, // "tag" → ["tag"]
}

let data: Data = parse(r#"{"count": "42", "price": "3.14", "active": "true", "tags": "tag"}"#).unwrap();

With LlmDeserialize (derive feature)

Advanced features require the derive feature:

[dependencies]
tryparse = { version = "0.3", features = ["derive"] }
tryparse-derive = "0.3"

Fuzzy field matching - Handles different naming conventions:

use tryparse::parse_llm;
use tryparse_derive::LlmDeserialize;

#[derive(Debug, LlmDeserialize)]
struct Config {
    user_name: String,  // Matches: userName, UserName, user-name, user.name, USER_NAME
    max_count: i64,
}

let data: Config = parse_llm(r#"{"userName": "Alice", "maxCount": 30}"#).unwrap();

Enum fuzzy matching - Case-insensitive, partial matches:

#[derive(Debug, LlmDeserialize)]
enum Status {
    InProgress,  // Matches: "in_progress", "in-progress", "inprogress", "in progress"
    Completed,   // Matches: "complete", "COMPLETED", "done"
    Cancelled,
}

Union types - Automatically picks the best variant:

#[derive(Debug, LlmDeserialize)]
#[llm(union)]
enum Value {
    Number(i64),
    Text(String),
    List(Vec<String>),
}

// Parses as Number(42)
let v1: Value = parse_llm("42").unwrap();

// Parses as Text("hello")
let v2: Value = parse_llm(r#""hello""#).unwrap();

// Parses as List(...)
let v3: Value = parse_llm(r#"["a", "b"]"#).unwrap();

Implied key - Single-field structs unwrap values:

#[derive(Debug, LlmDeserialize)]
struct Wrapper {
    data: String,
}

// Direct string wraps into the single field
let w: Wrapper = parse_llm(r#""hello world""#).unwrap();
assert_eq!(w.data, "hello world");

API Reference

Basic Parsing

// Parse with serde::Deserialize
fn parse<T: DeserializeOwned>(input: &str) -> Result<T>

// Parse with serde::Deserialize, get all candidates
fn parse_with_candidates<T: DeserializeOwned>(input: &str) -> Result<(T, Vec<FlexValue>)>

// Parse with custom parser configuration
fn parse_with_parser<T: DeserializeOwned>(input: &str, parser: &FlexibleParser) -> Result<T>

Advanced Parsing (requires derive feature)

// Parse with LlmDeserialize trait (fuzzy matching, unions, etc.)
fn parse_llm<T: LlmDeserialize>(input: &str) -> Result<T>

// Parse with LlmDeserialize, get all candidates
fn parse_llm_with_candidates<T: LlmDeserialize>(input: &str) -> Result<(T, Vec<FlexValue>)>

Utilities

// Score a candidate (lower is better)
fn score_candidate(candidate: &FlexValue) -> u32

// Rank candidates by score
fn rank_candidates(candidates: &mut [FlexValue])

// Get the best candidate
fn best_candidate(candidates: &[FlexValue]) -> Option<&FlexValue>

How It Works

1. Multi-Stage Parsing Pipeline

Input String
    ↓
┌──────────────────────────────────┐
│ Pre-Processing                   │
│ • Remove BOM, zero-width chars   │
│ • Fix excessive nesting (>50)    │
│ • Normalize backslashes          │
└────────────┬─────────────────────┘
             ↓
┌──────────────────────────────────┐
│ Strategy Execution (parallel)    │
│ • DirectJson        (priority 1) │
│ • Markdown          (priority 2) │
│ • YAML              (priority 15)│
│ • JsonFixer         (priority 20)│
│ • Heuristic         (priority 30)│
│                                  │
│ → Produces Vec<FlexValue>        │
└────────────┬─────────────────────┘
             ↓
┌──────────────────────────────────┐
│ Scoring & Ranking                │
│ • Base score by source           │
│ • Transformation penalties       │
│ • Confidence adjustment          │
│ • Sort ascending (best first)    │
└────────────┬─────────────────────┘
             ↓
┌──────────────────────────────────┐
│ Deserialization                  │
│ • Try candidates in order        │
│ • Apply type coercion            │
│ • Track transformations          │
│ • Return first success           │
└──────────────────────────────────┘

2. Parsing Strategies

Strategy Priority Description
DirectJson 1 Direct serde_json::from_str(). Fastest path for valid JSON.
Markdown 2 Extracts from markdown code blocks. Scores by keywords, position, size.
YAML 15 Parses YAML, converts to JSON. Requires yaml feature.
JsonFixer 20 Fixes common JSON errors (see below).
Heuristic 30 Pattern-based extraction from prose. Last resort.

3. JSON Fixes Applied

The JsonFixer strategy handles:

  • Trailing commas: {"a": 1,}{"a": 1}
  • Unquoted keys: {name: "x"}{"name": "x"}
  • Single quotes: {'a': 1}{"a": 1}
  • Missing commas: {"a":1 "b":2}{"a":1,"b":2}
  • Unclosed braces/brackets: {"a": 1{"a": 1}
  • Comments: {"a": 1 /* comment */}{"a": 1}
  • Smart quotes: {"a": "value"}{"a": "value"}
  • Double-escaped JSON: "{\"a\":1}"{"a":1}
  • Template literals: {`key`: "value"}{"key": "value"}
  • Hex numbers: {"a": 0xFF}{"a": 255}
  • Unescaped newlines in strings
  • JavaScript functions: Removed entirely

4. Type Coercion

Applied during deserialization (works with both Deserialize and LlmDeserialize):

From To Example
String Number "42"42
String Bool "true"true
Number String 42"42"
Float Int 42.042
Single Array "item"["item"]

5. Field Matching (LlmDeserialize only)

Normalizes field names to snake_case and matches case-insensitively:

Struct Field Matches JSON Keys
user_name userName, UserName, user-name, user.name, USER_NAME, username
max_count maxCount, MaxCount, max-count, max.count, MAX_COUNT

Note: Does not handle acronyms perfectly. XMLParser becomes x_m_l_parser not xml_parser.

6. Scoring System

Base Scores (by source):

  • Direct JSON: 0
  • Markdown: 10
  • YAML: 15
  • Fixed JSON: 20 + (5 × number of fixes)
  • Heuristic: 50

Transformation Penalties:

  • String→Number: +2
  • Float→Int: +3
  • Field rename: +4
  • Single→Array: +5
  • Default inserted: +50

Confidence Modifier:

  • Each transformation reduces confidence by 5%
  • Final score += (1.0 - confidence) × 100

Lower scores win. Direct JSON with no coercion scores 0 (best possible).

Examples

Handling Markdown Responses

let llm_output = r#"
Sure! Here's the user data:

```json
{
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com"
}

Let me know if you need anything else! "#;

#[derive(Deserialize, Debug)] struct User { name: String, age: i64, email: String, }

let user: User = parse(llm_output).unwrap();


### Inspecting Parse Candidates

```rust
use tryparse::{parse_with_candidates, scoring::score_candidate};

let (result, candidates) = parse_with_candidates::<User>(messy_input).unwrap();

println!("Best result: {:?}", result);
println!("\nAll candidates:");
for (i, candidate) in candidates.iter().enumerate() {
    println!("  {}: {:?} (score: {})",
        i,
        candidate.source(),
        score_candidate(candidate)
    );

    // Inspect transformations
    for t in candidate.transformations() {
        println!("    - {:?}", t);
    }
}

Custom Parser Configuration

use tryparse::parser::{FlexibleParser, strategies::*};

let parser = FlexibleParser::new()
    .with_strategy(DirectJsonStrategy)
    .with_strategy(MarkdownStrategy::new())
    .with_strategy(JsonFixerStrategy::new());

let data: User = parse_with_parser(input, &parser).unwrap();

Complex Nested Structures

use std::collections::HashMap;

#[derive(Debug, LlmDeserialize)]
struct Project {
    name: String,
    owner: User,
    status: Status,
    tags: Vec<String>,
    metadata: HashMap<String, String>,
}

let project: Project = parse_llm(complex_json).unwrap();

Union Types with Scoring

#[derive(Debug, LlmDeserialize)]
#[llm(union)]
enum Response {
    Success { data: User },
    Error { message: String, code: i64 },
    Pending { estimated_time: i64 },
}

// Automatically picks the variant that best matches the structure
let response: Response = parse_llm(api_response).unwrap();

match response {
    Response::Success { data } => println!("User: {:?}", data),
    Response::Error { message, code } => eprintln!("Error {}: {}", code, message),
    Response::Pending { estimated_time } => println!("Wait {}s", estimated_time),
}

Feature Flags

# Default: includes markdown and yaml
[dependencies]
tryparse = "0.3"

# Minimal build (core JSON parsing only)
tryparse = { version = "0.3", default-features = false }

# With derive macros for LlmDeserialize
tryparse = { version = "0.3", features = ["derive"] }

# All features
tryparse = { version = "0.3", features = ["derive", "markdown", "yaml"] }

Available features:

  • markdown (default) - Markdown code block extraction
  • yaml (default) - YAML parsing support
  • derive - Derive macro for LlmDeserialize (fuzzy field/enum matching, union types)

Testing

# Unit tests (227 tests in lib)
cargo test --lib

# All tests (lib + integration + doc tests)
cargo test --all-features

# Minimal build tests
cargo test --no-default-features

# Run specific test
cargo test --test integration_test -- --nocapture

Performance Considerations

  • Parsing is synchronous: No async/await support
  • Memory overhead: Tracks all parsing candidates and transformations
  • Strategy execution: Some strategies run in parallel
  • Regex compilation: Expensive regexes are compiled lazily and cached
  • Best for: <1MB inputs, occasional parsing (not high-frequency loops)

Optimizations:

  • Direct JSON (valid JSON) takes the fastest path
  • Failed strategies short-circuit early
  • Scoring is lazy (only computed when needed)
  • Copy semantics for small types (enums, etc.)

Debugging

Enable detailed logging:

env_logger::init();
std::env::set_var("RUST_LOG", "tryparse=debug");

let result = parse::<User>(input);

Inspect what went wrong:

match parse::<User>(input) {
    Ok(user) => println!("Success: {:?}", user),
    Err(e) => {
        eprintln!("Parse failed: {:?}", e);

        // Try getting candidates to see what was parsed
        if let Ok((_, candidates)) = parse_with_candidates::<User>(input) {
            for c in candidates {
                eprintln!("Candidate: {:?}", c.value);
            }
        }
    }
}

Known Limitations

  1. Synchronous only - No async parsing
  2. No streaming - Requires complete input string
  3. Memory overhead - Tracks all candidates and transformations
  4. Acronym handling - XMLParserx_m_l_parser (not xml_parser)
  5. Best-effort parsing - May produce unexpected results on ambiguous input
  6. No custom deserializers - Can't implement custom Deserialize logic for fields

Contributing

Requirements:

  • Rust 1.85.0+
  • Run cargo fmt before committing
  • Pass cargo clippy --all-targets --all-features
  • All tests must pass: cargo test --all-features

Pull request checklist:

  1. Clear description of what and why
  2. Tests for new functionality
  3. Update README if API changes
  4. No clippy warnings
  5. All existing tests pass

License

Apache-2.0

Credits

Parsing algorithms inspired by BAML's Schema-Aligned Parsing.

About

Multi-strategy parser for messy, real-world data. Built to handle LLM responses with broken JSON, markdown wrappers, type mismatches, and inconsistent formatting.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages