Skip to content

andrusha/sqlite-parser-nom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sqlite-parser-nom

SQLite binary database format parser.

Homonym libraries:

Usage

In your Cargo.toml:

[dependencies]
sqlite-parser-nom = "1.0.0"

Lazily parse the file

Load and parse file in memory:

use sqlite_parser_nom::Reader;
use sqlite_parser_nom::error;

fn main() -> Result<(), error::SQLiteError> {
    let reader = Reader::open_mmap("sample/sakila.db")?;
    println!("{}", reader.header.db_size);

    Ok(())
}

Parse a slice

You can also use parsers directly

use nom::Finish;
use sqlite_parser_nom::parser;
use sqlite_parser_nom::model;
use sqlite_parser_nom::error;

fn do_something_with_page(i: &[u8]) -> Result<model::Page, error::SQLiteError> {
    let (_, page) = parser::page(i)
        .finish()
        // the cast is necessary here, so the error could outlive the input 
        .map_err(|e| nom::error::Error {
            code: e.code,
            input: error::OwnedBytes(e.input.to_owned()),
        })?;

    Ok(page)
}

Check the documentation and parser to chose correct parser for your task.

SQLite format specification

References:

Physical structure

Database file

+---+-------+-----------+-----------+-----------+
| h |       |           |           |           |
| e |       |           |           |           |
| a | root  |   page 2  |    ...    |   page N  |
| d | page  |           |           |           |
| e |       |           |           |           |
| r |       |           |           |           |
+---+-------+-----------+-----------+-----------+
            ^           ^           ^
< page size | page size | page size | page size >
  • The SQLite database file is divided into equally-sized pages
    • Page size is defined within the header
    • Root page includes file header, but together still fits page size
    • All pages, including root page, count internal offsets from the beginning of the page itself
    • Pages are referenced by the number, therefore their position in the binary file can be computed

Page

+---------------------+
|     page header     |
+---------------------+
| cell pointer array  |
+---------------------+
|                     |
|  unallocated space  |
|                     |
+-------+-------------+
|cell N |free block   |
+-------+------+------+
|cell 5 |cell 4|cell 3|
+-------+----+-+------+
|   cell 2   | cell 1 |
+------------+--------+

Page types:

  • Both Index and Table pages in their Interior and Leaf flavour have the same structure, but differ in the header and some extra fields
    • See models for exact definition and BTree section for logic
  • Overflow page just has 0x00 in header and the rest is payload
  • Locking page is empty page in databases > 1gb at 1,073,741,824 offset
  • Pointer page exists in autovacuumed DBs and contains pointers to reorganized pages
  • Free blocks are stored in free-list and are not nulled, they might contain data, which was supposed to be removed

Page structure:

  • Cell pointer array grows from the top of the page to the bottom
    • Pointers are byte offsets within the page
  • Cells grow from the bottom of the page to the top

Cell

+-----------+--------+--------+--------+-----+--------+-----------+-----+-----------+
|Payload    |        | Header | Serial |     | Serial | Data Cell |     | Data Cell |
|(Cell Size)|  ...   |  Size  | Type 1 | ... | Type N | Column 1  | ... | Column N  |
+-----------+--------+--------+--------+-----+--------+-----------+-----+-----------+
                     |                                |
<    cell header     ^      record header             ^        table row data       >
                     <                            cell size                         >
  • This structure with some amendments applies to Table Leaf, Index Leaf and Interior pages
  • Table Interior page contains pointers to other pages and corresponding rowid
  • Header and auxillary values within the cell are mostly as varint
  • Serial types correspond to the data payloads and contain size information within them