Permalink
Browse files

Check word lexemes without temp buffer

  • Loading branch information...
isagalaev committed Sep 25, 2015
1 parent fa6abc0 commit 7e341b3ee1b605415dcdb47055b944118e86e19e
Showing with 18 additions and 26 deletions.
  1. +18 −26 src/lexer.rs
View
@@ -14,14 +14,6 @@ fn is_whitespace(value: u8) -> bool {
}
}
#[inline(always)]
fn is_word(value: u8) -> bool {
match value {
b'a' ... b'z' => true,
_ => false,
}
}
fn is_number(value: u8) -> bool {
match value {
b'+' | b'-' | b'.' | b'0' ... b'9' | b'E' | b'e' => true,
@@ -160,20 +152,18 @@ impl<T: io::Read> Lexer<T> {
Ok(result)
}
fn consume_word(&mut self) -> Result<Vec<u8>> {
let mut result = vec![];
loop {
let start = self.pos;
while self.pos < self.len && is_word(self.buf[self.pos]) {
self.pos += 1;
fn check_word(&mut self, expected: &[u8]) -> Result<()> {
let mut iter = expected.iter();
while let Some(byte) = iter.next() {
if let Buffer::Empty = try!(self.ensure_buffer()) {
return Err(Error::Unknown(b"".to_vec()))
}
result.extend(self.buf[start..self.pos].iter().cloned());
match try!(self.ensure_buffer()) {
Buffer::Reset => (), // continue
_ => break,
if self.buf[self.pos] != *byte {
return Err(Error::Unknown(self.buf[self.pos..self.pos + 1].to_vec()))
}
self.pos += 1;
}
Ok(result)
Ok(())
}
fn consume_number(&mut self) -> Result<Vec<u8>> {
@@ -207,13 +197,15 @@ impl<T: io::Read> Iterator for Lexer<T> {
Some(Ok(if self.buf[self.pos] == b'"' {
Lexeme::String(itry!(self.consume_string()))
} else if is_word(self.buf[self.pos]) {
match &itry!(self.consume_word())[..] {
b"true" => Lexeme::Boolean(true),
b"false" => Lexeme::Boolean(false),
b"null" => Lexeme::Null,
s @ _ => return Some(Err(Error::Unknown(s.to_owned()))),
}
} else if self.buf[self.pos] == b't' {
itry!(self.check_word(b"true"));
Lexeme::Boolean(true)
} else if self.buf[self.pos] == b'f' {
itry!(self.check_word(b"false"));
Lexeme::Boolean(false)
} else if self.buf[self.pos] == b'n' {
itry!(self.check_word(b"null"));
Lexeme::Null
} else if is_number(self.buf[self.pos]) {
let buffer = itry!(self.consume_number());
let s = unsafe { str::from_utf8_unchecked(&buffer[..]) };

5 comments on commit 7e341b3

@Suor

This comment has been minimized.

Suor replied Nov 13, 2015

Is it faster? Also checking length at the start could be more efficient than matching iterator at each symbol. A little C-style, but will probably be less code.

@isagalaev

This comment has been minimized.

Owner

isagalaev replied Nov 13, 2015

It's definitely faster. Copying a string is: iteration + copy + another iteration to check. Here it's only a single iteration. Creating an iterator in Rust is pretty much free (like, allocating a very small struct on the stack), and walking it is exactly the same as for-looping.

And I don't know the length yet. Because there's no separate pass to find the whole lexeme in the input buffer, I check characters right at the same time as I walk the buffer for the first time.

@isagalaev

This comment has been minimized.

Owner

isagalaev replied Nov 13, 2015

The difference wasn't huge anyway (0.140 → 0.136) because apparently I only have 100 of "true" values in the entire 18MB JSON :-)

@Suor

This comment has been minimized.

Suor replied Nov 13, 2015

@isagalaev

This comment has been minimized.

Owner

isagalaev replied Nov 13, 2015

I've got four :-) But this project is not intended for production anyway, it's just a learning test bed.

Please sign in to comment.