A simple lexical analyzer written in Ruby
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples Renaming example directory to examples Jul 19, 2013
lib Raise error fixed; unreferenced var removed Mar 31, 2014
spec Fix tests; Add line parameter Mar 25, 2014
.gitignore Ignoring *.vim files Jun 23, 2013
.rspec Update .rspec Jul 6, 2013
.travis.yml Adding travis Jul 5, 2013
CHANGELOG.md Adding changelog Jul 16, 2013
Gemfile Update Gemfile Jul 20, 2013
Gemfile.lock Adding Gemfile Jul 5, 2013
MIT-LICENSE Rename LICENSE to MIT-LICENSE Jul 13, 2013
README.md Update README.md Jun 18, 2014
Rakefile
TODO.md
lexeme.gemspec

README.md

lexeme Build Status Gem Version

A simple lexical analyzer for programming and human languages.

Install

There are two easy ways to get lexeme on your box. You can either download the source or install the ruby gem.

gem install lexeme

or just go to http://www.vladimirivic.com/lexeme/ and download the latest version archive.

Usage

Just look under the example directory for a quick example on how the library can be used to efficiently tokenize mathematical expressions such is 1 + 3 - sin(0)/cos(1) * pow(6). However, since tokenizing mathematical expressions may not be sufficient for a modern day programming language, another good example could be a demonstration of the ability to tokenize pseudo code.

Let's say we have a source code of some pseudo program and we save it in a file named pseudo-code.src:

func hello_world
  x = 1
  y = x + 2
  print "Hello"
fin

Since we can see that there's a couple of lexemes used in this language we will define them as part of the lexer's operative ruleset. To keep things as simple as possible, I'll place the language definition and the lexical analyzer call in the same code base. Ideally, language definition would be something you want to write and include separately.

Our ruby code should look like this:

require 'lexeme'

lexer = Lexeme.define do
  token :EQ       => /^=$/
  token :PLUS     => /^\+$/
  token :MINUS    => /^\-$/
  token :MULTI    => /^\*$/
  token :DIV      => /^\/$/
  token :NUMBER   => /^\d+\.?\d?$/
  token :RESERVED => /^(fin|print|func|)$/
  token :STRING   => /^".*"$/
  token :ID       => /^[\w_"]+$/ 
end

tokens = lexer.analyze do 
  from_file 'pseudo-code.src'
end

tokens.each do |t|
  puts "#{t.line} => #{t.name}: #{t.value}"
end

Once ran, the code above should output (line => token_id: token_value):

1 => RESERVED: func
1 => ID: hello_world
2 => ID: x
2 => EQ: =
2 => NUMBER: 1
3 => ID: y
3 => EQ: =
3 => ID: x
3 => PLUS: +
3 => NUMBER: 2
4 => RESERVED: print
4 => STRING: "Hello"
5 => RESERVED: fin

Natural languages

Lexeme can also be used for natral language processing. Here's a quick example on how to do it.

require 'lexeme'

puts "Greetings from Los Angeles!!".tokenize

Running this code will produce:

[WORD: Greetings, WORD: from, WORD: Los, WORD: Angeles, EXCL: !, EXCL: !]

A more advanced example with a customized syntatical rules would be something like this:

require 'lexeme'

lexer = Lexeme.define do
  token :STOP     =>   /^\.$/
  token :COMA     =>   /^,$/
  token :QUES     =>   /^\?$/
  token :EXCLAM   =>   /^!$/
  token :QUOT     =>   /^"$/
  token :APOS     =>   /^'$/
  token :WORD     =>   /^[\w\-]+$/
end 

tokens = lexer.analyze do
  from_string 'Hello! My name is Inigo Montoya. You killed my father. Prepare to die.'
end

tokens.each do |t|
  puts "#{t.name}: #{t.value}"
end

Will output:

WORD: Hello
EXCLAM: !
WORD: My
WORD: name
WORD: is
WORD: Inigo
WORD: Montoya
STOP: .
WORD: You
WORD: killed
WORD: my
WORD: father
STOP: .
WORD: Prepare
WORD: to
WORD: die
STOP: .

Change log

Version 0.0.5

  • Added line number to each token object. Useful for hinting errors to the user (thanks Rick).

    tokens.each do |token|
      puts "#{token.line} => #{token.name}: #{token.value}"
    end

Contributing

Any help on this project is very welcome. Please feel free to fork, modify and make pull requests. Also make sure you check the TODO file when the file is present in the repository.

Author

Lexeme was written by Vladimir Ivic (vladimir.ivic at icloud.com) and is released under the MIT license.