Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Public datasets on the Chinese language, accessible from Ruby
Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
config
data
lib
spec
.gitignore
.rvmrc
.travis.yml
Gemfile
Gemfile.devtools
Gemfile.lock
LICENSE.txt
README.md
Rakefile
SOURCES.md
analects.gemspec
code_of_conduct.md

README.md

analects.rb

Gem Version Build Status Dependency Status Code Climate

Public datasets on the Chinese language, accessible from Ruby

Download the data

With Rake

# Rakefile
require 'analects/rake_tasks'

Analects.init_rake_tasks do
  data_dir '/tmp/analects' # defaults to ~/.analects

  task :import_cedict do
    library.cedict.each do |entry|
      # ..
    end
  end
end
rake analects:download:all        # download all sources
rake analects:download:cedict     # download CC-CEDICT
rake analects:download:chise_ids  # download Chise-IDS
rake analects:download:hsk        # download HSK data
rake analects:download:unihan     # download Unihan database

Or from Ruby

analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.retrieve
analects.chise_ids.retrieve

Use the data

analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.take(3)
# => [["AA制", "AA制", "A A zhi4", "/to split the bill/to go Dutch/"], ["A咖", "A咖", "A ka1", "/class \"A\"/top grade/"], ["A片", "A片", "A pian4", "/adult movie/pornography/"]]

analects.chise_ids.to_a.sample(3)
# [["U+59BF", "妿", "⿱加女"], ["U-0002441B", "𤐛", "⿰火閙"], ["U+83A1", "莡", "⿱艹足"]]

Other stuff

Analects wraps RMMSeg for easy segmenting of Chinese text

Analects::Tokenizer.new.tokenize("为待那个朋友拿哟出来,咿呀噢哎…")
# => ["为", "待", "那个", "朋友", "拿", "哟", "出来", ",", "咿", "呀", "噢", "哎", "…"]

If you have Chinese text in GB or BIG5 encoding, you can do stuff like this

Analects::Encoding.valid_cjk(str)
Analects::Encoding.from_gb(str)   # returns UTF-8
Analects::Encoding.from_big5(str) # returns UTF-8

License

Copyright ⓒ Arne Brasseur 2012-2014

Licensed as GPL-v3

Something went wrong with that request. Please try again.