Skip to content

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character).

License

Notifications You must be signed in to change notification settings

pulibrary/CJKFoldingFilter

 
 

Repository files navigation

CJKFoldingFilter

<img src=“https://secure.travis-ci.org/solrmarc/CJKFoldingFilter.png?branch=master” alt=“Build Status” />

This is a Lucene filter and filter factory (see lucene.apache.org ) to fold certain CJK characters to improve recall. You should put it in your analysis chain BEFORE ICUTransforms from Traditional->Simplified Han, as it converts modern Japanese Kanji to their traditional equivalents.

Usage

  • clone the project

git clone git://github.com/solrmarc/CJKFoldingFilter.git
  • run the jar ant task

ant jar
  • put the CJKFoldingFilter.jar file found in the dist directory into your Solr lib directory

  • utilize the Solr CJKFoldingFilterFactory in your schema.xml file.

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
  </analyzer>
</fieldType>

Contributing

  1. Fork it

  2. Create your feature branch (‘git checkout -b my-new-feature`)

  3. Commit your changes (‘git commit -am ’Added some feature’‘)

  4. Push to the branch (‘git push origin my-new-feature`)

  5. Create new Pull Request

About

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character).

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%