elasticsearch-analysis-texsmart

TexSmart Analyzer for ElasticSearch

此分词器基于腾讯AI实验室TexSmart中文分词库，提供了ES中文分词插件.

🚩 更新日志：

适配Elasticsearch 7.x版本，TexSmart-0.1.3版本

版本对应

Plugin version	Elastic version
master	7.x
7.6.2	7.6.2

安装步骤

1. 下载安装ES对应Plugin Release版本

安装方式：

方式一

a. 下载对应的release安装包，最新release包可从github下载（链接: https://github.com/koios-sh/elasticsearch-analysis-texsmart/releases/download/v7.6.2/elasticsearch-analysis-texsmart-7.6.2.zip）

b. 执行如下命令安装，其中PATH为插件包绝对路径：

./bin/elasticsearch-plugin install file://${PATH}

方式二

a. 使用elasticsearch插件脚本安装command如下：

./bin/elasticsearch-plugin install https://github.com/koios-sh/elasticsearch-analysis-texsmart/releases/download/v7.6.2/elasticsearch-analysis-texsmart-7.6.2.zip

方式三

a. 编译：sh build.sh b. 执行如下命令安装，其中PATH为插件包绝对路径：

  `./bin/elasticsearch-plugin install file://${PATH}`

2. 安装数据包

release包中不包含TexSmart数据包，若要下载完整版数据包，请查看TexSmart Release。

数据包目录：/etc/elasticsearch/texsmart/data 可以修改config/texsmart.properties文件中的path值，调整数据路径

3. 安装libtencent_ai_texsmart.so

cp libtencent_ai_texsmart.so /usr/lib64 && chmod 777 /usr/lib64/libtencent_ai_texsmart.so

注：每个节点都需要做上述更改

提供的分词方式说明

texsmart: texsmart默认分词

texsmart_standard: 标准分词

texsmart_index: 索引分词

样例

POST http://localhost:9200/test/_analyze
{
    "text": "2020年，空调市场“冷风吹过”",
    "tokenizer": "texsmart_standard"
}

{
    "tokens": [
        {
            "token": "2020",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CD",
            "position": 0
        },
        {
            "token": "年",
            "start_offset": 4,
            "end_offset": 5,
            "type": "M",
            "position": 1
        },
        {
            "token": "，",
            "start_offset": 5,
            "end_offset": 6,
            "type": "PU",
            "position": 2
        },
        {
            "token": "空调",
            "start_offset": 6,
            "end_offset": 8,
            "type": "NN",
            "position": 3
        },
        {
            "token": "市场",
            "start_offset": 8,
            "end_offset": 10,
            "type": "NN",
            "position": 4
        },
        {
            "token": "“",
            "start_offset": 10,
            "end_offset": 11,
            "type": "PU",
            "position": 5
        },
        {
            "token": "冷风",
            "start_offset": 11,
            "end_offset": 13,
            "type": "NN",
            "position": 6
        },
        {
            "token": "吹过",
            "start_offset": 13,
            "end_offset": 15,
            "type": "VV",
            "position": 7
        },
        {
            "token": "”",
            "start_offset": 15,
            "end_offset": 16,
            "type": "PU",
            "position": 8
        }
    ]
}

保证词典编码UTF-8

自定义分词配置

TexSmart在提供了各类分词方式的基础上，也提供了一系列的分词配置，分词插件也提供了相关的分词配置，我们可以在通过如下配置来自定义自己的分词器：

Config	Elastic version
enable_index_mode	是否是索引分词
enable_stop_dictionary	是否启用停用词
enable_offset	是否计算偏移量
enable_pos_alg	指定pos_tagging算法 (log_linear(默认), crf,dnn)
enable_ner_alg	指定ner算法 (crf(默认),dnn)

注意：如果要采用如上配置过滤中英文标点符号，需要设置enable_stop_dictionary为true

例如：

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_texsmart_analyzer": {
          "tokenizer": "my_texsmart"
        }
      },
      "tokenizer": {
        "my_texsmart": {
          "type": "texsmart",
          "enable_stop_dictionary": true,
          "enable_pos_alg": "log_linear",
          "enable_ner_alg": "crf"
        }
      }
    }
  }
}

POST test/_analyze
{
    "text": "2020年，空调市场“冷风吹过”",
    "analyzer": "my_texsmart_analyzer"
}

结果：

{
    "tokens": [
        {
            "token": "2020",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CD",
            "position": 0
        },
        {
            "token": "年",
            "start_offset": 4,
            "end_offset": 5,
            "type": "M",
            "position": 1
        },
        {
            "token": "空调",
            "start_offset": 6,
            "end_offset": 8,
            "type": "NN",
            "position": 2
        },
        {
            "token": "市场",
            "start_offset": 8,
            "end_offset": 10,
            "type": "NN",
            "position": 3
        },
        {
            "token": "冷风",
            "start_offset": 11,
            "end_offset": 13,
            "type": "NN",
            "position": 4
        },
        {
            "token": "吹过",
            "start_offset": 13,
            "end_offset": 15,
            "type": "VV",
            "position": 5
        }
    ]
}

特别说明

1, texsmart目前官方不支持热词加载更新，听说下一个版本会支持。代码中参考analysis-hanlp插件集成了远程词库和动态更新分词的功能后续等腾讯官方版本更新后，上线该功能

🚩 参考资料： TexSmart analysis-hanlp

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
lib		lib
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
pom.xml		pom.xml
settings.xml		settings.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elasticsearch-analysis-texsmart

版本对应

安装步骤

1. 下载安装ES对应Plugin Release版本

2. 安装数据包

3. 安装libtencent_ai_texsmart.so

提供的分词方式说明

样例

自定义分词配置

特别说明

About

Releases 1

Packages

Languages

License

koios-sh/elasticsearch-analysis-texsmart

Folders and files

Latest commit

History

Repository files navigation

elasticsearch-analysis-texsmart

版本对应

安装步骤

1. 下载安装ES对应Plugin Release版本

2. 安装数据包

3. 安装libtencent_ai_texsmart.so

提供的分词方式说明

样例

自定义分词配置

特别说明

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages