Skip to content
This repository has been archived by the owner on Oct 13, 2021. It is now read-only.

TokenLoc out of length #123

Open
1 of 3 tasks
JabinGP opened this issue Dec 20, 2020 · 1 comment
Open
1 of 3 tasks

TokenLoc out of length #123

JabinGP opened this issue Dec 20, 2020 · 1 comment

Comments

@JabinGP
Copy link

JabinGP commented Dec 20, 2020

  • Riot version (or commit ref): 20201013133145-f4c30acb3704
  • Go version: go version go1.14.5 darwin/amd64
  • Operating system and bit: macOS 10.15.6
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:
package main

import (
	"log"

	"github.com/go-ego/riot"
	"github.com/go-ego/riot/types"
)

var (
	searcher = riot.Engine{}
)

func init() {
	initSearcher()
	initIndex()
}

func initSearcher() {
	searcher.Init(types.EngineOpts{
		Using:   3,
		GseDict: "zh",
		IndexerOpts: &types.IndexerOpts{
			IndexType: types.LocsIndex,
		},
	})
}

func initIndex() {
	docID := "1"
	content := "验证账户权限 运行一些简单的指令来验证账户的有效性 > show dbs admin 0.000GB config 0.000GB local 0.000GB > show users { \"_id\" : \"admin.admin\", \"userId\" : UUID(\"dc5760ea-c8c1-4f40-af5b-7d9d53779842\"), \"user\" : \"admin\", \"db\" : \"admin\", \"roles\" : [ { \"role\" : \"userAdminAnyDatabase\", \"db\" : \"admin\" } ], \"mechanisms\" : [ \"SCRAM-SHA-1\", \"SCRAM-SHA-256\" ] } "
	searcher.Index(docID,
		types.DocData{Content: content},
	)
	searcher.Flush()
}

func main() {
	keyword := "t"

	res := searcher.SearchDoc(types.SearchReq{Text: keyword})

	log.Println("TokenLocs = ", res.Docs[0].TokenLocs)
	log.Println("len(content) = ", len(res.Docs[0].Content))
}
  • Log gist:
    2020/12/20 13:39:33 Load the gse dictionary: "/Users/jabin/go/pkg/mod/github.com/go-ego/gse@v0.50.3/data/dict/dictionary.txt"
    2020/12/20 13:39:34 Gse dictionary loaded finished.
    2020/12/20 13:39:34 Check virtualMemory...
    2020/12/20 13:39:34 Total: 17179869184, Free: 15147008, UsedPercent: 64.184594%
    2020/12/20 13:39:34 TokenLocs = [[495]]
    2020/12/20 13:39:34 len(content) = 376

Description

First TokenLoc is 495 but greater than len(content).

@stuchilde
Copy link

Maybe, because of different between chinese character and english letter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants