Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: any plan on implementing the bm25 algorithm ? #181

Open
CocaineCong opened this issue Oct 30, 2023 · 2 comments
Open

feature: any plan on implementing the bm25 algorithm ? #181

CocaineCong opened this issue Oct 30, 2023 · 2 comments

Comments

@CocaineCong
Copy link
Contributor

Description

I see the bm25.go file in path(hmm/bm25/bm25.go), so I wanna ask author any plan on bm25 ? 😃

If author had the plan on implementing the bm25, I want to make it. 🫡

image
@vcaesar vcaesar added this to the v0.90.0 milestone Oct 30, 2023
@vcaesar
Copy link
Member

vcaesar commented Oct 30, 2023

Emm, A lot of features are planned, but my time is limited, and there any contributions welcome.

@CocaineCong
Copy link
Contributor Author

CocaineCong commented Nov 1, 2023

Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻‍💻

image

First, in the file pathhmm/idf/tag_extracker.go, I think the struct of Segment should be extracted to the new file.

type Segment struct {
	text   string
	weight float64
}

because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲

Secondly, still in the file hmm/idf/tag_extracker.go , the struct of TagExtracter , the field of Idf should be abstracted into a relevance computation to implementing more algorithm such as IDF, TFIDF, BM25 .

before:

type TagExtracter struct {
	seg gse.Segmenter

	Idf      *Idf
	stopWord *StopWord
}

after:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance.

before:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

after:

type IDF struct {
	median float64
	freqs []float64
	Base
}

type BM25 struct {
	K1 float64
	N float64
	Base
}

type Base struct {
	// loading some stop words
	StopWord *stop_word.StopWord

	// loading segmenter for cut word
	Seg gse.Segmenter
}

And then, I want to implementing the Relevance by Strategy Pattern.

such as :

// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
	// AddToken add text, frequency, position on obj
	AddToken(text string, freq float64, pos ...string) error

	// LoadDict load file from incoming parameters,
	// if incoming params no exist, will load file from default file path
	LoadDict(files ...string) error

	// LoadDictStr loading dict file by file path
	LoadDictStr(pathStr string) error

	// LoadStopWord loading word file by filename
	LoadStopWord(fileName ...string) error

	// Freq find the frequency, position, existence information of the key
	Freq(key string) (float64, string, bool)

	// TotalFreq the total number of tokens in the dictionary
	TotalFreq() float64

	// FreqMap get frequency map
	// key: word, value: frequency
	FreqMap(text string) map[string]float64

	// ConstructSeg return the segment with weight
	ConstructSeg(text string) segment.Segments
}

default IDF:

func NewIDF() Relevance {
	idf := &IDF{
		freqs: make([]float64, 0),
	}
	idf.StopWord = stop_word.NewStopWord()
	return Relevance(idf)
}

implement the interface function

// AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
	err := i.Seg.AddToken(text, freq, pos...)

	i.freqs = append(i.freqs, freq)
	sort.Float64s(i.freqs)
	i.median = i.freqs[len(i.freqs)/2]
	return err
}

// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
	if len(files) <= 0 {
		files = i.Seg.GetIdfPath(files...)
	}

	return i.Seg.LoadDict(files...)
}

// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
	return i.Seg.Find(key)
}

....
image

any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡

  • extracting stop word
  • extracting Segment with weight
  • extracting extracker
  • converging idf,bm25 in relevance
  • stop word should work and in relevance algorithm struct
  • relevance algorithm implement by Strategy Pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants