feature: any plan on implementing the bm25 algorithm ? #181

CocaineCong · 2023-10-30T14:42:55Z

Description

I see the bm25.go file in path(hmm/bm25/bm25.go), so I wanna ask author any plan on bm25 ? 😃

If author had the plan on implementing the bm25, I want to make it. 🫡

The text was updated successfully, but these errors were encountered:

vcaesar · 2023-10-30T15:08:21Z

Emm, A lot of features are planned, but my time is limited, and there any contributions welcome.

CocaineCong · 2023-11-01T16:46:31Z

Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻‍💻

First, in the file pathhmm/idf/tag_extracker.go, I think the struct of Segment should be extracted to the new file.

type Segment struct {
	text   string
	weight float64
}

because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲

Secondly, still in the file hmm/idf/tag_extracker.go , the struct of TagExtracter , the field of Idf should be abstracted into a relevance computation to implementing more algorithm such as IDF, TFIDF, BM25 .

before:

type TagExtracter struct {
	seg gse.Segmenter

	Idf      *Idf
	stopWord *StopWord
}

after:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance.

before:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

after:

type IDF struct {
	median float64
	freqs []float64
	Base
}

type BM25 struct {
	K1 float64
	N float64
	Base
}

type Base struct {
	// loading some stop words
	StopWord *stop_word.StopWord

	// loading segmenter for cut word
	Seg gse.Segmenter
}

And then, I want to implementing the Relevance by Strategy Pattern.

such as :

// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
	// AddToken add text, frequency, position on obj
	AddToken(text string, freq float64, pos ...string) error

	// LoadDict load file from incoming parameters,
	// if incoming params no exist, will load file from default file path
	LoadDict(files ...string) error

	// LoadDictStr loading dict file by file path
	LoadDictStr(pathStr string) error

	// LoadStopWord loading word file by filename
	LoadStopWord(fileName ...string) error

	// Freq find the frequency, position, existence information of the key
	Freq(key string) (float64, string, bool)

	// TotalFreq the total number of tokens in the dictionary
	TotalFreq() float64

	// FreqMap get frequency map
	// key: word, value: frequency
	FreqMap(text string) map[string]float64

	// ConstructSeg return the segment with weight
	ConstructSeg(text string) segment.Segments
}

default IDF:

func NewIDF() Relevance {
	idf := &IDF{
		freqs: make([]float64, 0),
	}
	idf.StopWord = stop_word.NewStopWord()
	return Relevance(idf)
}

implement the interface function

// AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
	err := i.Seg.AddToken(text, freq, pos...)

	i.freqs = append(i.freqs, freq)
	sort.Float64s(i.freqs)
	i.median = i.freqs[len(i.freqs)/2]
	return err
}

// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
	if len(files) <= 0 {
		files = i.Seg.GetIdfPath(files...)
	}

	return i.Seg.LoadDict(files...)
}

// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
	return i.Seg.Find(key)
}

....

any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡

extracting stop word
extracting Segment with weight
extracting extracker
converging idf,bm25 in relevance
stop word should work and in relevance algorithm struct
relevance algorithm implement by Strategy Pattern

vcaesar added the enhancement label Oct 30, 2023

vcaesar added this to the v0.90.0 milestone Oct 30, 2023

vcaesar added the Proposal label Oct 30, 2023

This was referenced Nov 5, 2023

refactor: extract the segment with weight module #182

Merged

feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: any plan on implementing the bm25 algorithm ? #181

feature: any plan on implementing the bm25 algorithm ? #181

CocaineCong commented Oct 30, 2023

vcaesar commented Oct 30, 2023

CocaineCong commented Nov 1, 2023 •

edited

Loading

feature: any plan on implementing the bm25 algorithm ? #181

feature: any plan on implementing the bm25 algorithm ? #181

Comments

CocaineCong commented Oct 30, 2023

Description

vcaesar commented Oct 30, 2023

CocaineCong commented Nov 1, 2023 • edited Loading

CocaineCong commented Nov 1, 2023 •

edited

Loading