Nogosari - NLP for Bahasa

A NLP package for bahasa, based on go-sastrawi (https://github.com/RadhiFadlillah/go-sastrawi), with several modifications to make it procedural, and additional features for advance functionality. It is also worth to note that the symbol removal feature is dropped by default, instead it cleans up the small break and period symbol like coma followed by space, semicolon followed by space, point followed by sepace and point at the end of sentence. However, a full tokenize function is available for use

Basic Concept

There are two things we need to understand:

Dictionary as a list of indexed words for reference.
A word can have more than one function (can be called part/position as well) depending on the structure and the context of a sentence or phrase. Therefore it is good to store the information in a uint variable that is based on binary encoding that marks the status of word-function (true/false).

Given the condition, we can make a dictionary by using map, with the word (string) as the key and word-functions (uint16) as the value.

There are 10 word-functions which can be represented by 10 bit length binary value, as stated in the following list respectively: Noun, Pronoun, Verb, Adj, Adverb, Conjunction, Preposition, Interjection, Numeric, Articula

Note that word-functions are meant to be used in sentence or phrase structure recognition. As for the basic task like stemming, it will not be used.

Usage

Installation

go get github.com/karincake/nogosari

Create reference and use it in stemmer

func main() {
    // generate words reference using dictionary
    // note that you can make it simpler by creating array of strings
    // then loop it and create map of struct
    ref := map[string]struct{}{
        "lorem": strunct{}{},
        "ipsum": strunct{}{},
        "dolor": strunct{}{},
        "sit": strunct{}{},
        "amet": strunct{}{},
    }


    // The string to stem
    s := "Lorem ipsum adalah sebuah teks yang digunakan untuk uji coba"

    // Process it
    ss := nogosari.Stem(s, ref)
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dictionary		dictionary
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
nogosari-helper.go		nogosari-helper.go
nogosari.go		nogosari.go
regex.go		regex.go
stem.go		stem.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nogosari - NLP for Bahasa

Basic Concept

Usage

About

Releases 3

Packages

Languages

License

karincake/nogosari

Folders and files

Latest commit

History

Repository files navigation

Nogosari - NLP for Bahasa

Basic Concept

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages