Skip to content
This repository has been archived by the owner on May 14, 2023. It is now read-only.

Commit

Permalink
Add modules
Browse files Browse the repository at this point in the history
  • Loading branch information
jdkato committed Jun 16, 2020
1 parent 0251669 commit 7be755d
Show file tree
Hide file tree
Showing 3 changed files with 120 additions and 177 deletions.
260 changes: 83 additions & 177 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,149 +1,92 @@
# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![fuzzit](https://app.fuzzit.dev/badge?org_id=prose=master)](https://fuzzit.dev) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/gopkg.in/jdkato/prose.v2) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=v2) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)
# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![Build status](https://ci.appveyor.com/api/projects/status/24bepq85nnnk4scr/branch/master?svg=true)](https://ci.appveyor.com/project/jdkato/prose/branch/master) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/github.com/jdkato/prose) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)

`prose` is a natural language processing library (English only) in *pure Go*. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: [Introducing `prose` v2.0.0: Bringing NLP *to Go*](https://medium.com/@errata.ai/introducing-prose-v2-0-0-bringing-nlp-to-go-a1f0c121e4a5).
`prose` is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.

> **NOTE**: If you're looking for v1.0.0's README, you can still [find it here](https://github.com/jdkato/prose/blob/v1/README.md).
See the [GoDoc documentation](https://godoc.org/github.com/jdkato/prose) for more information.

## Installation
## Install

```console
$ go get gopkg.in/jdkato/prose.v2
$ go get github.com/jdkato/prose/...
```

> **NOTE**: When using some vendoring tools, such as `govendor`, you may need to include the `github.com/jdkato/prose/internal/` package in addition to the core package(s). See [#14](https://github.com/jdkato/prose/issues/14) for more information.
## Usage

### Contents

* [Overview](#overview)
* [Tokenizing](#tokenizing)
* [Segmenting](#segmenting)
* [Tagging](#tagging)
* [NER](#ner)
* [Tokenizing](#tokenizing-godoc)
* [Tagging](#tagging-godoc)
* [Transforming](#transforming-godoc)
* [Summarizing](#summarizing-godoc)
* [Chunking](#chunking-godoc)
* [License](#license)


### Overview
### Tokenizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/tokenize))

Word, sentence, and regexp tokenizers are available. Every tokenizer implements the [same interface](https://godoc.org/github.com/jdkato/prose/tokenize#ProseTokenizer), which makes it easy to customize tokenization in other parts of the library.

```go
package main

import (
"fmt"
"log"

"gopkg.in/jdkato/prose.v2"
"github.com/jdkato/prose/tokenize"
)

func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}

// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}

// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}

// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
text := "They'll save and invest more."
tokenizer := tokenize.NewTreebankWordTokenizer()
for _, word := range tokenizer.Tokenize(text) {
// [They 'll save and invest more .]
fmt.Println(word)
}
}
```

The document-creation process adheres to the following sequence of steps:

```text
tokenization -> POS tagging -> NE extraction
\
segmentation
```

Each step may be disabled (assuming later steps aren't required) by passing the appropriate [*functional option*](https://godoc.org/gopkg.in/jdkato/prose.v2#DocOpt). To disable named-entity extraction, for example, you'd do the following:
### Tagging ([GoDoc](https://godoc.org/github.com/jdkato/prose/tag))

```go
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
```

### Tokenizing
The `tag` package includes a port of Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:

`prose` includes a tokenizer capable of handling modern text, including the non-word character spans shown below.

| Type | Example |
|-----------------|-----------------------------------|
| Email addresses | `Jane.Doe@example.com` |
| Hashtags | `#trending` |
| Mentions | `@jdkato` |
| URLs | `https://github.com/jdkato/prose` |
| Emoticons | `:-)`, `>:(`, `o_0`, etc. |
| Library | Accuracy | 5-Run Average (sec) |
|:--------|---------:|--------------------:|
| NLTK | 0.893 | 7.224 |
| `prose` | 0.961 | 2.538 |

(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)

```go
package main

import (
"fmt"
"log"

"gopkg.in/jdkato/prose.v2"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)

func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
text := "A fast and accurate part-of-speech tagger for Golang."
words := tokenize.NewTreebankWordTokenizer().Tokenize(text)

// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
tagger := tag.NewPerceptronTagger()
for _, tok := range tagger.Tag(words) {
fmt.Println(tok.Text, tok.Tag)
// @jdkato NN
// , ,
// go VB
// to TO
// http://example.com NN
// thanks NNS
// :) SYM
// . .
}
}
```

### Segmenting
### Transforming ([GoDoc](https://godoc.org/github.com/jdkato/prose/transform))

`prose` includes one of the most accurate sentence segmenters available according to the [Golden Rules](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms) created by the developers of the `pragmatic_segmenter`.
The `tranform` package implements a number of functions for changing the case of strings, including `Title`, `Snake`, `Pascal`, and `Camel`.

| Name | Language | License | GRS (English) | GRS (Other) | Speed† |
|---------------------|----------|-----------|----------------|-------------|----------|
| Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
| prose | Go | MIT | 73.07% (38/52) | N/A | 0.96 s |
| TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
| OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
| Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
| Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
| Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
| SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
| Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
Additionally, unlike `strings.Title`, `tranform.Title` adheres to common guidelines—including styles for both the [AP Stylebook](https://www.apstylebook.com/) and [The Chicago Manual of Style](http://www.chicagomanualofstyle.org/home.html). You can also add your own custom style by defining an [`IgnoreFunc`](https://godoc.org/github.com/jdkato/prose/transform#IgnoreFunc) callback.

> † The original tests were performed using a *MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5*, while `prose` was timed using a *MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3*.
Inspiration and test data taken from [python-titlecase](https://github.com/ppannuto/python-titlecase) and [to-title-case](https://github.com/gouch/to-title-case).

```go
package main
Expand All @@ -152,107 +95,70 @@ import (
"fmt"
"strings"

"gopkg.in/jdkato/prose.v2"
"github.com/jdkato/prose/transform"
)

func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))

// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
text := "the last of the mohicans"
tc := transform.NewTitleConverter(transform.APStyle)
fmt.Println(strings.Title(text)) // The Last Of The Mohicans
fmt.Println(tc.Title(text)) // The Last of the Mohicans
}
```

### Tagging
### Summarizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/summarize))

`prose` includes a tagger based on Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
The `summarize` package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like [readability-score](https://github.com/DaveChild/Text-Statistics/blob/master/src/DaveChild/TextStatistics/Text.php#L308), rely on naive regular expressions).

| Library | Accuracy | 5-Run Average (sec) |
|:--------|---------:|--------------------:|
| NLTK | 0.893 | 7.224 |
| `prose` | 0.961 | 2.538 |
It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.

(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
```go
package main

import (
"fmt"

"github.com/jdkato/prose/summarize"
)

func main() {
doc := summarize.NewDocument("This is some interesting text.")
fmt.Println(doc.SMOG(), doc.FleschKincaid())
}
```

### Chunking ([GoDoc](https://godoc.org/github.com/jdkato/prose/chunk))

The full list of supported POS tags is given below.

| TAG | DESCRIPTION |
|------------|-------------------------------------------|
| `(` | left round bracket |
| `)` | right round bracket |
| `,` | comma |
| `:` | colon |
| `.` | period |
| `''` | closing quotation mark |
| ``` `` ``` | opening quotation mark |
| `#` | number sign |
| `$` | currency |
| `CC` | conjunction, coordinating |
| `CD` | cardinal number |
| `DT` | determiner |
| `EX` | existential there |
| `FW` | foreign word |
| `IN` | conjunction, subordinating or preposition |
| `JJ` | adjective |
| `JJR` | adjective, comparative |
| `JJS` | adjective, superlative |
| `LS` | list item marker |
| `MD` | verb, modal auxiliary |
| `NN` | noun, singular or mass |
| `NNP` | noun, proper singular |
| `NNPS` | noun, proper plural |
| `NNS` | noun, plural |
| `PDT` | predeterminer |
| `POS` | possessive ending |
| `PRP` | pronoun, personal |
| `PRP$` | pronoun, possessive |
| `RB` | adverb |
| `RBR` | adverb, comparative |
| `RBS` | adverb, superlative |
| `RP` | adverb, particle |
| `SYM` | symbol |
| `TO` | infinitival to |
| `UH` | interjection |
| `VB` | verb, base form |
| `VBD` | verb, past tense |
| `VBG` | verb, gerund or present participle |
| `VBN` | verb, past participle |
| `VBP` | verb, non-3rd person singular present |
| `VBZ` | verb, 3rd person singular present |
| `WDT` | wh-determiner |
| `WP` | wh-pronoun, personal |
| `WP$` | wh-pronoun, possessive |
| `WRB` | wh-adverb |

### NER

`prose` v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (`PERSON`) and geographical/political Entities (`GPE`) by default.
The `chunk` package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.

```go
package main

import (
"gopkg.in/jdkato/prose.v2"
"fmt"

"github.com/jdkato/prose/chunk"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)

func main() {
doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
words := tokenize.TextToWords("Go is an open source programming language created at Google.")
regex := chunk.TreebankNamedEntities

tagger := tag.NewPerceptronTagger()
for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
fmt.Println(entity) // [Go Google]
}
}
```

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See [Prodigy + `prose`: Radically efficient machine teaching *in Go*](https://medium.com/@errata.ai/prodigy-prose-radically-efficient-machine-teaching-in-go-93389bf2d772) for a tutorial.
## License

If not otherwise specified (see below), the source files are distributed under MIT License found in the [LICENSE](https://github.com/jdkato/prose/blob/master/LICENSE) file.

Additionally, the following files contain their own license information:

- [`tag/aptag.go`](https://github.com/jdkato/prose/blob/master/tag/aptag.go): MIT © Matthew Honnibal.
- [`tokenize/punkt.go`](https://github.com/jdkato/prose/blob/master/tokenize/punkt.go): MIT © Eric Bower.
- [`tokenize/pragmatic.go`](https://github.com/jdkato/prose/blob/master/tokenize/pragmatic.go): MIT © Kevin S. Dias.
11 changes: 11 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
module github.com/jdkato/prose

go 1.13

require (
github.com/montanaflynn/stats v0.6.3
github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d
github.com/stretchr/testify v1.6.1
github.com/urfave/cli v1.22.4
gopkg.in/neurosnap/sentences.v1 v1.0.6
)
26 changes: 26 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d h1:U+s90UTSYgptZMwQh2aRr3LuazLJIa+Pg3Kc1ylSYVY=
github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
github.com/davecgh/go-spew v1.1.0 h1:ZDRjVQ15GmhC3fiQ8ni8+OwkZQO4DARzQgrnXU1Liz8=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/montanaflynn/stats v0.6.3 h1:F8446DrvIF5V5smZfZ8K9nrmmix0AFgevPdLruGOmzk=
github.com/montanaflynn/stats v0.6.3/go.mod h1:wL8QJuTMNUDYhXwkmfOly8iTdp5TEcJFWZD2D7SIkUc=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/russross/blackfriday/v2 v2.0.1 h1:lPqVAte+HuHNfhJ/0LC98ESWRz8afy9tM/0RK8m9o+Q=
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d h1:rUbV6LJa5RXK3jT/4jnJUz3UkrXzW6cqB+n9Fkbv9jY=
github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d/go.mod h1:2htx6lmL0NGLHlO8ZCf+lQBGBHIbEujyywxJArf+2Yc=
github.com/shurcooL/sanitized_anchor_name v1.0.0 h1:PdmoCO6wvbs+7yrJyMORt4/BmY5IYyJwS/kOiWx8mHo=
github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.6.1 h1:hDPOHmpOpP40lSULcqw7IrRb/u7w6RpDC9399XyoNd0=
github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/urfave/cli v1.22.4 h1:u7tSpNPPswAFymm8IehJhy4uJMlUuU/GmqSkvJ1InXA=
github.com/urfave/cli v1.22.4/go.mod h1:Gos4lmkARVdJ6EkW0WaNv/tZAAMe9V7XWyB60NtXRu0=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/neurosnap/sentences.v1 v1.0.6 h1:v7ElyP020iEZQONyLld3fHILHWOPs+ntzuQTNPkul8E=
gopkg.in/neurosnap/sentences.v1 v1.0.6/go.mod h1:YlK+SN+fLQZj+kY3r8DkGDhDr91+S3JmTb5LSxFRQo0=
gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c h1:dUUwHk2QECo/6vqA44rthZ8ie2QXMNeKRTHCNY2nXvo=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

0 comments on commit 7be755d

Please sign in to comment.