Add modules

jdkato · Jun 16, 2020 · 7be755d · 7be755d
1 parent 0251669
commit 7be755d
Show file tree

Hide file tree

Showing 3 changed files with 120 additions and 177 deletions.
diff --git a/README.md b/README.md
@@ -1,149 +1,92 @@
-# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![fuzzit](https://app.fuzzit.dev/badge?org_id=prose=master)](https://fuzzit.dev) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/gopkg.in/jdkato/prose.v2) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=v2) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)
+# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![Build status](https://ci.appveyor.com/api/projects/status/24bepq85nnnk4scr/branch/master?svg=true)](https://ci.appveyor.com/project/jdkato/prose/branch/master) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/github.com/jdkato/prose) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)
 
-`prose` is a natural language processing library (English only) in *pure Go*. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
 
-You can find a more detailed summary on the library's performance here: [Introducing `prose` v2.0.0: Bringing NLP *to Go*](https://medium.com/@errata.ai/introducing-prose-v2-0-0-bringing-nlp-to-go-a1f0c121e4a5).
+`prose` is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.
 
-> **NOTE**: If you're looking for v1.0.0's README, you can still [find it here](https://github.com/jdkato/prose/blob/v1/README.md).
+See the [GoDoc documentation](https://godoc.org/github.com/jdkato/prose) for more information.
 
-## Installation
+## Install
 
 ```console
-$ go get gopkg.in/jdkato/prose.v2
+$ go get github.com/jdkato/prose/...
 ```
 
+> **NOTE**: When using some vendoring tools, such as `govendor`, you may need to include the `github.com/jdkato/prose/internal/` package in addition to the core package(s). See [#14](https://github.com/jdkato/prose/issues/14) for more information.
+
 ## Usage
 
 ### Contents
 
-* [Overview](#overview)
-* [Tokenizing](#tokenizing)
-* [Segmenting](#segmenting)
-* [Tagging](#tagging)
-* [NER](#ner)
+* [Tokenizing](#tokenizing-godoc)
+* [Tagging](#tagging-godoc)
+* [Transforming](#transforming-godoc)
+* [Summarizing](#summarizing-godoc)
+* [Chunking](#chunking-godoc)
+* [License](#license)
+
 
-### Overview
+### Tokenizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/tokenize))
 
+Word, sentence, and regexp tokenizers are available. Every tokenizer implements the [same interface](https://godoc.org/github.com/jdkato/prose/tokenize#ProseTokenizer), which makes it easy to customize tokenization in other parts of the library.
 
 ```go
 package main
 
 import (
     "fmt"
-    "log"
 
-    "gopkg.in/jdkato/prose.v2"
+    "github.com/jdkato/prose/tokenize"
 )
 
 func main() {
-    // Create a new document with the default configuration:
-    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
-    if err != nil {
-        log.Fatal(err)
-    }
-
-    // Iterate over the doc's tokens:
-    for _, tok := range doc.Tokens() {
-        fmt.Println(tok.Text, tok.Tag, tok.Label)
-        // Go NNP B-GPE
-        // is VBZ O
-        // an DT O
-        // ...
-    }
-
-    // Iterate over the doc's named-entities:
-    for _, ent := range doc.Entities() {
-        fmt.Println(ent.Text, ent.Label)
-        // Go GPE
-        // Google GPE
-    }
-
-    // Iterate over the doc's sentences:
-    for _, sent := range doc.Sentences() {
-        fmt.Println(sent.Text)
-        // Go is an open-source programming language created at Google.
+    text := "They'll save and invest more."
+    tokenizer := tokenize.NewTreebankWordTokenizer()
+    for _, word := range tokenizer.Tokenize(text) {
+        // [They 'll save and invest more .]
+        fmt.Println(word)
     }
 }
 ```
 
-The document-creation process adheres to the following sequence of steps:
-
-```text
-tokenization -> POS tagging -> NE extraction
-            \
-             segmentation
-```
-
-Each step may be disabled (assuming later steps aren't required) by passing the appropriate [*functional option*](https://godoc.org/gopkg.in/jdkato/prose.v2#DocOpt). To disable named-entity extraction, for example, you'd do the following:
+### Tagging ([GoDoc](https://godoc.org/github.com/jdkato/prose/tag))
 
-```go
-doc, err := prose.NewDocument(
-        "Go is an open-source programming language created at Google.",
-        prose.WithExtraction(false))
-```
-
-### Tokenizing
+The `tag` package includes a port of Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
 
-`prose` includes a tokenizer capable of handling modern text, including the non-word character spans shown below.
-
-| Type            | Example                           |
-|-----------------|-----------------------------------|
-| Email addresses | `Jane.Doe@example.com`            |
-| Hashtags        | `#trending`                       |
-| Mentions        | `@jdkato`                         |
-| URLs            | `https://github.com/jdkato/prose` |
-| Emoticons       | `:-)`, `>:(`, `o_0`, etc.         |
+| Library | Accuracy | 5-Run Average (sec) |
+|:--------|---------:|--------------------:|
+| NLTK    |    0.893 |               7.224 |
+| `prose` |    0.961 |               2.538 |
 
+(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
 
 ```go
 package main
 
 import (
     "fmt"
-    "log"
 
-    "gopkg.in/jdkato/prose.v2"
+    "github.com/jdkato/prose/tag"
+    "github.com/jdkato/prose/tokenize"
 )
 
 func main() {
-    // Create a new document with the default configuration:
-    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
-    if err != nil {
-        log.Fatal(err)
-    }
+    text := "A fast and accurate part-of-speech tagger for Golang."
+    words := tokenize.NewTreebankWordTokenizer().Tokenize(text)
 
-    // Iterate over the doc's tokens:
-    for _, tok := range doc.Tokens() {
+    tagger := tag.NewPerceptronTagger()
+    for _, tok := range tagger.Tag(words) {
         fmt.Println(tok.Text, tok.Tag)
-        // @jdkato NN
-        // , ,
-        // go VB
-        // to TO
-        // http://example.com NN
-        // thanks NNS
-        // :) SYM
-        // . .
     }
 }
 ```
 
-### Segmenting
+### Transforming ([GoDoc](https://godoc.org/github.com/jdkato/prose/transform))
 
-`prose` includes one of the most accurate sentence segmenters available according to the [Golden Rules](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms) created by the developers of the `pragmatic_segmenter`.
+The `tranform` package implements a number of functions for changing the case of strings, including `Title`, `Snake`, `Pascal`, and `Camel`.
 
-| Name                | Language | License   | GRS (English)  | GRS (Other) | Speed†   |
-|---------------------|----------|-----------|----------------|-------------|----------|
-| Pragmatic Segmenter | Ruby     | MIT       | 98.08% (51/52) | 100.00%     | 3.84 s   |
-| prose               | Go       | MIT       | 73.07% (38/52) | N/A         | 0.96 s   |
-| TactfulTokenizer    | Ruby     | GNU GPLv3 | 65.38% (34/52) | 48.57%      | 46.32 s  |
-| OpenNLP             | Java     | APLv2     | 59.62% (31/52) | 45.71%      | 1.27 s   |
-| Standford CoreNLP   | Java     | GNU GPLv3 | 59.62% (31/52) | 31.43%      | 0.92 s   |
-| Splitta             | Python   | APLv2     | 55.77% (29/52) | 37.14%      | N/A      |
-| Punkt               | Python   | APLv2     | 46.15% (24/52) | 48.57%      | 1.79 s   |
-| SRX English         | Ruby     | GNU GPLv3 | 30.77% (16/52) | 28.57%      | 6.19 s   |
-| Scapel              | Ruby     | GNU GPLv3 | 28.85% (15/52) | 20.00%      | 0.13 s   |
+Additionally, unlike `strings.Title`, `tranform.Title` adheres to common guidelines&mdash;including styles for both the [AP Stylebook](https://www.apstylebook.com/) and [The Chicago Manual of Style](http://www.chicagomanualofstyle.org/home.html). You can also add your own custom style by defining an [`IgnoreFunc`](https://godoc.org/github.com/jdkato/prose/transform#IgnoreFunc) callback.
 
-> † The original tests were performed using a *MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5*, while `prose` was timed using a *MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3*.
+Inspiration and test data taken from [python-titlecase](https://github.com/ppannuto/python-titlecase) and [to-title-case](https://github.com/gouch/to-title-case).
 
 ```go
 package main
@@ -152,107 +95,70 @@ import (
     "fmt"
     "strings"
 
-    "gopkg.in/jdkato/prose.v2"
+    "github.com/jdkato/prose/transform"
 )
 
 func main() {
-    // Create a new document with the default configuration:
-    doc, _ := prose.NewDocument(strings.Join([]string{
-        "I can see Mt. Fuji from here.",
-        "St. Michael's Church is on 5th st. near the light."}, " "))
-
-    // Iterate over the doc's sentences:
-    sents := doc.Sentences()
-    fmt.Println(len(sents)) // 2
-    for _, sent := range sents {
-        fmt.Println(sent.Text)
-        // I can see Mt. Fuji from here.
-        // St. Michael's Church is on 5th st. near the light.
-    }
+    text := "the last of the mohicans"
+    tc := transform.NewTitleConverter(transform.APStyle)
+    fmt.Println(strings.Title(text))   // The Last Of The Mohicans
+    fmt.Println(tc.Title(text)) // The Last of the Mohicans
 }
 ```
 
-### Tagging
+### Summarizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/summarize))
 
-`prose` includes a tagger based on Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
+The `summarize` package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like [readability-score](https://github.com/DaveChild/Text-Statistics/blob/master/src/DaveChild/TextStatistics/Text.php#L308), rely on naive regular expressions).
 
-| Library | Accuracy | 5-Run Average (sec) |
-|:--------|---------:|--------------------:|
-| NLTK    |    0.893 |               7.224 |
-| `prose` |    0.961 |               2.538 |
+It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.
 
-(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
+```go
+package main
+
+import (
+    "fmt"
+
+    "github.com/jdkato/prose/summarize"
+)
+
+func main() {
+    doc := summarize.NewDocument("This is some interesting text.")
+    fmt.Println(doc.SMOG(), doc.FleschKincaid())
+}
+```
+
+### Chunking ([GoDoc](https://godoc.org/github.com/jdkato/prose/chunk))
 
-The full list of supported POS tags is given below.
-
-| TAG        | DESCRIPTION                               |
-|------------|-------------------------------------------|
-| `(`        | left round bracket                        |
-| `)`        | right round bracket                       |
-| `,`        | comma                                     |
-| `:`        | colon                                     |
-| `.`        | period                                    |
-| `''`       | closing quotation mark                    |
-| ``` `` ``` | opening quotation mark                    |
-| `#`        | number sign                               |
-| `$`        | currency                                  |
-| `CC`       | conjunction, coordinating                 |
-| `CD`       | cardinal number                           |
-| `DT`       | determiner                                |
-| `EX`       | existential there                         |
-| `FW`       | foreign word                              |
-| `IN`       | conjunction, subordinating or preposition |
-| `JJ`       | adjective                                 |
-| `JJR`      | adjective, comparative                    |
-| `JJS`      | adjective, superlative                    |
-| `LS`       | list item marker                          |
-| `MD`       | verb, modal auxiliary                     |
-| `NN`       | noun, singular or mass                    |
-| `NNP`      | noun, proper singular                     |
-| `NNPS`     | noun, proper plural                       |
-| `NNS`      | noun, plural                              |
-| `PDT`      | predeterminer                             |
-| `POS`      | possessive ending                         |
-| `PRP`      | pronoun, personal                         |
-| `PRP$`     | pronoun, possessive                       |
-| `RB`       | adverb                                    |
-| `RBR`      | adverb, comparative                       |
-| `RBS`      | adverb, superlative                       |
-| `RP`       | adverb, particle                          |
-| `SYM`      | symbol                                    |
-| `TO`       | infinitival to                            |
-| `UH`       | interjection                              |
-| `VB`       | verb, base form                           |
-| `VBD`      | verb, past tense                          |
-| `VBG`      | verb, gerund or present participle        |
-| `VBN`      | verb, past participle                     |
-| `VBP`      | verb, non-3rd person singular present     |
-| `VBZ`      | verb, 3rd person singular present         |
-| `WDT`      | wh-determiner                             |
-| `WP`       | wh-pronoun, personal                      |
-| `WP$`      | wh-pronoun, possessive                    |
-| `WRB`      | wh-adverb                                 |
-
-### NER
-
-`prose` v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (`PERSON`) and geographical/political Entities (`GPE`) by default.
+The `chunk` package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.
 
 ```go
 package main
 
 import (
-    "gopkg.in/jdkato/prose.v2"
+    "fmt"
+
+    "github.com/jdkato/prose/chunk"
+    "github.com/jdkato/prose/tag"
+    "github.com/jdkato/prose/tokenize"
 )
 
 func main() {
-    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
-    for _, ent := range doc.Entities() {
-        fmt.Println(ent.Text, ent.Label)
-        // Lebron James PERSON
-        // Los Angeles GPE
+    words := tokenize.TextToWords("Go is an open source programming language created at Google.")
+    regex := chunk.TreebankNamedEntities
+
+    tagger := tag.NewPerceptronTagger()
+    for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
+        fmt.Println(entity) // [Go Google]
     }
 }
 ```
 
-However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See [Prodigy + `prose`: Radically efficient machine teaching *in Go*](https://medium.com/@errata.ai/prodigy-prose-radically-efficient-machine-teaching-in-go-93389bf2d772) for a tutorial.
+## License
+
+If not otherwise specified (see below), the source files are distributed under MIT License found in the [LICENSE](https://github.com/jdkato/prose/blob/master/LICENSE) file.
+
+Additionally, the following files contain their own license information:
 
+- [`tag/aptag.go`](https://github.com/jdkato/prose/blob/master/tag/aptag.go): MIT © Matthew Honnibal.
+- [`tokenize/punkt.go`](https://github.com/jdkato/prose/blob/master/tokenize/punkt.go): MIT © Eric Bower.
+- [`tokenize/pragmatic.go`](https://github.com/jdkato/prose/blob/master/tokenize/pragmatic.go): MIT © Kevin S. Dias.
diff --git a/go.mod b/go.mod
@@ -0,0 +1,11 @@
+module github.com/jdkato/prose
+
+go 1.13
+
+require (
+	github.com/montanaflynn/stats v0.6.3
+	github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d
+	github.com/stretchr/testify v1.6.1
+	github.com/urfave/cli v1.22.4
+	gopkg.in/neurosnap/sentences.v1 v1.0.6
+)
diff --git a/go.sum b/go.sum
@@ -0,0 +1,26 @@
+github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
+github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d h1:U+s90UTSYgptZMwQh2aRr3LuazLJIa+Pg3Kc1ylSYVY=
+github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
+github.com/davecgh/go-spew v1.1.0 h1:ZDRjVQ15GmhC3fiQ8ni8+OwkZQO4DARzQgrnXU1Liz8=
+github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
+github.com/montanaflynn/stats v0.6.3 h1:F8446DrvIF5V5smZfZ8K9nrmmix0AFgevPdLruGOmzk=
+github.com/montanaflynn/stats v0.6.3/go.mod h1:wL8QJuTMNUDYhXwkmfOly8iTdp5TEcJFWZD2D7SIkUc=
+github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
+github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
+github.com/russross/blackfriday/v2 v2.0.1 h1:lPqVAte+HuHNfhJ/0LC98ESWRz8afy9tM/0RK8m9o+Q=
+github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
+github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d h1:rUbV6LJa5RXK3jT/4jnJUz3UkrXzW6cqB+n9Fkbv9jY=
+github.com/shogo82148/go-shuffle v0.0.0-20180218125048-27e6095f230d/go.mod h1:2htx6lmL0NGLHlO8ZCf+lQBGBHIbEujyywxJArf+2Yc=
+github.com/shurcooL/sanitized_anchor_name v1.0.0 h1:PdmoCO6wvbs+7yrJyMORt4/BmY5IYyJwS/kOiWx8mHo=
+github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
+github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
+github.com/stretchr/testify v1.6.1 h1:hDPOHmpOpP40lSULcqw7IrRb/u7w6RpDC9399XyoNd0=
+github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
+github.com/urfave/cli v1.22.4 h1:u7tSpNPPswAFymm8IehJhy4uJMlUuU/GmqSkvJ1InXA=
+github.com/urfave/cli v1.22.4/go.mod h1:Gos4lmkARVdJ6EkW0WaNv/tZAAMe9V7XWyB60NtXRu0=
+gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/neurosnap/sentences.v1 v1.0.6 h1:v7ElyP020iEZQONyLld3fHILHWOPs+ntzuQTNPkul8E=
+gopkg.in/neurosnap/sentences.v1 v1.0.6/go.mod h1:YlK+SN+fLQZj+kY3r8DkGDhDr91+S3JmTb5LSxFRQo0=
+gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
+gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c h1:dUUwHk2QECo/6vqA44rthZ8ie2QXMNeKRTHCNY2nXvo=
+gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=