A little text processing library for Scala.
HTML JavaScript CSS Scala
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
project
src
.gitignore
.travis.yml
CHANGELOG.md
LICENSE.txt
README.md
build.sbt
pp.png
version.sbt

README.md

lib-text

A little text processing library for Scala.

Build Status Coverage Status Gitter

Overview

This is a little text processing library which supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for

  • German
  • English
  • Spanish
  • French
  • Indonesian
  • Japanese
  • Malay
  • Dutch
  • Portuguese
  • Swedish
  • Turkish
  • Arabic

More to come.

Usage

Add to your project dependencies:

resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"

libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"

Example

import com.peoplepattern.text.Implicits._

val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"

txt.lang
// Some(en)

txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)

txt.terms
// Set(print, personalised, black, copy, friday, time)

txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)

txt.termBigrams
// Set(black friday, personalised print)

License

lib-text is open source and licensed under the Apache License 2.0.

Acknowledgements

Developed with ❤️ at People Pattern Corporation

People Pattern logo