Skip to content

jaju/duckling-clj

Repository files navigation

Duckling?

A Clojure library originally created by Wit.ai for identifying semantic content in text - like time, measurements, currencies and similar.

Wit.ai decided to abandon the Clojure version in favour of a Haskell version.

Introduction

https://clojars.org/org.msync/duckling/latest-version.svg

org.msync/duckling-clj is a Clojure library that parses text into structured data:

(parse "the first Tuesday of October")
; =>
{:value "2014-10-07T00:00:00.000-07:00"
 :grain :day}

For the time the hosted Clojure version by wit.ai exists, you can try it out at https://duckling.wit.ai

The original blog post announcement can be found at https://wit.ai/blog/2014/10/01/open-source-parser-duckling for more context.

Quick Start

To use Duckling in your project, you just need two functions: `load!` to load the default configuration, and `parse` to parse a string.

(ns myproject.core
  (:require [duckling.core :as p]))

(p/load!) ;; Load all languages

(p/parse :en$core ;; core configuration for English ; see also :fr$core, :es$core, :zh$core
         "wake me up the last Monday of January 2015 at 6am"
         [:time]) ;; We are interested in :time expressions only ; see also :duration, :temperature, etc.

;; => [{:label :time
;;        :start 15
;;        :end 49
;;        :value {:type "value", :value "2015-01-26T06:00:00.000-02:00", :grain :hour}
;;        :body "last Monday of January 2015 at 6am"}]

See the old hosted documentation at https://duckling.wit.ai for more information.

Working with Duckling

There are multiple languages supported. For reference, the current list looks like below

  ; In the duckling.core namespace
(available-languages)
#{"nl"
  "pt"
  "en"
  "zh"
  "ro"
  "tr"
  "it"
  "vi"
  "id"
  "uk"
  "pl"
  "my"
  "sv"
  "hr"
  "fr"
  "da"
  "de"
  "nb"
  "ru"
  "ga"
  "es"
  "ja"
  "et"
  "ar"
  "ko"
  "he"}

Before you can use duckling, you will need to load the relevant languages’ corpuses - datasets that contain examples and rules from which duckling learns how to generalize. For example, to load English and French data, run the following

(load! ["en" "fr"])
{:en$core
 (:phone-number
  :number
  :distance
  :volume
  :time
  :temperature
  :url
  :email
  :timezone
  :leven-unit
  :leven-product
  :unit
  :quantity
  :amount-of-money
  :ordinal
  :unit-of-duration
  :cycle
  :duration),
 :fr$core
 (:phone-number
  :time
  :number
  :distance
  :volume
  :temperature
  :url
  :email
  :timezone
  :leven-unit
  :leven-product
  :quantity
  :unit
  :amount-of-money
  :unit-of-duration
  :cycle
  :duration
  :ordinal)}

As you may already notice, there is support for identifying structured information on time, money, phone-numbers, temperature et al. The English language data, now available with the key :en$core, and French data with the key :fr$core

To parse a sentence, in a known language, use the parse function and the right language key. For example

(parse :en$core "Meet me at 8")
({:dim :number,
  :body "8",
  :value {:type "value", :value 8},
  :start 11,
  :end 12}
 {:dim :distance,
  :body "8",
  :value {:type "value", :value 8},
  :start 11,
  :end 12,
  :latent true}
 {:dim :volume,
  :body "8",
  :value {:type "value", :value 8},
  :start 11,
  :end 12,
  :latent true}
 {:dim :temperature,
  :body "8",
  :value {:type "value", :value 8},
  :start 11,
  :end 12,
  :latent true}
 {:dim :time,
  :body "at 8",
  :value
  {:type "value",
   :value "2021-04-17T08:00:00.000+05:30",
   :grain :hour,
   :values
   ({:type "value",
     :value "2021-04-17T08:00:00.000+05:30",
     :grain :hour}
    {:type "value",
     :value "2021-04-17T20:00:00.000+05:30",
     :grain :hour}
    {:type "value",
     :value "2021-04-18T08:00:00.000+05:30",
     :grain :hour})},
  :start 8,
  :end 12})

The returned map gives multiple possible interpretations, and the caller should pick the most appropriate one. The type of the value - the dimension - is given under the :dim key. For the dimensions duckling is more confident about, there is no :latent flag. So, in the above example, :number and :time are the most confident interpretations.

If you are sure about what dimension you are looking to extract, you can specify it

(parse :en$core "Meet me at 8" [:time])
({:dim :time,
  :body "at 8",
  :value
  {:type "value",
   :value "2021-04-17T20:00:00.000+05:30",
   :grain :hour,
   :values
   ({:type "value",
     :value "2021-04-17T20:00:00.000+05:30",
     :grain :hour}
    {:type "value",
     :value "2021-04-18T08:00:00.000+05:30",
     :grain :hour}
    {:type "value",
     :value "2021-04-18T20:00:00.000+05:30",
     :grain :hour})},
  :start 8,
  :end 12})

Notice that the results are contextual - dependent on the time when it was called. In the above example, 8 was interpreted to be the closest times when you’d see 8 on the clock - both PM and AM, in the immediate future.

But you can also supply a context - which has a reference time to consider while parsing.

(require '[duckling.time.obj :as time])
(parse :en$core "Meet me at 8" [:time] {:reference-time (time/t 0 2020 04 01)})
({:dim :time,
  :body "at 8",
  :value
  {:type "value",
   :value "2020-04-01T08:00:00.000Z",
   :grain :hour,
   :values
   ({:type "value", :value "2020-04-01T08:00:00.000Z", :grain :hour}
    {:type "value", :value "2020-04-01T20:00:00.000Z", :grain :hour}
    {:type "value", :value "2020-04-02T08:00:00.000Z", :grain :hour})},
  :start 8,
  :end 12})

Another interesting example is the following - duckling can consider other signals, like the world tomorrow below.

(parse :en$core "Meet me tomorrow at 8" [:time] {:reference-time (time/t 0 2020 04 01)})
({:dim :time,
  :body "tomorrow at 8",
  :value
  {:type "value",
   :value "2020-04-02T08:00:00.000Z",
   :grain :hour,
   :values
   ({:type "value", :value "2020-04-02T08:00:00.000Z", :grain :hour}
    {:type "value", :value "2020-04-02T20:00:00.000Z", :grain :hour})},
  :start 8,
  :end 21})

Packages

No packages published

Languages

  • Clojure 100.0%