Skip to content

go parser for human readable dates ported from the dateparser python package

License

Notifications You must be signed in to change notification settings

markusmobius/go-dateparser

Repository files navigation

Go-DateParser Go Reference

This package provides functionality to easily parse localized dates in almost any string formats commonly found on web pages. This is Go port from the Python library with the same name.

To use it, install the package inside your project:

go get -u -v github.com/markusmobius/go-dateparser

Table of Contents

1. Features

  • Generic parsing of dates in over 200 language locales plus numerous formats in a language agnostic fashion.
  • Generic parsing of relative dates like: "1 min ago", "2 weeks ago", "3 months, 1 week and 1 day ago", "in 2 days", "tomorrow".
  • Generic parsing of dates with time zones abbreviations or UTC offsets like: "August 14, 2015 EST", "July 4, 2013 PST", "21 July 2013 10:15 pm +0500".
  • Date lookup in longer texts.
  • Support for non-Gregorian calendar systems. See Supported Calendars.
  • Extensive test coverage.

2. Status

This package is up to date with the original dateparser until commit 748e48a (several commits after v1.2.0). There are several behavior and implementation differences between this port and the original:

  • In Python, timezone data is not included in date and time objects. Meanwhile in Go timezone data is required.
  • Regex in Go is pretty slow, so in this port whenever possible we use basic strings or runes operations instead of regex to improve the performance.

3. Common Use Cases

dateparser can be used for many different purposes, but it stands out when it comes to:

Consuming data from different sources:

  • Scraping: extract dates from different places with several different formats and languages
  • IoT: consuming data coming from different sources with different date formats
  • Tooling: consuming dates from different logs / sources
  • Format transformations: when transforming dates coming from different files (PDF, CSV, etc.) to other formats (database, etc).

Offering natural interaction with users:

  • Tooling and CLI: allow users to write 3 days ago to retrieve information.
  • Search engine: allow people to search by date in an easiest / natural format.
  • Bots: allow users to interact with a bot easily

4. Usage

The most straightforward way is to use the dateparser.Parse function that wraps around most of the functionality:

dt, err := dateparser.Parse(nil, "6 July 2020")
// locale: en, time: 2020-07-06 00:00:00, confidence: Day

You can also extract dates from longer strings of text by using dateparser.Search. The extracted dates are returned as list of SearchResult containing data of the extracted date and its original string:

Support for searching dates is really limited and needs a lot of improvement.

_, dates, _ := dps.Search(nil, "The client arrived to the office for the first time "+
	"in March 3rd, 2004 and got serviced, after a couple of months, on May 6th 2004, "+
	"the customer returned indicating a defect on the part")

// Output, formatted:
// "in March 3rd, 2004 and" => time: 2004-03-03 00:00:00, confidence: Day
// "on May 6th 2004" => time: 2004-05-06 00:00:00, confidence: Day

If you need more control over what is being parsed, check the documentation for Configuration and Parser.

5. Supported Languages and Locales

You can check the supported locales by checking out this code. In total there are almost 300 locales across 205 languages.

6. Supported Patterns

There are several patterns that can be parsed by dateparser:

  • Timestamp
  • Relative date
  • Absolute date
  • Custom specified formats

6.1. Timestamp

Timestamp is the date time string in Unix time format:

dt, _ := dps.Parse(nil, "1570308760263")
// time: 2019-10-05 20:52:40 +0000, confidence: Day

6.2. Relative Date

Relative date is pattern where the date is specified in relative value like "1 week ago" or "5 years ago". Dateparser will try to parse a date from the given string, attempting to detect the language for each time:

// Current time is 2015-06-01 00:00:00
cfg := &dps.Configuration{
	CurrentTime: time.Date(2015, 6, 1, 0, 0, 0, 0, time.UTC),
}

var dt date.Date
dt, _ = dps.Parse(cfg, "1 hour ago")
// time: 2015-05-31 23:00:00 UTC, confidence: Day

dt, _ = dps.Parse(cfg, "Il ya 2 heures") // French (2 hours ago)
// time: 2015-05-31 22:00:00 UTC, confidence: Day

dt, _ = dps.Parse(cfg, "1 anno 2 mesi") // Italian (1 year 2 months)
// time: 2014-04-01 00:00:00 UTC, confidence: Month

dt, _ = dps.Parse(cfg, "yaklaşık 23 saat önce") // Turkish (23 hours ago)
// time: 2015-05-31 01:00:00 UTC, confidence: Day

dt, _ = dps.Parse(cfg, "Hace una semana") // Spanish (a week ago)
// time: 2015-05-25 00:00:00 UTC, confidence: Day

dt, _ = dps.Parse(cfg, "2小时前") // Chinese (2 hours ago)
// time: 2015-05-31 22:00:00 UTC, confidence: Day

6.3. Absolute Date

Absolute date is pattern where the date (and time) is explicitly mentioned, e.g. "12 August 2021" or "23 January, 15:05". Like the relative time, dateparser will try to detect the language for each time:

dt, _ = dps.Parse(nil, "12/12/12")
// time: 2012-12-12 00:00:00 UTC, confidence: Day

dt, _ = dps.Parse(nil, "Fri, 12 Dec 2014 10:55:50")
// time: 2014-12-12 10:55:50 UTC, confidence: Day

dt, _ = dps.Parse(nil, "Martes 21 de Octubre de 2014") // Spanish (Tuesday 21 October 2014)
// time: 2014-10-21 00:00:00 UTC, confidence: Day

dt, _ = dps.Parse(nil, "Le 11 Décembre 2014 à 09:00") // French (11 December 2014 at 09:00)
// time: 2014-12-11 09:00:00 UTC, confidence: Day

dt, _ = dps.Parse(nil, "13 января 2015 г. в 13:34") // Russian (13 January 2015 at 13:34)
// time: 2015-01-13 13:34:00 UTC, confidence: Day

dt, _ = dps.Parse(nil, "1 เดือนตุลาคม 2005, 1:00 AM") // Thai (1 October 2005, 1:00 AM)
// time: 2005-10-01 01:00:00 UTC, confidence: Day

6.4. Custom Specified Formats

If you know the possible formats of the dates, you can specify it yourself using the Go time's layout:

dt, _ = dps.Parse(nil, "22 Décembre 2010", "02 January 2006")
// time: 2010-12-22 00:00:00 UTC, confidence: Day

7. Supported Calendars

Apart from the Georgian calendar, dateparser also supports the Persian Jalali calendar and the Hijri calendar which can be used by calling dateparser.ParseJalali and dateparser.ParseHijri.

Persian Jalali calendar is also often called Solar Hijri calendar and commonly used in Iran and Afghanistan.

Hijri calendar is a Lunar calendar which commonly used in Islamic countries. Some only use it for religious purposes (e.g. calculating when Ramadhan or Hajj started) while some also use it for both administrative purposes.

There are several variations of Hijri calendar, e.g. Tabular and Umm al-Qura. In dateparser we use Umm al-Qura calendar since it's the one that apparently mostly used by Islamic world.

dt, _ = dps.ParseJalali(nil, "جمعه سی ام اسفند ۱۳۸۷")
// time: 2009-03-20 00:00:00, confidence: Day

dt, _ = dps.ParseHijri(nil, "17-01-1437 هـ 08:30 مساءً")
// time: 2015-10-30 20:30:00, confidence: Day

Note: Hijri and Jalali parser on support the absolute date pattern, so relative and timestamp pattern won't work.

8. Language Based Date Order

By default dateparser will use date order from the detected language:

dt, _ = dps.Parse(nil, "02-03-2016") // assumes english language, uses MDY date order
// time: 2016-02-03 00:00:00 UTC, confidence: Day

dt, _ = dps.Parse(nil, "le 02-03-2016") // detects french, uses DMY date order
// time: 2016-03-02 00:00:00 UTC, confidence: Day

However, ordering is not locale based. So, since most English speakers come from America which uses MDY order, it will be used as default order. That's why you can't expect to use DMY order for UK/Australia English by default.

To solve this issue, you can explicitly specify the date order in Configuration:

cfg = &dps.Configuration{DateOrder: dps.DMY}
dt, _ = dps.Parse(cfg, "02-03-2016")
// time: 2016-03-02 00:00:00 UTC, confidence: Day

You can also specify the date order only for a specific locale:

cfg = &dps.Configuration{
	DateOrder: func(locale string) string {
		if locale == "en" {
			return "DMY"
		}
		return dps.DefaultDateOrder(locale)
	},
}

dt, _ = dps.Parse(cfg, "02-03-2016") // english language, now use DMY date order
// locale: en, time: 2016-03-02 00:00:00 UTC, confidence: Day

dt, _ = dps.Parse(cfg, "miy 02-03-2016") // philippines, keep using MDY date order
// locale: fil, time: 2016-02-03 00:00:00 UTC, confidence: Day

9. Timezone And UTC Offset

By default, dateparser uses the timezone that is present in the date string. If the timezone is not present, then it will use timezone from the CurrentTime:

// Use case: I'm from Jakarta, parsing London's news article.
// So, my current time use GMT+7, however the parser's default timezone is GMT.
london, _ := time.LoadLocation("Europe/London")
jakarta, _ := time.LoadLocation("Asia/Jakarta")

cfg := &dps.Configuration{
	DefaultTimezone: london,
	CurrentTime:     time.Date(2015, 6, 1, 12, 0, 0, 0, jakarta),
}

// Here the timezone is explicitly mentioned in string
dt, _ = dps.Parse(cfg, "January 12, 2012 10:00 PM EST")
// time: 2012-01-12 22:00:00 EST, confidence: Day

dt, _ = dps.Parse(cfg, "January 12, 2012 10:00 PM -0500")
// time: 2012-01-12 22:00:00 UTC-05:00, confidence: Day

dt, _ = dps.Parse(cfg, "2 hours ago EST")
// time: 2015-05-31 22:00:00 EST, confidence: Day

dt, _ = dps.Parse(cfg, "2 hours ago -0500")
// time: 2015-05-31 22:00:00 UTC-05:00, confidence: Day

// Now timezone is not specified, so parser will use `DefaultTimezone`
dt, _ = dps.Parse(cfg, "January 12, 2012 10:00 PM")
// time: 2012-01-12 22:00:00 GMT, confidence: Day

dt, _ = dps.Parse(cfg, "2 hours ago")
// time: 2015-06-01 04:00:00 BST, confidence: Day
// notice the timezone become BST, because we move back to Daylight Saving Time.

// Now we remove the default timezone, so it use timezone from `CurrentTime`
cfg.DefaultTimezone = nil

dt, _ = dps.Parse(cfg, "January 12, 2012 10:00 PM")
// time: 2012-01-12 22:00:00 WIB, confidence: Day

dt, _ = dps.Parse(cfg, "2 hours ago")
// time: 2015-06-01 10:00:00 WIB, confidence: Day

10. Incomplete Date Handling

By default dateparser will try to fill incomplete dates with value from current time:

// Current time is 2015-07-31 12:00:00 UTC
cfg := &dps.Configuration{
	CurrentTime: time.Date(2015, 7, 31, 12, 0, 0, 0, time.UTC),
}

dt, _ = dps.Parse(cfg, "December 2015")
// time: 2015-12-31 00:00:00 UTC (day from current time)

dt, _ = dps.Parse(cfg, "February 2020")
// time: 2020-02-29 00:00:00 UTC (day from current time, corrected for leap year)

dt, _ = dps.Parse(cfg, "December")
// time: 2015-12-31 00:00:00 UTC (year and day from current time)

dt, _ = dps.Parse(cfg, "2015")
// time: 2015-07-31 00:00:00 UTC (day and month from current time)

dt, _ = dps.Parse(cfg, "Sunday")
// time: 2015-07-26 00:00:00 UTC (the closest Sunday from current time)

You can change the behavior by using PreferredMonthOfYear, PreferredDayOfMonth and PreferredDateSource in Configuration:

// Current time is 2015-07-10 12:00:00 UTC

cfg.PreferredDayOfMonth = dps.Current
dt, _ = dps.Parse(cfg, "December 2015")
// time: 2015-12-10 00:00:00 UTC

cfg.PreferredDayOfMonth = dps.First
dt, _ = dps.Parse(cfg, "December 2015")
// time: 2015-12-01 00:00:00 UTC

cfg.PreferredDayOfMonth = dps.Last
dt, _ = dps.Parse(cfg, "December 2015")
// time: 2015-12-31 00:00:00 UTC
// Current time is 2015-10-10 12:00:00 UTC

cfg.PreferredDayOfMonth = dps.Current
cfg.PreferredMonthOfYear = dps.CurrentMonth
dt, _ = dps.Parse(cfg, "2020")
// time: 2020-10-10 00:00:00 UTC

cfg.PreferredDayOfMonth = dps.Last
cfg.PreferredMonthOfYear = dps.FirstMonth
dt, _ = dps.Parse(cfg, "2020")
// time: 2020-01-31 00:00:00 UTC

cfg.PreferredDayOfMonth = dps.First
cfg.PreferredMonthOfYear = dps.LastMonth
dt, _ = dps.Parse(cfg, "2015")
// time: 2015-12-01 00:00:00 UTC
// Current time is 2015-10-10 12:00:00 UTC

cfg.PreferredDateSource = dps.CurrentPeriod
dt, _ = dps.Parse(cfg, "March")
// time: 2015-03-10 00:00:00 UTC

cfg.PreferredDateSource = dps.Future
dt, _ = dps.Parse(cfg, "March")
// time: 2016-03-10 00:00:00 UTC

cfg.PreferredDateSource = dps.Past
dt, _ = dps.Parse(cfg, "August")
// time: 2015-08-10 00:00:00 UTC

You can also ignore parsing incomplete dates altogether by setting StrictParsing in config:

cfg = &dps.Configuration{StrictParsing: true}
dt, _ = dps.Parse(cfg, "March")
fmt.Println(dt.IsZero()) // true

11. Custom Language Detector

Dateparser allows to use a language detection behavior by using the DetectLanguagesFunction in the Parser. Using this, you can use any language detectors that you want. However since language detection often fail or inaccurate for short strings, it's highly recommended to use DetectLanguagesFunction while specifying DefaultLanguages in Configuration.

Here is an example on how to use lingua-go as detector:

package main

import (
	"fmt"
	"strings"

	dps "github.com/markusmobius/go-dateparser"
	"github.com/markusmobius/go-dateparser/date"
	"github.com/pemistahl/lingua-go"
)

func main() {
	// Prepare detector
	detector := lingua.
		NewLanguageDetectorBuilder().
		FromAllLanguages().
		WithPreloadedLanguageModels().
		Build()

	// Create custom Parser
	p := dps.Parser{
		DetectLanguagesFunction: func(s string) []string {
			var languages []string

			candidates := detector.ComputeLanguageConfidenceValues(s)
			for _, c := range candidates {
				isoCode := c.Language().IsoCode639_1().String()
				isoCode = strings.ToLower(isoCode)

				if c.Value() >= 0.9 && dps.IsKnownLocale(isoCode) {
					languages = append(languages, isoCode)
				}
			}

			return languages
		},
	}

	// Use the custom Parser to parse
	dt, _ := p.Parse(nil, "Sabtu, 13 Maret 2021")
	printDate(dt) // locale: id, time: 2021-03-13 00:00:00 UTC, confidence: Day

	// It will fail for short strings
	dt, _ = p.Parse(nil, "13 Maret 2021")
	fmt.Println(dt.IsZero()) // true

	// So we need to specify default languages
	cfg := &dps.Configuration{DefaultLanguages: []string{"id"}}
	dt, _ = p.Parse(cfg, "13 Maret 2021")
	printDate(dt) // locale: id, time: 2021-03-13 00:00:00 UTC, confidence: Day
}

func printDate(dt date.Date) {
	fmt.Printf("locale: %s, time: %s, confidence: %s\n",
		dt.Locale, dt.Time.Format("2006-01-02 15:04:05 MST"), dt.Period)
}

12. Handling False Positives

By default dateparser will do its best to return a date, dealing with multiple formats and different locales. For that reason it is important that the input is a valid date, otherwise it could return false positives.

To reduce the possibility of receiving false positives, make sure that:

  • The input string is a valid date and it doesn't contain any other words or numbers.
  • If you know the languages or locales beforehand, you can specify them in the Configuration.

Besides those, you also could exclude any of the parser method (timestamp, relative-time...) or change the order in which they are executed.

13. Performance

This library heavily uses regular expression for various purposes, from detecting language, extracting relevant content, and translating the date value. Unfortunately, as commonly known, Go's regular expression is pretty slow. This is because:

  • The regex engine in other language usually implemented in C, while in Go it's implemented from scratch in Go language. As expected, C implementation is still faster than Go's.
  • Since Go is usually used for web service, its regex is designed to finish in time linear to the length of the input, which useful for protecting server from ReDoS attack. However, this comes with performance cost.

If you want to parse a huge amount of data, it would be preferrable to have a better performance. So, this package provides C++ re2 as an alternative regex engine using binding from go-re2. To activate it, you can build your app using tag re2_wasm or re2_cgo, for example:

go build -tags re2_cgo .

More detailed instructions in how to prepare your system for compiling with cgo are provided below.

When using re2_wasm tag, it will make your app uses re2 that packaged as WebAssembly module so it should be runnable even without cgo. However, if your input is too small, it might be even slower than using Go's standard regex engine.

When using re2_cgo tag, it will make your app uses re2 library that wrapped using cgo. It's a lot faster than Go's standard regex and re2_wasm, however to use it cgo must be available and re2 should be installed in your system.

In my test, re2_cgo will always be faster than the standard library, however re2_wasm only faster than standard regex when processing more than 10,000 short strings (e.g. "26 gennaio 2014"). So, when possible you might be better using re2_cgo. You could run scripts/speedtest to test it yourself.

Do note that this alternative regex engine is experimental, so use on your own risk.

Compiling with cgo under Linux

On Ubuntu install the gcc tool chain and the re2 library as follows:

sudo apt install build-essential
sudo apt-get install -y libre2-dev

Compiling with cgo under Windows

On Windows start by installing MSYS2. Then open the MINGW64 terminal and install the gcc toolchain and re2 via pacman:

pacman -S mingw-w64-x86_64-gcc
pacman -S mingw-w64-x86_64-re2

If you want to run the resulting exe program outside the MINGW64 terminal you need to add a path to the MinGW-w64 libraries to the PATH environmental variable (adjust as needed for your system):

SET PATH=C:\msys64\mingw64\bin;%PATH%

14. License

Just like the original, this package is licensed under BSD-3 License.

About

go parser for human readable dates ported from the dateparser python package

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages