Overview

The purpose of this module is to efficiently find dates inside text, in almost any format. For example:

text = """
                         24/5/2017
   Dear Morty,
   I will be visiting New York between December 22nd, 2017 and January 1, 2018.
   Yours,
   Ben
"""

from date_detector import Parser
parser = Parser()
for match in parser.parse(text):
     print(match)

>>> Match(date=datetime.date(2017, 5, 24), offset=26, text='24/5/2017')
>>> Match(date=datetime.date(2017, 12, 22), offset=90, text='December 22nd, 2017')
>>> Match(date=datetime.date(2018, 1, 1), offset=114, text='January 1, 2018')

How does it work?

The text is broken up into tokens, which are sequences of characters from a single type: digits, letters, whitespace or other. The algorithm then tries to find sequences of tokens which might be a part of a date, for example 2017, 09 or December. Any sequence that can be interpreted as a valid date is returned. Some sequences can be interpreted as as several different dates, in which case they are all returned (for example: 01/02/03).

Similar projects

datefinder (https://github.com/akoumjian/datefinder)
dateparser (https://github.com/scrapinghub/dateparser)
date-extractor (https://github.com/DanielJDufour/date-extractor)
parsedatetime (https://github.com/bear/parsedatetime)
python-natty (https://github.com/eadmundo/python-natty)

Usage

To look for dates in a text, first construct a Parser:

from date_detector import Parser
parser = Parser()

Then use the parse method to get a generator returning Match objects. Each match has three fields: date, offset, and text.

for match in parser.parse(text):
    # Do something with match.date

Parser options

When constructing a Parser instance, you can pass several options:

dictionaries: a list of language codes of dictionaries to use (default: ["en"]). See below for more information about dictionaries.
month_before_day: whether to prefer M/D/Y dates (American) over D/M/Y (default: False).
min_date: the minimal date to consider (default: 1950-01-01).
max_date: the maximal date to consider (default: 2049-12-31).
tokenizer_class: the class to use for tokenizing text (default: Tokenizer)

Language Dictionaries

Currently the following languages are supported:

English (en)
Hebrew (he)

To support additional languages, dictionary files need to be added. They should be located under the date_detector/dictionaries directory. Take a look at the existing dictionaries to see how they are formatted.

Note: dictionaries are case-insensitive.

Contributing

After checking out the code, build the project by running the following commands:

easy_install -U infi.projector
projector devenv build --use-isolated-python

Running tests

To run the tests:

cd src
../bin/nosetests

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
buildout.cfg		buildout.cfg
setup.in		setup.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

MANIFEST.in

MANIFEST.in

README.md

README.md

buildout.cfg

buildout.cfg

setup.in

setup.in

Repository files navigation

Overview

How does it work?

Similar projects

Usage

Parser options

Language Dictionaries

Contributing

Running tests

About

Releases

Packages

Languages

ishirav/date-detector

Folders and files

Latest commit

History

Repository files navigation

Overview

How does it work?

Similar projects

Usage

Parser options

Language Dictionaries

Contributing

Running tests

About

Topics

Resources

Stars

Watchers

Forks

Languages