quick and dirty entity recognition
Cirrus attempts to serialize arbitrary natural text into a Go value using dictionary matching. At the moment it doesn't really do much, and shouldn't be used by anybody.
Cirrus tries to parse tokens into various values based on pretty simple heuristics like capitalization and the presence of a dollar sign. Some entities, like cardinality ("one", "many", "ten", etc) have to be matched against every token.
- grouping of sequential entities
- e.g. sequential cardinals can be grouped,
two dozen
becomesResult{Value: 24}
- e.g. sequential cardinals can be grouped,
- text classification
- use wordnet?
2021-10-22
is adate
$20
is avalue
inUSD
20mph
is avalue
inmiles per hour
Hong Kong
is acity
Charles Dickens
is aperson
Microsoft
orMSFT
is acompany
Australia
is acountry
1.2
10e-3
1.00001
20E10
15
2/3
one
two thousand three hundred seventy five
Christmas eve
10/22/15
10.22.15
2005
december 5 2005 // variations thereof
12:10
ten minutes and five seconds
- probably best to use an established regexp for this
- best to use an established regexp or std's
url.Parse
| name | age |
| john | 2 |
| jane | 12 |
name,age
john,2
jane,12
name age
"john" 2
"jane" 12
name age
john 2
jane 12