Skip to content

Latest commit

 

History

History
657 lines (591 loc) · 15.5 KB

builtin-entity-extraction.md

File metadata and controls

657 lines (591 loc) · 15.5 KB

Builtin Entity Extraction

The NER Manager includes by default a builtin entity extraction with different bundles available for different languages. The entity extraction is done even if the utterance is not matched to an intent.

Builtin English French Spanish Portuguese Chinese Japanese Other
Email X X X X X X X
Ip X X X X X X X
Hashtag X X X X X X X
Phone Number X X X X X X X
URL X X X X X X X
Number X X X X X X see 1
Ordinal X X X X X X
Percentage X X X X X X see 2
Dimension X X X X X X see 3
Age X X X X X X
Currency X X X X X X
Date X X X X see 4 see 4 see 4
Duration X
  • 1: Only for non text numbers
  • 2: Only for % symbol non text numbers
  • 3: Only for dimension acronyms (km, s, km/h...) non text numbers
  • 4: Only dd/MM/yyyy formats or similars, non text

Email Extraction

It can identify and extract valid emails accounts, this works for any language.

"utterance": "My email is something@somehost.com please write me",
"entities": [
  {
    "start": 12,
    "end": 33,
    "len": 22,
    "accuracy": 0.95,
    "sourceText": "something@somehost.com",
    "utteranceText": "something@somehost.com",
    "entity": "email",
    "resolution": {
      "value": "something@somehost.com"
    }
  }
]

IP Extraction

It can identify and extract valid IPv4 and IPv6 addresses, this works for any language.

"utterance": "My ip is 8.8.8.8",
"entities": [
  {
    "start": 9,
    "end": 15,
    "len": 7,
    "accuracy": 0.95,
    "sourceText": "8.8.8.8",
    "utteranceText": "8.8.8.8",
    "entity": "ip",
    "resolution": {
      "value": "8.8.8.8",
      "type": "ipv4"
    }
  }
]

"utterance": "My ip is ABEF:452::FE10",
"entities": [
  {
    "start": 9,
    "end": 22,
    "len": 14,
    "accuracy": 0.95,
    "sourceText": "ABEF:452::FE10",
    "utteranceText": "ABEF:452::FE10",
    "entity": "ip",
    "resolution": {
      "value": "ABEF:452::FE10",
      "type": "ipv6"
    }
  }
]

Hashtag Extraction

It can identify and extract hashtags from the utterances, this works for any language.

"utterance": "Open source is great! #proudtobeaxa",
"entities": [
  {
    "start": 22,
    "end": 34,
    "len": 13,
    "accuracy": 0.95,
    "sourceText": "#proudtobeaxa",
    "utteranceText": "#proudtobeaxa",
    "entity": "hashtag",
    "resolution": {
      "value": "#proudtobeaxa"
    }
  }
]

Phone Number Extraction

It can identify and extract phone numbers from the utterances, this works for any language.

"utterance": "So here is my number +1 541-754-3010 callme maybe",
"entities": [
  {
    "start": 21,
    "end": 35,
    "len": 15,
    "accuracy": 0.95,
    "sourceText": "+1 541-754-3010",
    "utteranceText": "+1 541-754-3010",
    "entity": "phonenumber",
    "resolution": {
      "value": "+1 541-754-3010"
    }
  }
]

URL Extraction

It can identify and extract phone URLs from the utterances, this works for any language.

"utterance": "The url is https://something.com",
"entities": [
  {
    "start": 11,
    "end": 31,
    "len": 21,
    "accuracy": 0.95,
    "sourceText": "https://something.com",
    "utteranceText": "https://something.com",
    "entity": "url",
    "resolution": {
      "value": "https://something.com"
    }
  }
]

Number Extraction

It can identify and extract numbers. This works for any language, and the numbers can be integer or floats.

"utterance": "This is 12",
"entities": [
  {
    "start": 8,
    "end": 9,
    "len": 2,
    "accuracy": 0.95,
    "sourceText": "12",
    "utteranceText": "12",
    "entity": "number",
    "resolution": {
      "strValue": "12",
      "value": 12,
      "subtype": "integer"
    }
  }
]

The numbers can be also be text written, but this only works for: English, French, Spanish and Portuguese.

"utterance": "This is twelve",
"entities": [
  {
    "start": 8,
    "end": 13,
    "len": 6,
    "accuracy": 0.95,
    "sourceText": "twelve",
    "utteranceText": "twelve",
    "entity": "number",
    "resolution": {
      "strValue": "12",
      "value": 12,
      "subtype": "integer"
    }
  }
]

The text feature also works for fractions.

"utterance": "one over 3",
"entities": [
  {
    "start": 0,
    "end": 9,
    "len": 10,
    "accuracy": 0.95,
    "sourceText": "one over 3",
    "utteranceText": "one over 3",
    "entity": "number",
    "resolution": {
      "strValue": "0.333333333333333",
      "value": 0.333333333333333,
      "subtype": "float"
    }
  }
]

Ordinal Extraction

It can identify and extract numbers. This works only for English, Spanish, French and Portuguese.

"utterance": "He was 2nd",
"entities": [
  {
    "start": 7,
    "end": 9,
    "len": 3,
    "accuracy": 0.95,
    "sourceText": "2nd",
    "utteranceText": "2nd",
    "entity": "ordinal",
    "resolution": {
      "strValue": "2",
      "value": 2,
      "subtype": "integer"
    }
  }
]

The numbers can be written by text.

"utterance": "one hundred twenty fifth",
"entities": [
  {
    "start": 0,
    "end": 23,
    "len": 24,
    "accuracy": 0.95,
    "sourceText": "one hundred twenty fifth",
    "utteranceText": "one hundred twenty fifth",
    "entity": "ordinal",
    "resolution": {
      "strValue": "125",
      "value": 125,
      "subtype": "integer"
    }
  }
]

Percentage Extraction

It can identify and extract percentages. If the percentage is indicated with the symbol % it works for any language.

"utterance": "68.2%",
"entities": [
  {
    "start": 0,
    "end": 4,
    "len": 5,
    "accuracy": 0.95,
    "sourceText": "68.2%",
    "utteranceText": "68.2%",
    "entity": "percentage",
    "resolution": {
      "strValue": "68.2%",
      "value": 68.2,
      "subtype": "float"
    }
  }
]

The percentage can be indicated by text, but it only works for English, French, Spanish and Portuguese.

"utterance": "68.2 percent",
"entities": [
  {
    "start": 0,
    "end": 11,
    "len": 12,
    "accuracy": 0.95,
    "sourceText": "68.2 percent",
    "utteranceText": "68.2 percent",
    "entity": "percentage",
    "resolution": {
      "strValue": "68.2%",
      "value": 68.2,
      "subtype": "float"
    }
  }
]

It can understand text numbers but only works for English, French, Spanish and Portuguese.

"utterance": "thirty five percentage",
"entities": [
  {
    "start": 0,
    "end": 21,
    "len": 22,
    "accuracy": 0.95,
    "sourceText": "thirty five percentage",
    "utteranceText": "thirty five percentage",
    "entity": "percentage",
    "resolution": {
      "strValue": "35%",
      "value": 35,
      "subtype": "integer"
    }
  }
]

Dimension Extraction

It can identify and extract different dimensions, like length, distance, speed, volume, area,... If the international acronym of the dimension is used then it works in any language.

"utterance": "120km",
"entities": [
  {
    "start": 0,
    "end": 4,
    "len": 5,
    "accuracy": 0.95,
    "sourceText": "120km",
    "utteranceText": "120km",
    "entity": "dimension",
    "resolution": {
      "strValue": "120",
      "value": 120,
      "unit": "Kilometer",
      "localeUnit": "Kilometer"
    }
  }
]

In instead of the acronym, the text of the dimension is used in a language, then it works in English, French, Spanish and Portuguese.

"utterance": "Está a 325 kilómetros de Bucarest",
"entities": [
  {
    "start": 7,
    "end": 20,
    "len": 14,
    "accuracy": 0.95,
    "sourceText": "325 kilómetros",
    "utteranceText": "325 kilómetros",
    "entity": "dimension",
    "resolution": {
      "strValue": "325",
      "value": 325,
      "unit": "Kilometer",
      "localeUnit": "Kilómetro"
    }
  }
]

Age Extraction

It can identify and extract ages. It works in English, French, Spanish and Portuguese. Take into account that several ways to say an age can be also confused with a duraction ("It will be 10 years" can be an age or a duration), so two overlaped entities, one age and one duration, can be returned.

"utterance": "This saga is ten years old",
"entities": [
  {
    "start": 13,
    "end": 25,
    "len": 13,
    "accuracy": 0.95,
    "sourceText": "ten years old",
    "utteranceText": "ten years old",
    "entity": "age",
    "resolution": {
      "strValue": "10",
      "value": 10,
      "unit": "Year",
      "localeUnit": "Year"
    }
  },
  {
    "start": 13,
    "end": 21,
    "len": 9,
    "accuracy": 0.95,
    "sourceText": "ten years",
    "utteranceText": "ten years",
    "entity": "duration",
    "resolution": {
      "values": [
        {
          "timex": "P10Y",
          "type": "duration",
          "value": "315360000"
        }
      ]
    }
  }
]

Currency Extraction

It can identify and extract currency values. It works in English, French, Spanish and Portuguese.

"utterance": "420 million finnish markka",
"entities": [
  {
    "start": 0,
    "end": 25,
    "len": 26,
    "accuracy": 0.95,
    "sourceText": "420 million finnish markka",
    "utteranceText": "420 million finnish markka",
    "entity": "currency",
    "resolution": {
      "strValue": "420000000",
      "value": 420000000,
      "unit": "Finnish markka",
      "localeUnit": "Finnish markka"
    }
  }
]

It the used language is not english, the localeUnit contains the locale name of the currency.

"utterance": "420 millones de marcos finlandeses",
"entities": [
  {
    "start": 0,
    "end": 33,
    "len": 34,
    "accuracy": 0.95,
    "sourceText": "420 millones de marcos finlandeses",
    "utteranceText": "420 millones de marcos finlandeses",
    "entity": "currency",
    "resolution": {
      "strValue": "420000000",
      "value": 420000000,
      "unit": "Finnish markka",
      "localeUnit": "Marco finlandés"
    }
  }
]

Date Extraction

It can identify and extract dates, if provided in numeric format can work in any language, but take into account that the localization also affect to the date format.

"utterance": "28/10/2018",
"entities": [
  {
    "start": 0,
    "end": 9,
    "len": 10,
    "accuracy": 0.95,
    "sourceText": "28/10/2018",
    "utteranceText": "28/10/2018",
    "entity": "date",
    "resolution": {
      "type": "date",
      "timex": "2018-10-28",
      "strValue": "2018-10-28",
      "date": "2018-10-28T00:00:00.000Z"
    }
  }
]

It can understand written date formats in English, French, Spanish and Portuguese.

"utterance": "Volveré el 12 de enero del 2019",
"entities": [
  {
    "start": 11,
    "end": 30,
    "len": 20,
    "accuracy": 0.95,
    "sourceText": "12 de enero del 2019",
    "utteranceText": "12 de enero del 2019",
    "entity": "date",
    "resolution": {
      "type": "date",
      "timex": "2019-01-12",
      "strValue": "2019-01-12",
      "date": "2019-01-12T00:00:00.000Z"
    }
  }
]

It can understand partial dates. Then the timex contains the resolution, example, if I provide the day but not the month neither the year, then both year and month will be filled with X. Also, in this case, two possible dates will be returned: the past and the future. Also take into account that in cases like that, the resolution can also include a number, like in this example:

"utterance": "I'll go back on 15",
"entities": [
  {
    "start": 16,
    "end": 17,
    "len": 2,
    "accuracy": 0.95,
    "sourceText": "15",
    "utteranceText": "15",
    "entity": "number",
    "resolution": {
      "strValue": "15",
      "value": 15,
      "subtype": "integer"
    }
  },
  {
    "start": 16,
    "end": 17,
    "len": 2,
    "accuracy": 0.95,
    "sourceText": "15",
    "utteranceText": "15",
    "entity": "date",
    "resolution": {
      "type": "interval",
      "timex": "XXXX-XX-15",
      "strPastValue": "2018-08-15",
      "pastDate": "2018-08-15T00:00:00.000Z",
      "strFutureValue": "2018-09-15",
      "futureDate": "2018-09-15T00:00:00.000Z"
    }
  }
]

When the grain resolution is not a day, it can be resolved not only with a past and future date, but also each date is an interval. Example: if we are resolving a date that is a month, like January, it will return the past and future januaries, but also each january is an interval from the day 1 of January until the day 1 of February, like in this example:

"utterance": "I'll be out in Jan",
"entities": [
  {
    "start": 15,
    "end": 17,
    "len": 3,
    "accuracy": 0.95,
    "sourceText": "Jan",
    "utteranceText": "Jan",
    "entity": "daterange",
    "resolution": {
      "type": "interval",
      "timex": "XXXX-01",
      "strPastStartValue": "2018-01-01",
      "pastStartDate": "2018-01-01T00:00:00.000Z",
      "strPastEndValue": "2018-02-01",
      "pastEndDate": "2018-02-01T00:00:00.000Z",
      "strFutureStartValue": "2019-01-01",
      "futureStartDate": "2019-01-01T00:00:00.000Z",
      "strFutureEndValue": "2019-02-01",
      "futureEndDate": "2019-02-01T00:00:00.000Z"
    }
  }
]

It also identifies expecial dates, like Christmas:

"utterance": "I will return in Christmas",
"entities": [
  {
    "start": 17,
    "end": 25,
    "len": 9,
    "accuracy": 0.95,
    "sourceText": "Christmas",
    "utteranceText": "Christmas",
    "entity": "date",
    "resolution": {
      "type": "interval",
      "timex": "XXXX-12-25",
      "strPastValue": "2017-12-25",
      "pastDate": "2017-12-25T00:00:00.000Z",
      "strFutureValue": "2018-12-25",
      "futureDate": "2018-12-25T00:00:00.000Z"
    }
  }
]

Duration Extraction

It can identify and extract duration intervals. It works currently in English only. The resolution is done in seconds, with a timex indicator. Example: "It will take me 5 minutes" the timex is "PT5M" meaning "Present Time 5 Minutes".

"utterance": "It will take me 5 minutes",
"entities": [
  {
    "start": 13,
    "end": 21,
    "len": 9,
    "accuracy": 0.95,
    "sourceText": "5 minutes",
    "utteranceText": "5 minutes",
    "entity": "duration",
    "resolution": {
      "values": [
        {
          "timex": "PT5M",
          "type": "duration",
          "value": "300"
        }
      ]
    }
  }
]