Named Entity Recognition for chatbots
Chatbot NER is an open source framework custom built to supports entity recognition in text messages. After doing thorough research on existing NER systems, team at Haptik felt the strong need of building a framework which is tailored for Conversational AI and also supports Indian languages. Currently Chatbot-ner supports English, Hindi, Gujarati, Marathi, Bengali and Tamil and their code mixed form. Currently this framework uses common patterns along with few NLP techniques to extract necessary entities from languages with sparse data. API structure of Chatbot ner is designed keeping in mind usability for Conversational AI applications. Team at Haptik is continuously working towards porting this framework for all Indian languages and their respective local dialects.
Detailed documentation on how to setup Chatbot NER on your system using docker is available here.
|Entity type||Code reference||Description||example||Supported languages - ISO 639-1 code|
|Time||TimeDetector||Detect time from given text.||tomorrow morning at 5, कल सुबह ५ बजे, kal subah 5 baje||'en', 'hi', 'gu', 'bn', 'mr', 'ta'|
|Date||DateAdvancedDetector||Detect date from given text||next monday, agle somvar, अगले सोमवार||'en', 'hi', 'gu', 'bn', 'mr', 'ta'|
|Number||NumberDetector||Detect number and respective units in given text||50 rs per person, ५ किलो चावल, मुझे एक लीटर ऑइल चाहिए||'en', 'hi', 'gu', 'bn', 'mr', 'ta'|
|Phone number||PhoneDetector||Detect phone number in given text||9833530536, +91 9833530536, ९८३३४३०५३५||'en', 'hi', 'gu', 'bn', 'mr', 'ta'|
|EmailDetector||Detect email in firstname.lastname@example.org||'en'|
|Text||TextDetector||Detect custom entities in text string using full text search in Datastore or based on contextual model||Order me a pizza, मुंबई में मौसम कैसा है||Search supported for 'en', 'hi', 'gu', 'bn', 'mr', 'ta', Contextual model supported for 'en' only|
|PNR||PNRDetector||Detect PNR (serial) codes in given text.||My flight PNR is 4SGX3E||'en'|
|regex||RegexDetector||Detect entities using custom regex patterns||My flight PNR is 4SGX3E||NA|
There are other custom detectors such as city, budget shopping size which are derived from above mentioned primary detectors but they are supported currently in English only and limited to Indian users only. We are currently in process of restructuring them to scale them across languages and geography and their current versions might be deprecated in future. So for applications already in production, we would recommend you to use only primary detectors mentioned in the table above.
Detailed documentation of APIs for all entity types is available here. Current API structure is built for ease of accessing it from conversational AI applications. However, it can be used for other applications also.
In any conversational AI application, there are several entities to be identified and logic for detection on one entity might be different from other. We have organised this repository as shown below
We have classified entities into four main types i.e. numeral, pattern, temporal and textual.
numeral: This type will contain all the entities that deal with the numeral or numbers. For example, number detection, budget detection, size detection, etc.
pattern: This will contain all the detection logics where identification can be done using patterns or regular expressions. For example, email, phone_number, pnr, etc.
temporal: It will contain detection logics for detecting time and date.
textual: It identifies entities by looking at the dictionary. This detection mainly contains detection of text (like cuisine, dish, restaurants, etc.), the name of cities, the location of a user, etc.
Numeral, temporal and pattern have been moved to ner_v2 for language portability with more flexible detection logic. In ner_v1, currently only text entity has language support. We will be moving it to ner_v2 without any major API changes.
Currently, you can contribute to ner_v2 in Chatbot NER either by adding Training Data or by contributing Detection Patterns in form of regex. We will work on removing few architectural limitations which will ease out process of adding ML models and New Entities in future.
- Adding Training Data: You can significantly improve detection capabilities of Chatbot NER by simply adding data in csv files. For example, date detection in Hindi and Hinglish can be improved by adding data in csv files mentioned in the image below. You can refer to documentation for date, time and numbers respectively if you wish to contribute.
- Adding Detection Pattern: You can simply add custom language patterns for different languages by adding simple functions. An example of adding custom pattern for detecting number of people can be referred here.
Please refer to general steps of contribution, approval and coding guidelines mentioned here.