**1- Creating a json file from example training sentences, given corresponding intents and entities**

This piece of code takes a training data in a format specified below, and converts it into a json file in the desired format.

Imagine we have a training dataset consisting of three lines as shown below:

Example Sentence                       |  Intent             | Entity Word      | Entity Class 
 --- | --- | --- | ---|
"hey"                                  |  "greet"            |                  |              
"show me a mexican place in the center"|  "restaurant_search"| "mexican"        | "cuisine"    
"show me a mexican place in the center"|  "restaurant_search"| "center"         | "area"       

Use the merge_json function to convert this into the desired json format for training purposes. The conversion of the table given above can be executed as follows: <br> <br>
merge_json(sentence_list = ["hey","show me a mexican place in the center"],<br>
           &ensp;&ensp;&ensp;&ensp;&ensp; intent_list =["greet","restaurant_search"], <br>
           &ensp;&ensp;&ensp;&ensp;&ensp; entity_words_list = **[**&nbsp;[&nbsp;], ["center", "mexican"]&nbsp;**]**,<br>
           &ensp;&ensp;&ensp;&ensp;&ensp; entities_list = **[**&nbsp;[&nbsp;],["area", "cuisine"]&nbsp;**]**&nbsp;) <br> 
           <br> 
Don't forget to provide your lists in the correct order.

In [12]:
def json_format(training_sentence, intent, entity_words, entity_classes):
    json_string = "{\"text\": \"" + training_sentence + "\", " + "\"intent\": \"" + intent + "\", "
    entity_string = "\"entities\": ["
    if len(entity_words) > 0:
        for word in entity_words:
            index = entity_words.index(word)
            entity = entity_classes[index]
            starting_index = training_sentence.lower().find(word.lower())
            ending_index = starting_index + len(word.lower())
            entity_string = entity_string + " { \"start\": " + str(starting_index) + ", \"end\": " +str(ending_index)+\
            ", \"value\": \"" + word + "\", \"entity\": \"" + entity + "\" },"
        entity_string = entity_string[:-1] + " ] }"
    else:
        entity_string += "] }"
    return json_string + entity_string
        
def merge_json(sentence_list, intent_list, entity_words_list, entities_list):
    # all list inputs must of same length. look at the example below
    begin_string = "{ \"rasa_nlu_data\": { \"common_examples\": [ "
    end_string = "] } }"
    number_of_sentences = len(sentence_list)
    for index in range(number_of_sentences):
        begin_string += json_format(sentence_list[index], intent_list[index], \
                                    entity_words_list[index], entities_list[index])
        begin_string += ", "
    begin_string = begin_string[:-2] + end_string
    return begin_string

Example of a desired json file text:

In [13]:
merge_json(sentence_list = ["hey", "give a definition for noun in the English grammar of noun"],\
           intent_list =["greet", "grammar"],\
           entity_words_list = [[], ["definition", "noun"]],\
           entities_list = [[],["function", "subtopic"]])

'{ "rasa_nlu_data": { "common_examples": [ {"text": "hey", "intent": "greet", "entities": [] }, {"text": "give a definition for noun in the English grammar of noun", "intent": "grammar", "entities": [ { "start": 7, "end": 17, "value": "definition", "entity": "function" }, { "start": 22, "end": 26, "value": "noun", "entity": "subtopic" } ] }] } }'

** 2- Creating a json File for Training Grammar Related Questions**

In the grammar section, we'd like to have three different entities:<br>
1- 'topic' i.e. which topic of grammar in general are we talking about. Some examples are nouns, verbs,...<br>
2- 'subtopic' i.e. which subtopic of the mentioned topic are we referring to. For instance if the topic is noun, we might be talking about 'gendered nouns' or 'singular nouns' in particular.<br>
3- 'function' i.e. what exactly are we providing information about a particular (topic, subtopic) pair. There are only two distinct values for this entity: 'Description' and 'example'. As the names suggest, 'description' gives general information and provides rules about the topic whereas 'example' deals with examples.<br>

In [14]:
'''for the grammar database, all the examples will have the same intent (= 'grammar'), 
therefore we don't need to worry about it'''

# first provide all the names(topics) under entity = topic: 
entity1_list = ["noun", "pronoun", "adjective", "adverb", "determiner", "verb", "relative clause"]

# then we create a dictionary which maps all topic names to itself and the plural form of the topic
entity1_dict = {}
for item in entity1_list:
    dummy_list = []
    dummy_list.append(item)
    dummy_list.append(item+"s")
    entity1_dict[item] = dummy_list

# we can see the topic dictionary as follows:
entity1_dict

{'adjective': ['adjective', 'adjectives'],
 'adverb': ['adverb', 'adverbs'],
 'determiner': ['determiner', 'determiners'],
 'noun': ['noun', 'nouns'],
 'pronoun': ['pronoun', 'pronouns'],
 'relative clause': ['relative clause', 'relative clauses'],
 'verb': ['verb', 'verbs']}

In [15]:
# then in the second part, we need to show all the subtopics that fall under a given topic. we use a dictionary for
# the mapping between topics and subtopics:

entity1map2_dict = {"noun" : ["gendered nouns", "singular nouns", "plural nouns", "countable nouns", \
                              "uncountable nouns", "possessive nouns"],
                "pronoun" : ["definite pronouns", "indefinite pronouns", "compound pronouns", "possessive pronouns"],
              "adjective" : ["placing adjectives", "order of adjectives", "comparative adjectives", \
                        "superlative adjectives", "adjective comparison", "adjective equality", "adjective inequality" ],\
            "adverb" : ["adverbs of place", "adverbs of time", "adverbs of manner", "adverbs of degree", \
                       "adverbs of certainty", "relative adverbs", "interrogative adverbs", "adverbs from adjectives",\
                       "comparative adverbs", "superlative adverbs"],\
                "determiner" : ["definite article", "indefinite article", "demonstrative", "possessive determiner",\
                               "quantifier", "few", "little", "many", "some", "numbers", "distributive", \
                                "difference words", "pre-determiners"],\
                    "verb" : ["present simple", "present continuous", "past simple", "past continuous", \
                              "present perfect", "present perfect continuous", "past perfect", \
                              "past perfect continuous", "future perfect", "future perfect continuous", "future simple",\
                              "future continuous", "zero conditional", "type 1 conditional", "type 2 conditional", \
                             "type 3 conditional", "mixed conditional", "gerund", "present participle", "infinitive",\
                             "passive voice", "contraction", "passive infinitive", "regularity"],\
                    "relative clause" :["sentence formation"]    }

In [16]:
# after obtaining the topic - subtopic mapping, we also need to provide all possible alternative names for each 
# subtopic. for instance 'past tense' can be referred to as 'past simple tense' or 'simple past tense' as well.

entity2_dict = {\
'gendered nouns' : ["gender", "genders", "feminine", "masculine", "neutral"],\
'singular nouns' : ["singular", "singulars", "singular noun", "plurality"],\
'plural nouns' : ["plural", "plurals", "plural noun", "plurality"],\
'countable nouns' : ["countable", "countability", "countables", "countable noun"],\
'uncountable nouns' : ["uncountable", "uncountability", "uncountables", "uncountable noun"],\
'possessive nouns' : ["possessive noun", "ownership", "apostrophe"],\
'definite pronouns' : ["definite pronoun", "definitive pronoun", "definite"],\
'indefinite pronouns' : ["indefinite pronoun", "indefinitive pronoun", "indefinite"],\
'compound pronouns' : ["compound pronoun", "compound"],\
'possessive pronouns' : ["possessive pronoun", "possessives", "possessive", "possessive determiner", \
                         "possessive determiners", "object pronoun", "subject pronoun", "object pronouns",\
                         "subject pronouns"],\
'placing adjectives' : ["adjective placement", "placing"],\
'order of adjectives' : ["order of adjective", "adjective order", "order"],\
'comparative adjectives' : ["comparative", "comparatives", "comparative adjective"],\
'superlative adjectives' : ["superlative", "superlatives", "superlative adjective"],\
'adjective comparison' : ["comparing adjective", "comparison of adjectives", "comparison"],\
'adjective equality' : ["equality", "equal", "equal adjectives"],\
'adjective inequality' : ["inequality", "unequal", "unequal adjectives"],\
'adverbs of place' : ["place adverbs", "place adverb", "place"],\
'adverbs of time' : ["time adverbs", "time adverb", "time"],\
'adverbs of manner' : ["manner adverbs", "manner adverb", "manner"],\
'adverbs of degree' : ["degree adverbs", "degree adverb", "degree"],\
'adverbs of certainty' : ["certainty adverbs", "certainty adverb", "certainty"],\
'relative adverbs' : ["relativity", "relative adverb"],\
'interrogative adverbs' : ["interrogative", "interrogative adverb"],\
'adverbs from adjectives' : ["adverb from adjective"],\
'comparative adverbs' : ["comparative adverb"],\
'superlative adverbs' : ["superlative adverb"],\
'definite article' : ["definite the", "definite articles"],\
'indefinite article' : ["indefinite a", "indefinite an", "indefinite articles"],\
'demonstrative' : ["demostratives", "demonstrative determiners"],\
'possessive determiner' : ["possessive determiners"],\
'quantifier' : ["quantifiers"],\
'few' : ["a few"],\
'little': ["a little"],\
'many': ["much"], \
'some': ["any"],\
'numbers' : ["cardinal", "ordinal", "determiner numbers"],\
'distributive' : ["distributives", "distributive determiner", "distributive determiners"],\
'difference words' : ["difference word", "determiner other", "determiner another"],\
'pre-determiners' : ["predeterminers", "predeterminer", "pre-determiner"],\
'present simple' : ["present simple tense", "present tense", "simple present tense", "present"],\
'present continuous' : ["present continuous tense", "present progressive", "present progressive tense"],\
'past simple' : ["past simple tense", "past tense", "simple past tense", "past", "preterite"],\
'past continuous' : ["past continuous tense", "past progressive", "past progressive tense"],\
'present perfect' : ["present perfect tense"],\
'present perfect continuous' : ["present perfect continuous tense", "present perfect progressive tense",\
                               "present perfect progressive"],\
'past perfect' : ["past perfect tense"],\
'past perfect continuous' : ["past perfect continous tense", "past perfect progressive", \
                             "past perfect progressive tense"],\
'future perfect' : ["future perfect tense"],\
'future perfect continuous' : ["future perfect continuous tense", "future perfect progressive tense", \
                              "future perfect progressive"],\
'future simple' : ["future simple tense", "future tense", "simple future tense", "future"],\
'future continuous' : ["future continuous tense", "future progressive", "future progressive tense"],\
'zero conditional' : ["zero conditionals", "0 conditional", "conditional 0", "conditional zero",],\
'type 1 conditional' : ["type1 conditional", "type1 conditionals", "type 1 conditionals", "1 conditional",\
                        "conditional 1"],\
'type 2 conditional' : ["type2 conditional", "type2 conditionals", "type 2 conditionals", "2 conditional",\
                        "conditional 2"],\
'type 3 conditional' : ["type3 conditional", "type3 conditionals", "type 3 conditionals", "3 conditional",\
                        "conditional 3"],\
'mixed conditional' : ["mixed conditionals"],\
'gerund' : ["gerunds", "adding ing", "-ing"],\
'present participle' : ["ing form", "-ing form"],\
'infinitive' : ["infinitives", "infinitive form"],\
'passive infinitive' : ["infinitive with passive", "passive with infinitive", "infinitive passive"],\
'passive voice' : ["passive tense", "passive"],\
'contraction' : ["contractions"],\
'regularity' : ["irregularity", "irregular verb", "regular verb", "irregular verbs", "regular verbs"],\
'sentence formation' : ["sentence with relative clause"]\
    }

# as the last step add the keys themselves to the lists
for item in entity2_dict.keys():
    entity2_dict[item].append(item)

# you can check the final dictionary that provides all subtopics with alternative equivalent names    
entity2_dict

{'adjective comparison': ['comparing adjective',
  'comparison of adjectives',
  'comparison',
  'adjective comparison'],
 'adjective equality': ['equality',
  'equal',
  'equal adjectives',
  'adjective equality'],
 'adjective inequality': ['inequality',
  'unequal',
  'unequal adjectives',
  'adjective inequality'],
 'adverbs from adjectives': ['adverb from adjective',
  'adverbs from adjectives'],
 'adverbs of certainty': ['certainty adverbs',
  'certainty adverb',
  'certainty',
  'adverbs of certainty'],
 'adverbs of degree': ['degree adverbs',
  'degree adverb',
  'degree',
  'adverbs of degree'],
 'adverbs of manner': ['manner adverbs',
  'manner adverb',
  'manner',
  'adverbs of manner'],
 'adverbs of place': ['place adverbs',
  'place adverb',
  'place',
  'adverbs of place'],
 'adverbs of time': ['time adverbs', 'time adverb', 'time', 'adverbs of time'],
 'comparative adjectives': ['comparative',
  'comparatives',
  'comparative adjective',
  'comparative adjectives'],
 'com

In [17]:
# now it is time to create sample questions or sentences for the training set. 
# we'll generate questions/sentences such as "can you describe {subtopic} in the English grammar of {topic}?
# consequently a sample question might be "can you describe **past tense** in the English grammar of **verbs**?

# first we need to create lists of template question/sentence formations. There will be two distinct sets, one 
# for the description function and one for the example function:

description_mold = ["give a definition for ", "can you define ", "what is ", "how do you use ", "can you describe ", \
               "can you give me a description of ", "what are the rules for ", "how do you form "]
example_mold = ["can you give me examples for ", "can you give me an example of ", "can you exemplify ",\
                "some examples of ", "how to form a sentence with "]

# next, we need to specify which part of the template corresponds to an entity value. for example, 
# in the question template "can you give me examples for...", 'examples' is an entity that states: function = example.
# we'll use two different dictionaries for two distinct functions, 'description' and 'example'.

description_mold_entity_map = {"give a definition for " : "definition", "can you define " : "define", \
                               "what is " : "what is", "how do you use " : "use", \
                               "can you describe " : "describe", "can you give me a description of " : "description",\
                               "what are the rules for " : "rules", "how do you form ": "form" }
example_mold_entity_map = {"can you give me examples for " : "examples", "can you give me an example of " : "example",\
                           "can you exemplify " : "exemplify", "some examples of " : "examples", \
                           "how to form a sentence with " : "sentence" }

In [30]:
# finally, for each (intent, topic, subtopic, function) combination, we need to generate a final question/sentence. 
# we also need to create lists for each of these sentences, corresponding intents, entity words in these sentences,
# and the entity names. since these will be used to generate a json file, the appropriate format is as described in
# section 1 of this document.

# the following code generates questions/sentences and lists for generating 'example' related data
sentence_list = []
intent_list = []
entity_words_list = []
entities_list = []
for mold in example_mold: # change to description_mold to create lists for description functions
    sentence = mold
    for entity in entity1_list:
        for entity_version in entity1_dict[entity]:
            current_entity_words_list= []
            current_entities_list = []
            current_entity_words_list.append(example_mold_entity_map[mold]) # again, change to description_mold_entity_map if necesssary
            current_entities_list.append("function")
            current_entity_words_list.append(entity_version)
            current_entities_list.append("topic")
            for subtopic in entity1map2_dict[entity]:
                for subtopic_version in entity2_dict[subtopic]:
                    current_entities_list.append("subtopic")
                    current_entity_words_list.append(subtopic_version)
                    sentence_list.append(sentence + subtopic_version + " in the English grammar of " + entity_version)
                    intent_list.append("grammar")
                    entity_words_list.append(current_entity_words_list[:])
                    entities_list.append(current_entities_list[:])
                    current_entities_list.pop()
                    current_entity_words_list.pop()

In [32]:
# below you can find the generated json file for 'example' related data
merge_json(sentence_list, intent_list, entity_words_list, entities_list)

'{ "rasa_nlu_data": { "common_examples": [ {"text": "can you give me examples for gender in the English grammar of noun", "intent": "grammar", "entities": [ { "start": 16, "end": 24, "value": "examples", "entity": "function" }, { "start": 62, "end": 66, "value": "noun", "entity": "topic" }, { "start": 29, "end": 35, "value": "gender", "entity": "subtopic" } ] }, {"text": "can you give me examples for genders in the English grammar of noun", "intent": "grammar", "entities": [ { "start": 16, "end": 24, "value": "examples", "entity": "function" }, { "start": 63, "end": 67, "value": "noun", "entity": "topic" }, { "start": 29, "end": 36, "value": "genders", "entity": "subtopic" } ] }, {"text": "can you give me examples for feminine in the English grammar of noun", "intent": "grammar", "entities": [ { "start": 16, "end": 24, "value": "examples", "entity": "function" }, { "start": 64, "end": 68, "value": "noun", "entity": "topic" }, { "start": 29, "end": 37, "value": "feminine", "entity": "su

In [26]:
# optionally, you can create another json file consisting only of the raw values of all the topics and subtopics.
# this might be necessary to train the model for cases where the user briefly mentions the topic and/or subtopic such
# as "tell me about past tense"

raw_sentence = []
raw_intent = []
raw_words = []
raw_entities = []
for entity in entity1_list:
    for item in entity1_dict[entity]:
        list1 = []
        list2 = []
        raw_sentence.append(item)
        raw_intent.append('grammar')
        list1.append(item)
        list2.append("topic")
        raw_words.append(list1)
        raw_entities.append(list2)
for item in entity2_dict.keys():
    for alt in entity2_dict[item]:
        list1 = []
        list2 = []
        raw_sentence.append(alt)
        raw_intent.append('grammar')
        list1.append(alt)
        list2.append("subtopic")
        raw_words.append(list1)
        raw_entities.append(list2)

# the generated json data is as follows:
merge_json(raw_sentence, raw_intent, raw_words, raw_entities)

'{ "rasa_nlu_data": { "common_examples": [ {"text": "noun", "intent": "grammar", "entities": [ { "start": 0, "end": 4, "value": "noun", "entity": "topic" } ] }, {"text": "nouns", "intent": "grammar", "entities": [ { "start": 0, "end": 5, "value": "nouns", "entity": "topic" } ] }, {"text": "pronoun", "intent": "grammar", "entities": [ { "start": 0, "end": 7, "value": "pronoun", "entity": "topic" } ] }, {"text": "pronouns", "intent": "grammar", "entities": [ { "start": 0, "end": 8, "value": "pronouns", "entity": "topic" } ] }, {"text": "adjective", "intent": "grammar", "entities": [ { "start": 0, "end": 9, "value": "adjective", "entity": "topic" } ] }, {"text": "adjectives", "intent": "grammar", "entities": [ { "start": 0, "end": 10, "value": "adjectives", "entity": "topic" } ] }, {"text": "adverb", "intent": "grammar", "entities": [ { "start": 0, "end": 6, "value": "adverb", "entity": "topic" } ] }, {"text": "adverbs", "intent": "grammar", "entities": [ { "start": 0, "end": 7, "value": 

In [27]:
# optionally, another json file can be created for certain alternatives of topics and subtopics, such as 'of a noun'

raw_sentence = []
raw_intent = []
raw_words = []
raw_entities = []
prefix_list = ["of", "in", "with", "for a", "of a", "in a", "with a", "for a", "of an", "in an", "with an", "for an"]
for entity in entity1_list:
    for item in entity1_dict[entity]:
        for prefix in prefix_list:
            list1 = []
            list2 = []
            string = prefix + " "+ item
            raw_sentence.append(string)
            raw_intent.append('grammar')
            list1.append(item)
            list2.append("topic")
            raw_words.append(list1)
            raw_entities.append(list2)
for item in entity2_dict.keys():
    for alt in entity2_dict[item]:
        for prefix in prefix_list:
            list1 = []
            list2 = []
            string = prefix + " "+ alt
            raw_sentence.append(string)
            raw_intent.append('grammar')
            list1.append(alt)
            list2.append("subtopic")
            raw_words.append(list1)
            raw_entities.append(list2)


# the generated json data is as follows:
merge_json(raw_sentence, raw_intent, raw_words, raw_entities)

'{ "rasa_nlu_data": { "common_examples": [ {"text": "of noun", "intent": "grammar", "entities": [ { "start": 3, "end": 7, "value": "noun", "entity": "topic" } ] }, {"text": "in noun", "intent": "grammar", "entities": [ { "start": 3, "end": 7, "value": "noun", "entity": "topic" } ] }, {"text": "with noun", "intent": "grammar", "entities": [ { "start": 5, "end": 9, "value": "noun", "entity": "topic" } ] }, {"text": "for a noun", "intent": "grammar", "entities": [ { "start": 6, "end": 10, "value": "noun", "entity": "topic" } ] }, {"text": "of a noun", "intent": "grammar", "entities": [ { "start": 5, "end": 9, "value": "noun", "entity": "topic" } ] }, {"text": "in a noun", "intent": "grammar", "entities": [ { "start": 5, "end": 9, "value": "noun", "entity": "topic" } ] }, {"text": "with a noun", "intent": "grammar", "entities": [ { "start": 7, "end": 11, "value": "noun", "entity": "topic" } ] }, {"text": "for a noun", "intent": "grammar", "entities": [ { "start": 6, "end": 10, "value": "no

In [None]:
# also optionally, you can create a dictionary where every alternative of a topic and/or subtopic
# maps to a single entity value that is used in the database (e.g. {'past tense' : ' past simple', 
#                                                                   'past simple tense' : 'past simple',
#                                                                   'simple past tense' : 'past simple',...})
alternatives_dict = {}
# for subtopics
for entity in entity2_dict.keys():
    alternative_list = entity2_dict[entity]
    for alternative in alternative_list:
        alternatives_dict[alternative] = entity
        
# for topics
for entity in entity1_dict.keys():
    alternative_list = entity1_dict[entity]
    for alternative in alternative_list:
        alternatives_dict[alternative] = entity
alternatives_dict