This is an example tutorial to generate a training dataset from scratch.

The example here creates the first dialog from the dstc2 train dataset, but in a similar fashion, you can create any dataset that fits your purpose.

To start, you will need:
1. dstc2 type template file. See the downloaded dstc2-templates.txt for a reference. You can create a new one with your own templates
2. slot values that you need to provide in a JSON format. See dstc_slot_vals.json as a reference
3. An sqlite database instance with a table that matches 1 and 2 above. Spend some time to see how these two relate to the database. Again, downloaded db.sqlite is a good starting poin.

Once you have these, you are set to start your own dataset generation


In [None]:
!pip install deeppavlov


In [None]:
import os

from deeppavlov.contrib import examples
from deeppavlov.contrib.data.tools.train_set_generation import TrainSetGeneration

template_fn = "dstc2-templates.txt"
slot_fn = "dstc_slot_vals.json"
db_fn = "db.sqlite"

template_path = os.path.join(examples.__path__._path[0], template_fn)
slot_path = os.path.join(examples.__path__._path[0], slot_fn)
db_path = os.path.join(examples.__path__._path[0], db_fn)

trainsetgen = TrainSetGeneration(template_path = template_path,
                                 slot_path = slot_path,
                                 save_path = "generated_data.json",
                                 db_path = db_path)

In [None]:
trainsetgen.start_generation()


**********
[INPUT] choose turn (1 for user, 2 for bot, 3 to start a new dialog or 10 for saving and exit): 2

**********
[INFO] current slot vals are:  {}
0 api_call
1 bye
2 canthear
3 canthelp_area
4 canthelp_area_food
5 canthelp_area_food_pricerange
6 canthelp_area_pricerange
7 canthelp_food
8 canthelp_food_pricerange
9 confirm-domain
10 expl-conf_area
11 expl-conf_food
12 expl-conf_pricerange
13 impl-conf_area+impl-conf_pricerange+request_food
14 impl-conf_food+impl-conf_pricerange+request_area
15 impl-conf_food+request_area
16 inform_addr+inform_food+offer_name
17 inform_addr+inform_phone+inform_pricerange+offer_name
18 inform_addr+inform_phone+offer_name
19 inform_addr+inform_postcode+offer_name
20 inform_addr+inform_pricerange+offer_name
21 inform_addr+offer_name
22 inform_area+inform_food+inform_pricerange+offer_name
23 inform_area+inform_food+offer_name
24 inform_area+inform_phone+offer_name
25 inform_area+inform_postcode+offer_name
26 inform_area+inform_pricerange+offer_name


type template number from the list: 0


INFO in 'deeppavlov.models.go_bot.tracker.dialogue_state_tracker'['dialogue_state_tracker'] at line 102: Made api_call with {'area': 'south', 'pricerange': 'cheap'}, got 2 results.


[INFO] generated response is:  Api_call area="south" food="#food" pricerange="cheap"	api_call area="south" food="#food" pricerange="cheap"
[INFO] the result of the db call is:  {'food': 'chinese', 'pricerange': 'cheap', 'area': 'south', 'postcode': 'c.b 1, 7 d.y', 'phone': '01223 244277', 'addr': 'cambridge leisure park clifton way cherry hinton', 'name': 'the lucky star'}

**********
[INPUT] choose turn (1 for user, 2 for bot, 3 to start a new dialog or 10 for saving and exit): 2

**********
[INFO] current slot vals are:  {'pricerange': 'cheap', 'this': 'dontcare', 'area': 'south'}
0 api_call
1 bye
2 canthear
3 canthelp_area
4 canthelp_area_food
5 canthelp_area_food_pricerange
6 canthelp_area_pricerange
7 canthelp_food
8 canthelp_food_pricerange
9 confirm-domain
10 expl-conf_area
11 expl-conf_food
12 expl-conf_pricerange
13 impl-conf_area+impl-conf_pricerange+request_food
14 impl-conf_food+impl-conf_pricerange+request_area
15 impl-conf_food+request_area
16 inform_addr+inform_food+offe

type template number from the list: 1
[INFO] generated response is:  You are welcome!

**********
[INPUT] choose turn (1 for user, 2 for bot, 3 to start a new dialog or 10 for saving and exit): 10

**********
[INFO] saving the dialogs and exiting...


In [None]:
import json

with open(trainsetgen.save_path) as f:
    dialogs = json.load(f)

In [None]:
from pprint import pprint

pprint(dialogs, indent=4)

[   [   {   'act': 'welcomemsg',
            'slots': [],
            'speaker': 2,
            'text': 'Hello, welcome to the Cambridge restaurant system. You '
                    'can ask for restaurants by area, price range or food '
                    'type. How may I help you?'},
        {   'slots': [['pricerange', 'cheap']],
            'speaker': 1,
            'text': 'cheap restaurant'},
        {   'act': 'request_food',
            'slots': [],
            'speaker': 2,
            'text': 'What kind of food would you like?'},
        {'slots': [['this', 'dontcare']], 'speaker': 1, 'text': 'any'},
        {   'act': 'request_area',
            'slots': [],
            'speaker': 2,
            'text': 'What part of town do you have in mind?'},
        {'slots': [['area', 'south']], 'speaker': 1, 'text': 'south'},
        {   'act': 'api_call',
            'db_result': '{"food": "chinese", "pricerange": "cheap", "area": '
                         '"south", "postcode": "c.b