# Automated Generation of Training Data for Microsoft LUIS and Speech Service
This notebook serves to batch-generate training data for [Microsoft LUIS](https://luis.ai) and [Microsoft Speech Service](https://speech.microsoft.com) based on example utterances and possible entity-values.

## Example
### Input sentence: 
- "I would like to book a flight from {city} to {city} and my name is {name}."

### Sample values: 
- city: 'Stuttgart', 'Singapore', 'Frankfurt', 'Kuala Lumpur'
- name: 'Nadella', 'Gates'

### Returns:
- Training Data for Speech-To-Text Engine or textual input for Text-to-Speech generation
    - "I would like to book a flight from Frankfurt to Kuala Lumpur and my name is Nadella."
    - "I would like to book a flight from Singapore to Stuttgart and my name is Gates."
    - "I would like to book a flight from Singapore to Frankfurt and my name is Ballmer."
- Training data for Microsoft LUIS (see the concept of [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding))
    - I would like to book a flight from {city=Frankfurt} to {city=Kuala Lumpur} via {station=Bus Stop} and my name is {name=Nadella}.
    - I would like to book a flight from {city=Singapore} to {city=Stuttgart} via {station=Airport} and my name is {name=Gates}.
    - I would like to book a flight from {city=Singapore} to {city=Frankfurt} via {station=Airport} and my name is {name=Ballmer}.

In [37]:
# Import relevant packages
import json
import re
import logging
import pandas as pd
import random
import sys

# Import LUIS generator components
sys.path.append("../src/")
from luis_data_generator import LUISGenerator
from luis_data_generator import transform_lu

# Auto Reload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Input Data
We prepared some examples for you, but you can also import your own data below. Just make sure you follow the file structure and the notation for the entities, which always has to be this way. In case you have multiple entities of the same type in one sentence, you do not have to enumerate them. The tool will take care of it.

In [38]:
# Define input values, or import them from a pandas data frame
utterances = ['i would like to book a flight from {city} to {city} via {station}, my name is {name}.', 
              'i am coming from {city} and want to travel via {station} to {city}.',
              'i want to book a seat on my flight to {city}.', 
              'how are you doing?']

values = {'city': ['Singapore', 'Frankfurt', 'Kuala Lumpur', 'Stuttgart'], 
          'station': ['Airport', 'Central Station', 'Bus Stop'], 
          'name': ['Nadella', 'Gates', 'Ballmer']}

intents =  ['BookFlight', 
            'BookFlight', 
            'BookSeat',
            'None']

## Generator Setup
In the next step, we will create an instance of the LUIS generator and assign the respective objects to it.

In [71]:
# Create instance of the LUISGenerator-class along with your utterances, values and intents.
# If you have no intents, just remove it. It is an optional argument for the class.
flight_generator = LUISGenerator(utterances, values, intents)

In [72]:
# Define amount of iterations below.
# Keep in mind that it does not necessarily mean, that there will be 1,000 examples of every utterance, as duplicates will be filtered out.
# The amount of utterances per example depends on the maximum number of combinations based on example-entity value combinations.
iterations = 1000

In [73]:
%%time
# Loop through the generator multiple times to get a variation of utterances.
# If you have intents, speech_results and luis_results will be zipped lists each.
# If you have no intents, speech_results and luis_results will be one-dimensional lists.
speech_results = []
luis_results = []
for _ in range(1, iterations):
    flight_generator.get_values()
    speech, luis = flight_generator.fill_values()
    speech_results.extend(speech)
    luis_results.extend(luis)
print("Done!")

Done!
Wall time: 38 ms


## Export
As we generated the data, we can export it now to use it for our tools.

### Speech to Text / Text to Speech
The section below give you a glance on the results and writes them to a text file.
If you write generated these utterances along with intents, you may also use it for LUIS scoring with GLUE, as you have intent-text combinations.
This can help you to evaluate the performance of the model given different entity values.


In [74]:
# Show the head of the speech-results.
speech_results[:4]

[('BookFlight',
  'i would like to book a flight from Stuttgart to Singapore via Central Station, my name is Gates.'),
 ('BookFlight',
  'i am coming from Stuttgart and want to travel via Bus Stop to Singapore.'),
 ('BookSeat', 'i want to book a seat on my flight to Kuala Lumpur.'),
 ('None', 'how are you doing?')]

In [75]:
# File name of your target text file.
text_filename = "example_text_file"
# If speech_results is a list of tuples along with intents, we write two files:
# One file is only text, the other is comma-separated for potential LUIS scoring as described above.
if len(speech_results[0]) == 2:
    df_text = pd.DataFrame(speech_results, columns=['intent', 'text'])
    df_text.to_csv(f'{text_filename}_intent_text.csv', encoding="utf-8", sep=",", index=False)
    df_text['text'].to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)
# If the results are only in a list, we just write text file
else:
    df_text = pd.DataFrame(speech_results, columns=['text'])
    df_text.to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)


### LUIS
The section below shows you how the results look like and writes them to a [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding). This file can be used as input file for [LUIS](https://luis.ai) training and to accelerate your model development.

In [43]:
# Show the head of the luis results.
luis_results[:5]

[('BookFlight',
  'i would like to book a flight from {city=Frankfurt} to {city=Kuala Lumpur} via {station=Central Station}, my name is {name=Ballmer}.'),
 ('BookFlight',
  'i am coming from {city=Frankfurt} and want to travel via {station=Central Station} to {city=Stuttgart}.'),
 ('BookSeat', 'i want to book a seat on my flight to {city=Singapore}.'),
 ('None', 'how are you doing?'),
 ('BookFlight',
  'i would like to book a flight from {city=Kuala Lumpur} to {city=Stuttgart} via {station=Central Station}, my name is {name=Gates}.')]

In [44]:
# File name of your target LU-file.
luis_file_name = 'example_lu_file' 
# Boolean to write to file, if false it will only show in the output.
write = True
# Transform to LU-file. Keep in mind, that you will need a list of tuples with intents, otherwise the function will throw an error.
transform_lu(luis_results, luis_file_name, write=True)


# BookFlight
- i would like to book a flight from {city=Frankfurt} to {city=Kuala Lumpur} via {station=Central Station}, my name is {name=Ballmer}.
- i would like to book a flight from {city=Singapore} to {city=Stuttgart} via {station=Bus Stop}, my name is {name=Ballmer}.
- i am coming from {city=Frankfurt} and want to travel via {station=Airport} to {city=Singapore}.
- i would like to book a flight from {city=Kuala Lumpur} to {city=Frankfurt} via {station=Bus Stop}, my name is {name=Gates}.
- i am coming from {city=Kuala Lumpur} and want to travel via {station=Airport} to {city=Singapore}.
- i would like to book a flight from {city=Stuttgart} to {city=Singapore} via {station=Airport}, my name is {name=Ballmer}.
- i am coming from {city=Singapore} and want to travel via {station=Airport} to {city=Stuttgart}.
- i would like to book a flight from {city=Frankfurt} to {city=Stuttgart} via {station=Bus Stop}, my name is {name=Gates}.
- i am coming from {city=Singapore} and want to travel v