# Automated Generation of Training Data for Microsoft LUIS and Speech Service
This notebook serves to batch-generate training data for [Microsoft LUIS](https://luis.ai) and [Microsoft Speech Service](https://speech.microsoft.com) based on example utterances and possible entity-values.

## Example
### Input sentence: 
- "I would like to book a flight from {city} to {city} and my name is {name}."

### Sample values: 
- city: 'Stuttgart', 'Singapore', 'Frankfurt', 'Kuala Lumpur'
- name: 'Nadella', 'Gates'

### Returns:
- Training Data for Speech-To-Text Engine or textual input for Text-to-Speech generation
    - "I would like to book a flight from Frankfurt to Kuala Lumpur and my name is Nadella."
    - "I would like to book a flight from Singapore to Stuttgart and my name is Gates."
    - "I would like to book a flight from Singapore to Frankfurt and my name is Ballmer."
- Training data for Microsoft LUIS (see the concept of [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding))
    - I would like to book a flight from {city=Frankfurt} to {city=Kuala Lumpur} via {station=Bus Stop} and my name is {name=Nadella}.
    - I would like to book a flight from {city=Singapore} to {city=Stuttgart} via {station=Airport} and my name is {name=Gates}.
    - I would like to book a flight from {city=Singapore} to {city=Frankfurt} via {station=Airport} and my name is {name=Ballmer}.

In [152]:
# Import relevant packages
import json
import re
import logging
import pandas as pd
import random
import sys

# Import LUIS generator components
sys.path.append("../src/")
from luis_data_generator import LUISGenerator
from luis_data_generator import transform_lu

# Auto Reload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Input Data
We prepared some examples for you, but you can also import your own data below. Just make sure you follow the file structure and the notation for the entities, which always has to be this way. In case you have multiple entities of the same type in one sentence, you do not have to enumerate them. The tool will take care of it.

In [153]:
# Get Base Utterances and Intents from csv
df_baseutterances = pd.read_csv("../assets/examples/input_files/example_baseutterances.csv", sep=";", encoding='utf-8')[['intent','text']]
# df_baseutterances = pd.read_csv("../assets/customer_data/example_baseutterances.csv", sep=";", encoding='utf-8')[['intent','text']]
utterances = df_baseutterances['text'].tolist()
intents = df_baseutterances['intent'].tolist()

# Review imported data for debugging
# print(utterances)
# print()
# print(intents)
# print()
# print(len(utterances))
# print(len(intents))

# Extract Entitydata from csv
entitydata = pd.read_csv("../assets/examples/input_files/example_entities.csv", sep=";", encoding='utf-8')[['entity_name','entity_value']]
# entitydata = pd.read_csv("../assets/customer_data/example_entities.csv", sep=";", encoding='utf-8')[['entity_name','entity_value']]
# Convert Entitynames to list and deduplicate
entityname_list = list(dict.fromkeys(entitydata['entity_name'].tolist()))

# Create values object based on entitynames/entityvalues read from csv
values = {}
for each_entity in entityname_list:
    # Review imported data for debugging
    # print(each_entity)
    # print()
    globals()[each_entity] = []
    globals()[each_entity] = entitydata['entity_value'].loc[(entitydata["entity_name"] == each_entity)].tolist()
    # Review imported data for debugging
    # print(globals()[each_entity])
    # print()   
    values[each_entity] = globals()[each_entity]
# Review imported data for debugging
# print(values)


#############################################################################################################################################
## If you do not want to import data from csv you can see below for an example of hard-coded data. In this case uncomment all of the above. #
#############################################################################################################################################

# Define input values, or import them from a pandas data frame
# utterances = ['Test {first_name} {last_name} Test', 
#               'how are you doing?']

#values = {'city': ['Singapore', 'Frankfurt', 'Kuala Lumpur', 'Stuttgart'], 
#          'station': ['Airport', 'Central Station', 'Bus Stop'], 
#          'name': ['Nadella', 'Gates', 'Ballmer']}

# intents =  ['GetEntities',
#             'None']



## Generator Setup
In the next step, we will create an instance of the LUIS generator and assign the respective objects to it.

In [154]:
# Create instance of the LUISGenerator-class along with your utterances, values and intents.
# If you have no intents, just remove it. It is an optional argument for the class.
flight_generator = LUISGenerator(utterances, values, intents)

In [155]:
# Define amount of iterations below.
# Keep in mind that it does not necessarily mean, that there will be 1,000 examples of every utterance, as duplicates will be filtered out.
# The amount of utterances per example depends on the maximum number of combinations based on example-entity value combinations.
iterations = 200

In [156]:
%%time
# Loop through the generator multiple times to get a variation of utterances.
# If you have intents, speech_results and luis_results will be zipped lists each.
# If you have no intents, speech_results and luis_results will be one-dimensional lists.
speech_results = []
luis_results = []
for _ in range(1, iterations):
    flight_generator.get_values()
    speech, luis = flight_generator.fill_values()
    speech_results.extend(speech)
    luis_results.extend(luis)
print("Done!")

Done!
CPU times: total: 62.5 ms
Wall time: 35 ms


## Export
As we generated the data, we can export it now to use it for our tools.

### Speech to Text / Text to Speech
The section below give you a glance on the results and writes them to a text file.
If you write generated these utterances along with intents, you may also use it for LUIS scoring with GLUE, as you have intent-text combinations.
This can help you to evaluate the performance of the model given different entity values.


In [157]:
# Show the head of the speech-results.
speech_results[:4]

[('VINResolver',
  'it is two victor eight kilo oh kilo golf hotel mike three six six yankee four golf yankee juliett'),
 ('VINResolver',
  "it's two victor eight kilo oh kilo golf hotel mike three six six yankee four golf yankee juliett"),
 ('VINResolver', 'a L D F D F V K 3 M 9 M J N T M K 6'),
 ('VINResolver', 'my vin is 2 V 8 6 W L 7 6 P P A D 2 2 1 M A')]

In [158]:
# File name of your target text file.
text_filename = "../assets/examples/output_files/example_text_file"
# text_filename = "../assets/customer_data/customer_text_file"

# If speech_results is a list of tuples along with intents, we write two files:
# One file is only text, the other is comma-separated for potential LUIS scoring as described above.
if len(speech_results[0]) == 2:
    df_text = pd.DataFrame(speech_results, columns=['intent', 'text'])
    df_text.to_csv(f'{text_filename}_intent_text.csv', encoding="utf-8", sep=",", index=False)
    df_text['text'].to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)
# If the results are only in a list, we just write text file
else:
    df_text = pd.DataFrame(speech_results, columns=['text'])
    df_text.to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)


### LUIS
The section below shows you how the results look like and writes them to a [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding). This file can be used as input file for [LUIS](https://luis.ai) training and to accelerate your model development.

In [159]:
# Show the head of the luis results.
luis_results[:5]

[('VINResolver',
  'it is {vin=two victor eight kilo oh kilo golf hotel mike three six six yankee four golf yankee juliett}'),
 ('VINResolver',
  "it's {vin=two victor eight kilo oh kilo golf hotel mike three six six yankee four golf yankee juliett}"),
 ('VINResolver', 'a {vin=L D F D F V K 3 M 9 M J N T M K 6}'),
 ('VINResolver', 'my vin is {vin=2 V 8 6 W L 7 6 P P A D 2 2 1 M A}'),
 ('VINResolver',
  'the number is {vin=two victor eight kilo oh kilo golf hotel mike three six six yankee four golf yankee juliett}')]

In [160]:
# File name of your target LU-file.
luis_file_name = '../assets/examples/output_files/example_lu_file' 
# luis_file_name = '../assets/customer_data/customer_lu_file'

# Boolean to write to file, if false it will only show in the output.
write = True
# Transform to LU-file. Keep in mind, that you will need a list of tuples with intents, otherwise the function will throw an error.
transform_lu(luis_results, luis_file_name, write=True)




# GetEntities
- my lastname is {last_name=Rasztawicki}
- my name is {first_name=Maria} {last_name=Coskun}.
- i am {first_name=Mary} {last_name=Ebner}.
- my firstname is {first_name=Lisa}.
- my lastname is {last_name=Mahoney}.
- It's {last_name=Mahoney}.
- it is {first_name=Tom}.
- it is {first_name=Paul}.
- It's {last_name=Maier}.
- {first_name=Theo} {last_name=Reichenbach}.
- i am {first_name=Lisa} {last_name=Delaunay}.
- my name is {first_name=Linda} {last_name=Mahoney}.
- {first_name=Maria} {last_name=Mahoney}.
- it is {first_name=Mary}
- {first_name=Paul} {last_name=Ebner}
- my name is {first_name=Maria} {last_name=Ebner}
- i am {first_name=Paul} {last_name=Rasztawicki}
- my firstname is {first_name=Lisa}
- my firstname is {first_name=Linda}.
- it is {first_name=Tom}
- It's {last_name=Rasztawicki}
- my lastname is {last_name=Ebner}
- my lastname is {last_name=Coskun}.
- it is {first_name=Lisa}.
- my firstname is {first_name=Margret}.
- i am {first_name=Margret} {last_name=Reichenb