# Automated Generation of Training Data for Microsoft LUIS and Speech Service
This notebook serves to batch-generate training data for [Microsoft LUIS](https://luis.ai) and [Microsoft Speech Service](https://speech.microsoft.com) based on example utterances and possible entity-values.

## Example
### Input sentence: 
- "I would like to book a flight from {city} to {city} and my name is {name}."

### Sample values: 
- city: 'Stuttgart', 'Singapore', 'Frankfurt', 'Kuala Lumpur'
- name: 'Nadella', 'Gates'

### Returns:
- Training Data for Speech-To-Text Engine or textual input for Text-to-Speech generation
    - "I would like to book a flight from Frankfurt to Kuala Lumpur and my name is Nadella."
    - "I would like to book a flight from Singapore to Stuttgart and my name is Gates."
    - "I would like to book a flight from Singapore to Frankfurt and my name is Ballmer."
- Training data for Microsoft LUIS (see the concept of [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding))
    - I would like to book a flight from {city=Frankfurt} to {city=Kuala Lumpur} via {station=Bus Stop} and my name is {name=Nadella}.
    - I would like to book a flight from {city=Singapore} to {city=Stuttgart} via {station=Airport} and my name is {name=Gates}.
    - I would like to book a flight from {city=Singapore} to {city=Frankfurt} via {station=Airport} and my name is {name=Ballmer}.

In [1]:
# Import relevant packages
import json
import re
import logging
import pandas as pd
import random
import sys

# Import LUIS generator components
sys.path.append("../src/")
from luis_data_generator import LUISGenerator
from luis_data_generator import transform_lu

# Auto Reload
%load_ext autoreload
%autoreload 2

In [2]:
# Custom Input Data
# Get Base Utterances and Intents from csv
customerutterances = pd.read_csv("../assets/customer_data/baseutterances.csv", sep=";", encoding='utf-8')[['intent','text']]
customerutterances_list = customerutterances['text'].tolist()
print(customerutterances_list)
print()
customerintents_list = customerutterances['intent'].tolist()
print(customerintents_list)
print()

# Count entries in customerutterances_list
print(len(customerutterances_list))
print(len(customerintents_list))

# Extract Firstnames from csv
first_names = pd.read_csv("../assets/customer_data/title+first_name.csv", sep=";", encoding='utf-8')[['title', 'first_name']]
# Convert first_names to list
first_names_list = first_names['first_name'].tolist()

# Extract Lastnames from csv
last_names = pd.read_csv("../assets/customer_data/last_name.csv", sep=";", encoding='utf-8')[['last_name']]
# Converst last_names to list
last_names_list = last_names['last_name'].tolist()

# Extract Companynames from csv
company_names = pd.read_csv("../assets/customer_data/companynames.csv", sep=";", encoding='utf-8')[['Type','Companyname']]
# Converst company_names to list
company_names_list = company_names['Companyname'].tolist()

# Extract vins from csv
vins = pd.read_csv("../assets/customer_data/vin_de.csv", sep=";", encoding='utf-8')[['vin']]
vin_list = vins['vin'].tolist()

customervalues = {'first_name': first_names_list, 'last_name': last_names_list, 'company_name': company_names_list, 'vin': vin_list}


['äh das is {vin}.', 'ah {vin}', 'äh {vin}', 'ähm {vin}', 'ähm natürlich {vin}', 'ähm oh das ist {vin} und ja', 'ähm wie war das nochmal mein fahrzeugidentnummer {vin}', 'also {vin}', 'also ich fange jetzt an {vin}', '{vin} ist mein fahrzeugidentnummer', '{vin}', '{vin}.', '{vin} ist mein vin', 'das dreht sich um das fahrzeug {vin}', 'das ist {vin}', 'das ist das vin {vin}', 'das ist die {vin}', 'das ist {vin}.', 'das ist {vin} und mein termin sollte in frankfurt sein', 'das ist {vin} glaub', 'das vin lautet {vin}', 'das vin meines autos is {vin}', 'das vin von meinem auto ist {vin}.', 'das vin von meinem auto ist {vin}', 'das vin von meinem autos {vin}.', 'das wäre {vin}.', 'dat is {vin}', 'die {vin}', 'doch klar {vin}', 'durchsicht bei dem wagen r u also {vin}', 'eder {vin}.', 'es ist die {vin}', 'für {vin}', 'götz ist nochmal nachgucken sie hatte mir den äh das {vin} für friedrich ebert damm.', '{vin} genau.', 'hansestadt {vin}', '{vin} genau', 'hey, mein vin {vin}.', 'hi das vin vo

## Input Data
We prepared some examples for you, but you can also import your own data below. Just make sure you follow the file structure and the notation for the entities, which always has to be this way. In case you have multiple entities of the same type in one sentence, you do not have to enumerate them. The tool will take care of it.

In [3]:
# Define input values, or import them from a pandas data frame
# utterances = ['Test {first_name} {last_name} Test', 
#               'how are you doing?']
utterances = customerutterances_list

#values = {'city': ['Singapore', 'Frankfurt', 'Kuala Lumpur', 'Stuttgart'], 
#          'station': ['Airport', 'Central Station', 'Bus Stop'], 
#          'name': ['Nadella', 'Gates', 'Ballmer']}
values = customervalues

# intents =  ['GetEntities',
#             'None']
intents = customerintents_list


## Generator Setup
In the next step, we will create an instance of the LUIS generator and assign the respective objects to it.

In [4]:
# Create instance of the LUISGenerator-class along with your utterances, values and intents.
# If you have no intents, just remove it. It is an optional argument for the class.
flight_generator = LUISGenerator(utterances, values, intents)

In [5]:
# Define amount of iterations below.
# Keep in mind that it does not necessarily mean, that there will be 1,000 examples of every utterance, as duplicates will be filtered out.
# The amount of utterances per example depends on the maximum number of combinations based on example-entity value combinations.
iterations = 200

In [6]:
%%time
# Loop through the generator multiple times to get a variation of utterances.
# If you have intents, speech_results and luis_results will be zipped lists each.
# If you have no intents, speech_results and luis_results will be one-dimensional lists.
speech_results = []
luis_results = []
for _ in range(1, iterations):
    flight_generator.get_values()
    speech, luis = flight_generator.fill_values()
    speech_results.extend(speech)
    luis_results.extend(luis)
print("Done!")

Done!
CPU times: total: 93.8 ms
Wall time: 101 ms


## Export
As we generated the data, we can export it now to use it for our tools.

### Speech to Text / Text to Speech
The section below give you a glance on the results and writes them to a text file.
If you write generated these utterances along with intents, you may also use it for LUIS scoring with GLUE, as you have intent-text combinations.
This can help you to evaluate the performance of the model given different entity values.


In [7]:
# Show the head of the speech-results.
speech_results[:4]

[('VINResolver',
  'äh das is viktor florida acht florida viere null papa ulrich tim nikolaus havana stefan baltimore uniform zorro karl delta.'),
 ('VINResolver',
  'ah wilhelm marta amsterdam christoph tango zwo bravo ulrich siegfried xaver paris 9 martin 3 yverdon gustav paris'),
 ('VINResolver',
  'äh luxemburg luxemburg 8 viktor sechse ein 2 bravo xylophon kilogramm 1 viktor 5 jerusalem einse yokohama x-ray'),
 ('VINResolver', 'ähm K N A 5 D J V C R T J P 9 R 1 A 5')]

In [8]:
# File name of your target text file.
# text_filename = "example_text_file"
text_filename = "../assets/customer_data/customer_text_file"

# If speech_results is a list of tuples along with intents, we write two files:
# One file is only text, the other is comma-separated for potential LUIS scoring as described above.
if len(speech_results[0]) == 2:
    df_text = pd.DataFrame(speech_results, columns=['intent', 'text'])
    df_text.to_csv(f'{text_filename}_intent_text.csv', encoding="utf-8", sep=",", index=False)
    df_text['text'].to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)
# If the results are only in a list, we just write text file
else:
    df_text = pd.DataFrame(speech_results, columns=['text'])
    df_text.to_csv(f'{text_filename}_text.csv', encoding="utf-8", sep="\t", index=False, header=False)


### LUIS
The section below shows you how the results look like and writes them to a [LU-files](https://docs.microsoft.com/en-us/composer/concept-language-understanding). This file can be used as input file for [LUIS](https://luis.ai) training and to accelerate your model development.

In [9]:
# Show the head of the luis results.
luis_results[:5]

[('VINResolver',
  'äh das is {vin=viktor florida acht florida viere null papa ulrich tim nikolaus havana stefan baltimore uniform zorro karl delta}.'),
 ('VINResolver',
  'ah {vin=wilhelm marta amsterdam christoph tango zwo bravo ulrich siegfried xaver paris 9 martin 3 yverdon gustav paris}'),
 ('VINResolver',
  'äh {vin=luxemburg luxemburg 8 viktor sechse ein 2 bravo xylophon kilogramm 1 viktor 5 jerusalem einse yokohama x-ray}'),
 ('VINResolver', 'ähm {vin=K N A 5 D J V C R T J P 9 R 1 A 5}'),
 ('VINResolver',
  'ähm natürlich {vin=zweie paris dreie viere dreie madagaskar viktor 0 sechse edison yverdon peter yokohama siebene ypsilon 5 anna}')]

In [10]:
# File name of your target LU-file.
# luis_file_name = 'example_lu_file' 
luis_file_name = '../assets/customer_data/customer_lu_file'

# Boolean to write to file, if false it will only show in the output.
write = True
# Transform to LU-file. Keep in mind, that you will need a list of tuples with intents, otherwise the function will throw an error.
transform_lu(luis_results, luis_file_name, write=True)




# VINResolver
- äh das is {vin=viktor florida acht florida viere null papa ulrich tim nikolaus havana stefan baltimore uniform zorro karl delta}.
- ich frage jetzt ab vin {vin=JKUE3WZYDXTEWVCRY}
- ich hätte gerne einen termin in stuttgart und mein fahrzeugidentnummer ist {vin=L F V M L M 0 T F R B 7 T P K 4 S}
- ich lebe in hamburg und mein fin ist {vin=WDZ2PU365HKJGNJ54}
- ich stehe irgendwo bei frankfurt und das ist {vin=zwo daimler dreie zürich 2 xanthippe valencia havana bernd 8 südpol 2 heinrich martha gustav willem luxemburg}
- ich wohne in stuttgart und es ist {vin=8ADJLDU78931XJHAF}
- is {vin=3P3871F4W9WV8CNC5}
- ist {vin=fünf florida dänemark viktor historisch sieben juliet zweie vier achte sieben paula neune papa uniform friedrich zwei}
- ja das is gut {vin=8 G G 8 R C 0 U 9 1 T V X A L J 9}
- ja das ist {vin=mercedes mike albert daimler siebene wilhelm 5 foxtrot tristan luxemburg deutsch gee foxtrott deutsch kilogramm christopher zorro}.
- ja das war {vin=eins nordpol xaver

In [11]:
# Correct lu File output for Entity labels
df_text = pd.read_csv("../assets/customer_data/customer_lu_file.lu", sep=";", encoding='ansi', header=None, names=['text'])
df_textresult = pd.DataFrame(columns=['text'])

for i in df_text.index:
    if 'first_name' in df_text.loc[i, 'text'] and 'last_name' not in df_text.loc[i, 'text']:
        firstnamevalue = re.findall(r'{first_name=(.*?)}', df_text.loc[i, 'text'])
        firstnamevalue = ''.join(firstnamevalue)
        textvalue = df_text.loc[i, 'text']
        textvalue = textvalue.replace('{first_name=' + firstnamevalue + '}', '{@personName={@first_name=' + firstnamevalue + '}}')
        df_textresult = df_textresult.append(pd.DataFrame({'text': [textvalue]}))
    elif 'first_name' not in df_text.loc[i, 'text'] and 'last_name' in df_text.loc[i, 'text']:
        lastnamevalue = re.findall(r'{last_name=(.*?)}', df_text.loc[i, 'text'])
        lastnamevalue = ''.join(lastnamevalue)
        textvalue = df_text.loc[i, 'text']
        textvalue = textvalue.replace('{last_name=' + lastnamevalue + '}', '{@personName={@last_name=' + lastnamevalue + '}}')
        df_textresult = df_textresult.append(pd.DataFrame({'text': [textvalue]}))
    else:
        firstnamevalue = re.findall(r'{first_name=(.*?)}', df_text.loc[i, 'text'])
        firstnamevalue = ''.join(firstnamevalue)
        lastnamevalue = re.findall(r'{last_name=(.*?)}', df_text.loc[i, 'text'])
        lastnamevalue = ''.join(lastnamevalue)
        textvalue = df_text.loc[i, 'text']
        textvalue = textvalue.replace('{first_name=' + firstnamevalue + '} {last_name=' + lastnamevalue + '}', '{@personName={@first_name=' + firstnamevalue + '} {@last_name=' + lastnamevalue + '}}')
        df_textresult = df_textresult.append(pd.DataFrame({'text': [textvalue]}))

for i in df_text.index:
    if 'company_name' in df_text.loc[i, 'text']:
        textvalue = df_text.loc[i, 'text']
        textvalue = textvalue.replace('{company_name=', '{@company_name=')
        df_textresult = df_textresult.append(pd.DataFrame({'text': [textvalue]}))

for i in df_text.index:
    if 'vin=' in df_text.loc[i, 'text']:
        textvalue = df_text.loc[i, 'text']
        textvalue = textvalue.replace('{vin=', '{@vin=')
        df_textresult = df_textresult.append(pd.DataFrame({'text': [textvalue]}))
df_textresult.to_csv('../assets/customer_data/customer_lu_file_clean.lu', encoding="utf-8", sep=";", index=False, header=False)
df_textresult.head()

            







Unnamed: 0,text
0,# VINResolver
0,- äh das is {vin=viktor florida acht florida v...
0,- ich frage jetzt ab vin {vin=JKUE3WZYDXTEWVCRY}
0,- ich hätte gerne einen termin in stuttgart un...
0,- ich lebe in hamburg und mein fin ist {vin=WD...
