In [6]:
import json
import os
import pandas as pd
import numpy as np

# Fly Me
A travel provider for individuals and professionals. 

## Overview
The chatbot project - Jupyter Notebook
Fly Me has launched an ambitious project to develop a *chatbot* to *help users choose a travel offer*.

The first phase of this project is to **develop an MVP** that will help Fly Me employees to easily book airline tickets for their holidays.

This first MVP will allow us to test the concept and performance of the chatbot quickly and on a large scale.

As this project is iterative, we have limited the features of the chatbot V1. It must be able to identify the following five elements in the user's request:

* Departure city
* Destination city
* Desired flight departure date
* Desired flight return date
* Maximum budget for the total price of tickets

If one of the elements is missing, the chatbot must be able to ask the user the relevant questions in order to fully understand the request. When the chatbot thinks it has understood all the elements of the user's request, it must be able to reformulate the user's request and ask the user for confirmation.

## Tools and Technologies used
To carry out this project, we will need to use the following tools and technologies:

* The Microsoft Bot Framework source code for Python “Microsoft Bot Framework SDK v4 for Python” 
* The Azure LUIS Cognitive Service which allows you to perform a semantic analysis of a message entered by the user and structure it for processing by the bot (it should allow you to identify the five elements requested)
* The Azure Web App service that allows you to run a web application on the Azure Cloud (you won't need to use the Azure Bot service)
* The Bot Framework Emulator which will allow you to test your chatbot locally and in production. It is an interface that allows a user to interact with the chatbot.
## Data
The data used to train the LUIS model comes from a dataset of conversations between users and travel agents. This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget.
The dataset is divided into two parts: a training set and a test set. The training set contains 1200 conversations and is used to train the LUIS model. The test set contains 300 conversations and is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The `turns` field contains a list of messages exchanged between the user and the travel agent. Each message has a `speaker` field indicating who sent the message (either "user" or "agent") and a `text` field containing the content of the message.
## Objective
The objective of this notebook is to explore and preprocess the dataset to prepare it for training the LUIS model. This includes loading the JSON file, flattening the nested structure, and extracting relevant information from the conversations.


## What is a bot?
Bots provide an experience that feels less like using a computer and more like dealing with a person—or intelligent robot. You can use bots to shift simple, repetitive tasks—such as taking a dinner reservation or gathering profile information—onto automated systems that may no longer require direct human intervention. Users converse with a bot using text, interactive cards, and speech. A bot interaction can be a quick answer to a question or an involved conversation that intelligently provides access to services.

In [2]:
# Flatten nested JSON columns into a flat DataFrame
from pandas import json_normalize
# Load JSON file
path = '../data/frames_dataset/frames.json'
with open(path, 'r') as file:
    data = json.load(file)

# create a flat DataFrame from the loaded `data` variable
try:
    # Normalize `data` to have all dicts/lists at the same level
    if 'data' in globals() and (isinstance(data, list) or isinstance(data, dict)):
        df_flat = json_normalize(data, sep='.')
    else:
        # Fall back to normalizing rows from the existing DataFrame `df`
        df_flat = json_normalize(df.to_dict(orient='records'), sep='.')
except Exception as e:
    print('Normalization error:', e)
    df_flat = df.copy()

# Handle columns that are lists of dicts: detect and explode+normalize them
for col in df_flat.columns.tolist():
    # If the column contains lists of dicts, expand them
    if df_flat[col].apply(lambda x: isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict)).any():
        # Explode the list so each element becomes its own row, then normalize that column
        exploded = df_flat.explode(col).reset_index(drop=True)
        # Normalize the exploded column (it may contain dicts or NaN)
        expanded = json_normalize(exploded[col].dropna().tolist(), sep='.')
        # Prefix new columns with the original column name
        expanded = expanded.add_prefix(col + '.')
        # Join back to exploded frame (align by index)
        exploded = exploded.drop(columns=[col]).join(expanded)
        df_flat = exploded

# Show results
print('Flattened DataFrame shape:', df_flat.shape)
display(df_flat.head())

Flattened DataFrame shape: (19986, 14)


Unnamed: 0,user_id,wizard_id,id,labels.userSurveyRating,labels.wizardSurveyTaskSuccessful,turns.text,turns.author,turns.timestamp,turns.labels.acts,turns.labels.acts_without_refs,turns.labels.active_frame,turns.labels.frames,turns.db.result,turns.db.search
0,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,I'd like to book a trip to Atlantis from Capri...,user,1471272000000.0,"[{'args': [{'val': 'book', 'key': 'intent'}], ...","[{'args': [{'val': 'book', 'key': 'intent'}], ...",1,"[{'info': {'intent': [{'val': 'book', 'negated...",,
1,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Hi...I checked a few options for you, and unfo...",wizard,1471272000000.0,"[{'args': [{'val': [{'annotations': [], 'frame...",,1,"[{'info': {'intent': [{'val': 'book', 'negated...",[[{'trip': {'returning': {'duration': {'hours'...,"[{'ORIGIN_CITY': 'Porto Alegre', 'PRICE_MIN': ..."
2,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Yes, how about going to Neverland from Caprica...",user,1471273000000.0,"[{'args': [{'val': 'Neverland', 'key': 'dst_ci...","[{'args': [{'val': 'Neverland', 'key': 'dst_ci...",2,"[{'info': {'intent': [{'val': 'book', 'negated...",,
3,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,I checked the availability for this date and t...,wizard,1471273000000.0,[{'args': [{'val': [{'annotations': [{'val': N...,,2,"[{'info': {'intent': [{'val': 'book', 'negated...","[[], [], [], [], [], []]","[{'ORIGIN_CITY': 'Caprica', 'PRICE_MIN': '1700..."
4,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,I have no flexibility for dates... but I can l...,user,1471273000000.0,"[{'args': [{'val': False, 'key': 'flex'}], 'na...","[{'args': [{'val': False, 'key': 'flex'}], 'na...",3,"[{'info': {'intent': [{'val': 'book', 'negated...",,


In [20]:
df_flat['turns.text']

0        I'd like to book a trip to Atlantis from Capri...
1        Hi...I checked a few options for you, and unfo...
2        Yes, how about going to Neverland from Caprica...
3        I checked the availability for this date and t...
4        I have no flexibility for dates... but I can l...
                               ...                        
19981    Yup it's from the 12th to the 25th, and it wil...
19982                                 Ok perfect, book me!
19983    Consider it done! Have a good trip :slightly_s...
19984                                              Thanks!
19985                                         My pleasure!
Name: turns.text, Length: 19986, dtype: object

In [21]:
df_flat['turns.labels.acts']

0        [{'args': [{'val': 'book', 'key': 'intent'}], ...
1        [{'args': [{'val': [{'annotations': [], 'frame...
2        [{'args': [{'val': 'Neverland', 'key': 'dst_ci...
3        [{'args': [{'val': [{'annotations': [{'val': N...
4        [{'args': [{'val': False, 'key': 'flex'}], 'na...
                               ...                        
19981    [{'args': [{'val': '12th', 'key': 'str_date'},...
19982    [{'args': [{'val': 'book', 'key': 'intent'}], ...
19983    [{'args': [{'val': 'book', 'key': 'action'}], ...
19984                   [{'args': [], 'name': 'thankyou'}]
19985            [{'args': [], 'name': 'you_are_welcome'}]
Name: turns.labels.acts, Length: 19986, dtype: object

In [1]:
import platform
print(platform.architecture()[0])

64bit


In [None]:
sample = df_flat.sample(5, random_state=42)
json_string = sample.to_json()
print(json_string)
# Save the flattened DataFrame to a CSV file
sample.to_json('../data/frames_dataset/sample_frames.json', orient='records', lines=True)

Unnamed: 0,user_id,wizard_id,id,labels.userSurveyRating,labels.wizardSurveyTaskSuccessful,turns.text,turns.author,turns.timestamp,turns.labels.acts,turns.labels.acts_without_refs,turns.labels.active_frame,turns.labels.frames,turns.db.result,turns.db.search
4448,U24V2QUKC,U21DMV0KA,41fc42bf-b3ef-4e07-9895-71275af20132,5.0,True,"Ok. In that case, the costs are 3591.44USD and...",wizard,1471881000000.0,"[{'args': [{'val': [{'annotations': [], 'frame...",,10,"[{'info': {'intent': [{'val': 'book', 'negated...",[],[]
10317,U260BGVS6,U21T9NMKM,8c9532d5-da74-410e-a6b3-083ac765726a,4.0,True,Wow that's a good deal! Do you have anything l...,user,1472569000000.0,"[{'args': [{'val': 'earlier', 'key': 'str_date...","[{'args': [{'val': 'earlier', 'key': 'str_date...",1,"[{'info': {'intent': [{'val': 'book', 'negated...",,
1815,U231PNNA3,U21E41CQP,e6774560-0e90-4890-a2e8-b6636fe5f076,5.0,True,Hello! I am located in Germany and I'm getting...,user,1471453000000.0,"[{'args': [{'val': 'book', 'key': 'intent'}], ...","[{'args': [{'val': 'book', 'key': 'intent'}], ...",1,"[{'info': {'intent': [{'val': 'book', 'negated...",,
9113,U22K1SX9N,U21E0179B,f73a248b-f531-42ab-9a1f-81f74e0d79b2,5.0,False,I think I’d prefer Rome. lets book it,user,1472482000000.0,[{'args': [{'val': [{'annotations': [{'val': '...,"[{'args': [{'val': 'Rome', 'key': 'dst_city'}]...",3,"[{'info': {'or_city': [{'val': 'Barcelona', 'n...",,
11750,U24V2QUKC,U21E0179B,aa9921e6-d4a6-486e-9d0d-6973e2254781,5.0,True,Could it get me to Sao Paulo?,user,1472667000000.0,"[{'args': [{'val': 'Sao Paulo', 'key': 'dst_ci...","[{'args': [{'val': 'Sao Paulo', 'key': 'dst_ci...",4,"[{'info': {'intent': [{'val': 'book', 'negated...",,


In [None]:
# Load JSON file
path = '../data/frames_dataset/frames.json'
frames = pd.read_json(path, lines=True)
flyme[198].values

## Create Conversation Analysis Client
To create a Conversation Analysis client, you need to install the `azure-ai-conversationanalysis` package. You can do this using pip:

```bashpip install azure-ai-conversationanalysis
```
```## Import necessary libraries
```python
import json
import pandas as pd
from pandas import json_normalize
# Load the JSON file
with open('../data/frames_dataset/frames.json', 'r') as file:
    data = json.load(file)
# Normalize the JSON data to flatten the structure
df = json_normalize(data['conversations'], 'turns', ['id'])
# Display the first few rows of the DataFrame
df.head()

In [1]:
import azure.core
from azure.core.credentials import AzureKeyCredential
from azure.ai.language.conversations import ConversationAnalysisClient

endpoint = "https://mytravel.cognitiveservices.azure.com/"
credential = AzureKeyCredential("key1")
client = ConversationAnalysisClient(endpoint, credential)

#### Generate a LUIS JSON from the frames.json file
This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget.
The dataset is divided into two parts: a training set and a test set. The training set contains 1200 conversations and is used to train the LUIS model. The test set contains 300 conversations and is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The dataset contains conversation in French and English. The LUIS model must be able to understand both languages.
Each conversation is identified by a unique ID and consists of multiple turns. Each turn has a speaker (either "user" or "agent") and the text of the message.
The structure:
* multiple turns per conversation (user and agent)
* each turn has a speaker and text (frames)
* information about travel preferences (departure city, destination city, travel dates, budget)
The goal is to extract the relevant information (<user utterances>) from the conversations and format it according to the LUIS JSON structure. This includes identifying the intents and entities in the user's messages and structuring (map slots/entities) them in a way that LUIS can understand.

**LUIS JSON Structure:**
```json
{
  "luis_schema_version": "7.0.0",
  "versionId": "0.1",
  "name": "TravelBooking",
  "desc": "LUIS model for travel booking chatbot",
  "culture": "en-us",
  "intents": [
    {
      "name": "BookFlight"
    }
  ],
  "entities": [
    {
      "name": "DepartureCity"
    },
    {
      "name": "DestinationCity"
    },
    {
      "name": "DepartureDate"
    },
    {
      "name": "ReturnDate"
    },
    {
      "name": "Budget"
    }
  ],
  "composites": [],
  "closedLists": [],
  "patternAnyEntities": [],
  "regex_entities": [],
  "prebuiltEntities": [],
  "model_features": [],
  "regex_features": [],
  "utterances": [
    {
      "text": "I want to book a flight from New York to Paris.",
      "intent": "BookFlight",
      "entities": [
        {
          "entity": "DepartureCity",
          "startPos": 27,
          "endPos": 34
        },
        {
          "entity": "DestinationCity",
          "startPos": 38,
          "endPos": 42
        }
      ]
    }
  ],
  "patterns": []
}
```

In [26]:
# Robust extraction of LUIS utterances from frames.json
import json, re, difflib
from pathlib import Path

# Reuse existing notebook constants if available, else fall back
INPUT_FILE = globals().get('INPUT_FILE', '../data/frames_dataset/frames.json')
OUTPUT_FILE = globals().get('OUTPUT_FILE', '../data/frames_dataset/luis_flight_booking.json')
INTENT_MAP = globals().get('INTENT_MAP', {
    'book': 'BookFlight',
    'inform': 'ProvideInfo',
    'offer': 'OfferFlight',
    'request': 'RequestInfo',
    'confirm': 'ConfirmBooking',
    'greet': 'Greet',
    'thankyou': 'ThankYou',
    'select': 'SelectOption',
    'deny': 'DenyRequest',
    'ack': 'Acknowledge'
})
ENTITY_MAP = globals().get('ENTITY_MAP', {
    'departure_city': 'DepartureCity',
    'from_city': 'DepartureCity',
    'origin_city': 'DepartureCity',
    'arrival_city': 'DestinationCity',
    'to_city': 'DestinationCity',
    'destination_city': 'DestinationCity',
    'depart_date': 'DepartureDate',
    'return_date': 'ReturnDate',
    'date': 'Date',
    'price': 'Budget',
    'budget': 'Budget',
    'num_people': 'NumPassengers'
})

def find_positions(text, value):
    if not value or not text:
        return None
    # Normalize strings (keep same indexing by removing only punctuation)
    clean_text = re.sub(r'[^]', '', text.lower())
    clean_value = re.sub(r'[^]', '', str(value).lower())
    idx = clean_text.find(clean_value)
    if idx != -1:
        # compute original offsets by searching the raw text for the best matching substring
        m = re.search(re.escape(str(value)), text, flags=re.IGNORECASE)
        if m:
            return m.start(), m.end() - 1
        return idx, idx + len(clean_value) - 1
    # fuzzy match on words
    words = clean_text.split()
    matches = difflib.get_close_matches(clean_value, words, n=1, cutoff=0.75)
    if matches:
        word = matches[0]
        start = clean_text.find(word)
        # Try to locate that word in original text
        m = re.search(r'\b' + re.escape(word) + r'\b', text, flags=re.IGNORECASE)
        if m:
            return m.start(), m.end() - 1
        return start, start + len(word) - 1
    return None

# Load dataset robustly (file may be a dict with 'conversations' or a list of conversations)
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

conversations = dataset.get('conversations') if isinstance(dataset, dict) and 'conversations' in dataset else dataset

# Prepare LUIS skeleton
luis_data = {
    'luis_schema_version': '7.0.0',
    'versionId': '0.1',
    'name': 'FlightBookingBot',
    'desc': 'LUIS model generated from frames dataset',
    'culture': 'en-us',
    'intents': [],
    'entities': [],
    'utterances': []
}
intents_set, entities_set = set(), set()

# Iterate conversations and turns
for convo in conversations:
    turns = convo.get('turns') if isinstance(convo, dict) else []
    for turn in turns:
        # Accept different speaker/author keys
        speaker = (turn.get('speaker') or turn.get('author') or '').lower()
        if speaker != 'user':
            continue
        text = (turn.get('text') or '').strip()
        if not text:
            continue

        # Determine intent (default BookFlight) using frames/actions when available
        intent = 'BookFlight'
        for fr in turn.get('frames', []):
            for act in fr.get('actions', []):
                act_type = (act.get('act') or '').lower()
                if act_type in INTENT_MAP:
                    intent = INTENT_MAP[act_type]
                    break
            if intent != 'BookFlight':
                break

        intents_set.add(intent)

        # Extract entities from multiple possible frame shapes
        utter_entities = []
        for fr in turn.get('frames', []):
            # info, slots, attributes are possible containers in different dataset variants
            candidates = []
            if isinstance(fr.get('info'), list):
                candidates = fr.get('info')
            elif isinstance(fr.get('slots'), list):
                candidates = fr.get('slots')
            elif isinstance(fr.get('attributes'), list):
                candidates = fr.get('attributes')

            for info in candidates:
                # Try common keys for slot name and value
                slot = info.get('slot') or info.get('name') or info.get('key') or info.get('label')
                value = info.get('value') or info.get('text') or info.get('values') or info.get('valueText')
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                if not slot or value is None:
                    continue

                normalized = ENTITY_MAP.get(slot.lower(), slot)
                entities_set.add(normalized)
                pos = find_positions(text, value)
                if pos:
                    start, end = pos
                else:
                    start, end = 0, 0
                utter_entities.append({'entity': normalized, 'startPos': start, 'endPos': end})

        # If no explicit frame info entities, try quick regex extraction for dates and numbers
        if not utter_entities:
            # date-like (YYYY-MM-DD)
            m = re.search(r'\b(20\d{2}-\d{2}-\d{2})\b', text)
            if m:
                utter_entities.append({'entity': 'DepartureDate', 'startPos': m.start(), 'endPos': m.end()-1})
                entities_set.add('DepartureDate')
            # budget-like numbers
            m2 = re.search(r'\b\$?(\d{2,6})\b', text)
            if m2:
                utter_entities.append({'entity': 'Budget', 'startPos': m2.start(), 'endPos': m2.end()-1})
                entities_set.add('Budget')

        luis_data['utterances'].append({'text': text, 'intent': intent, 'entities': utter_entities})

# finalize intents and entities lists
luis_data['intents'] = [{'name': i} for i in sorted(intents_set)]
luis_data['entities'] = [{'name': e} for e in sorted(entities_set)]

# Save LUIS JSON
Path(OUTPUT_FILE).parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    json.dump(luis_data, f, indent=2, ensure_ascii=False)

print(f'✅ LUIS JSON created:', OUTPUT_FILE)
print(f'Intents: {len(intents_set)}, Entities: {len(entities_set)}, Utterances: {len(luis_data['utterances'])}')

✅ LUIS JSON created: ../data/frames_dataset/luis_flight_booking.json
Intents: 1, Entities: 1, Utterances: 10407


In [28]:
import pandas as pd
# Load JSON file
with open(OUTPUT_FILE, 'r') as file:
    for i in range(50):  # Read first 3 lines
        line = file.readline()
        print(line.strip())

{
"luis_schema_version": "7.0.0",
"versionId": "0.1",
"name": "FlightBookingBot",
"desc": "LUIS model generated from frames dataset",
"culture": "en-us",
"intents": [
{
"name": "BookFlight"
}
],
"entities": [
{
"name": "Budget"
}
],
"utterances": [
{
"text": "I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
"intent": "BookFlight",
"entities": [
{
"entity": "Budget",
"startPos": 69,
"endPos": 70
}
]
},
{
"text": "Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budget would be 1900.",
"intent": "BookFlight",
"entities": [
{
"entity": "Budget",
"startPos": 57,
"endPos": 58
}
]
},
{
"text": "I have no flexibility for dates... but I can leave from Atlantis rather than Caprica. How about that?",
"intent": "BookFlight",
"entities": []
},
{
"text": "I suppose I'll speak with my husband to see if we can choose other dates, and then I'll come back to you.Thanks for your h