In [1]:
import json
import os
import pandas as pd
import numpy as np

# Fly Me
A travel provider for individuals and professionals. 

## Overview
The chatbot project - Jupyter Notebook
Fly Me has launched an ambitious project to develop a *chatbot* to *help users choose a travel offer*.

The first phase of this project is to **develop an MVP** that will help Fly Me employees to easily book airline tickets for their holidays.

This first MVP will allow us to test the concept and performance of the chatbot quickly and on a large scale.

As this project is iterative, we have limited the features of the chatbot V1. It must be able to identify the following five elements in the user's request:

* Departure city
* Destination city
* Desired flight departure date
* Desired flight return date
* Maximum budget for the total price of tickets

If one of the elements is missing, the chatbot must be able to ask the user the relevant questions in order to fully understand the request. When the chatbot thinks it has understood all the elements of the user's request, it must be able to reformulate the user's request and ask the user for confirmation.

## Tools and Technologies used
To carry out this project, we will need to use the following tools and technologies:

* The Microsoft Bot Framework source code for Python ‚ÄúMicrosoft Bot Framework SDK v4 for Python‚Äù 
* The Azure LUIS Cognitive Service which allows you to perform a semantic analysis of a message entered by the user and structure it for processing by the bot (it should allow you to identify the five elements requested)
* The Azure Web App service that allows you to run a web application on the Azure Cloud (you won't need to use the Azure Bot service)
* The Bot Framework Emulator which will allow you to test your chatbot locally and in production. It is an interface that allows a user to interact with the chatbot.
## Data
The data used to train the LUIS model comes from a dataset of conversations between users and travel agents. This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget.
The dataset is divided into two parts: a training set and a test set. The training set contains 1200 conversations and is used to train the LUIS model. The test set contains 300 conversations and is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The `turns` field contains a list of messages exchanged between the user and the travel agent. Each message has a `speaker` field indicating who sent the message (either "user" or "agent") and a `text` field containing the content of the message.
## Objective
The objective of this notebook is to explore and preprocess the dataset to prepare it for training the LUIS model. This includes loading the JSON file, flattening the nested structure, and extracting relevant information from the conversations.


## What is a bot?
Bots provide an experience that feels less like using a computer and more like dealing with a person‚Äîor intelligent robot. You can use bots to shift simple, repetitive tasks‚Äîsuch as taking a dinner reservation or gathering profile information‚Äîonto automated systems that may no longer require direct human intervention. Users converse with a bot using text, interactive cards, and speech. A bot interaction can be a quick answer to a question or an involved conversation that intelligently provides access to services.

In [2]:
# Flatten nested JSON columns into a flat DataFrame
from pandas import json_normalize
# Load JSON file
path = '../data/frames_dataset/frames.json'
with open(path, 'r') as file:
    data = json.load(file)

# create a flat DataFrame from the loaded `data` variable
try:
    # Normalize `data` to have all dicts/lists at the same level
    if 'data' in globals() and (isinstance(data, list) or isinstance(data, dict)):
        df_flat = json_normalize(data, sep='.')
    else:
        # Fall back to normalizing rows from the existing DataFrame `df`
        df_flat = json_normalize(df.to_dict(orient='records'), sep='.')
except Exception as e:
    print('Normalization error:', e)
    df_flat = df.copy()

# Handle columns that are lists of dicts: detect and explode+normalize them
for col in df_flat.columns.tolist():
    # If the column contains lists of dicts, expand them
    if df_flat[col].apply(lambda x: isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict)).any():
        # Explode the list so each element becomes its own row, then normalize that column
        exploded = df_flat.explode(col).reset_index(drop=True)
        # Normalize the exploded column (it may contain dicts or NaN)
        expanded = json_normalize(exploded[col].dropna().tolist(), sep='.')
        # Prefix new columns with the original column name
        expanded = expanded.add_prefix(col + '.')
        # Join back to exploded frame (align by index)
        exploded = exploded.drop(columns=[col]).join(expanded)
        df_flat = exploded

# Show results
print('Flattened DataFrame shape:', df_flat.shape)
display(df_flat.head(3))

Flattened DataFrame shape: (19986, 14)


Unnamed: 0,user_id,wizard_id,id,labels.userSurveyRating,labels.wizardSurveyTaskSuccessful,turns.text,turns.author,turns.timestamp,turns.labels.acts,turns.labels.acts_without_refs,turns.labels.active_frame,turns.labels.frames,turns.db.result,turns.db.search
0,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,I'd like to book a trip to Atlantis from Capri...,user,1471272000000.0,"[{'args': [{'val': 'book', 'key': 'intent'}], ...","[{'args': [{'val': 'book', 'key': 'intent'}], ...",1,"[{'info': {'intent': [{'val': 'book', 'negated...",,
1,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Hi...I checked a few options for you, and unfo...",wizard,1471272000000.0,"[{'args': [{'val': [{'annotations': [], 'frame...",,1,"[{'info': {'intent': [{'val': 'book', 'negated...",[[{'trip': {'returning': {'duration': {'hours'...,"[{'ORIGIN_CITY': 'Porto Alegre', 'PRICE_MIN': ..."
2,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Yes, how about going to Neverland from Caprica...",user,1471273000000.0,"[{'args': [{'val': 'Neverland', 'key': 'dst_ci...","[{'args': [{'val': 'Neverland', 'key': 'dst_ci...",2,"[{'info': {'intent': [{'val': 'book', 'negated...",,


In [3]:
df_flat['turns.text'].values

array(["I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
       'Hi...I checked a few options for you, and unfortunately, we do not currently have any trips that meet this criteria.  Would you like to book an alternate travel option?',
       'Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budget would be 1900.',
       ..., 'Consider it done! Have a good trip :slightly_smiling_face:',
       'Thanks!', 'My pleasure!'], shape=(19986,), dtype=object)

In [1]:
import platform
print(platform.architecture()[0])

64bit


In [4]:
# sample = df_flat.sample(5, random_state=42)
# json_string = sample.to_json()
# print(json_string)
# # Save the flattened DataFrame to a CSV file
# sample.to_json('../data/frames_dataset/sample_frames.json', orient='records', lines=True)

## Create Conversation Analysis Client
To create a Conversation Analysis client, you need to install the `azure-ai-conversationanalysis` package. You can do this using pip:

```bashpip install azure-ai-conversationanalysis
```
```## Import necessary libraries
```python
import json
import pandas as pd
from pandas import json_normalize
# Load the JSON file
with open('../data/frames_dataset/frames.json', 'r') as file:
    data = json.load(file)
# Normalize the JSON data to flatten the structure
df = json_normalize(data['conversations'], 'turns', ['id'])
# Display the first few rows of the DataFrame
df.head()

In [1]:
import azure.core
from azure.core.credentials import AzureKeyCredential
from azure.ai.language.conversations import ConversationAnalysisClient

endpoint = "https://mytravel.cognitiveservices.azure.com/"
credential = AzureKeyCredential("key1")
client = ConversationAnalysisClient(endpoint, credential)

#### Generate a LUIS JSON from the frames.json file
This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget.
The dataset is divided into two parts: a training set and a test set. The training set contains 1200 conversations and is used to train the LUIS model. The test set contains 300 conversations and is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The dataset contains conversation in French and English. The LUIS model must be able to understand both languages.
Each conversation is identified by a unique ID and consists of multiple turns. Each turn has a speaker (either "user" or "agent") and the text of the message.
The structure:
* multiple turns per conversation (user and agent)
* each turn has a speaker and text (frames)
* information about travel preferences (departure city, destination city, travel dates, budget)
The goal is to extract the relevant information (<user utterances>) from the conversations and format it according to the LUIS JSON structure. This includes identifying the intents and entities in the user's messages and structuring (map slots/entities) them in a way that LUIS can understand.

**LUIS JSON Structure:**
```json
{
  "luis_schema_version": "7.0.0",
  "versionId": "0.1",
  "name": "TravelBooking",
  "desc": "LUIS model for travel booking chatbot",
  "culture": "en-us",
  "intents": [
    {
      "name": "BookFlight"
    }
  ],
  "entities": [
    {
      "name": "DepartureCity"
    },
    {
      "name": "DestinationCity"
    },
    {
      "name": "DepartureDate"
    },
    {
      "name": "ReturnDate"
    },
    {
      "name": "Budget"
    }
  ],
  "composites": [],
  "closedLists": [],
  "patternAnyEntities": [],
  "regex_entities": [],
  "prebuiltEntities": [],
  "model_features": [],
  "regex_features": [],
  "utterances": [
    {
      "text": "I want to book a flight from New York to Paris.",
      "intent": "BookFlight",
      "entities": [
        {
          "entity": "DepartureCity",
          "startPos": 27,
          "endPos": 34
        },
        {
          "entity": "DestinationCity",
          "startPos": 38,
          "endPos": 42
        }
      ]
    }
  ],
  "patterns": []
}
```

In [5]:
# Date-first extraction using dateparser (prefer these spans)
import dateparser
import re

MONTH_WORDS = r'\b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december)\b'

def find_date_spans(text):
    """
    Return list of (start, end, matched_text, parsed_dt) found by dateparser.search.search_dates.
    Fall back to a few explicit regexes (ISO) if needed.
    """
    spans = []
    # try dateparser search first
    try:
        res = dateparser.search.search_dates(text, settings={'PREFER_DATES_FROM': 'future', 'RETURN_AS_TIMEZONE_AWARE': False})
    except Exception:
        res = None
    if res:
        for match_text, dt in res:
            # find first occurrence of match_text (case-insensitive) and use it
            m = re.search(re.escape(match_text), text, flags=re.IGNORECASE)
            if m:
                spans.append((m.start(), m.end()-1, match_text, dt))
    # fallback: explicit ISO-ish regex spans not caught by dateparser
    for m in re.finditer(r'\b20\d{2}[/-]\d{1,2}[/-]\d{1,2}\b', text):
        spans.append((m.start(), m.end()-1, text[m.start():m.end()], None))
    return spans


# Improved numeric-as-budget decision (use date spans + context cues + year heuristics)
CURRENCY_CUES = ['$', 'usd', 'dollars', 'eur', '‚Ç¨', '¬£', 'budget', 'price', 'cost', 'fare', 'ticket', 'pay']

def is_token_overlapping_spans(start, end, spans):
    return any(s <= start <= e or s <= end <= e for (s,e, *_ ) in spans)

def is_likely_budget_token(text, start, end, date_spans):
    """
    Return True if numeric token at (start,end) is likely a budget, False otherwise.
    Uses date_spans to avoid classifying date tokens.
    """
    # don't label if overlaps an already-detected date
    if is_token_overlapping_spans(start, end, date_spans):
        return False

    window = text[max(0, start-30):min(len(text), end+30)].lower()

    # strong currency cues -> budget
    if any(cue in window for cue in CURRENCY_CUES):
        return True
    # currency symbol immediately before e.g. $1900
    if start > 0 and text[start-1] in ['$', '‚Ç¨', '¬£']:
        return True

    token = text[start:end+1]
    # numeric-only value
    try:
        val = int(re.sub(r'[^0-9]', '', token))
    except Exception:
        return False

    # heuristic rules for years vs price:
    # if token is a 4-digit year within typical year range and there's a month word nearby -> treat as date
    if len(token) == 4 and 1900 <= val <= 2035:
        month_nearby = re.search(MONTH_WORDS, window, flags=re.IGNORECASE)
        if month_nearby:
            return False  # likely a year in date context
        # if sentence contains explicit date tokens (24, Aug etc) treat as Date
        if re.search(r'\b\d{1,2}\b', window) and re.search(MONTH_WORDS, window, flags=re.IGNORECASE):
            return False

    # numeric-value heuristics: consider values >=100 as plausible budgets (tunable)
    if val >= 100 and val <= 1000000:
        # If text contains words like 'on' before token plus month words, that might be a date: be conservative
        before = text[max(0, start-10):start].lower()
        if re.search(r'\bon\b', before) and re.search(MONTH_WORDS, text, flags=re.IGNORECASE):
            return False
        return True

    return False

In [6]:
# Using spaCy NER extraction of LUIS utterances from frames.json
# - Uses spaCy NER when available (fallback to rule-based regex and heuristics)
# - Adds more entity patterns (airport codes, times, currencies, passenger counts)

import json
import re
import sys
from pathlib import Path

# === Configuration, Reuse existing notebook constants if available, else fall back ===
INPUT_FILE = globals().get('INPUT_FILE', '../data/frames_dataset/frames.json')
OUTPUT_FILE = globals().get('OUTPUT_FILE', '../data/frames_dataset/luis_flight_booking.json')

# === Intent mapping from frame actions ===
INTENT_MAP = globals().get('INTENT_MAP', {
    'book': 'BookFlight',
    'inform': 'ProvideInfo',
    'offer': 'OfferFlight',
    'request': 'RequestInfo',
    'confirm': 'ConfirmBooking',
    'greet': 'Greet',
    'thankyou': 'ThankYou',
    'select': 'SelectOption',
    'deny': 'DenyRequest',
    'ack': 'Acknowledge'
})

# === Entity normalization mapping ===
ENTITY_MAP = globals().get('ENTITY_MAP', {
    'departure_city': 'DepartureCity',
    'from_city': 'DepartureCity',
    'origin_city': 'DepartureCity',
    'arrival_city': 'DestinationCity',
    'to_city': 'DestinationCity',
    'destination_city': 'DestinationCity',
    'depart_date': 'DepartureDate',
    'return_date': 'ReturnDate',
    'date': 'Date',
    'price': 'Budget',
    'budget': 'Budget',
    'num_people': 'NumPassengers'
})

# === Blocklist certain spaCy labels and unwanted entity names ===
BLOCKLIST = {'EVENT', 'FAC', 'LAW', 'PRODUCT', 'NORP', 'PERCENT', 'PERSON', 'WORK_OF_ART'}

# === Try to import spaCy and add rule-based matcher if available ===
USE_SPACY = False
try:
    import spacy
    from spacy.matcher import Matcher
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    USE_SPACY = True
except Exception:
    print('spaCy not available or model missing, falling back to rule-based extraction', file=sys.stderr)

# === Position finder: finds the best case-insensitive span for a value in text ===
def find_positions(text, value):
    if not value or not text:
        return None
    s = str(value).strip()
    if not s:
        return None
    # === direct case-insensitive search first ===
    m = re.search(re.escape(s), text, flags=re.IGNORECASE)
    if m:
        return m.start(), m.end() - 1

    # === normalized whitespace and punctuation relaxed search ===
    norm_text = re.sub(r'\s+', ' ', text).lower()
    norm_value = re.sub(r'\s+', ' ', s).lower()
    idx = norm_text.find(norm_value)
    if idx != -1:
        words = norm_value.split()
        pattern = r'\\b' + r'\\s+'.join(map(re.escape, words)) + r'\\b'
        m2 = re.search(pattern, text, flags=re.IGNORECASE)
        if m2:
            return m2.start(), m2.end() - 1

    # === fuzzy word match fallback ===
    from difflib import get_close_matches
    words = norm_text.split()
    cm = get_close_matches(norm_value, words, n=1, cutoff=0.8)
    if cm:
        word = cm[0]
        m3 = re.search(re.escape(word), text, flags=re.IGNORECASE)
        if m3:
            return m3.start(), m3.end() - 1
    return None

# === Rule-based regex patterns to capture common entities ===
ENTITY_PATTERNS = [
    ("DepartureDate", re.compile(r"\b(20\d{2}[\/-]\d{1,2}[\/-]\d{1,2})\b")),
    ("DepartureDate", re.compile(r"\b(\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[a-z]*\s+20\d{2})\b", re.IGNORECASE)),
    ("Time", re.compile(r"\b(\d{1,2}:(?:\d{2})(?:\s?[APMapm]{2})?)\b")),
    ("Budget", re.compile(r"\b\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?\b")),
    ("Budget", re.compile(r"\b(?:usd|dollars|eur|‚Ç¨|¬£)\s?\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?\b", re.IGNORECASE)),
    # keep AirportCode and NumPassengers patterns; numeric-only budgets handled separately below
    ("AirportCode", re.compile(r"\b[A-Z]{3}\b")),
    ("NumPassengers", re.compile(r"\b(\d+)\s*(?:passengers|people|persons|pax)\b", re.IGNORECASE)),
]

# === Add spaCy matcher patterns ===
if USE_SPACY:
    try:
        matcher.add('NUM_PASS', [[{"IS_DIGIT": True}, {"LOWER": {"IN": ["adult", "adults", "passenger", "passengers"]}}]])
    except Exception:
        pass

# === helper to apply spaCy NER and also our rule-based patterns ===
def extract_entities_from_text(text):
    date_spans = find_date_spans(text)  # list of (s,e,text,dt)
    entities = []
    # === add date entities first ===
    for s, e, match, dt in date_spans:
        entities.append({'entity': 'Date', 'startPos': s, 'endPos': e, 'text': match})

    # === apply rule-based patterns once (avoid duplicates and date overlaps) ===
    for etype, pattern in ENTITY_PATTERNS:
        try:
            for m in pattern.finditer(text):
                s, e = m.start(), m.end() - 1
                if is_token_overlapping_spans(s, e, [(ds, de) for ds, de, _, _ in date_spans]):
                    continue
                if any(ent['startPos'] <= s <= ent['endPos'] or ent['startPos'] <= e <= ent['endPos'] for ent in entities):
                    continue
                entities.append({'entity': etype, 'startPos': s, 'endPos': e, 'text': text[s:e+1]})
        except re.error:
            continue

    # === now numeric-only tokens (e.g. "1900") - use the date-aware budget detector ===
    for m in re.finditer(r"\b\d{3,6}\b", text):
        s, e = m.start(), m.end() - 1
        # === skip numbers overlapping date spans or already captured ===
        if is_token_overlapping_spans(s, e, [(ds, de) for ds, de, _, _ in date_spans]):
            continue
        if any(ent['startPos'] <= s <= ent['endPos'] or ent['startPos'] <= e <= ent['endPos'] for ent in entities):
            continue
        if is_likely_budget_token(text, s, e, date_spans):
            span_text = text[s:e+1]
            entities.append({'entity': 'Budget', 'startPos': s, 'endPos': e, 'text': span_text})

    # === spaCy NER - prefer these labels but avoid duplicates ===
    if USE_SPACY:
        doc = nlp(text)
        for ent in doc.ents:
            label = ent.label_
            if label in ('GPE', 'LOC', 'ORG'):
                et = 'Location'
            elif label in ('DATE',):
                et = 'Date'
            elif label in ('TIME',):
                et = 'Time'
            elif label in ('MONEY',):
                et = 'Budget'
            else:
                et = label
            start, end = ent.start_char, ent.end_char - 1
            if any(existing['startPos'] <= start <= existing['endPos'] or existing['startPos'] <= end <= existing['endPos'] for existing in entities):
                continue
            # === filter blocklisted spaCy label values ===
            if et in BLOCKLIST:
                continue
            entities.append({'entity': et, 'startPos': start, 'endPos': end, 'text': ent.text})

    # === airport code heuristic with context ===
    for m in re.finditer(r"\b([A-Z]{3})\b", text):
        ctx = text[max(0, m.start()-20):m.end()+20].lower()
        if 'airport' in ctx or 'from' in ctx or 'to' in ctx or 'arriv' in ctx or 'depart' in ctx:
            start, end = m.start(), m.end() - 1
            if not any(s['startPos'] <= start <= s['endPos'] for s in entities):
                entities.append({'entity': 'AirportCode', 'startPos': start, 'endPos': end, 'text': m.group(1)})

    # === normalize a few labels ===
    for e in entities:
        lab = e['entity']
        if lab.lower() in ('location', 'gpe', 'loc'):
            e['entity'] = 'Location'
    return entities

# === Load dataset robustly (file may be a dict with 'conversations' or a list of conversations) ===
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

conversations = dataset.get('conversations') if isinstance(dataset, dict) and 'conversations' in dataset else dataset

# === Initialize LUIS structure ===
luis_data = {
    'luis_schema_version': '7.0.0',
    'versionId': '0.1',
    'name': 'FlightBookingBot',
    'desc': 'LUIS model generated from frames dataset (improved NER)',
    'culture': 'en-us',
    'intents': [],
    'entities': [],
    'utterances': []
}
intents_set, entities_set = set(), set()

# === Iterate conversations and turns ===
for convo in conversations:
    turns = convo.get('turns') if isinstance(convo, dict) else []
    for turn in turns:
        speaker = (turn.get('speaker') or turn.get('author') or '').lower()
        if speaker != 'user':
            continue
        text = (turn.get('text') or '').strip()
        if not text:
            continue

        # === Determine intent (default BookFlight) using frames/actions when available ===
        intent = 'BookFlight'
        for fr in turn.get('frames', []):
            for act in fr.get('actions', []):
                act_type = (act.get('act') or '').lower()
                if act_type in INTENT_MAP:
                    intent = INTENT_MAP[act_type]
                    break
            if intent != 'BookFlight':
                break

        intents_set.add(intent)

        # === Extract Entitie ===
        utter_entities = []

        # === prefer authoritative frame values when available ===
        for fr in turn.get('frames', []):
            candidates = []
            if isinstance(fr.get('info'), list):
                candidates = fr.get('info')
            elif isinstance(fr.get('slots'), list):
                candidates = fr.get('slots')
            elif isinstance(fr.get('attributes'), list):
                candidates = fr.get('attributes')
            for info in candidates:
                slot = info.get('slot') or info.get('name') or info.get('key') or info.get('label')
                value = info.get('value') or info.get('text') or info.get('values') or info.get('valueText')
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                if not slot or value is None:
                    continue
                normalized = ENTITY_MAP.get(slot.lower(), slot)
                entities_set.add(normalized)
                pos = find_positions(text, value)
                if pos:
                    start, end = pos
                else:
                    start, end = 0, 0
                utter_entities.append({'entity': normalized, 'startPos': start, 'endPos': end, 'text': str(value)})
        
        # === supplement with text-based extraction ===
        extracted = extract_entities_from_text(text)
        for e in extracted:
            label = e['entity']
            entities_set.add(label)
            if not any(u.get('startPos') == e['startPos'] and u.get('endPos') == e['endPos'] for u in utter_entities):
                utter_entities.append({'entity': label, 'startPos': e['startPos'], 'endPos': e['endPos'], 'text': e.get('text')})
        luis_data['utterances'].append({'text': text, 'intent': intent, 'entities': utter_entities})

# === Add unique intents and entities ===
luis_data['intents'] = [{'name': i} for i in sorted(intents_set)]

# === remove blocked items from entities_set before finalizing ===
entities_set = {e for e in entities_set if e not in BLOCKLIST}

luis_data['entities'] = [{'name': e} for e in sorted(entities_set)]

# === Save output (LUIS JSON) ===
Path(OUTPUT_FILE).parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    json.dump(luis_data, f, indent=2, ensure_ascii=False)

print("‚úÖ LUIS JSON created successfully!")
print(f"üìÑ Saved to: {OUTPUT_FILE}")
print(f'üß† Intents: {len(intents_set)}, Entities: {len(entities_set)}, Utterances: {len(luis_data['utterances'])}')

‚úÖ LUIS JSON created successfully!
üìÑ Saved to: ../data/frames_dataset/luis_flight_booking.json
üß† Intents: 1, Entities: 10, Utterances: 10407


In [7]:
import pandas as pd
# Load JSON file
with open(OUTPUT_FILE, 'r') as file:
    for i in range(250):  # Read first 150 lines
        line = file.readline()
        print(line.strip())

{
"luis_schema_version": "7.0.0",
"versionId": "0.1",
"name": "FlightBookingBot",
"desc": "LUIS model generated from frames dataset (improved NER)",
"culture": "en-us",
"intents": [
{
"name": "BookFlight"
}
],
"entities": [
{
"name": "AirportCode"
},
{
"name": "Budget"
},
{
"name": "CARDINAL"
},
{
"name": "Date"
},
{
"name": "LANGUAGE"
},
{
"name": "Location"
},
{
"name": "NumPassengers"
},
{
"name": "ORDINAL"
},
{
"name": "QUANTITY"
},
{
"name": "Time"
}
],
"utterances": [
{
"text": "I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
"intent": "BookFlight",
"entities": [
{
"entity": "Budget",
"startPos": 117,
"endPos": 120,
"text": "1700"
},
{
"entity": "Date",
"startPos": 52,
"endPos": 76,
"text": "Saturday, August 13, 2016"
},
{
"entity": "CARDINAL",
"startPos": 82,
"endPos": 82,
"text": "8"
}
]
},
{
"text": "Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budg