In [1]:
import json
import os
import pandas as pd
import numpy as np

# Fly Me
A travel provider for individuals and professionals. 

## Overview
The chatbot project - Jupyter Notebook
Fly Me a travel agency has launched an ambitious project to develop a *chatbot* to *help users choose a travel offer*.

The first phase of this project is to **develop an MVP** that will help Fly Me employees to easily book airline tickets for their holidays.

This first MVP will allow us to test the concept and performance of the chatbot quickly and on a large scale.

As this project is iterative, we have limited the features of the chatbot V1. It must be able to identify the following five elements in the user's request:

* Departure city
* Destination city
* Desired flight departure date
* Desired flight return date
* Maximum budget for the total price of tickets

If one of the elements is missing, the chatbot must be able to ask the user the relevant questions in order to fully understand the request. When the chatbot thinks it has understood all the elements of the user's request, it must be able to reformulate the user's request and ask the user for confirmation.

## Tools and Technologies used
To carry out this project, we will need to use the following tools and technologies:

* The Microsoft Bot Framework source code for Python “Microsoft Bot Framework SDK v4 for Python” 
* The Azure LUIS Cognitive Service which allows you to perform a semantic analysis of a message entered by the user and structure it for processing by the bot (it should allow you to identify the five elements requested)
* The Azure Web App service that allows you to run a web application on the Azure Cloud (you won't need to use the Azure Bot service)
* The Bot Framework Emulator which will allow you to test your chatbot locally and in production. It is an interface that allows a user to interact with the chatbot.

## Data
The data used to train the LUIS model comes from a dataset of conversations between users and travel agents. This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget. For more details about the dataset, please refer to the original source: [Frames Dataset](https://www.microsoft.com/en-us/research/project/frames-dataset/).

For the purpose of this project, the dataset is divided into two parts: a training set and a test set. The training set contains 80% conversations and is used to train the LUIS model, the remainder of the conversations are assigned to the test set. The test set is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The `turns` field contains a list of messages exchanged between the user and the travel agent. Each message has a `speaker` field indicating who sent the message (either "user" or "agent") and a `text` field containing the content of the message.

## Objective
The objective of this notebook is to explore and preprocess the dataset to prepare it for training the LUIS model. This includes loading the JSON file, flattening the nested structure, and extracting relevant information from the conversations. We will use the `pandas` library to manipulate the data and perform basic exploratory data analysis (EDA) to understand the distribution of the different entities and intents in the dataset.

## What is a bot?
Bots provide an experience that feels less like using a computer and more like dealing with a person—or intelligent robot. Bots can used to shift simple, repetitive tasks—such as taking a dinner reservation or gathering profile information—onto automated systems that may no longer require direct human intervention. Users converse with a bot using text, interactive cards, and speech. A bot interaction can be a quick answer to a question or an involved conversation that intelligently provides access to services.

In creating a bot, we define how the bot interacts with users, what services it connects to, and how it processes information. We can create bots that run in a variety of environments, including websites, apps, Microsoft Teams, Skype, Slack, Facebook Messenger, and more.
For a detailed overview of bots, see [What are bots?](https://learn.microsoft.com/en-us/azure/bot-service/bot-service-overview-introduction?view=azure-bot-service-4.0). 

In this project we will create a bot that can help users book flights using the Language Understanding Interface System (LUIS) or the Conversational Language Understanding (CLU) model from Microsoft. To create the model we will define intents, entities, and utterances.

1. **Intents**: 

An intent represents a task or action that the user wants to perform. It is the purpose of the user's input. For example, in our flight booking bot, we might have intents such as "BookFlight", "CancelFlight", and "GetFlightStatus".

2. **Entities**: 

Entities are used to extract specific pieces of information from the user's input that are relevant to the intent. For example, in the "BookFlight" intent, we might have entities such as "DepartureCity", "DestinationCity", "DepartureDate", "ReturnDate", and "Budget".

3. **Utterances**: 

Utterances are the actual phrases or sentences that users might say to express their intent. For example, for the "BookFlight" intent, some example utterances might be "I want to book a flight from New York to Paris on June 1st and return on June 10th with a budget of $1000" or "Can you help me find a flight to London next month?".

In summary:
* An **intent** represents the purpose of a user's input or a task/action a user wants to perform. It is the meaning of an utterance. In this case, the intent is to book a flight.
* An **entity** are used to add specific context to intents. Represents a specific piece of information that is relevant to the intent. In this case, the entities are the departure city, destination city, departure date, return date, and budget.
* An **utterance** is a phrase(s) that a user might enter when interacting with the application. It is a specific example of a user's input that expresses the intent. In this case, an utterance could be "I want to book a flight from New York to Paris on June 1st and return on June 10th with a budget of $1000."

We create a model by defining intents and associating them with one or more utterances. We define the intents we want the model to understand. Every model must have a **None** intent (used to explicitly identify utterances that a user might submit, but for which there is no specific action required (for example, conversational greetings like "hello") or that fall outside of the scope of the domain for this model). We must define the entities that are relevant to the intent and annotate the utterances with the appropriate entities. This helps the model learn to recognize the intent and extract the relevant entities from user input. 

The model can then recognize the intent to book a flight and extract the relevant entities from the user's input. This model can then be integrated into a chatbot application to help users book flights.



In [2]:
# Flatten nested JSON columns into a flat DataFrame
from pandas import json_normalize
# Load JSON file
path = '../data/frames_dataset/frames.json'
with open(path, 'r') as file:
    data = json.load(file)

# create a flat DataFrame from the loaded `data` variable
try:
    # Normalize `data` to have all dicts/lists at the same level
    if 'data' in globals() and (isinstance(data, list) or isinstance(data, dict)):
        df_flat = json_normalize(data, sep='.')
    else:
        # Fall back to normalizing rows from the existing DataFrame `df`
        df_flat = json_normalize(df.to_dict(orient='records'), sep='.')
except Exception as e:
    print('Normalization error:', e)
    df_flat = df.copy()

# Handle columns that are lists of dicts: detect and explode+normalize them
for col in df_flat.columns.tolist():
    # If the column contains lists of dicts, expand them
    if df_flat[col].apply(lambda x: isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict)).any():
        # Explode the list so each element becomes its own row, then normalize that column
        exploded = df_flat.explode(col).reset_index(drop=True)
        # Normalize the exploded column (it may contain dicts or NaN)
        expanded = json_normalize(exploded[col].dropna().tolist(), sep='.')
        # Prefix new columns with the original column name
        expanded = expanded.add_prefix(col + '.')
        # Join back to exploded frame (align by index)
        exploded = exploded.drop(columns=[col]).join(expanded)
        df_flat = exploded

# Show results
print('Flattened DataFrame shape:', df_flat.shape)
display(df_flat.head(3))

Flattened DataFrame shape: (19986, 14)


Unnamed: 0,user_id,wizard_id,id,labels.userSurveyRating,labels.wizardSurveyTaskSuccessful,turns.text,turns.author,turns.timestamp,turns.labels.acts,turns.labels.acts_without_refs,turns.labels.active_frame,turns.labels.frames,turns.db.result,turns.db.search
0,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,I'd like to book a trip to Atlantis from Capri...,user,1471272000000.0,"[{'args': [{'val': 'book', 'key': 'intent'}], ...","[{'args': [{'val': 'book', 'key': 'intent'}], ...",1,"[{'info': {'intent': [{'val': 'book', 'negated...",,
1,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Hi...I checked a few options for you, and unfo...",wizard,1471272000000.0,"[{'args': [{'val': [{'annotations': [], 'frame...",,1,"[{'info': {'intent': [{'val': 'book', 'negated...",[[{'trip': {'returning': {'duration': {'hours'...,"[{'ORIGIN_CITY': 'Porto Alegre', 'PRICE_MIN': ..."
2,U22HTHYNP,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,4.0,True,"Yes, how about going to Neverland from Caprica...",user,1471273000000.0,"[{'args': [{'val': 'Neverland', 'key': 'dst_ci...","[{'args': [{'val': 'Neverland', 'key': 'dst_ci...",2,"[{'info': {'intent': [{'val': 'book', 'negated...",,


In [3]:
df_flat['turns.text'].values

array(["I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
       'Hi...I checked a few options for you, and unfortunately, we do not currently have any trips that meet this criteria.  Would you like to book an alternate travel option?',
       'Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budget would be 1900.',
       ..., 'Consider it done! Have a good trip :slightly_smiling_face:',
       'Thanks!', 'My pleasure!'], shape=(19986,), dtype=object)

In [4]:
import platform
print(platform.architecture()[0])

64bit


In [4]:
# sample = df_flat.sample(5, random_state=42)
# json_string = sample.to_json()
# print(json_string)
# # Save the flattened DataFrame to a CSV file
# sample.to_json('../data/frames_dataset/sample_frames.json', orient='records', lines=True)

## Create Conversation Analysis Client
To create a Conversation Analysis client, you need to install the `azure-ai-conversationanalysis` package. You can do this using pip:

```bashpip install azure-ai-conversationanalysis
```
```## Import necessary libraries
```python
import json
import pandas as pd
from pandas import json_normalize
# Load the JSON file
with open('../data/frames_dataset/frames.json', 'r') as file:
    data = json.load(file)
# Normalize the JSON data to flatten the structure
df = json_normalize(data['conversations'], 'turns', ['id'])
# Display the first few rows of the DataFrame
df.head()

#### Generate a LUIS JSON from the frames.json file
This dataset is in JSON format and contains a total of 1500 conversations. Each conversation consists of a series of messages exchanged between the user and the travel agent. The messages contain information about the user's travel preferences, such as departure city, destination city, travel dates, and budget.
The dataset is divided into two parts: a training set and a test set. The training set contains 1200 conversations and is used to train the LUIS model. The test set contains 300 conversations and is used to evaluate the performance of the LUIS model.
The dataset is available in the `data/frames_dataset` directory of the project. The main file is `frames.json`, which contains all the conversations. The file is structured as follows:
```json
{
  "conversations": [
    {
      "id": "1",
      "turns": [
        {
          "speaker": "user",
          "text": "I want to book a flight from New York to Paris."
        },
        {
          "speaker": "agent",
          "text": "Sure, when do you want to depart?"
        },
        ...
      ]
    },
    ...
  ]
}
```
The dataset contains conversation in French and English. The LUIS model must be able to understand both languages.
Each conversation is identified by a unique ID and consists of multiple turns. Each turn has a speaker (either "user" or "agent") and the text of the message.
The structure:
* multiple turns per conversation (user and agent)
* each turn has a speaker and text (frames)
* information about travel preferences (departure city, destination city, travel dates, budget)
The goal is to extract the relevant information (<user utterances>) from the conversations and format it according to the LUIS JSON structure. This includes identifying the intents and entities in the user's messages and structuring (map slots/entities) them in a way that LUIS can understand.

**LUIS JSON Structure:**
```json
{
  "luis_schema_version": "7.0.0",
  "versionId": "0.1",
  "name": "TravelBooking",
  "desc": "LUIS model for travel booking chatbot",
  "culture": "en-us",
  "intents": [
    {
      "name": "BookFlight"
    }
  ],
  "entities": [
    {
      "name": "DepartureCity"
    },
    {
      "name": "DestinationCity"
    },
    {
      "name": "DepartureDate"
    },
    {
      "name": "ReturnDate"
    },
    {
      "name": "Budget"
    }
  ],
  "composites": [],
  "closedLists": [],
  "patternAnyEntities": [],
  "regex_entities": [],
  "prebuiltEntities": [],
  "model_features": [],
  "regex_features": [],
  "utterances": [
    {
      "text": "I want to book a flight from New York to Paris.",
      "intent": "BookFlight",
      "entities": [
        {
          "entity": "DepartureCity",
          "startPos": 27,
          "endPos": 34
        },
        {
          "entity": "DestinationCity",
          "startPos": 38,
          "endPos": 42
        }
      ]
    }
  ],
  "patterns": []
}
```

In [10]:
# === Date-first extraction using dateparser (prefer these spans) ===
# Improved date and budget entity extraction using dateparser + heuristics
import dateparser
import re

MONTH_WORDS = r'\b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december)\b'


# === date spans extraction ===
def find_date_spans(text):
    """
    Return list of (start, end, matched_text, parsed_dt) found by dateparser.search.search_dates.
    Fall back to a few explicit regexes (ISO) if needed.
    """
    spans = []
    # === try dateparser search first ===
    if dateparser is not None:
        try:
            res = dateparser.search.search_dates(text, settings={'PREFER_DATES_FROM': 'future', 'RETURN_AS_TIMEZONE_AWARE': False})
        except Exception:
            res = None
        if res:
            for match_text, dt in res:
                # === find first occurrence of match_text (case-insensitive) and use it ===
                m = re.search(re.escape(match_text), text, flags=re.IGNORECASE)
                if m:
                    spans.append((m.start(), m.end()-1, match_text, dt))
    # === fallback: explicit ISO-ish regex spans not caught by dateparser ===
    for m in re.finditer(r'\b20\d{2}[/-]\d{1,2}[/-]\d{1,2}\b', text):
        spans.append((m.start(), m.end()-1, text[m.start():m.end()], None))
    return spans

# === Improved numeric-as-budget decision (use date spans + context cues + year heuristics) ===
CURRENCY_CUES = ['$', 'usd', 'dollars', 'eur', '€', '£', 'budget', 'price', 'cost', 'fare', 'ticket', 'pay']

def is_token_overlapping_spans(start, end, spans):
    return any(s <= start <= e or s <= end <= e for (s,e, *_ ) in spans)

def is_likely_budget_token(text, start, end, date_spans):
    """
    Return True if numeric token at (start,end) is likely a budget, False otherwise.
    Uses date_spans to avoid classifying date tokens.
    """
    # don't label if overlaps an already-detected date
    if is_token_overlapping_spans(start, end, date_spans):
        return False

    window = text[max(0, start-30):min(len(text), end+30)].lower()

    # strong currency cues -> budget
    if any(cue in window for cue in CURRENCY_CUES):
        return True
    # currency symbol immediately before e.g. $1900
    if start > 0 and text[start-1] in ['$', '€', '£']:
        return True

    token = text[start:end+1]
    # numeric-only value
    try:
        val = int(re.sub(r'[^0-9]', '', token))
    except Exception:
        return False

    # heuristic rules for years vs price:
    # if token is a 4-digit year within typical year range and there's a month word nearby -> treat as date
    if len(token) == 4 and 1900 <= val <= 2035:
        month_nearby = re.search(MONTH_WORDS, window, flags=re.IGNORECASE)
        if month_nearby:
            return False  # likely a year in date context
        # if sentence contains explicit date tokens (24, Aug etc) treat as Date
        if re.search(r'\b\d{1,2}\b', window) and re.search(MONTH_WORDS, window, flags=re.IGNORECASE):
            return False

    # numeric-value heuristics: consider values >=100 as plausible budgets (tunable)
    if val >= 100 and val <= 1000000:
        # If text contains words like 'on' before token plus month words, that might be a date: be conservative
        before = text[max(0, start-10):start].lower()
        if re.search(r'\bon\b', before) and re.search(MONTH_WORDS, text, flags=re.IGNORECASE):
            return False
        return True

    return False

In [12]:
# Using spaCy NER extraction of LUIS utterances from frames.json
# - Uses spaCy NER when available (fallback to rule-based regex and heuristics)
# - Adds more entity patterns (airport codes, times, currencies, passenger counts)

import json
import re
import sys
from pathlib import Path

# === Configuration, Reuse existing notebook constants if available, else fall back ===
INPUT_FILE = globals().get('INPUT_FILE', '../data/frames_dataset/frames.json')
OUTPUT_FILE = globals().get('OUTPUT_FILE', '../data/frames_dataset/luis_flight_booking.json')

# === Intent mapping from frame actions ===
INTENT_MAP = globals().get('INTENT_MAP', {
    'book': 'BookFlight',
    'inform': 'ProvideInfo',
    'offer': 'OfferFlight',
    'request': 'RequestInfo',
    'confirm': 'ConfirmBooking',
    'greet': 'Greet',
    'thankyou': 'ThankYou',
    'select': 'SelectOption',
    'deny': 'DenyRequest',
    'ack': 'Acknowledge'
})

# === IATA airport code mapping (3-letter codes) ===
# This is a small sample; in production, should use a comprehensive list or API lookup as needed ===
IATA_MAP = {
    "LON": "London",
    "NYC": "New York",
    "SFO": "San Francisco",
    "SEA": "Seattle",
    "CHI": "Chicago",
    "BOS": "Boston",
    "ATL": "Atlanta",
    "DFW": "Dallas",
    "DEN": "Denver",
    "MIA": "Miami",
    "LAX": "Los Angeles",
    "PAR": "Paris",
    "BER": "Berlin",
    "ROM": "Rome",
    "AMS": "Amsterdam",
    "BKK": "Bangkok",
    "HKG": "Hong Kong",
    "DEL": "Delhi",
    "DXB": "Dubai",
    "SYD": "Sydney"
}

# === Helper to position finder for IATA codes ===
def map_iata_to_city(value):
    if not value:
        return value
    val = str(value).strip().upper()
    return IATA_MAP.get(val, value)

# === Helper to find positions of value in text (case-insensitive) ===
def find_positions(text, value):
    if not value or not text:
        return None
    s = str(value).strip()
    if not s:
        return None
    m = re.search(re.escape(s), text, flags=re.IGNORECASE)
    if m:
        return m.start(), m.end() - 1
    norm_text = re.sub(r'\s+', ' ', text).lower()
    norm_value = re.sub(r'\s+', ' ', s).lower()
    idx = norm_text.find(norm_value)
    if idx != -1:
        words = norm_value.split()
        pattern = r'\\b' + r'\\s+'.join(map(re.escape, words)) + r'\\b'
        m2 = re.search(pattern, text, flags=re.IGNORECASE)
        if m2:
            return m2.start(), m2.end() - 1
    mapped = map_iata_to_city(s)
    if mapped and mapped.lower() != norm_value:
        norm_mapped = re.sub(r'\s+', ' ', mapped).lower()
        idx2 = norm_text.find(norm_mapped)
        if idx2 != -1:
            words = norm_mapped.split()
            pattern = r'\\b' + r'\\s+'.join(map(re.escape, words)) + r'\\b'
            m2 = re.search(pattern, text, flags=re.IGNORECASE)
            if m2:
                return m2.start(), m2.end() - 1
    from difflib import get_close_matches
    words = norm_text.split()
    cm = get_close_matches(norm_value, words, n=1, cutoff=0.8)
    if cm:
        word = cm[0]
        m3 = re.search(re.escape(word), text, flags=re.IGNORECASE)
        if m3:
            return m3.start(), m3.end() - 1
    return None

# === Entity normalization mapping ===
# ENTITY_MAP = globals().get('ENTITY_MAP', {})

# === Blocklist certain spaCy labels and unwanted entity names ===
BLOCKLIST =  globals().get('BLOCKLIST', {'EVENT', 'FAC', 'LAW', 'PRODUCT', 'NORP', 'PERCENT', 'PERSON', 'WORK_OF_ART'})
# add common unwanted entity names
BLOCKLIST.update({'Act', 'Action', 'Agent', 'Airline', 'Airlines', 'Class', 'Classes', 'Confirmation', 'Email'
                 , 'Emails', 'Flight', 'Flights', 'Meal', 'Meals', 'Name', 'Names'
                 , 'Preference', 'Preferences', 'Seat', 'Seats', 'Status', 'Statuses', 'Ticket', 'Tickets'
                 , 'Type', 'Types', 'Value', 'Values'})

# === entity patterns for regex-based extraction ===
ENTITY_PATTERNS = [
    ("DepartureDate", re.compile(r"\b(20\d{2}[\/-]\d{1,2}[\/-]\d{1,2})\b")),
    ("DepartureDate", re.compile(r"\b(\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[a-z]*\s+20\d{2})\b", re.IGNORECASE)),
    ("Time", re.compile(r"\b(\d{1,2}:(?:\d{2})(?:\s?[APMapm]{2})?)\b")),
    ("Budget", re.compile(r"\b\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?\b")),
    ("Budget", re.compile(r"\b(?:usd|dollars|eur|€|£)\s?\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?\b", re.IGNORECASE)),
    ("AirportCode", re.compile(r"\b[A-Z]{3}\b")),
    ("NumPassengers", re.compile(r"\b(\d+)\s*(?:passengers|people|persons|pax)\b", re.IGNORECASE)),
]

# === Try to enable spaCy (dependency parsing + NER) and sklearn for optional classifier ===
USE_SPACY = False
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
    USE_SPACY = True
except Exception:
    USE_SPACY = False

# === Try to import sklearn for optional intent classifier ===
HAVE_SKLEARN = False
clf = None
vectorizer = None
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from collections import Counter
    import joblib, os
    HAVE_SKLEARN = True
except Exception:
    HAVE_SKLEARN = False

# If True, force training and persisting the intent classifier even if training counts are small
FORCE_TRAIN = True

# === Helper for entity extraction ===
def extract_entities_from_text(text):
    date_spans = find_date_spans(text)
    entities = []
    for s, e, match, dt in date_spans:
        entities.append({'category': 'Date', 'offset': s, 'endPos': e, 'text': match})
    for etype, pattern in ENTITY_PATTERNS:
        try:
            for m in pattern.finditer(text):
                s, e = m.start(), m.end()-1
                if is_token_overlapping_spans(s, e, [(ds, de) for ds, de, _, _ in date_spans]):
                    continue
                if any(ent['offset'] <= s <= ent['endPos'] or ent['offset'] <= e <= ent['endPos'] for ent in entities):
                    continue
                entities.append({'category': etype, 'offset': s, 'endPos': e, 'text': text[s:e+1]})
        except re.error:
            continue
    for m in re.finditer(r"\b\d{3,6}\b", text):
        s, e = m.start(), m.end()-1
        if is_token_overlapping_spans(s, e, [(ds, de) for ds, de, _, _ in date_spans]):
            continue
        if any(ent['offset'] <= s <= ent['endPos'] or ent['offset'] <= e <= ent['endPos'] for ent in entities):
            continue
        if is_likely_budget_token(text, s, e, date_spans):
            span_text = text[s:e+1]
            entities.append({'category': 'Budget', 'offset': s, 'endPos': e, 'text': span_text})
    if USE_SPACY:
        doc = nlp(text)
        for ent in doc.ents:
            label = ent.label_
            if label in ('GPE', 'LOC', 'ORG'):
                et = 'Location'
            elif label in ('DATE',):
                et = 'Date'
            elif label in ('TIME',):
                et = 'Time'
            elif label in ('MONEY',):
                et = 'Budget'
            else:
                et = label
            start, end = ent.start_char, ent.end_char-1
            if any(existing['offset'] <= start <= existing['endPos'] or existing['offset'] <= end <= existing['endPos'] for existing in entities):
                continue
            if et in BLOCKLIST:
                continue
            entities.append({'category': et, 'offset': start, 'endPos': end, 'text': ent.text})
        # dependency-based heuristics: look for numeric tokens attached to budget/price words
        try:
            for token in doc:
                if token.like_num or token.ent_type_ == 'MONEY':
                    head = token.head.lemma_.lower() if token.head is not None else ''
                    left = token.nbor(-1).lemma_.lower() if token.i > 0 else ''
                    right = token.nbor(1).lemma_.lower() if token.i < len(doc)-1 else ''
                    cues = {'budget', 'price', 'cost', 'fare', 'costs', 'budgeted', 'pay', 'paying', 'expense', 'amount'}
                    if head in cues or left in cues or right in cues or token.ent_type_ == 'MONEY':
                        s = token.idx
                        e = token.idx + len(token.text) - 1
                        if not is_token_overlapping_spans(s, e, [(ds, de) for ds, de, _, _ in date_spans]):
                            if not any(existing['offset'] == s and existing['endPos'] == e for existing in entities):
                                entities.append({'category': 'Budget', 'offset': s, 'endPos': e, 'text': token.text})
        except Exception:
            pass
    for m in re.finditer(r"\b([A-Z]{3})\b", text):
        ctx = text[max(0, m.start()-20):m.end()+20].lower()
        if 'airport' in ctx or 'from' in ctx or 'to' in ctx or 'arriv' in ctx or 'depart' in ctx:
            start, end = m.start(), m.end()-1
            if not any(s['offset'] <= start <= s['endPos'] for s in entities):
                entities.append({'category': 'AirportCode', 'offset': start, 'endPos': end, 'text': m.group(1)})
    for e in entities:
        lab = e['category']
        if lab.lower() in ('location', 'gpe', 'loc'):
            e['category'] = 'Location'
    return entities

# === Helper to convert our internal entity form to the requested final form ===
def to_final_entities(entity_list):
    out = []
    for ent in entity_list:
        cat = ent.get('category') or ent.get('entity')
        if not cat or cat in BLOCKLIST:
            continue
        start = ent.get('offset')
        end = ent.get('endPos')
        if start is None or end is None:
            continue
        length = end - start + 1
        if length <= 0:
            continue
        out.append({'category': cat, 'offset': start, 'length': length})
    return out

# === Helper to Dedupe the found entities ===
def dedupe_entities(entities):
    """Remove exact duplicates and resolve overlapping spans using a priority order.
    Priority (higher first): Date > Budget > Time > Location > AirportCode > NumPassengers > others
    """
    if not entities:
        return []
    
    # === normalize keys first ===
    items = []
    for e in entities:
        items.append({
            'category': e.get('category'),
            'start': int(e.get('offset')),
            'end': int(e.get('endPos')),
            'text': e.get('text')
        })
    # === remove exact duplicates first ===
    uniq = []
    seen = set()
    for it in items:
        key = (it['category'], it['start'], it['end'])
        if key in seen:
            continue
        seen.add(key)
        uniq.append(it)

    # === resolve overlaps using priority ===
    priority = {'Date': 6, 'Budget': 5, 'Time': 4, 'Location': 3, 'AirportCode': 2, 'NumPassengers': 1}

    # === sort by start then by priority desc then by length desc ===
    uniq.sort(key=lambda x: (x['start'], -priority.get(x['category'], 0), -(x['end']-x['start'])))

    result = []
    for cand in uniq:
        overlap = False
        for kept in result:
            if not (cand['end'] < kept['start'] or cand['start'] > kept['end']):

                # overlapping spans: keep the one with higher priority or longer span if same priority
                p_c = priority.get(cand['category'], 0)
                p_k = priority.get(kept['category'], 0)
                if p_c > p_k:
                    # replace kept with cand
                    result.remove(kept)
                    result.append(cand)
                elif p_c == p_k:
                    # same priority: keep longer span
                    len_c = cand['end'] - cand['start']
                    len_k = kept['end'] - kept['start']
                    if len_c > len_k:
                        result.remove(kept)
                        result.append(cand)
                overlap = True
                break
        if not overlap:
            result.append(cand)

    # === convert back to output format ===
    out = []
    for r in result:
        out.append({
            'category': r['category'],
            'offset': r['start'],
            'endPos': r['end'],
            'text': r.get('text')
        })
    return out

# === Load dataset robustly (file may be a dict with 'conversations' or a list of conversations) ===
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

conversations = dataset.get('conversations') if isinstance(dataset, dict) and 'conversations' in dataset else dataset


# === If sklearn is available, attempt to load a persisted classifier + vectorizer if present. ===
# DO NOT train or persist models from the notebook; only load existing artifacts.
clf = None
vectorizer = None
if HAVE_SKLEARN:
    try:
        model_dir = Path(OUTPUT_FILE).parent / 'models'
        clf_file = model_dir / 'intent_clf.joblib'
        vect_file = model_dir / 'intent_vect.joblib'
        if clf_file.exists() and vect_file.exists():
            clf = joblib.load(clf_file)
            vectorizer = joblib.load(vect_file)
            print('✅ Loaded persisted intent classifier from', clf_file)
        else:
            print('ℹ️ sklearn available but no persisted classifier found; skipping training as requested')
    except Exception as e:
        print('⚠️ Failed to load persisted classifier:', e)
        clf = None

output = []


# === Iterate conversations and turns ===
for convo in (conversations or []):
    turns = convo.get('turns') if isinstance(convo, dict) else []
    for turn in (turns or []):
        speaker = (turn.get('speaker') or turn.get('author') or '').lower()
        if speaker != 'user':
            continue
        text = (turn.get('text') or '').strip()
        if not text:
            continue

        # === determine intent: prefer frame/actions ===
        intent = None
        found = False
        for fr in (turn.get('frames') or []):
            for act in (fr.get('actions') or []):
                act_type = (act.get('act') or act.get('type') or act.get('name') or '').lower().strip()
                if not act_type:
                    continue
                if act_type in INTENT_MAP:
                    intent = INTENT_MAP[act_type]
                    found = True
                    break
                base = act_type.split('_')[0]
                if base in INTENT_MAP:
                    intent = INTENT_MAP[base]
                    found = True
                    break
            if found:
                break

        # === classifier fallback ===
        if not intent and clf is not None and vectorizer is not None:
            try:
                pred = clf.predict(vectorizer.transform([text]))
                intent = pred[0]
            except Exception:
                intent = None

        # === heuristic fallback on text if still unknown ===
        if not intent:
            txt = text.lower()
            if any(k in txt for k in ['book', 'reserve', 'purchase', 'buy', 'ticket']):
                intent = 'BookFlight'
            elif any(k in txt for k in ['price', 'cost', 'fare', 'quote', 'how much', 'budget', '$']):
                intent = 'RequestInfo'
            elif any(k in txt for k in ['hello', 'hi', 'good morning', 'hey']):
                intent = 'Greet'
            elif any(k in txt for k in ['thanks', 'thank you']):
                intent = 'ThankYou'
            else:
                intent = 'BookFlight'

        # === entity extraction ===
        utter_entities = []


        # === prefer authoritative frame values when available ===
        for fr in (turn.get('frames') or []):
            candidates = []
            if isinstance(fr.get('info'), list):
                candidates = fr.get('info')
            elif isinstance(fr.get('slots'), list):
                candidates = fr.get('slots')
            elif isinstance(fr.get('attributes'), list):
                candidates = fr.get('attributes')
            for info in (candidates or []):
                slot = info.get('slot') or info.get('name') or info.get('key') or info.get('label')
                value = info.get('value') or info.get('text') or info.get('values') or info.get('valueText')
                if isinstance(value, list) and len(value) > 0:
                    value = value[0]
                if not slot or value is None:
                    continue
                # cat = ENTITY_MAP.get(slot.lower(), slot)
                cat = (slot and slot.lower()) or slot
                pos = None
                try:
                    pos = find_positions(text, value)
                except Exception:
                    pos = None
                if pos:
                    s, e = pos
                    utter_entities.append({'category': cat, 'offset': s, 'endPos': e, 'text': str(value)})

        # === supplement with text-based extraction ===
        try:
            extracted = extract_entities_from_text(text)
        except Exception:
            extracted = []
        for e in (extracted or []):
            category = e.get('category') or e.get('entity')
            start = e.get('offset') if 'offset' in e else e.get('startPos')
            end = e.get('endPos') if 'endPos' in e else e.get('end')
            if category in BLOCKLIST:
                continue
            if start is None or end is None:
                continue
            dup = any(d['category'] == category and d['offset'] == start and d['endPos'] == end for d in utter_entities)
            if not dup:
                utter_entities.append({'category': category, 'offset': start, 'endPos': end, 'text': e.get('text')})

        # === spaCy dependency-based budget heuristics ===
        if USE_SPACY:
            try:
                doc = nlp(text)
                for ent in doc.ents:
                    if ent.label_.upper() == 'MONEY':
                        s, e = ent.start_char, ent.end_char - 1
                        if not any(existing['offset'] <= s <= existing['endPos'] or existing['offset'] <= e <= existing['endPos'] for existing in utter_entities):
                            utter_entities.append({'category': 'Budget', 'offset': s, 'endPos': e, 'text': ent.text})
                for token in doc:
                    if token.like_num or token.ent_type_ == 'MONEY':
                        head = token.head.lemma_.lower() if token.head is not None else ''
                        left = token.nbor(-1).lemma_.lower() if token.i > 0 else ''
                        right = token.nbor(1).lemma_.lower() if token.i < len(doc)-1 else ''
                        cues = {'budget', 'price', 'cost', 'fare', 'costs', 'budgeted', 'pay', 'paying', 'expense', 'amount'}
                        if head in cues or left in cues or right in cues or token.ent_type_ == 'MONEY':
                            s = token.idx
                            e = token.idx + len(token.text) - 1
                            if not is_token_overlapping_spans(s, e, find_date_spans(text)):
                                if not any(existing['offset'] == s and existing['endPos'] == e for existing in utter_entities):
                                    utter_entities.append({'category': 'Budget', 'offset': s, 'endPos': e, 'text': token.text})
            except Exception:
                pass
        # === dedupe and finalize entities ===
        deduped = dedupe_entities(utter_entities)
        final_entities = to_final_entities(deduped)
        output.append({
            'intent': intent,
            'language': 'en-us',
            'text': text,
            'entities': final_entities
        })

# === save the simple array format ===
Path(OUTPUT_FILE).parent.mkdir(parents=True, exist_ok=True)
def normalize_utterance_text(t):
    if not t:
        return ''
    import re
    s = re.sub(r'\s+', ' ', t).strip().lower()
    return s
# === dedupe utterances by (normalized_text, intent) preserving first occurrence ===
seen_utts = set()
deduped_output = []
removed = 0
for item in output:
    key = (normalize_utterance_text(item.get('text', '')), item.get('intent'))
    if key in seen_utts:
        removed += 1
        continue
    seen_utts.add(key)
    deduped_output.append(item)

# === Save output (LUIS JSON) ===
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    json.dump(deduped_output, f, indent=2, ensure_ascii=False)

print(f'✅ Wrote {len(deduped_output)} utterances to: {OUTPUT_FILE} (removed {removed} duplicates)')
print("✅ LUIS JSON created successfully!")

ℹ️ sklearn available but no persisted classifier found; skipping training as requested
✅ Wrote 9562 utterances to: ../data/frames_dataset/luis_flight_booking.json (removed 845 duplicates)
✅ LUIS JSON created successfully!


In [11]:
import pandas as pd
# Load JSON file
with open(OUTPUT_FILE, 'r') as file:
    for i in range(250):  # Read first 150 lines
        line = file.readline()
        print(line.strip())

{
"intents": [
{
"name": "BookFlight"
}
],
"language": "en-us",
"utterances": [
{
"text": "I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
"intent": "BookFlight",
"entities": [
{
"entity": "Budget",
"startPos": 117,
"endPos": 120,
"text": "1700"
},
{
"entity": "Date",
"startPos": 52,
"endPos": 76,
"text": "Saturday, August 13, 2016"
},
{
"entity": "CARDINAL",
"startPos": 82,
"endPos": 82,
"text": "8"
}
]
},
{
"text": "Yes, how about going to Neverland from Caprica on August 13, 2016 for 5 adults. For this trip, my budget would be 1900.",
"intent": "BookFlight",
"entities": [
{
"entity": "Budget",
"startPos": 114,
"endPos": 117,
"text": "1900"
},
{
"entity": "Location",
"startPos": 39,
"endPos": 45,
"text": "Caprica"
},
{
"entity": "Date",
"startPos": 50,
"endPos": 64,
"text": "August 13, 2016"
},
{
"entity": "CARDINAL",
"startPos": 70,
"endPos": 70,
"text": "5"
}
]
},
{
"text": "I have no flexibility for dates

# Authoring the APP

### Get an API key
az cognitiveservices account keys list --resource-group <resource-group-name> --name <resource-name>


az cognitiveservices account keys list --resource-group flyme_resource --name flyme

### Create ConversationAnalysisClient

In [1]:
import azure.core
from azure.core.credentials import AzureKeyCredential
from azure.ai.language.conversations import ConversationAnalysisClient

endpoint = "https://flyme.cognitiveservices.azure.com/"
credential = AzureKeyCredential("key1")
client = ConversationAnalysisClient(endpoint, credential)

### Create ConversationAuthoringClient

In [2]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.language.conversations.authoring import ConversationAuthoringClient

endpoint = "https://flyme.cognitiveservices.azure.com/"
credential = AzureKeyCredential("key1")
client = ConversationAuthoringClient(endpoint, credential)

### Create a client with an Azure Active Directory Credential

In [3]:
from azure.ai.language.conversations import ConversationAnalysisClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = ConversationAnalysisClient(endpoint="https://flyme.cognitiveservices.azure.com/", credential=credential)

In [None]:
    # pip show azure-ai-textanalytics
    # pip install azure-ai-textanalytics

### Example using DefaultAzureCredential

In [5]:
# import libraries
import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.language.conversations import ConversationAnalysisClient

# get secrets
# clu_endpoint = os.environ["AZURE_CONVERSATIONS_ENDPOINT"]
clu_endpoint = os.environ[endpoint]
# clu_key = os.environ["AZURE_CONVERSATIONS_KEY"]
clu_key = os.environ[credential]
# project_name = os.environ["AZURE_CONVERSATIONS_PROJECT_NAME"]
project_name = os.environ[booking_travel]
# deployment_name = os.environ["AZURE_CONVERSATIONS_DEPLOYMENT_NAME"]
deployment_name = os.environ[luismodel]

# analyze quey
client = ConversationAnalysisClient(clu_endpoint, AzureKeyCredential(clu_key))
with client:
    query = "Hi there! So, between September 7 and 27 I would like to see what is available from Curitiba to Mexico City. My budget is around 1900 dollars and I will be traveling with my wife and two kids. Can you help me with that?"
    result = client.analyze_conversation(
        task={
            "kind": "Conversation",
            "analysisInput": {
                "conversationItem": {
                    "participantId": "1",
                    "id": "1",
                    "modality": "text",
                    "language": "en",
                    "text": query
                },
                "isLoggingEnabled": False
            },
            "parameters": {
                "projectName": project_name,
                "deploymentName": deployment_name,
                "verbose": True
            }
        }
    )

# view result
print("query: {}".format(result["result"]["query"]))
print("project kind: {}\n".format(result["result"]["prediction"]["projectKind"]))

print("top intent: {}".format(result["result"]["prediction"]["topIntent"]))
print("category: {}".format(result["result"]["prediction"]["intents"][0]["category"]))
print("confidence score: {}\n".format(result["result"]["prediction"]["intents"][0]["confidenceScore"]))

print("entities:")
for entity in result["result"]["prediction"]["entities"]:
    print("\ncategory: {}".format(entity["category"]))
    print("text: {}".format(entity["text"]))
    print("confidence score: {}".format(entity["confidenceScore"]))
    if "resolutions" in entity:
        print("resolutions")
        for resolution in entity["resolutions"]:
            print("kind: {}".format(resolution["resolutionKind"]))
            print("value: {}".format(resolution["value"]))
    if "extraInformation" in entity:
        print("extra info")
        for data in entity["extraInformation"]:
            print("kind: {}".format(data["extraInformationKind"]))
            if data["extraInformationKind"] == "ListKey":
                print("key: {}".format(data["key"]))
            if data["extraInformationKind"] == "EntitySubtype":
                print("value: {}".format(data["value"]))

KeyError: 'https://flyme.cognitiveservices.azure.com/'