Normalize transcript lines:
- Lowercase
- Remove punctuation
- Tokenize (split into words)

Normalize callsigns:
- Convert all to lowercase
- Optionally split alphanumerics (e.g., LOT3YM → lot 3 y m)

Match lines to callsigns:
- If a line contains all parts of a normalized callsign, assign it
- If multiple matches, assign all (or pick the best using priority rules)

Fuzzy but ordered phonetic matching:
- Soft matching: tolerate ASR noise or variation (e.g., "wizz air force six" ≈ WZZ46)
- Order-aware: word order matters (e.g., "lot four november juliett" ≠ "juliett november four lot").
- Longest/best match scoring: Prefer longer, more specific callsigns (e.g., LOT4NJ > LOT4).
- Not necessarily consecutive: Allow intermediate words ("ruzyne three seven delta bravo" still matches WZZ37DB).

- Score each known callsign by how many words from its phonetic form appear in order in the transcript line (even if non-consecutive).
- Return the top matches (with scores).

In [None]:
from traffic.core import Traffic

adsb = Traffic.from_file("adsb/EPWA-epwa_app-Jun-23-2025-1000Z.parquet")
adsb.data.callsign.unique()

In [104]:
# Watchout, alpha = alfa in transcripts
NATO = {
    'a': 'alfa', 'b': 'bravo', 'c': 'charlie', 'd': 'delta', 'e': 'echo',
    'f': 'foxtrot', 'g': 'golf', 'h': 'hotel', 'i': 'india', 'j': 'juliett',
    'k': 'kilo', 'l': 'lima', 'm': 'mike', 'n': 'november', 'o': 'oscar',
    'p': 'papa', 'q': 'quebec', 'r': 'romeo', 's': 'sierra', 't': 'tango',
    'u': 'uniform', 'v': 'victor', 'w': 'whiskey', 'x': 'xray', 'y': 'yankee', 'z': 'zulu'
}

# To be completed
CALLSIGN_PREFIXES = {
    "AF": "air france",
    "BAW": "speedbird",
    "LOT": "lot",
    "WZZ": "wizz air",
    "UAE": "emirates",
    "DLH": "lufthansa",
    "QTR": "qatar",
    "TAP": "air portugal",
    "TVP": "jet travel",  
    "ENT": "ent air",
    "MGH": "mavi",
    "FIN": "finnair",
    "MOC": "monarch cargo",
    "SWR": "swiss",
    "RYR": "ryan air",
}

num2words = {0: 'zero',
             1: 'one', 
             2: 'two', 
             3: 'three', 
             4: 'four', 
             5: 'five',
             6: 'six', 
             7: 'seven', 
             8: 'eight', 
             9: 'nine',
             }

In [105]:
import re

def normalize(text):
    return re.sub(r"[^a-z0-9 ]", "", text.lower())

def split_callsign(cs):
    # "LOT3YM" -> ['lot', '3', 'y', 'm']
    return re.findall(r'[a-z]+|\d+', cs.lower())

def callsign_to_words(callsign):
    callsign = callsign.strip().upper()

    # Try 3-letter and 2-letter prefixes
    prefix = None
    for i in [3, 2]:
        maybe_prefix = callsign[:i]
        if maybe_prefix in CALLSIGN_PREFIXES:
            prefix = CALLSIGN_PREFIXES[maybe_prefix]
            rest = callsign[i:]
            break
    else:
        # No known prefix match — use raw characters
        prefix = callsign[:3].lower()
        rest = callsign[3:]

    parts = []

    for char in rest:
        if char.isdigit():
            parts.append(num2words[int(char)]) 
        elif char.isalpha():
            parts.append(NATO[char.lower()])

    return prefix + " " + " ".join(parts)

# Strict matching
def identify_callsigns_transcript(transcript_lines, known_callsigns):
    callsign_variants = {
        cs: callsign_to_words(cs) for cs in known_callsigns
    }

    matches = []
    for line in transcript_lines:
        norm_line = re.sub(r"[^a-z0-9 ]", "", line.lower())
        matched = [cs for cs, variant in callsign_variants.items() if variant in norm_line]
        matches.append({"line": line, "callsigns": matched})

    return matches

def callsign_match_score(words_line, callsign_words):
    """
    Computes best score of matching any trailing subsequence of callsign_words
    within words_line, preserving order.
    """
    best_score = 0
    # Try matching only the last N words of callsign_words (prefix flexibility)
    for start in range(len(callsign_words)):
        trimmed = callsign_words[start:]
        i = 0  # pointer in words_line
        j = 0  # pointer in trimmed
        score = 0
        while i < len(words_line) and j < len(trimmed):
            if words_line[i] == trimmed[j]:
                score += 1
                j += 1
            i += 1
        best_score = max(best_score, score)
    return best_score

def identify_callsigns_transcript_soft(transcript_lines, known_callsigns, threshold=3):
    """
    Returns best-matching callsigns for each line with soft phonetic scoring.
    """
    matches = []
    callsign_word_forms = {
        cs: callsign_to_words(cs) for cs in known_callsigns
    }

    for line in transcript_lines:
        timestamp = line.split()[0]
        norm_line = re.sub(r"[^a-z0-9 ]", "", line.lower())
        words_line = norm_line.split()[1:]

        scored_matches = []
        for cs, cs_words in callsign_word_forms.items():
            cs_words = cs_words.split() # from char to list of words
            score = callsign_match_score(words_line, cs_words)
            if score >= threshold:
                scored_matches.append((cs, score))

        # Sort by descending score
        scored_matches.sort(key=lambda x: -x[1])

        matches.append({
            "line": line,
            # "callsigns": [cs for cs, score in scored_matches],
            "callsigns": [scored_matches[0][0] if scored_matches else None], 
            "scores": scored_matches
        })

    return matches

In [106]:
print(callsign_to_words('WZZ37DB'))
print(callsign_to_words('AF756'))
print(callsign_to_words('UAE98X'))
print(callsign_to_words('LOT40'))

wizz air three seven delta bravo
air france seven five six
emirates nine eight xray
lot four zero


What to modify: 
- Prefix is not necessarily 3 letters. It can be two like AF
- Numbers in callsign should be converted to words
- Don't look for perfect match, but look for the most probable one (e.g. the most corresponding words between transcript and callsign)
- When scanning the transcript, the call sign should be recognised among the consecutive words in the transcript. 
- Now: converting callsign and looking if in communication. What about if one callsign hasn't been detected in adsb data ? Try to to the inverse as well: idenfitying the callsign in the transcript and look in the detected callsigns. 
- Try to have a better temporal match by looking at the trajectory touchdown time and tracking the time of the conversation in the transcript. This has to be done when building the chunks.


WARNINGS:
- WARNING: ent air four zero eight eight is recognised as czech air force zero eight eight or tango four zero eight eight or turkish four zero eight eight or and turn 4088 (not consistent across the communications). Maybe could be solved with a soft matching and adding a more precise time. 
- WARNING: I have callsigns in the ATC that are not present in the adsb data. This is bcause the frequency is used for both EPWA and EPMO (they share the same TMA), and sometimes it's also used for departures. Instead of looking for landing aircraft, look for all aircraft present in the TMA (looking at departures + arrivals at EPWA is not sufficient.)
- WARNING: What about callsigns that are not in the ICAO databases (might be charter or small aircraft) like PLF, JDI, FDB, GCK, SEH.

In [107]:
with open("transcripts/EPWA-epwa_app-Jun-23-2025-1000Z.txt") as f:
    lines = [line.strip() for line in f if line.strip()]

callsigns = adsb.data.callsign.unique()

matches = identify_callsigns_transcript(lines, callsigns)

i = 0
for match in matches:
    print(f"{i}: {match['line']} -> {match['callsigns']}")
    i+=1

0: [10:00:02] thousand looking for traffic lot five tango alfa -> ['LOT5TA']
1: [10:00:13] ruzyne three seven delta bravo contact radar one two five zero five five -> []
2: [10:00:19] one two five zero five five wizz air three seven delta bravo -> ['WZZ37DB']
3: [10:00:24] approach czech air force zero eight eight descending seven thousand feet -> []
4: [10:00:30] tango four zero eight eight qality descend seven thousand qnh one zero one zero traffic below -> []
5: [10:00:34] descending seven thousand qnh one zero one zero copy that turkish four zero eight eight -> []
6: [10:01:22] and air force zero eight eight descend altitude five thousand feet -> []
7: [10:01:24] descending five thousand and turn four zero eight eight -> []
8: [10:01:47] six five tango alfa for contact approach one two five decimal zero five -> []
9: [10:01:52] one two five zero five five lot five tango -> []
10: [10:01:56] approach vietnam lot three mike hotel heading three three zero -> ['LOT3MH']
11: [10:02:02] 

In [108]:
total = 0
m = 0
for match in matches:
    total +=1
    if match['callsigns']:
        m += 1
m/total

0.273972602739726

In [109]:
with open("transcripts/EPWA-epwa_app-Jun-23-2025-1000Z.txt") as f:
    lines = [line.strip() for line in f if line.strip()]

callsigns = adsb.data.callsign.unique()

matches = identify_callsigns_transcript_soft(lines, callsigns)

i = 0
for match in matches:
    print(f"{i}: {match['line']} -> {match['callsigns']}")
    i+=1

0: [10:00:02] thousand looking for traffic lot five tango alfa -> ['LOT5TA']
1: [10:00:13] ruzyne three seven delta bravo contact radar one two five zero five five -> ['WZZ37DB']
2: [10:00:19] one two five zero five five wizz air three seven delta bravo -> ['WZZ37DB']
3: [10:00:24] approach czech air force zero eight eight descending seven thousand feet -> ['ENT4088']
4: [10:00:30] tango four zero eight eight qality descend seven thousand qnh one zero one zero traffic below -> ['ENT4088']
5: [10:00:34] descending seven thousand qnh one zero one zero copy that turkish four zero eight eight -> ['ENT4088']
6: [10:01:22] and air force zero eight eight descend altitude five thousand feet -> ['ENT4088']
7: [10:01:24] descending five thousand and turn four zero eight eight -> ['ENT4088']
8: [10:01:47] six five tango alfa for contact approach one two five decimal zero five -> ['LOT6529']
9: [10:01:52] one two five zero five five lot five tango -> ['LOT5TA']
10: [10:01:56] approach vietnam lot 

WARNING: Might mix up heading/speed/frequency clearences for callsigns. Might be improved though time matching but not solved