<a href="https://colab.research.google.com/github/kpe/notebooks/blob/master/atis_resplit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import os
import itertools

from collections import defaultdict, Counter
from random import Random
from functools import partial

from urllib import request, parse

import numpy as np

# The ATIS Dataset

The ATIS DataSet is widely used [TODO add the references] benchmark dataset for intent classification and slot filling models used in dialog systems.

This notebook will:
  
  1. fetch the ATIS DataSet from https://github.com/yvchen/JointSLU 
  2. Explore the the slot and intent label distribution accross the the train/dev/test dataset split
  3. Provide an alternative (balanced) datasplit
  







## Fetching the dataset
The ATIS dataset from [yvchen/JointSLU](https://github.com/yvchen/JointSLU) is provided as train, dev and test split in text format, one sample per line with whitespace separated tokens and IOB tags, followeb by the intent label, i.e.:

    BOS from denver to baltimore EOS O O B-fromloc.city_name O B-toloc.city_name atis_flight
    BOS ground transportation in denver EOS  O O O O B-city_name atis_ground_service

In [12]:
ATIS_BASE_URL="https://raw.githubusercontent.com/yvchen/JointSLU/master/data/"

def load_atis_ds(fname, base_url=ATIS_BASE_URL):
    res = []
    with request.urlopen(parse.urljoin(base_url,fname)) as req:
        for line in req.readlines():
            line = line.decode(req.info().get_content_charset())
            toks,si      = map(str.split, line.split("\t"))
            slots,intent = si[:-1]+['O'], si[-1]
            assert len(toks) == len(slots)
            res.append((toks, slots, intent))
    ds_name = '.'.join(fname.split('.')[:2])
    print('{:>20s}: {:4d}'.format(ds_name, len(res)))
    return res, ds_name

atis = {name: ds for ds, name in 
        map(load_atis_ds, map(lambda name: name+'.w-intent.iob',
                              ['atis.test', 'atis-2.dev','atis-2.train']))}


           atis.test:  893
          atis-2.dev:  500
        atis-2.train: 4478


## Exploring the data splits

In [17]:
# a single entry looks like this
toks,slots,intent = atis['atis.test'][0]
print(' input:', ' '.join(toks))
print(' slots:', ' '.join(slots))
print('intent:',          intent)

 input: BOS i would like to find a flight from charlotte to las vegas that makes a stop in st. louis EOS
 slots: O O O O O O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name O O O O O B-stoploc.city_name I-stoploc.city_name O
intent: atis_flight


Lets check if all labels are present in both the train and test (or dev) data splits.

In [42]:
test,dev,train = map(atis.get, ['atis.test','atis-2.dev','atis-2.train'])

def subdict(d, keys):
  return {key: d[key] for key in set(keys)}

def atis_label_counts(ds):
  tokens, slots, intents = zip(*ds)
  token_labs  = Counter(list(itertools.chain.from_iterable(tokens)))
  slot_labs   = Counter(list(itertools.chain.from_iterable(slots)))
  intent_labs = Counter(intents)
  return token_labs, slot_labs, intent_labs

def check_atis_split(train, dev, test):
  (_,ts,ti), (_,ds,di), (_,es,ei) = map(atis_label_counts,
                                        [train, dev, test])

  lens = np.array(list(map(len, [train,dev,test])))
  print("sample count: {:5d} splitted into:".format(lens.sum()))
  for dslen, dsname in zip(lens, ['train', 'dev', 'test']):
    print("       {:>5s}: {:5d} ({:.3f})".format(dsname, dslen, dslen/lens.sum()))
  

  
  # map slot/intent labels to usage frequency
  token_labs,slot_labs,intent_labs = atis_label_counts(train+dev+test)
  sfreqs, ifreqs = map(partial(partial,subdict), [slot_labs, intent_labs])

  print("intent count:", len(intent_labs))
  print("  slot count:", len(slot_labs))
  print(" token count:", len(token_labs))
  
  print("missing data for slot/intent labels:")
  ts,ti,ds,di,es,ei = map(lambda s: set(s.keys()), [ts,ti,ds,di,es,ei])
  for dsname, mints, mslots in [("train", 
                                 ifreqs(di.union(ei).difference(ti)),
                                 sfreqs(ds.union(es).difference(ts))),
                                ("dev", 
                                 ifreqs(ti.difference(di)),
                                 sfreqs(ts.difference(ds))),
                                ("test", 
                                 ifreqs(ti.difference(ei)),
                                 sfreqs(ts.difference(es)))]:
    print("  no {:>5s} data for {:2d} intents: {}".format(dsname, 
                                                       len(mints), mints))
    print("  no {:>5s} data for {:2d}   slots: {}".format(dsname, 
                                                       len(mslots), mslots))


check_atis_split(train, dev, test)

sample count:  5871 splitted into:
       train:  4478 (0.763)
         dev:   500 (0.085)
        test:   893 (0.152)
intent count: 26
  slot count: 127
 token count: 952
missing data for slot/intent labels:
  no train data for  5 intents: {'atis_airfare#atis_flight_time': 1, 'atis_day_name': 2, 'atis_airfare#atis_flight': 1, 'atis_flight_no#atis_airline': 1, 'atis_flight#atis_airline': 1}
  no train data for  7   slots: {'I-return_date.day_number': 1, 'B-compartment': 1, 'B-booking_class': 1, 'I-state_name': 1, 'B-flight': 1, 'I-flight_number': 1, 'B-stoploc.airport_code': 1}
  no   dev data for  6 intents: {'atis_flight_no': 20, 'atis_aircraft#atis_flight#atis_flight_no': 1, 'atis_cheapest': 1, 'atis_meal': 12, 'atis_ground_service#atis_ground_fare': 1, 'atis_airline#atis_flight_no': 2}
  no   dev data for 25   slots: {'B-time_relative': 1, 'B-today_relative': 2, 'I-arrive_time.time_relative': 6, 'I-meal_code': 4, 'I-arrive_date.day_number': 6, 'B-month_name': 2, 'B-time': 2, 'I-mea

So from above, we see, there are 5 intent and 7 slot labels not present at all in the train dataset. And quite simillary up to 20% of the labels are not present in the dev or test dataset.

## Splitting the ATIS dataset