# ATIS Flight Reservations - Information Extraction


<hr>

Table of Contents:

1. Understanding the Data
2. Information Extraction 
    - Pipeline for Information Extraction Systems
    - Named Entity Recognition (NER)
3. Models for Entity Recognition
    - Rule-based models
        - Regular Expression Based Rules (ex)
        - Chunking 
    - Probabilistic models
        - Unigram and Bigram models
        - Naive Bayes Classifier 
        - Conditional Random Fields (CRFs)

<hr>

The ATIS (Airline Travel Information Systems) dataset consists of English language queries for booking (or requesting information about) flights in the US. 

Each word in a query (i.e. a request by a user) is labelled according to its **entity-type**, for e.g. in the query 'please show morning flights from chicago to new york', 'chicago' and 'new york are labelled as 'source' and 'destination' locations respectively while 'morning' is labelled as 'time-of-day' (the exact labelling scheme is a bit different, more on that later).

Some example queries taken from the dataset are shown below:

```
{
'what flights leave atlanta at about DIGIT in the afternoon and arrive in san francisco',
 'what is the abbreviation for canadian airlines international',
 "i 'd like to know the earliest flight from boston to atlanta",
 'show me the us air flights from atlanta to boston',
 'show me the cheapest round trips from dallas to baltimore',
 "i 'd like to see all flights from denver to philadelphia"
 }
 ```

### Objective
Our objective is to **build an information extraction system** which can extract entities relevant for booking flights (such as source and destination cities, time, date, budget constraints etc.) in a **structured format** from a given user-generated query.

A structured format could be a dictionary, a JSON, etc. - basically anything that can be parsed and used for looking up relevant flights from a database.


### Downloads
The dataset is divided into five folds, each fold having a training, validation and test set.
You can download the dataset here: http://lisaweb.iro.umontreal.ca/transfert/lisa/users/mesnilgr/atis/




In [7]:
# =====> import libraries
import numpy as np
import pandas as pd
import pprint, nltk
import matplotlib.pyplot as plt
import random 

import gzip, os, pickle # ===> for reading gz files
import _pickle as cPickle

import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')

In [8]:
f = gzip.GzipFile('atis.fold0.pkl.gz', 'rb')
try:
    train_set, valid_set, test_set, dicts = pickle.load(f, encoding='latin1')
except:
    train_set, valid_set, test_set, dicts = pickle.load(f)
finally:
    f.close()


In [9]:
train_set[:1]

([array([554, 194, 268,  64,  62,  16,   8, 234, 481,  20,  40,  58, 234,
         415, 205], dtype=int32),
  array([554, 241, 481,  14, 200,  91,  26, 239], dtype=int32),
  array([232,   0, 273, 502, 254, 481, 165, 193, 208,  77, 502,  64],
        dtype=int32),
  array([439, 301, 481, 532,  22, 194, 208,  64, 502,  77], dtype=int32),
  array([439, 301, 481,  99, 410, 516, 208, 128, 502,  69], dtype=int32),
  array([232,   0, 273, 502, 425,  32, 194, 208, 137, 502, 376], dtype=int32),
  array([177, 182, 111, 394], dtype=int32),
  array([232,   0, 273,  13, 530,  26, 193, 358, 546, 208, 415, 205, 502,
          77], dtype=int32),
  array([554, 241, 481, 386, 353,  37,  26, 193,   9, 208, 332, 569, 502,
         285,  41], dtype=int32),
  array([554, 157, 481, 302, 111, 411, 453, 200], dtype=int32),
  array([554,  50,  32, 194, 502, 137, 208, 376, 358, 463], dtype=int32),
  array([554, 501, 157, 481, 263,  20, 193, 268, 208, 543, 200, 137],
        dtype=int32),
  array([554, 194,  50, 

In [10]:
test_set[:1]

([array([232, 565, 273, 502, 189,  13, 193, 208,  97, 502, 260, 539, 480,
         294,  13, 458, 234, 452, 286], dtype=int32),
  array([358,  49, 190, 232, 331,  13, 498, 208, 466, 502, 415, 247, 139,
          72,   8,  35], dtype=int32),
  array([358,  49, 190, 232, 331,  13, 193, 215, 208, 378, 502, 415, 147],
        dtype=int32),
  array([232, 565, 273,  13, 193, 514, 359, 544, 208, 378, 502, 415, 147,
         358,  49, 190], dtype=int32),
  array([232, 565, 273,  13, 193, 208, 367, 502, 413, 256, 104, 200,  49,
         190, 358, 136,  26], dtype=int32),
  array([232, 331,  13, 193, 208, 506, 502, 333, 359, 544, 270, 546, 174,
         363, 496, 321], dtype=int32),
  array([317, 321, 232, 565, 273, 502, 196, 208, 114, 502, 236], dtype=int32),
  array([358, 546,  49, 443, 232, 565, 273, 502, 196, 208, 282,  71, 502,
         114,  19,   8, 384], dtype=int32),
  array([ 19,   9, 384, 358, 546,  49, 443, 232, 565, 273, 502, 196, 208,
         282,  71, 502, 114], dtype=int32),
  a

In [11]:
dicts

{'labels2idx': {'B-time_relative': 74,
  'B-stoploc.state_code': 72,
  'B-depart_date.today_relative': 29,
  'B-arrive_date.date_relative': 5,
  'B-depart_date.date_relative': 25,
  'I-restriction_code': 113,
  'B-return_date.month_name': 62,
  'I-time': 120,
  'B-depart_date.day_name': 26,
  'I-arrive_time.end_time': 86,
  'B-fromloc.airport_code': 46,
  'B-cost_relative': 21,
  'B-connect': 20,
  'B-return_time.period_mod': 64,
  'B-arrive_time.period_mod': 11,
  'B-flight_number': 43,
  'B-depart_time.time_relative': 36,
  'I-toloc.city_name': 123,
  'B-arrive_time.period_of_day': 12,
  'B-depart_time.period_of_day': 33,
  'I-return_date.date_relative': 114,
  'I-depart_time.start_time': 98,
  'B-fare_amount': 38,
  'I-depart_time.time_relative': 100,
  'B-city_name': 17,
  'B-depart_date.day_number': 27,
  'I-meal_description': 112,
  'I-depart_date.today_relative': 95,
  'I-airport_name': 84,
  'I-arrive_date.day_number': 85,
  'B-toloc.state_code': 80,
  'B-arrive_date.month_name

In [12]:
print(type(valid_set[0]), type(valid_set[1]), type(valid_set[2]))
print(len(valid_set[0]), len(train_set[0]), len(test_set[0]))

<class 'list'> <class 'list'> <class 'list'>
995 3983 893


In [13]:
pprint.pprint(train_set[0][:3])
print('#'*50)
pprint.pprint(train_set[1][:3])
print('#'*50)
pprint.pprint(train_set[2][:3])

[array([554, 194, 268,  64,  62,  16,   8, 234, 481,  20,  40,  58, 234,
       415, 205], dtype=int32),
 array([554, 241, 481,  14, 200,  91,  26, 239], dtype=int32),
 array([232,   0, 273, 502, 254, 481, 165, 193, 208,  77, 502,  64],
      dtype=int32)]
##################################################
[array([  0,   0,   0,  18,   0,   1,  52,   0,   0,  76,   0,   0,   0,
        18, 109], dtype=int32),
 array([  0,   0,   0,   0,   0,   6, 107, 107], dtype=int32),
 array([ 0,  0,  0,  0,  0,  0, 44,  0,  0, 18,  0, 18], dtype=int32)]
##################################################
[array([126, 126, 126,  48, 126,  36,  35, 126, 126,  33, 126, 126, 126,
        78, 123], dtype=int32),
 array([126, 126, 126, 126, 126,   2,  83,  83], dtype=int32),
 array([126, 126, 126, 126, 126, 126,  42, 126, 126,  48, 126,  78],
      dtype=int32)]


In [14]:
train_x, _, train_label = train_set
val_x, _, val_label = valid_set
test_x, _, test_label = test_set

So now, for training, validation and test sets, we have the **encoded words and labels** stored in the lists (train_x, train_label), (val_x, val_label) and (test_x, test_label). The first list represents the actual words (encoded), and the other list contains their labels (again, encoded).

Let's now understand the structure of the lists.

In [15]:
train_x[0]

array([554, 194, 268,  64,  62,  16,   8, 234, 481,  20,  40,  58, 234,
       415, 205], dtype=int32)

In [16]:
train_label[0]

array([126, 126, 126,  48, 126,  36,  35, 126, 126,  33, 126, 126, 126,
        78, 123], dtype=int32)

To map the integers to words, we need to use the dictionaries provided. The dicts ```words2idx``` and ```labels2idx``` map the numeric ids to the actual words and labels respectively.

In [17]:
dicts

{'labels2idx': {'B-time_relative': 74,
  'B-stoploc.state_code': 72,
  'B-depart_date.today_relative': 29,
  'B-arrive_date.date_relative': 5,
  'B-depart_date.date_relative': 25,
  'I-restriction_code': 113,
  'B-return_date.month_name': 62,
  'I-time': 120,
  'B-depart_date.day_name': 26,
  'I-arrive_time.end_time': 86,
  'B-fromloc.airport_code': 46,
  'B-cost_relative': 21,
  'B-connect': 20,
  'B-return_time.period_mod': 64,
  'B-arrive_time.period_mod': 11,
  'B-flight_number': 43,
  'B-depart_time.time_relative': 36,
  'I-toloc.city_name': 123,
  'B-arrive_time.period_of_day': 12,
  'B-depart_time.period_of_day': 33,
  'I-return_date.date_relative': 114,
  'I-depart_time.start_time': 98,
  'B-fare_amount': 38,
  'I-depart_time.time_relative': 100,
  'B-city_name': 17,
  'B-depart_date.day_number': 27,
  'I-meal_description': 112,
  'I-depart_date.today_relative': 95,
  'I-airport_name': 84,
  'I-arrive_date.day_number': 85,
  'B-toloc.state_code': 80,
  'B-arrive_date.month_name

In [19]:
dicts.keys()

dict_keys(['labels2idx', 'tables2idx', 'words2idx'])

In [20]:
dicts['labels2idx']

{'B-time_relative': 74,
 'B-stoploc.state_code': 72,
 'B-depart_date.today_relative': 29,
 'B-arrive_date.date_relative': 5,
 'B-depart_date.date_relative': 25,
 'I-restriction_code': 113,
 'B-return_date.month_name': 62,
 'I-time': 120,
 'B-depart_date.day_name': 26,
 'I-arrive_time.end_time': 86,
 'B-fromloc.airport_code': 46,
 'B-cost_relative': 21,
 'B-connect': 20,
 'B-return_time.period_mod': 64,
 'B-arrive_time.period_mod': 11,
 'B-flight_number': 43,
 'B-depart_time.time_relative': 36,
 'I-toloc.city_name': 123,
 'B-arrive_time.period_of_day': 12,
 'B-depart_time.period_of_day': 33,
 'I-return_date.date_relative': 114,
 'I-depart_time.start_time': 98,
 'B-fare_amount': 38,
 'I-depart_time.time_relative': 100,
 'B-city_name': 17,
 'B-depart_date.day_number': 27,
 'I-meal_description': 112,
 'I-depart_date.today_relative': 95,
 'I-airport_name': 84,
 'I-arrive_date.day_number': 85,
 'B-toloc.state_code': 80,
 'B-arrive_date.month_name': 8,
 'B-stoploc.airport_code': 69,
 'I-depar

In [21]:
dicts['words2idx']

{'all': 32,
 'coach': 110,
 'cincinnati': 102,
 'people': 374,
 'month': 318,
 'four': 202,
 'code': 111,
 'go': 213,
 'show': 439,
 'thursday': 496,
 'to': 502,
 'restriction': 405,
 'dinnertime': 151,
 'under': 529,
 'sorry': 450,
 'include': 235,
 'midwest': 311,
 'worth': 564,
 'southwest': 451,
 'me': 301,
 'returning': 408,
 'far': 181,
 'vegas': 539,
 'airfare': 24,
 'ticket': 498,
 'difference': 148,
 'arrange': 54,
 'tickets': 499,
 'louis': 286,
 'cheapest': 99,
 'list': 276,
 'wednesday': 546,
 'leave': 268,
 'heading': 222,
 'ten': 474,
 'direct': 152,
 'turboprop': 520,
 'rate': 395,
 'cost': 121,
 'quebec': 392,
 'layover': 266,
 'air': 22,
 'what': 554,
 'stands': 454,
 'chicago': 100,
 'schedule': 419,
 'transcontinental': 510,
 'goes': 214,
 'new': 332,
 'transportation': 512,
 'here': 225,
 'hours': 228,
 'let': 272,
 'twentieth': 523,
 'along': 33,
 'thrift': 494,
 'passengers': 371,
 'great': 216,
 'thirty': 490,
 'canadian': 91,
 'leaves': 269,
 'alaska': 31,
 'lea

In [22]:
words = dicts['words2idx']
labels = dicts['labels2idx']
tables = dicts['tables2idx']

In [23]:
labels

{'B-time_relative': 74,
 'B-stoploc.state_code': 72,
 'B-depart_date.today_relative': 29,
 'B-arrive_date.date_relative': 5,
 'B-depart_date.date_relative': 25,
 'I-restriction_code': 113,
 'B-return_date.month_name': 62,
 'I-time': 120,
 'B-depart_date.day_name': 26,
 'I-arrive_time.end_time': 86,
 'B-fromloc.airport_code': 46,
 'B-cost_relative': 21,
 'B-connect': 20,
 'B-return_time.period_mod': 64,
 'B-arrive_time.period_mod': 11,
 'B-flight_number': 43,
 'B-depart_time.time_relative': 36,
 'I-toloc.city_name': 123,
 'B-arrive_time.period_of_day': 12,
 'B-depart_time.period_of_day': 33,
 'I-return_date.date_relative': 114,
 'I-depart_time.start_time': 98,
 'B-fare_amount': 38,
 'I-depart_time.time_relative': 100,
 'B-city_name': 17,
 'B-depart_date.day_number': 27,
 'I-meal_description': 112,
 'I-depart_date.today_relative': 95,
 'I-airport_name': 84,
 'I-arrive_date.day_number': 85,
 'B-toloc.state_code': 80,
 'B-arrive_date.month_name': 8,
 'B-stoploc.airport_code': 69,
 'I-depar

In [24]:
words.items()

dict_items([('all', 32), ('coach', 110), ('cincinnati', 102), ('people', 374), ('month', 318), ('four', 202), ('code', 111), ('go', 213), ('show', 439), ('thursday', 496), ('to', 502), ('restriction', 405), ('dinnertime', 151), ('under', 529), ('sorry', 450), ('include', 235), ('midwest', 311), ('worth', 564), ('southwest', 451), ('me', 301), ('returning', 408), ('far', 181), ('vegas', 539), ('airfare', 24), ('ticket', 498), ('difference', 148), ('arrange', 54), ('tickets', 499), ('louis', 286), ('cheapest', 99), ('list', 276), ('wednesday', 546), ('leave', 268), ('heading', 222), ('ten', 474), ('direct', 152), ('turboprop', 520), ('rate', 395), ('cost', 121), ('quebec', 392), ('layover', 266), ('air', 22), ('what', 554), ('stands', 454), ('chicago', 100), ('schedule', 419), ('transcontinental', 510), ('goes', 214), ('new', 332), ('transportation', 512), ('here', 225), ('hours', 228), ('let', 272), ('twentieth', 523), ('along', 33), ('thrift', 494), ('passengers', 371), ('great', 216),

In [25]:
[k for val in train_x[0] for k, v in words.items() if v == val]

['what',
 'flights',
 'leave',
 'atlanta',
 'at',
 'about',
 'DIGIT',
 'in',
 'the',
 'afternoon',
 'and',
 'arrive',
 'in',
 'san',
 'francisco']

In [26]:
random.sample(labels.items(), 25)

[('B-arrive_time.start_time', 13),
 ('I-toloc.airport_name', 122),
 ('B-depart_time.period_of_day', 33),
 ('B-mod', 54),
 ('I-class_type', 92),
 ('B-arrive_time.time', 14),
 ('B-airport_name', 4),
 ('I-depart_time.period_of_day', 97),
 ('I-fare_basis_code', 103),
 ('I-depart_time.start_time', 98),
 ('I-flight_mod', 104),
 ('B-cost_relative', 21),
 ('B-return_date.month_name', 62),
 ('B-state_code', 67),
 ('B-day_number', 23),
 ('B-flight_stop', 44),
 ('B-return_time.period_mod', 64),
 ('B-arrive_time.end_time', 10),
 ('B-fromloc.city_name', 48),
 ('B-fromloc.airport_code', 46),
 ('I-restriction_code', 113),
 ('I-meal_code', 111),
 ('I-economy', 101),
 ('I-transport_type', 125),
 ('B-compartment', 19)]

#### Reversing the labels and words:

In [27]:
id_to_words = {words[k]: k for k in words}
id_to_labels = {labels[k]:k for k in labels}

In [31]:
id_to_words[32]

'all'

In [32]:
id_to_labels[74]

'B-time_relative'

In [33]:
def print_query(index):
    w = [id_to_words[id] for id in train_x[index]]
    l = [id_to_labels[id] for id in train_label[index]]
    return list(zip(w, l))

In [34]:
print_query(3900)

[('please', 'O'),
 ('show', 'O'),
 ('me', 'O'),
 ('the', 'O'),
 ('return', 'O'),
 ('flight', 'O'),
 ('number', 'O'),
 ('from', 'O'),
 ('toronto', 'B-fromloc.city_name'),
 ('to', 'O'),
 ('st.', 'B-toloc.city_name'),
 ('petersburg', 'I-toloc.city_name')]

## Models of enitty recognition:

### Parts of speach tagging

In [35]:
id_to_words[202]

'four'

In [36]:
p = [id_to_words[val] for val in train_x[0]]
# nltk.pos_tag()
p

['what',
 'flights',
 'leave',
 'atlanta',
 'at',
 'about',
 'DIGIT',
 'in',
 'the',
 'afternoon',
 'and',
 'arrive',
 'in',
 'san',
 'francisco']

In [37]:
def pos_tagging(sent_list):
    pos_tags = []
    for sent in sent_list:
        tagged_words = nltk.pos_tag([id_to_words[val] for val in sent])
        pos_tags.append(tagged_words)
    return pos_tags

In [38]:
print(len(train_x))
print(len(val_x))
print(len(test_x))

3983
995
893


In [39]:
train_pos =pos_tagging(train_x)
validation_pos = pos_tagging(val_x)
test_pos = pos_tagging(test_x)

In [40]:
train_pos[1]

[('what', 'WP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('abbreviation', 'NN'),
 ('for', 'IN'),
 ('canadian', 'JJ'),
 ('airlines', 'NNS'),
 ('international', 'JJ')]

### Creating 3-tuples of ```(word, pos, IOS_label)```
To train a model, we need the entity labels of each word along with the POS tags, for e.g. in this format:

```
('show', 'VB', 'O'),
('me', 'PRP', 'O'),
('the', 'DT', 'O'),
('cheapest', 'JJS', 'B-cost_relative'),
('round', 'NN', 'B-round_trip'),
('trips', 'NNS', 'I-round_trip'),
('from', 'IN', 'O'),
('dallas', 'NN', 'B-fromloc.city_name'),
('to', 'TO', 'O'),
('baltimore', 'VB', 'B-toloc.city_name')
```
<hr>

Let's convert the training, validation and test sentences to this form. Since we have already  done POS tagging of the queries, we'll write a function which takes queries in the form (word, pos_tag) and the labels as input, and returns the list of sentences in the form (word, pos_tag, iob_label).

In [41]:
train_pos

[[('what', 'WP'),
  ('flights', 'NNS'),
  ('leave', 'VBP'),
  ('atlanta', 'VBN'),
  ('at', 'IN'),
  ('about', 'RB'),
  ('DIGIT', 'NNP'),
  ('in', 'IN'),
  ('the', 'DT'),
  ('afternoon', 'NN'),
  ('and', 'CC'),
  ('arrive', 'NN'),
  ('in', 'IN'),
  ('san', 'JJ'),
  ('francisco', 'NN')],
 [('what', 'WP'),
  ('is', 'VBZ'),
  ('the', 'DT'),
  ('abbreviation', 'NN'),
  ('for', 'IN'),
  ('canadian', 'JJ'),
  ('airlines', 'NNS'),
  ('international', 'JJ')],
 [('i', 'JJ'),
  ("'d", 'MD'),
  ('like', 'VB'),
  ('to', 'TO'),
  ('know', 'VB'),
  ('the', 'DT'),
  ('earliest', 'JJS'),
  ('flight', 'NN'),
  ('from', 'IN'),
  ('boston', 'NN'),
  ('to', 'TO'),
  ('atlanta', 'VB')],
 [('show', 'VB'),
  ('me', 'PRP'),
  ('the', 'DT'),
  ('us', 'PRP'),
  ('air', 'NN'),
  ('flights', 'NNS'),
  ('from', 'IN'),
  ('atlanta', 'NN'),
  ('to', 'TO'),
  ('boston', 'VB')],
 [('show', 'VB'),
  ('me', 'PRP'),
  ('the', 'DT'),
  ('cheapest', 'JJS'),
  ('round', 'NN'),
  ('trips', 'NNS'),
  ('from', 'IN'),
  ('dallas

In [42]:
train_label

[array([126, 126, 126,  48, 126,  36,  35, 126, 126,  33, 126, 126, 126,
         78, 123], dtype=int32),
 array([126, 126, 126, 126, 126,   2,  83,  83], dtype=int32),
 array([126, 126, 126, 126, 126, 126,  42, 126, 126,  48, 126,  78],
       dtype=int32),
 array([126, 126, 126,   2,  83, 126, 126,  48, 126,  78], dtype=int32),
 array([126, 126, 126,  21,  66, 117, 126,  48, 126,  78], dtype=int32),
 array([126, 126, 126, 126, 126, 126, 126, 126,  48, 126,  78], dtype=int32),
 array([126, 126, 126,  39], dtype=int32),
 array([126, 126, 126, 126,   2,  83, 126, 126,  26, 126,  48, 109, 126,
         78], dtype=int32),
 array([126, 126, 126, 126, 126,   2,  83, 126,  43, 126,  48, 109, 126,
         78, 123], dtype=int32),
 array([126, 126, 126,  51, 126,  52, 126, 126], dtype=int32),
 array([126, 126, 126, 126, 126,  78, 126,  48, 126,  26], dtype=int32),
 array([126,  45, 126, 126,  33,  97, 126, 126, 126,  48, 126,  78],
       dtype=int32),
 array([126, 126, 126, 126,  26, 126,  48

In [43]:
# function to create (word, pos_tag, iob_label) tuples for a given dataset
def create_word_pos_label(pos_tagged_data, labels):
    iob_labels = []         # initialize the list of 3-tuples to be returned
    
    for sent in list(zip(pos_tagged_data, labels)):
        pos = sent[0]       
        labels = sent[1]    
        zipped_list = list(zip(pos, labels)) # [(word, pos), label]
        
        # create (word, pos, label) tuples from zipped list
        tuple_3 = [(word_pos_tuple[0], word_pos_tuple[1], id_to_labels[label]) 
                   for word_pos_tuple, label in zipped_list]
        iob_labels.append(tuple_3)
    return iob_labels

In [45]:
train_labels = create_word_pos_label(train_pos, train_label)
train_labels[4:6]

[[('show', 'VB', 'O'),
  ('me', 'PRP', 'O'),
  ('the', 'DT', 'O'),
  ('cheapest', 'JJS', 'B-cost_relative'),
  ('round', 'NN', 'B-round_trip'),
  ('trips', 'NNS', 'I-round_trip'),
  ('from', 'IN', 'O'),
  ('dallas', 'NN', 'B-fromloc.city_name'),
  ('to', 'TO', 'O'),
  ('baltimore', 'VB', 'B-toloc.city_name')],
 [('i', 'JJ', 'O'),
  ("'d", 'MD', 'O'),
  ('like', 'VB', 'O'),
  ('to', 'TO', 'O'),
  ('see', 'VB', 'O'),
  ('all', 'DT', 'O'),
  ('flights', 'NNS', 'O'),
  ('from', 'IN', 'O'),
  ('denver', 'NN', 'B-fromloc.city_name'),
  ('to', 'TO', 'O'),
  ('philadelphia', 'VB', 'B-toloc.city_name')]]

In [46]:
valid_labels = create_word_pos_label(validation_pos, val_label)
test_labels = create_word_pos_label(test_pos, test_label)

In [47]:
from nltk.corpus import conll2000
from nltk import conlltags2tree, tree2conlltags

In [48]:
train_labels[0]

[('what', 'WP', 'O'),
 ('flights', 'NNS', 'O'),
 ('leave', 'VBP', 'O'),
 ('atlanta', 'VBN', 'B-fromloc.city_name'),
 ('at', 'IN', 'O'),
 ('about', 'RB', 'B-depart_time.time_relative'),
 ('DIGIT', 'NNP', 'B-depart_time.time'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('afternoon', 'NN', 'B-depart_time.period_of_day'),
 ('and', 'CC', 'O'),
 ('arrive', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('san', 'JJ', 'B-toloc.city_name'),
 ('francisco', 'NN', 'I-toloc.city_name')]

In [49]:
from nltk import word_tokenize, pos_tag, ne_chunk

In [None]:
import nltk

In [50]:
train_trees = [conlltags2tree(sent) for sent in train_labels]
valid_trees = [conlltags2tree(sent) for sent in valid_labels]
test_trees = [conlltags2tree(sent) for sent in test_labels]

In [51]:
print(train_trees[1:2])

[Tree('S', [('what', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('abbreviation', 'NN'), ('for', 'IN'), Tree('airline_name', [('canadian', 'JJ'), ('airlines', 'NNS'), ('international', 'JJ')])])]


### Chunking

Chunking is a way to identify meaningful sequences of tokens called chunks in a sentence. It is commonly used to identify sequences of nouns, verbs etc. For example, in the example sentence taken from the NLTK book: <br>

S = *"We saw the yellow dog"*

there are two **noun phrase chunks** as shown below. Each outer box represents a chunk.

<img src='https://www.nltk.org/book/tree_images/ch07-tree-1.png'>

The corresponding **IOB representation** of the same is as follows:

<img src='https://www.nltk.org/images/chunk-tagrep.png'>



Similarly, in our dataset, the following sentence contains chunks such as fromloc.city_name (san francisco), class_type (first class), depart_time.time (DIGITDIGIT noon) etc.


In [52]:
print(train_trees[3468])

(S
  show/VB
  me/PRP
  flights/NNS
  on/IN
  (depart_date.day_name sunday/NN)
  going/VBG
  from/IN
  (fromloc.city_name san/JJ francisco/NN)
  to/TO
  (toloc.city_name boston/VB)
  (flight_stop nonstop/JJ)
  (class_type first/JJ class/NN)
  leaving/NN
  (depart_time.time_relative after/IN)
  (depart_time.time DIGITDIGIT/NNP noon/NN))


In [53]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
            ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

In [54]:
grammer = "NP_Chunk: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer)
result = cp.parse(sentence)
print(result)

(S
  (NP_Chunk the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP_Chunk the/DT cat/NN))


#### with no grammer check the accuracy

In [55]:
grammer = ''
reg_parser = nltk.RegexpParser(grammer)
result = reg_parser.evaluate(train_trees)
print(result)

ChunkParse score:
    IOB Accuracy:  63.8%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


In [56]:
print(train_trees[random.randrange(len(train_trees))])

(S
  information/NN
  on/IN
  a/DT
  flight/NN
  from/IN
  (fromloc.city_name san/JJ francisco/NN)
  to/TO
  (toloc.city_name philadelphia/VB))


Note: There is enquiry from and to city flight which can express the grammer

In [57]:
grammer = """S: {<NP><VP>}
NP: {<DT|JJ|NN.*>+}
PP: {<IN><NP>} 
VP: {<VB.*><NP|PP>+$} """
cp = nltk.RegexpParser(grammer)
print(cp.parse([("Rohit", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")])) 

(S
  (NP Rohit/NN)
  saw/VBD
  (NP the/DT cat/NN)
  (VP sit/VB (PP on/IN (NP the/DT mat/NN))))


In [58]:
grammer = '''
fromloc.city_name: {<JJ>?<NN>}
toloc.city_name: {<VB><NN>?}
'''
parser = nltk.RegexpParser(grammer)
result = parser.evaluate(train_trees)
print(result)

ChunkParse score:
    IOB Accuracy:  65.2%%
    Precision:     31.7%%
    Recall:        40.9%%
    F-Measure:     35.7%%


### Unigram chunker:

In [59]:
from nltk import ChunkParserI

class UnigramChunker(ChunkParserI):

    def __init__(self, train_sents):
        train_data = [ [(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
    
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [60]:
unigram_chunker = UnigramChunker(train_trees)
print(unigram_chunker.evaluate(valid_trees))

ChunkParse score:
    IOB Accuracy:  66.3%%
    Precision:     37.5%%
    Recall:        18.5%%
    F-Measure:     24.8%%


In [61]:
train_trees[0].leaves()

[('what', 'WP'),
 ('flights', 'NNS'),
 ('leave', 'VBP'),
 ('atlanta', 'VBN'),
 ('at', 'IN'),
 ('about', 'RB'),
 ('DIGIT', 'NNP'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('afternoon', 'NN'),
 ('and', 'CC'),
 ('arrive', 'NN'),
 ('in', 'IN'),
 ('san', 'JJ'),
 ('francisco', 'NN')]

In [62]:
print(train_trees[0])

(S
  what/WP
  flights/NNS
  leave/VBP
  (fromloc.city_name atlanta/VBN)
  at/IN
  (depart_time.time_relative about/RB)
  (depart_time.time DIGIT/NNP)
  in/IN
  the/DT
  (depart_time.period_of_day afternoon/NN)
  and/CC
  arrive/NN
  in/IN
  (toloc.city_name san/JJ francisco/NN))


In [63]:
class BigramChunker(ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for word, t, c in nltk.chunk.tree2conlltags(sent)]  for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
    
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [64]:
bigram_chunker = BigramChunker(train_trees)
print(bigram_chunker.evaluate(valid_trees))

ChunkParse score:
    IOB Accuracy:  70.6%%
    Precision:     43.5%%
    Recall:        38.8%%
    F-Measure:     41.0%%


### Using a Gazetteer to Lookup Cities and States

A gazetteer is a geographical directory which stores data regarding the names of geographical entities (cities, states, countries) and some other features related to the geographies. An example gazetteer file for the US is given below.

Data download URL: https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv


We'll write a simple function which takes a word as input and returns a tuple indicating **whether the word is a city, state or a county**.

In [65]:
us_cities = pd.read_csv('https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv', sep='|')
us_cities.head()

Unnamed: 0,City,State short,State full,County,City alias
0,Holtsville,NY,New York,SUFFOLK,Internal Revenue Service
1,Holtsville,NY,New York,SUFFOLK,Holtsville
2,Adjuntas,PR,Puerto Rico,ADJUNTAS,URB San Joaquin
3,Adjuntas,PR,Puerto Rico,ADJUNTAS,Jard De Adjuntas
4,Adjuntas,PR,Puerto Rico,ADJUNTAS,Colinas Del Gigante


In [66]:
cities = set(us_cities['City'].str.lower())
states = set(us_cities['State full'].str.lower())
counties = set(us_cities['County'].str.lower())

In [67]:
def gazetteer_lookup(word):
    return (word in cities, word in states, word in counties)

In [68]:
# sample lookups
print(gazetteer_lookup('washington'))
print(gazetteer_lookup('utah'))
print(gazetteer_lookup('philadelphia'))


(True, True, True)
(False, True, True)
(True, False, True)


In [69]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    
    # the first word has both previous word and previous tag undefined
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]

    # gazetteer lookup features (see section below)
    gazetteer = gazetteer_lookup(word)

    return {"pos": pos, "prevpos": prevpos, 'word':word,
           'word_is_city': gazetteer[0],
           'word_is_state': gazetteer[1],
           'word_is_county': gazetteer[2]}

In [70]:
sent_pos = train_pos[0]

for i in range(len(sent_pos)):
    print(npchunk_features(sent_pos, i, history=[]))
    print(' ')

{'pos': 'WP', 'prevpos': '<START>', 'word': 'what', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'NNS', 'prevpos': 'WP', 'word': 'flights', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'VBP', 'prevpos': 'NNS', 'word': 'leave', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'VBN', 'prevpos': 'VBP', 'word': 'atlanta', 'word_is_city': True, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'IN', 'prevpos': 'VBN', 'word': 'at', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'RB', 'prevpos': 'IN', 'word': 'about', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'NNP', 'prevpos': 'RB', 'word': 'DIGIT', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'IN', 'prevpos': 'NNP', 'word': 'in', 'word_is_city': False, 'word_is_state': False, 'word_is_county': False}
 
{'pos': 'DT', '

In [71]:
class ConsecutiveNPChunkTagger(nltk.TaggerI): 

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            # compute features for each word
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) 
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI): 
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

In [72]:
chunker = ConsecutiveNPChunker(train_trees)
print(chunker.evaluate(valid_trees))

ChunkParse score:
    IOB Accuracy:  91.7%%
    Precision:     75.3%%
    Recall:        81.8%%
    F-Measure:     78.4%%


In [None]:
# extracts features for a given word i in a given sentence 
# history refers to the previous POS tags in the sentence
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    
    # the first word has both previous word and previous tag undefined
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
        
    if i == len(sentence)-1:
        nextword, nextpos = '<END>', '<END>'
    else:
        nextword, nextpos = sentence[i+1]

    # gazetteer lookup features (see section below)
    gazetteer = gazetteer_lookup(word)

    # adding word_is_digit feature (boolean)
    return {"pos": pos, "prevpos": prevpos, 'word':word, 
           'word_is_city': gazetteer[0],
           'word_is_state': gazetteer[1],
           'word_is_county': gazetteer[2],
           'word_is_digit': word in 'DIGITDIGITDIGIT', 
           'nextword': nextword, 
           'nextpos': nextpos}

In [None]:
# train and evaluate the chunker 
chunker = ConsecutiveNPChunker(train_trees)
print(chunker.evaluate(valid_trees))

In [None]:
chunker.tagger.classifier.show_most_informative_features(15)

We can also try other classifiers that come with NLTK - let's try building a **decision tree**.

In [None]:
# Decision Tree Classifier
class ConsecutiveNPChunkTagger(nltk.TaggerI): 

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            # compute features for each word
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) 
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.DecisionTreeClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI): 
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
chunker = ConsecutiveNPChunker(train_trees)
print(chunker.evaluate(valid_trees))

In [None]:
print(chunker.evaluate(test_trees))

## CRF:

In [None]:
train_labels