# Bigram (k-gram) indexes

Maintain a second inverted index with mapping from bigram to dictionary terms that match each bigram

eg.

```
$m -> mace, madden, monday, ...
mo -> monday, among, amortize, ...
on -> among, axon, ...
```

### Advantages

1. Fast
2. Space efficient - compared to permuterm

In [7]:
import nltk

sentence = "April is the cruelest month"
dictionary = {"":[]}

for word in nltk.word_tokenize(sentence):
    processed_word = "$" + word.lower() + "$"
    for gram in nltk.ngrams(processed_word, 2):
        print ("".join(gram))
        kgram = "".join(gram)
        if (kgram in dictionary):
            dictionary[kgram].append(word.lower())
        else:
            dictionary[kgram] = [word.lower()]

$a
ap
pr
ri
il
l$
$i
is
s$
$t
th
he
e$
$c
cr
ru
ue
el
le
es
st
t$
$m
mo
on
nt
th
h$


### Moving forward
Store the ngrams was keys mapping it to a list of words containing that ngram. Then we can perform queries just like inverted indexing


#### Wildcard queries

- `mon*` can now be interpreted as `$m` AND `mo` AND `on`
- `hel*o` can not be interpreted as `$h` AND `he` AND `el` AND `o$`

#### False positives

Sometimes we can also encounter false positives like `moon`. Then we just have to do some post filtering.

#### Short comings
1. Can still be very expensive esp if we query for something like `pyth*` AND `prog*` multiple wildcard query.

In [6]:
print (dictionary["ap"])
print (dictionary["th"])

['april']
['the', 'month']
