# Phrase queries

For phrase queries we want to be able to answer eg. "Stanford University" as a phrase. It no longer suffices to just store `<term: doc>` entries.

## Biword indexes

index every consecutive par of terms in the text.  
eg.
```
friends, Romans, Countrymen
is now:
friends romans
romans countrymen
```

We store these bigrams as dictionary terms.

### Longer phrase queries

Longer phrase queries such as `stanford univeristy palo alto` are accessed as `stanford university` AND `university palo` AND `palo alto`.


### Extended biwords

1. Parse the indexed text and perform POST (parts of speech tagging).
2. Bucket the terms into Nouns (N) and articles/ prepostions (X).
3. Call any string of terms in the form of `NX*N` an extended biword.

eg. `catcher in the rye`  
- catcher: N
- in: X
- the: X
- rye: N

after **segment query into enhanced biwords**, we have `catcher rye`

### Issues

- False positives: Can have false positives as there is no way to ensure that the words appear next to each other.
- Index blowup: due to bigger dictionary, infeasible for more than biwords.



## Positional indexes

Store positions of the words together within the document ID. eg.
```
hello
{2: 1,4,5}, {10: 5,9,120}, ...

world
{2: 2,3},...
```
Then `hello world` is a hit at docID `2` and at position `1, 2`.

### Postional index size
- Need an entry for each occurence, not just once per document.
- Index size depends on the average document size - depends on how many positions there are in the fist place.

### Rule of thumb

- Positional index size is usually 2-4x larger than a non-positional index
- Positional index size is ~35-50% of the volume of original text
- Caveat: All of this holds for English-like languages.