# Pruning the ranking and search space with heuristics

In [2]:
# Faster cosine scoring
scores = {}
def get_cosine (query):
    for doc in documents:
        scores[doc] = 0
    for term in query.split():
        weight_q = weight(term, query) # Find weight of query term
        postings_list = get_postings(term)
        
        for doc, tf in postings_list:
            scores[doc] += weight (term, doc) # Find tf-idf of doc
            # Notice we dont multiply by the terms in the query at all
    output = []
    for doc in scores:
        scores[doc] /= len(doc) # Normalisation
        output.append((-scores[doc], doc))
    output.sort()
    
    return output # or choose to return top K only here

### Efficient cosine ranking

#### Find K docs in collection
1. Dump all the docs primarily sorted by their cosine scores into heap
2. Extract K documents

#### Pros:
This is faster than sorting as we can build heap in `O(N)` and extract with `O(KlogN)`

#### Cons:
We are still bottlenecked by the cosine calculation.

### Index elimination

Extend `FastCosineScoring` to only consider documents
1. containing many query terms
2. with high idf query terms (rarer terms)

#### High idf query terms only
Stopwords will contribute very little to overall score for rank-ordering.

##### Pros:
- Postings of low idf terms have many docs -> these many docs get eliminated from contenders list
- Similiar in spirit to stop word removal

#### Docs containing many query terms
For multi-term queries, only compute scores for docs with multiple query terms.  
eg. 3 out of 4 query terms

### Champion list

Pre compute for each dictionary term `t`, the `r` docs of highest weight in t's postings.  
So we have some foresight as to what might be most relevant for each query term before user starts using the search engine.

- Note that `r` has to be chosen during indexing stage -> thus its possible to have `r`<`K`.

At query time, only compute the scores for docs in the champion list for some query terms
- pick the K top-scoring docs from amongst there

#### High and low list

Similar to champion list, we maintain a `high` and `low` list.

- When traversing the postings on a query, only traverse `high` list first.
- If not enough docs to meet `K` docs, then traverse the `low` list.

##### Pros:
Can be used even for simple cosine scores without global quality `g(d)`.

#### Tiered indexes
Generalising `high` and `low` lists into tiers.
- We break the postings up into a hierarchy of lists
```
Most impt
...
Least impt
```
- At query time, we only use top tier unless insufficient to get `K` docs. -> if so drop to lower tiers

### Impact-ordered postings

We only compute scores for docs with high enough word frequency `Wft,d`.

- We sort each postings list by `Wft,d`

#### Cons:
Not all postings in a common order, concurrent traversal is not possible. -> cannot perform merge operation as usual

#### Early termination
1. Sort t's postings by descending `Wft,d` value `Wft,d is the weight of term of the doc`
2. When traversing t's postings, stop early after a fixed number of docs or `Wft,d` drops below certain *threshold*.
3. Take the union of the results from each query term
4. Compute only the scores for the docs in the union

#### Idf ordered terms

1. Look at postings of query terms in order of decreasing idf
2. As we update score contribution from each query term, stop if doc scores are relatively unchanged

### Cluster pruning (K-means clustering)

- Pick sqrt(N) docs at random and call these `centers`
- For other docs, pre-compute nearest `center`
    - Docs attached to a `center` are its followers
    - Each `center` will be likely to have sqrt(N) followers

Process a query as follows:
- Given a query Q, find the nearest leader L
- Seek K nearest docs from among L's followers and L itself

#### Pros:
- Fast when we choose `centers` at random
- `centers` reflect data distribution
- Clustering can be done offline (indexing) as well

### Incorporating additional information

- We want top-ranking documents to be both relevant and authoratative
    - Relevance is beign modeled by cosine scores
    - Authority is typically a query-independent property of a document

- Some examples of authority signals
    - Wikipedia among other websites
    - Articiles in certain newspapers
    - A paper with many citations - quantitative
    - Many views, retweets, favourites, likes - quantitative
    - PageRank score - quantitative

#### Modelling authority
We assign a query-independent quality score in [0,1] to each document. Denote this by `g(d)`.

#### Net score
`net-score(q,d) = g(d) + cos(q,d)` quality is **added** to the pre-existing cosine score  
Now we seek top `K` docs by `net-score`

#### Top K by net score
Fast method: Order all postings by `g(d)`.  
Thus we can concurrently traverse query terms' postings for Postings intersection and Cosine score computation.

##### Why?
Under `g(d)` ordering, top-scoring docs are likely to appear early in postings traversal.

### Combination of concepts
Champion list in `g(d)` ordering.  
- Maintain for each term a champion list of the `r` docs with highest g(d) + tf-idf instead of just tf-idf.
- Seek top `K` results from only the docs in the champion list

### Parametric and zone indexes
Defining parts for documents eg. fields.  
These contribute metadata to the document

#### Zone indexing

A zone is a region of the doc that can contain an arbitrary amount of text eg.
- Title
- Abstract
- References

Build inverted indexes on zones as well as permit querying of these zones. eg. `Find merchants in tht title zone and matching the query gentle rain`

##### Approach 1
```
william.abstract: 11 -> 121 -> ...
william.title: ...
william.author: ...
```
We encode zones in dictionary  
**Pros:**  
We can search for specific zones without the need for retrieving all the postings
##### Approach 2
```
william: 2.author, 2.title -> 3.author -> 4.title -> ...
```
**Pros:**  
Postings can have a single Df score accounting for all the documents regardless of zone

### Query term proximity

Heuristic: Users prefer docs where the query terms occur closer to each other.  
Let `w` be the smallest window in a doc containing all the query terms

### Query parsers

Free text query from user may spawn one or more queries to the indexes eg. `NUS open day`
#### Steps:

1. Run the query as a phrase query
2. If <`K` docs contain the phrase `NUS open day`, run the two phrase queries `NUS open` and `open day`.
3. If we still have <`K` docs, run the vector space query on `NUS open day`
4. Rank matching docs by vector space scoring

#### Alternatively: 
1. We can find the smallest window containing all of the query terms if we can go through all the positional posting's list concurrently
2. Then based on the smallest window, we can set the cap `w` to be that instead

The sequence is issued by the query parser