# Index Construction

Can use BSBI, SPIMI, and distributed indexing to make index construction more scalable.

## Hardware Limitations and Bottlenecks
`Seek time` - time a mechanical hand takes to move to a random location in the disk

`Transfer time` - time to transfer a data block

- Access to data in memory is *much* faster than acces to data on disk. 
- No data is transferred during disk seeks, hence transferring *large chunks* of data from disk to memory is faster than transferring many small chunks. 
- Disk I/O is block based



`s` - Average seek time, is on average *8 ms*

`b` - Transfer time per byte, is on average *0.006 μs*

- Dataset used is [Reteurs RCV1](https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiview+Text+Categorization+Test+collection)
- It is a newswire collection from 1995 and 1996
- 800,000 documents; 100,000,000 non-positional postings; avg. bytes per token: 4.5; avg. bytes per term: 7.5
- Average no. of bytes per token is lower because of stop word removal. Many tokens may actually be stop words, and the removal of them will lead to the average size per token being lower.

## Blocked Sort-Based Indexing
- Idea is to sort with fewer disk seeks
- 8-byte (4+4) records (termID, docID)
- We now sort 8-byte records by termID (termID used instead of the term to increase speed)
- Basic idea: accumulate postings for each *block*, sort, then write to disk
- Then merge the blocks into one long sorted order

### Sorting example
In order to sort 10 blocks of 10 million records:

1. Accumulate entries for a block, sort within and write to this disk:
2. Quicksort takes *N* in *N* expected steps (in our case 10 million in 10 million steps)
3. 10 times this estimate - 10 sorted *runs* of 10 million records each on disk

### Merging the blocks
First approach: binary merge with a merge tree. During each layer, read into memory runs in blocks of 10 million, merge, then write back. 

Second approach (n-way merge):
1. Reading from all blocks simultaneously
2. Read decent-sized chunks of each block into memory, merge, then write out to decent sized output chunk. Efficiency isn't lost by disk seeks

## Single-Pass in-memory indexing
- Idea #1: Generate separate dictionaries for each block - no need to maintain term-termID mapping across blocks. 
- Idea #2: Build the 

