## Importing

Natural Language Toolkit (nltk) is a powerful library for working with human language data.

The `word_tokenize` function from nltk.tokenize is used for breaking text into words.

The `PorterStemmer` from nltk.stem is a stemming algorithm that reduces words to their base or root form.

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import os

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Proshir-
[nltk_data]     Pc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## Open Documents
In this section, we retrieve the content of multiple documents stored in a specified folder using Python. The code utilizes the `os` module to navigate through the file system, opening each document in read mode and extracting its text. The document contents are then collected in a list named `documents` for further processing or analysis.


In [10]:
folder_path = "docs"
documents = []
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    with open(file_path, 'r') as file:
        document_text = file.read()
        documents.append(document_text)
documents

['People joke that no one in Los Angeles reads; everyone watches TV, rents videos, or goes to the movies. The most popular reading material is comic books, movie magazines, and TV guides. City libraries have only 10 percent of the traffic that car washes have. But how do you explain this? An annual book festival in west Los Angeles is “sold out” year after year. People wait half an hour for a parking space to become available.This outdoor festival, sponsored by a newspaper, occurs every April for one weekend. This year’s attendance was estimated at 70,000 on Saturday and 75,000 on Sunday. The festival featured 280 exhibitors. There were about 90 talks given by authors, with an audience question-and-answer period following each talk. Autograph seekers sought out more than 150 authors. A food court sold all kinds of popular and ethnic foods, from American hamburgers to Hawaiian shave ice drinks. Except for a $7 parking fee, the festival was free. Even so, some people avoided the food cou

## Preprocess Document

In this section, we define a function `preprocess_document` responsible for preparing a document's text for further analysis. The function applies several text preprocessing steps to enhance the quality of the data.


In [14]:
def preprocess_document(document):
    # Tokenization
    words = word_tokenize(document)

    # Remove punctuation
    words = [word for word in words if word.isalnum()]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in words]

    return stemmed_words

In [15]:
preprocess_document(documents[0]) # sample

['peopl',
 'joke',
 'that',
 'no',
 'one',
 'in',
 'lo',
 'angel',
 'read',
 'everyon',
 'watch',
 'tv',
 'rent',
 'video',
 'or',
 'goe',
 'to',
 'the',
 'movi',
 'the',
 'most',
 'popular',
 'read',
 'materi',
 'is',
 'comic',
 'book',
 'movi',
 'magazin',
 'and',
 'tv',
 'guid',
 'citi',
 'librari',
 'have',
 'onli',
 '10',
 'percent',
 'of',
 'the',
 'traffic',
 'that',
 'car',
 'wash',
 'have',
 'but',
 'how',
 'do',
 'you',
 'explain',
 'thi',
 'an',
 'annual',
 'book',
 'festiv',
 'in',
 'west',
 'lo',
 'angel',
 'is',
 'sold',
 'out',
 'year',
 'after',
 'year',
 'peopl',
 'wait',
 'half',
 'an',
 'hour',
 'for',
 'a',
 'park',
 'space',
 'to',
 'becom',
 'outdoor',
 'festiv',
 'sponsor',
 'by',
 'a',
 'newspap',
 'occur',
 'everi',
 'april',
 'for',
 'one',
 'weekend',
 'thi',
 'year',
 's',
 'attend',
 'wa',
 'estim',
 'at',
 'on',
 'saturday',
 'and',
 'on',
 'sunday',
 'the',
 'festiv',
 'featur',
 '280',
 'exhibitor',
 'there',
 'were',
 'about',
 '90',
 'talk',
 'given',


## Building an Inverted Index Block

This section introduces a function named `build_inverted_index_block`, responsible for constructing an inverted index from a collection of documents. The function utilizes the `preprocess_document` function to prepare the text of each document and then builds the inverted index by associating terms with their corresponding document IDs.


In [50]:
def build_inverted_index_block(documents):
    inverted_index = {}

    for doc_id, document in enumerate(documents):
        # Preprocess the document to obtain a list of terms
        terms = preprocess_document(document)
        
        # Update the inverted index with term-document ID associations
        for term in terms:
            if term not in inverted_index:
                inverted_index[term] = [doc_id]
            else:
                inverted_index[term].append(doc_id)

    return inverted_index

In [51]:
build_inverted_index_block(documents) # test

{'peopl': [0, 0, 0, 1, 7, 8, 12, 12, 14],
 'joke': [0],
 'that': [0,
  0,
  1,
  2,
  2,
  2,
  3,
  3,
  3,
  3,
  4,
  4,
  4,
  4,
  4,
  4,
  5,
  6,
  6,
  6,
  6,
  6,
  6,
  7,
  7,
  7,
  7,
  7,
  8,
  8,
  8,
  9,
  9,
  10,
  10,
  11,
  11,
  12,
  12,
  12,
  12,
  14,
  14,
  14],
 'no': [0, 1, 1, 4, 4, 7, 12, 14],
 'one': [0,
  0,
  0,
  0,
  0,
  1,
  1,
  2,
  4,
  6,
  6,
  6,
  8,
  8,
  8,
  9,
  9,
  9,
  9,
  10,
  11,
  11,
  12,
  14,
  14],
 'in': [0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  2,
  2,
  2,
  3,
  3,
  3,
  3,
  3,
  4,
  4,
  4,
  4,
  4,
  4,
  5,
  5,
  5,
  5,
  5,
  5,
  6,
  6,
  6,
  6,
  6,
  7,
  7,
  7,
  7,
  8,
  8,
  8,
  8,
  9,
  9,
  10,
  10,
  11,
  11,
  12,
  12,
  12,
  13,
  14],
 'lo': [0, 0, 0, 6, 6],
 'angel': [0, 0, 0, 6, 6],
 'read': [0, 0],
 'everyon': [0, 4, 4, 4, 11, 13],
 'watch': [0, 4, 9, 9, 11, 11],
 'tv': [0, 0, 2, 11],
 'rent': [0, 8, 8, 8, 12, 12, 12],
 'video': [0, 9, 9, 9, 

## Gamma Encoding Function
This section introduces a function named `gamma_encode`, responsible for applying gamma coding to a positive integer. Gamma coding is a variable-length coding technique used for efficient representation, particularly for positive integers with skewed distributions.



In [53]:
def gamma_encode(number):
    # Convert the positive integer to binary and remove the leading '0b' prefix
    unary_repr = bin(number)[3:]
    
    # Construct the unary representation by adding a leading '1' and appending the binary representation
    return '0' * len(unary_repr) + '1' + unary_repr

In [32]:
gamma_encode(5) # test

'00101'

In [26]:
gamma_encode(100) # test

'0000001100100'

## Document ID Encoding Function

This section introduces a function named `encode_document_ids`, responsible for encoding a list of document IDs using gamma coding. Gamma coding is a variable-length coding technique that optimizes the representation of positive integers, improving storage efficiency.


In [55]:
def encode_document_ids(doc_ids):
    gaps = [doc_ids[i + 1] - doc_ids[i] for i in range(len(doc_ids) - 1)]
    
    # Apply gamma encoding to each gap to obtain a list of gamma codes
    gamma_codes = [gamma_encode(gap) for gap in gaps]
    return "".join(gamma_codes)

In [31]:
encode_document_ids([0,5,105]) # test

'001010000001100100'

## Index Block Merging Function
This section introduces a function named `merge_index_blocks`, responsible for merging multiple encoded index blocks into a single string. During the Block-Sorted Based Indexing (BSBI) algorithm, merging index blocks efficiently is crucial for constructing the final inverted index.



In [37]:
def merge_index_blocks(blocks):
    merged_index = ""

    for block in blocks:
        merged_index += block

    return merged_index

## Processing Blocks Function
This section introduces a function named `process_blocks`, which orchestrates the key steps of the Block-Sorted Based Indexing (BSBI) algorithm. The function divides a collection of documents into blocks, builds an inverted index for each block, encodes document IDs using gamma coding, and then merges the encoded blocks to construct the final inverted index.



In [56]:
def process_blocks(documents, block_size):
    encoded_blocks = []

    for i in range(0, len(documents), block_size):
        block = documents[i:i + block_size]
        # Build the inverted index block for the current block of documents
        inverted_index_block = build_inverted_index_block(block)
        # Encode document IDs using gamma coding for the inverted index block
        encoded_block = encode_document_ids([doc_id for doc_ids in inverted_index_block.values() for doc_id in doc_ids])
        encoded_blocks.append(encoded_block)

    return merge_index_blocks(encoded_blocks)

Applying the `process_blocks` Function with Block Size 4

In [57]:
process_blocks(documents, 4) 

'111011111111111100111110111111111001101111111111111111111111100111111111111110100011011111100110011001111111111111111111111111111111100111111111111111111111111111111111111111111111111111100111111111111010111111111111001111111111111111111111111111111111100111111111010110011101010011011111111111111111100111101010011011111111111110011110110011111111110011111101011100111111110011011111111111111011111111111110011101100111011001111010001110100011011111111111110011111111111111111111111111111111111111110011111111111111001101110111011111010100110111111111111111111111100110101011001111010111100110111111111011111010111100111111111100110111111010100111101000110111110100011011111011111101000111101111110111111111111110100011111111100110101110011110110011111101111111001101101111110111101111011100111110111001111101010011111111111111111001111111111100111111111001101111111001111111100110111111101111010100111111111001101110100011101000110111101111001111111110100011011111111111111101111111111111101110100