<a href="https://colab.research.google.com/github/pedrogengo/CISI_BM25/blob/main/Downloading_and_Processing_CISI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descrição

Esse notebook tem por objetivo realizar o download e o processamento do conjunto de dados CISI (Centre for Inventions and Scientific Information). Os dados estão divididos em 4 arquivos:

- cisi.all: arquivo de texto que contém 1460 documentos.
- cisi.qry: arquivo de texto que contém 112 queries.
- cisi.rel: arquivo de texto que contém pares de queries e documentos relevantes.
- cisi.bln: arquivo de texto com lista de queries booleanas (não será usado).

A parte mais importante na hora de processar esses dados é garantir que as quantidades de dados extraídos batem com as descrições e, além disso, que durante a extração não estamos considerando textos a mais ou a menos.

## 1. Donwload e descompressão dos arquivos

In [1]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz
!tar -xvzf cisi.tar.gz

--2023-02-20 02:36:54--  http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz
Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 775144 (757K) [application/gzip]
Saving to: ‘cisi.tar.gz’


2023-02-20 02:36:55 (1.15 MB/s) - ‘cisi.tar.gz’ saved [775144/775144]



## 2. Processamento dos dados

### CISI.ALL

Extrair apenas o texto que aparecer abaixo de `.W`.

```
.I 6
.T
Abstracting Concepts and Methods
.A
Borko, H.
.W
     Graduate library school study of abstracting should be more than a
how-to-do-it course.
It should include general material on the characteristics and types of abstracts.
.X
6 6 6
363 1 6
403 1 6
```

### CISI.QRY

Extrair apenas o texto que aparecer abaixo de `.W`.

```
.I 21
.W
The need to provide personnel for the information field.
```

### CISI.REL

Considerar o par como sendo o primeiro e o segundo número de cada linha. Os últimos (0 e 0.000) podemos desconsiderar.

```
    21      6 0 0.000000
    21     14 0 0.000000
    21     22 0 0.000000
    21     85 0 0.000000
    21    171 0 0.000000
    21    185 0 0.000000
    21    186 0 0.000000
    21    303 0 0.000000
    21    339 0 0.000000
    21    392 0 0.000000
    21    400 0 0.000000
```

In [8]:
import re
import math
from collections import defaultdict


def load_collection(path):
    """Load the CISI collection from a file."""
    with open(path, 'r') as f:
        collection = f.read()
    return collection

def parse_documents(collection):
    """Parse the documents in the CISI collection."""
    document_pattern = re.compile(r'\.W\s+(.*?)\s+\.[A-Z]', re.DOTALL)
    documents = document_pattern.findall(collection)
    documents = [doc.replace("\n", " ").strip() for doc in documents]
    return documents

def parse_queries(path):
    """Parse the queries in the CISI queries file."""
    with open(path, 'r') as f:
        queries = f.read()
    query_pattern = re.compile(r'\.W\s*(.*?)\n+\.[A-Z]', re.DOTALL)
    queries = query_pattern.findall(queries)
    queries = [query.replace("\n", " ").strip() for query in queries]
    return queries

def parse_judgments(path):
    """Parse the relevance judgments in the CISI relevance judgments file."""
    with open(path, 'r') as f:
        judgments = f.read()
    judgment_pattern = re.compile(r'\s+(\d+)\s+(\d+)\s+', re.DOTALL)
    judgments = judgment_pattern.findall(judgments)
    judgments_dict = defaultdict(lambda: [])
    for query, document in judgments:
      judgments_dict[int(query)].append(int(document))
    return judgments_dict

In [9]:
collection = load_collection("CISI.ALL")
documents = parse_documents(collection)
queries = parse_queries("CISI.QRY")
judgments = parse_judgments("CISI.REL")

assert len(documents) == 1460
assert len(queries) == 112

In [None]:
def tokenize(text):
    """Tokenize a document or query."""
    words = re.findall(r'\w+', text.lower())
    return words

def build_index(documents):
    """Build an inverted index from the documents."""
    index = {}
    doc_term_freqs = []

    for i, document in enumerate(documents):
        # Tokenize the document
        terms = tokenize(document)

        # Count the term frequencies
        term_freqs = {}
        for term in terms:
            term_freqs[term] = term_freqs.get(term, 0) + 1

        # Normalize the term frequencies by the length of the document
        doc_length = len(terms)
        term_freqs = {term: freq / doc_length for term, freq in term_freqs.items()}
        doc_term_freqs.append(term_freqs)

        # Add the document to the index for each term it contains
        for term in term_freqs:
            if term not in index:
                index[term] = []
            index[term].append((i, term_freqs[term]))

    # Calculate the inverse document frequencies
    N = len(documents)
    idfs = {term: math.log((N - len(postings) + 0.5) / (len(postings) + 0.5)) for term, postings in index.items()}

    # Return the inverted index and document term frequencies
    return index, doc_term_freqs, idfs

In [5]:
index, doc_term_freqs, idfs = build_index(documents)