

## Lab goals:

 demonstrates how we can index a collection.
 access an index to visualize some index analysis.

The **learning outcomes** of the this notebook are:


*   PyTerrier setup.
*   Indexing a collection.
*   Accessing and exploring the index.

What is PyTerrier?

**[PyTerrier](https://pyterrier.readthedocs.io/en/latest/)** is a Python framework, but uses the underlying [Terrier information retrieval](http://terrier.org/) toolkit for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations.


### **Setup**
We will first install Pyterrier as follows:

In [41]:
#install the Pyterrier framework
!pip install python-terrier



The next step is to initialise PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

In [42]:
import pyterrier as pt
if not pt.started():
  pt.init()

  if not pt.started():


In [43]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
from tqdm import tqdm

### **What are DataFrames?**
[Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): Two-dimensional, size-mutable, potentially heterogeneous tabular data. Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

In [44]:
#create a new dataframe
my_df=pd.DataFrame([["Alice",25,50000],["Bob",35,690000],["Charlie",45,460000]],columns=['name','age','salary'])
my_df

Unnamed: 0,name,age,salary
0,Alice,25,50000
1,Bob,35,690000
2,Charlie,45,460000


In [45]:
#insert a new row
my_df.loc[len(my_df)]= ['David',24,90000]
my_df

Unnamed: 0,name,age,salary
0,Alice,25,50000
1,Bob,35,690000
2,Charlie,45,460000
3,David,24,90000


In [46]:
descriptions = [
    "Alice is a dedicated software engineer with a passion for solving complex problems. She has experience in backend development and cloud technologies.",
    "Bob is a results-driven project manager who specializes in agile methodologies. He ensures smooth communication across teams and timely delivery of projects.",
    "Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities and implementing security protocols to protect sensitive company data.",
    "David is a data scientist with expertise in machine learning and statistical modeling. He works on predictive analytics to support business decisions."
]
my_df["description"] = descriptions


my_df

Unnamed: 0,name,age,salary,description
0,Alice,25,50000,Alice is a dedicated software engineer with a passion for solving complex problems. She has experience in backend development and cloud technologies.
1,Bob,35,690000,Bob is a results-driven project manager who specializes in agile methodologies. He ensures smooth communication across teams and timely delivery o...
2,Charlie,45,460000,Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities and implementing security protocols to protect sensitive company data.
3,David,24,90000,David is a data scientist with expertise in machine learning and statistical modeling. He works on predictive analytics to support business decisi...


In [47]:
#print just name and salary
my_df[['name','salary']]

Unnamed: 0,name,salary
0,Alice,50000
1,Bob,690000
2,Charlie,460000
3,David,90000


In [48]:
#print the data about people with salary>60000
my_df[my_df['salary']>60000]

Unnamed: 0,name,age,salary,description
1,Bob,35,690000,Bob is a results-driven project manager who specializes in agile methodologies. He ensures smooth communication across teams and timely delivery o...
2,Charlie,45,460000,Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities and implementing security protocols to protect sensitive company data.
3,David,24,90000,David is a data scientist with expertise in machine learning and statistical modeling. He works on predictive analytics to support business decisi...


In [49]:
#increase the salary of all by 1000
def increase_salary(salary):
    return salary+1000

my_df["salary"]=my_df["salary"].apply(increase_salary)
my_df

Unnamed: 0,name,age,salary,description
0,Alice,25,51000,Alice is a dedicated software engineer with a passion for solving complex problems. She has experience in backend development and cloud technologies.
1,Bob,35,691000,Bob is a results-driven project manager who specializes in agile methodologies. He ensures smooth communication across teams and timely delivery o...
2,Charlie,45,461000,Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities and implementing security protocols to protect sensitive company data.
3,David,24,91000,David is a data scientist with expertise in machine learning and statistical modeling. He works on predictive analytics to support business decisi...


In [50]:
my_df["docno"] = range(1, len(my_df) + 1)
my_df

Unnamed: 0,name,age,salary,description,docno
0,Alice,25,51000,Alice is a dedicated software engineer with a passion for solving complex problems. She has experience in backend development and cloud technologies.,1
1,Bob,35,691000,Bob is a results-driven project manager who specializes in agile methodologies. He ensures smooth communication across teams and timely delivery o...,2
2,Charlie,45,461000,Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities and implementing security protocols to protect sensitive company data.,3
3,David,24,91000,David is a data scientist with expertise in machine learning and statistical modeling. He works on predictive analytics to support business decisi...,4


In [51]:
my_df["docno"] = my_df["docno"].astype(str)


In [52]:
indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)
# index the text, record the docnos as metadata
index_ref = indexer.index(my_df["description"], my_df["docno"])
index_ref.toString()

  indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)


'./myFirstIndex/data.properties'

### **Explore the index**
An index has several data structures:

*    **the CollectionStatistics**- the salient global statistics of the index.
*    **the Lexicon** - the vocabulary of the index, including statistics of the terms, and a pointer into the inverted index.

* **the inverted index (a PostingIndex**) - contains the posting list for each term, detailing the frequency in which aterm appears in that document .
* **the DocumentIndex** - contains the length of the document (and other field lengths).  
* **the MetaIndex** - contains document metadata, such as the docno, and optionally the raw text and the URL ofeach document.
* **the direct index (also a PostingIndex)** - contains a posting list for each document, detailing which terms occuringthat document and which frequency. The presence of the direct index depends on the IndexingType that has beenapplied - single-pass and some memory indices do not provide a direct index.


Let's check the files the index files created.

In [53]:
!ls -lh myFirstIndex/

total 48K
-rw-r--r-- 1 root root   20 Sep  9 19:45 data.direct.bf
-rw-r--r-- 1 root root   68 Sep  9 19:45 data.document.fsarrayfile
-rw-r--r-- 1 root root   30 Sep  9 19:45 data.inverted.bf
-rw-r--r-- 1 root root 4.7K Sep  9 19:45 data.lexicon.fsomapfile
-rw-r--r-- 1 root root  513 Sep  9 19:45 data.lexicon.fsomaphash
-rw-r--r-- 1 root root  220 Sep  9 19:45 data.lexicon.fsomapid
-rw-r--r-- 1 root root   32 Sep  9 19:45 data.meta-0.fsomapfile
-rw-r--r-- 1 root root   32 Sep  9 19:45 data.meta.idx
-rw-r--r-- 1 root root   52 Sep  9 19:45 data.meta.zdata
-rw-r--r-- 1 root root 4.1K Sep  9 19:45 data.properties


We can export our index into our machine as follows:

In [54]:
# from google.colab import files
# !zip -r ./myFirstIndex.zip ./myFirstIndex
# files.download("myFirstIndex.zip")

Let's check the statistics about the index we created.

In [55]:
print(index_ref.toString())
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

./myFirstIndex/data.properties
Number of documents: 4
Number of terms: 55
Number of postings: 57
Number of fields: 0
Number of tokens: 58
Field names: []
Positions:   false



We can check the lexicon which is the **vocabulary** of the collection.

* Nt is the number of unique documents that each term occurs in.
* TF is the total number of occurrences – some weighting models use this instead of Nt.
* The numbers in the @{} are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.


In [56]:
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString()))

across -> term28 Nt=1 TF=1 maxTF=1 @{0 0 0} 
agil -> term19 Nt=1 TF=1 maxTF=1 @{0 0 4} 
alic -> term12 Nt=1 TF=1 maxTF=1 @{0 1 0} 
analyst -> term39 Nt=1 TF=1 maxTF=1 @{0 1 2} 
analyt -> term53 Nt=1 TF=1 maxTF=1 @{0 1 6} 
backend -> term2 Nt=1 TF=1 maxTF=1 @{0 2 4} 
bob -> term18 Nt=1 TF=1 maxTF=1 @{0 2 6} 
busi -> term48 Nt=1 TF=1 maxTF=1 @{0 3 2} 
charli -> term41 Nt=1 TF=1 maxTF=1 @{0 4 0} 
cloud -> term3 Nt=1 TF=1 maxTF=1 @{0 4 4} 
commun -> term24 Nt=1 TF=1 maxTF=1 @{0 4 6} 
compani -> term37 Nt=1 TF=1 maxTF=1 @{0 5 2} 
complex -> term6 Nt=1 TF=1 maxTF=1 @{0 5 6} 
cybersecur -> term38 Nt=1 TF=1 maxTF=1 @{0 6 0} 
data -> term33 Nt=2 TF=2 maxTF=1 @{0 6 4} 
david -> term44 Nt=1 TF=1 maxTF=1 @{0 7 2} 
decis -> term52 Nt=1 TF=1 maxTF=1 @{0 8 0} 
dedic -> term8 Nt=1 TF=1 maxTF=1 @{0 8 6} 
deliveri -> term13 Nt=1 TF=1 maxTF=1 @{0 9 0} 
develop -> term4 Nt=1 TF=1 maxTF=1 @{0 9 4} 
driven -> term14 Nt=1 TF=1 maxTF=1 @{0 9 6} 
engin -> term0 Nt=1 TF=1 maxTF=1 @{0 10 2} 
ensur -> term15 Nt=1

we can also lookup a term in PyTerrier's lexicon:

In [57]:
index.getLexicon()["work"].toString()

'term46 Nt=1 TF=1 maxTF=1 @{0 27 3}'

**The inverted index** tells us in which documents each term occurs in.
The LexiconEntry is the pointer that tell us where to find the postings for that term in the inverted index.

Let's look in which documents the word "work" occurs and its frequency in each document.

**Note:** we need to preprocess each search term with the same preprocessing steps we performed on the collection.

How many documents does term "technolog" occur in?

In [58]:
term="technolog"
index.getLexicon()[term].getDocumentFrequency()

1

What terms occur in the 4th document?

In [59]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 3 #docids are 0-based #note: postings will be null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

data with frequency 1
scientist with frequency 1
predict with frequency 1
david with frequency 1
model with frequency 1
work with frequency 1
machin with frequency 1
busi with frequency 1
learn with frequency 1
support with frequency 1
statist with frequency 1
decis with frequency 1
analyt with frequency 1
expertis with frequency 1


# Make index with position

In [60]:
indexer2 = pt.DFIndexer("./mySecondtIndex", blocks= True, overwrite=True)

# index the text, record the docnos as metadata
index_ref2 = indexer2.index(my_df["description"], my_df["docno"])
index_ref2.toString()

index2 = pt.IndexFactory.of(index_ref2)



  indexer2 = pt.DFIndexer("./mySecondtIndex", blocks= True, overwrite=True)


In [61]:
term= "data"
#search the term
try:
 pointer = index2.getLexicon()[term]
 for posting in index2.getInvertedIndex().getPostings(pointer):
    x= posting.toString()
    print (posting)
    # Find the index of "["
    start_pos = x.find("[")
    end_pos = x.find("]")
    # Extract the number between '[' and ']'
    number = x[start_pos + 1:end_pos]
    print("the term position is (start count from 0):", number," in the docID = %d" % posting.getId())
    print(" where the doc len is %d "% posting.getDocumentLength())
except:
    print("term %s not found"%term)


(2,1,B[13])
the term position is (start count from 0): 13  in the docID = 2
 where the doc len is 14 
(3,1,B[1])
the term position is (start count from 0): 1  in the docID = 3
 where the doc len is 14 


In [62]:
di = index2.getDirectIndex()
doi = index2.getDocumentIndex()
lex = index2.getLexicon()

# Document ID for docno "1" is 0 (since docids are 0-based)
docid_to_check = 2

try:
    # Get the DocumentEntry for the specified docid
    doc_entry = doi.getDocumentEntry(docid_to_check)

    if doc_entry is not None:
        print(f"Lexicon entries for terms in document with docno '1' (docid {docid_to_check}):")
        # Iterate through the postings in the direct index for the document
        for posting in di.getPostings(doc_entry):
            termid = posting.getId()
            lee = lex.getLexiconEntry(termid)
            print(f"Term: {lee.getKey()} -> {lee.getValue().toString()}")
    else:
        print(f"Document with docid {docid_to_check} not found in the index.")

except Exception as e:
    print(f"An error occurred: {e}")

#"Charlie is a cybersecurity analyst who focuses on identifying vulnerabilities
#and implementing security protocols to protect sensitive company data.",
#(2,1,B[13])
#the term position is (start count from 0): 13  in the docID = 2
#where the doc len is 14


Lexicon entries for terms in document with docno '1' (docid 2):
Term: who -> term17 Nt=2 TF=2 maxTF=1 @{0 78 1}
Term: focus -> term29 Nt=1 TF=1 maxTF=1 @{0 35 2}
Term: vulner -> term30 Nt=1 TF=1 maxTF=1 @{0 76 6}
Term: protect -> term31 Nt=1 TF=1 maxTF=1 @{0 54 5}
Term: protocol -> term32 Nt=1 TF=1 maxTF=1 @{0 56 2}
Term: data -> term33 Nt=2 TF=2 maxTF=1 @{0 18 6}
Term: sensit -> term34 Nt=1 TF=1 maxTF=1 @{0 62 0}
Term: secur -> term35 Nt=1 TF=1 maxTF=1 @{0 60 3}
Term: implement -> term36 Nt=1 TF=1 maxTF=1 @{0 38 0}
Term: compani -> term37 Nt=1 TF=1 maxTF=1 @{0 14 7}
Term: cybersecur -> term38 Nt=1 TF=1 maxTF=1 @{0 17 5}
Term: analyst -> term39 Nt=1 TF=1 maxTF=1 @{0 3 7}
Term: identifi -> term40 Nt=1 TF=1 maxTF=1 @{0 36 5}
Term: charli -> term41 Nt=1 TF=1 maxTF=1 @{0 11 0}


In [63]:

di = index2.getDirectIndex()
doi = index2.getDocumentIndex()
lex = index2.getLexicon()

# Document ID for docno "1" is 0 (since docids are 0-based)
docid_to_check = 3

try:
    # Get the DocumentEntry for the specified docid
    doc_entry = doi.getDocumentEntry(docid_to_check)

    if doc_entry is not None:
        print(f"Lexicon entries for terms in document with docno '1' (docid {docid_to_check}):")
        # Iterate through the postings in the direct index for the document
        for posting in di.getPostings(doc_entry):
            termid = posting.getId()
            lee = lex.getLexiconEntry(termid)
            print(f"Term: {lee.getKey()} -> {lee.getValue().toString()}")
    else:
        print(f"Document with docid {docid_to_check} not found in the index.")

except Exception as e:
    print(f"An error occurred: {e}")


##"David is a data scientist with expertise in machine learning and statistical modeling.
# He works on predictive analytics to support business decisions."
#(3,1,B[1])
#the term position is (start count from 0): 1  in the docID = 3
#where the doc len is 14

Lexicon entries for terms in document with docno '1' (docid 3):
Term: data -> term33 Nt=2 TF=2 maxTF=1 @{0 18 6}
Term: scientist -> term42 Nt=1 TF=1 maxTF=1 @{0 59 0}
Term: predict -> term43 Nt=1 TF=1 maxTF=1 @{0 48 7}
Term: david -> term44 Nt=1 TF=1 maxTF=1 @{0 21 2}
Term: model -> term45 Nt=1 TF=1 maxTF=1 @{0 45 7}
Term: work -> term46 Nt=1 TF=1 maxTF=1 @{0 80 5}
Term: machin -> term47 Nt=1 TF=1 maxTF=1 @{0 41 2}
Term: busi -> term48 Nt=1 TF=1 maxTF=1 @{0 9 1}
Term: learn -> term49 Nt=1 TF=1 maxTF=1 @{0 39 5}
Term: support -> term50 Nt=1 TF=1 maxTF=1 @{0 70 2}
Term: statist -> term51 Nt=1 TF=1 maxTF=1 @{0 68 5}
Term: decis -> term52 Nt=1 TF=1 maxTF=1 @{0 22 3}
Term: analyt -> term53 Nt=1 TF=1 maxTF=1 @{0 5 0}
Term: expertis -> term54 Nt=1 TF=1 maxTF=1 @{0 33 5}



**Why are the positions different from the original text, and where are the other words?! → nested cleaning**


### **Exercise1**
How many documents mention "data" ? which documents are those?

### **Exercise2**
Select any document from the collection and check which of its terms appear in the index?
