# IR Lab SoSe 2024: Dense Baseline (ANCE)

This jupyter notebook serves as baseline for dense retrieval using [ANCE](https://github.com/microsoft/ANCE/).
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)).

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a [ANCE](https://github.com/microsoft/ANCE/) retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier) using the [pyterrier_ance plugin](https://github.com/terrierteam/pyterrier_ance) that wraps the [ANCE code](https://arxiv.org/pdf/2007.00808.pdf).

In [8]:
# Need to install dependencies in Colab
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git
!pip3 install --upgrade git+https://github.com/tira-io/tira.git@development#\&subdirectory=python-client
!pip install faiss-cpu

Collecting git+https://github.com/terrierteam/pyterrier_ance.git
  Cloning https://github.com/terrierteam/pyterrier_ance.git to /tmp/pip-req-build-b0c3xsxn
  Running command git clone --filter=blob:none --quiet https://github.com/terrierteam/pyterrier_ance.git /tmp/pip-req-build-b0c3xsxn
  Resolved https://github.com/terrierteam/pyterrier_ance.git to commit f61f0981826bfca68c53a74326a61a4edcda6082
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting ANCE@ git+https://github.com/cmacdonald/ANCE.git
  Cloning https://github.com/cmacdonald/ANCE.git to /tmp/pip-install-vo79l4zp/ance_ebf0881f9cf5476fbd9d8f9d5ed2f197
  Running command git clone --filter=blob:none --quiet https://github.com/cmacdonald/ANCE.git /tmp/pip-install-vo79l4zp/ance_ebf0881f9cf5476fbd9d8f9d5ed2f197
  Resolved https://github.com/cmacdonald/ANCE.git to commit 0946eac34f738803ce8dfab7627a9d0af274e1af
  Preparing metadata (setup.py) ... [?25ldone
[0mCollecting git+https://github.com/tira-io/tira.git@developme

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
import pyterrier as pt

# do not truncate text in the dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [4]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the Dataset, the Index, and define the Retrieval Pipeline


In [5]:
# load the dataset
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240411-training')

In [6]:
# Declarative pipeline:
# Step 1: retrieve the top 10 results with ANCE
# Step 2: Add the document text
pt_ance = tira.pt_ance.ance_retrieval(pt_dataset)

ance = pt_ance %10 >> pt.text.get_text(pt_dataset, "text")


Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/2024-04-11-19-47-18.zip
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 344M/344M [00:11<00:00, 32.0MiB/s] 


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-lab-sose-2024/ir-acl-anthology-20240411-training/ows
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/Passage_ANCE_FirstP_Checkpoint.zip
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 1.19G/1.19G [00:39<00:00, 32.7MiB/s]


Download finished. Persist...
Download finished:  /root/.tira/raw_resources/Passage_ANCE_FirstP_Checkpoint.zip
Extracting checkpoint /root/.tira/raw_resources/Passage_ANCE_FirstP_Checkpoint.zip
Loading checkpoint /tmp/tmpufqrwo9r/Passage ANCE(FirstP) Checkpoint
Using mean: False
Loading shard metadata


Loading shards: 100%|██████████| 1/1 [00:00<00:00,  5.94shard/s]


In [7]:
ance.search('bm25 rm3')

***** inference of 1 queries *****


Inferencing: 1it [00:00, 11.88it/s]


Not running in distributed mode
***** faiss search for 1 queries on 1 shards *****


100%|██████████| 1/1 [00:00<00:00, 35.47shard/s]


Unnamed: 0,qid,query,docid,docno,score,rank,text
0,1,bm25 rm3,79623,2011.sigirconf_conference-2011.121,709.886841,0,"When documents are very long, BM25 fails!\n\n\n ABSTRACTWe reveal that the Okapi BM25 retrieval function tends to overly penalize very long documents. To address this problem, we present a simple yet effective extension of BM25, namely BM25L, which ""shifts"" the term frequency normalization formula to boost scores of very long documents. Our experiments show that BM25L, with the same computation cost, is more effective and robust than the standard BM25.\nCategories and Subject Descriptors\nH.3.3 [Information Search and Retrieval]: Retrieval models\nGeneral TermsAlgorithms Keywords BM25, BM25L, term frequency, very long documents\nMOTIVATIONThe Okapi BM25 retrieval function has been the state-of-the-art for nearly two decades. BM25 scores a document D with respect to query Q as follows:where c(q, Q) is the count of q in Q, N is the total number of documents, df (q) is the document frequency of q, and k3 is a parameter. Following [1], we use a modified IDF formula in BM25 to avoid its problem of possibly negative IDF values. A key component of BM25 contributing to its success is its sub-linear term frequency (TF) normalization formula:where |D| represents document length, avdl stands for average document length, c(q, D) is the raw TF of q in D, and b and k1 are two parameters. c (q, D) is the normalized TF by document length using pivoted length normalization .Copyright is held by the author/owner(s). SIGIR'11, July 24-28, 2011, Beijing, China. ACM 978-1-4503-0757-4/11/07. \nBOOSTING VERY LONG DOCUMENTSIn order to avoid overly-penalizing very long documents, we need to add a constraint in TF normalization to make sure that the ""score gap"" of f One heuristic way to achieve this goal is to define f (q, D) as follows:"
1,1,bm25 rm3,94817,2004.cikm_conference-2004.6,708.151855,1,"Simple BM25 extension to multiple weighted fields\n\n\n ABSTRACTThis paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the nonlinear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection."
2,1,bm25 rm3,110064,2012.trec_conference-2012.15,708.033813,2,"DCU@TRECMed 2012: Using adhoc Baselines for Domain-Specific Retrieval\n\n\n This paper describes the first participation of DCU in the TREC Medical Records Track (TRECMed) 2012. We performed initial experiments on the the 2011 TRECMed data based on the BM25 retrieval model. Surprisingly, we found that the standard BM25 model with default parameters performs comparable to the best automatic runs submitted to TRECMed 2011 and our experiments would have ranked among the top four out of 29 participating groups. We expected that some form of domain adaptation would increase performance. However, results on the 2011 data proved otherwise: query expansion decreased performance, and filtering and reranking by term proximity also decreased performance slightly. We submitted four runs based on the BM25 retrieval model to TRECMed 2012 using standard BM25, standard query expansion, result filtering, and concept-based query expansion. Official results for 2012 confirm that domain-specific knowledge, as applied by us, does not increase performance compared to the BM25 baseline."
3,1,bm25 rm3,101702,2018.ictir_conference-2018.36,708.01062,3,"StatBM25: An Aggregative and Statistical Approach for Document Ranking\n\n\n ABSTRACTIn Information Retrieval and Web Search, BM25 is one of the most influential probabilistic retrieval formulas for document weighting and ranking. BM25 involves three parameters k 1 , k 3 and b, which provide scalar approximation and scaling of important document features such as term frequency, document frequency, and document length. We investigate in this paper aggregative and statistical document features for document ranking. Shortly speaking, a statistically adjusted BM25 is used to score in an aggregative way on virtual documents, which are generated by randomly combining documents from the original collection. The problem size, in the number of virtual documents to be ranked, is an expansion to the problem size of the original problem. As a result, ranking is actually realized through performing statistical sampling. Rejection Sampling, a simple Monte Carlo sampling method is used at present. This new framework is called StatBM25, in emphasizing first the fact that the original problem domain space is K-expanded (a concept to be further explained in the paper); Further, statistical sampling is employed in the model. Empirical studies are performed on several standard test collections, where StatBM25 demonstrates convincingly high degree of both uniqueness and effectiveness compared to BM25. This means, in our belief, that StatBM25 as a statistically smoothed and normalized variant to BM25, might eventually lead to discoveries of useful new statistic measures for document ranking."
4,1,bm25 rm3,102200,2012.wwwconf_conference-2012c.74,707.888,4,"H2RDF: adaptive query processing on RDF data in the cloud\n\n\n ABSTRACTIn this work we present H2RDF , a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Our system features two unique characteristics that enable efficient processing of both simple and multi-join SPARQL queries on virtually unlimited number of triples: Join algorithms that execute joins according to query selectivity to reduce processing; and adaptive choice among centralized and distributed (MapReduce-based) join execution for fast query responses. Our system efficiently answers both simple joins and complex multivariate queries and easily scales to 3 billion triples using a small cluster of 9 worker nodes. H2RDF outperforms state-of-the-art distributed solutions in multi-join and nonselective queries while achieving comparable performance to centralized solutions in selective queries. In this demonstration we showcase the system's functionality through an interactive GUI. Users will be able to execute predefined or custom-made SPARQL queries on datasets of different sizes, using different join algorithms. Moreover, they can repeat all queries utilizing a different number of cluster resources. Using real-time cluster monitoring and detailed statistics, participants will be able to understand the advantages of different execution schemes versus the input data as well as the scalability properties of H2RDF over both the data size and the available worker resources."
5,1,bm25 rm3,126534,2017.tois_journal-ir0volumeA36A2.11,707.834106,5,"Local Representative-Based Matrix Factorization for Cold-Start Recommendation\n\n\n Cold-start recommendation is one of the most challenging problems in recommender systems. An important approach to cold-start recommendation is to conduct an interview for new users, called the interview-based approach. Among the interview-based methods, Representative-Based Matrix Factorization (RBMF) provides an effective solution with appealing merits: it represents users over selected representative items, which makes the recommendations highly intuitive and interpretable. However, RBMF only utilizes a global set of representative items to model all users. Such a representation is somehow too strict and may not be flexible enough to capture varying users' interests. To address this problem, we propose a novel interview-based model to dynamically create meaningful user groups using decision trees and then select local representative items for different groups. A two-round interview is performed for a new user. In the first round, l 1 global questions are issued for group division, while in the second round, l 2 local-group-specific questions are given to derive local representation. We collect the feedback on the (l 1 + l 2 ) items to learn the user representations. By putting these steps together, we develop a joint optimization model, named local representative-based matrix factorization, for new user recommendations. Extensive experiments on three public datasets have demonstrated the effectiveness of the proposed model compared with several competitive baselines."
6,1,bm25 rm3,53598,P13-2109,707.796265,6,"Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers\n\n\n We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD 3 , an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German)."
7,1,bm25 rm3,94788,2009.cikm_conference-2009.315,707.705261,7,"HDDBrs middleware for implementing highly available distributed databases\n\n\n ABSTRACTOur demo presents HDDB RS , a middle tier offering to clients a highly available distributed database interface using Reed Solomon codes to compute parity data. Parity data is stored in dedicated parity DB backends, is synchronously updated and allows recovering from multiple DB backend unavailability. HDDB RS middle tier is implemented in JAVA using standard technology, and is designed to be interoperable with any database engine that provides a JDBC driver and implements X/open XA protocol."
8,1,bm25 rm3,41184,W19-4331,707.695557,8,"Modality-based Factorization for Multimodal Fusion\n\n\n We propose a novel method, Modality-based Redundancy Reduction Fusion (MRRF), for understanding and modulating the relative contribution of each modality in multimodal inference tasks. This is achieved by obtaining an (M + 1)-way tensor to consider the high-order relationships between M modalities and the output layer of a neural network model. Applying a modality-based tensor factorization method, which adopts different factors for different modalities, results in removing information present in a modality that can be compensated by other modalities, with respect to model outputs. This helps to understand the relative utility of information in each modality. In addition it leads to a less complicated model with less parameters and therefore could be applied as a regularizer avoiding overfitting. We have applied this method to three different multimodal datasets in sentiment analysis, personality trait recognition, and emotion recognition. We are able to recognize relationships and relative importance of different modalities in these tasks and achieves a 1% to 4% improvement on several evaluation measures compared to the state-of-the-art for all three tasks."
9,1,bm25 rm3,110712,2006.trec_conference-2006.51,707.596924,9,"MG4J at TREC 2006\n\n\n MG4J participated in the ad hoc task of the Terabyte Track (find all the relevant documents with high precision from 25.2 million pages from the .gov domain) at TREC 2006. It was the second time the MG4J group participated to TREC. For this year, we integrated standard techniques (such as stemming and BM25 scoring) into MG4J, and submitted also automatic runs based on trivial query expansion techniques."


In [8]:
ance.search('measure importance of web pages')

***** inference of 1 queries *****


Inferencing: 1it [00:00, 16.59it/s]


Not running in distributed mode
***** faiss search for 1 queries on 1 shards *****


100%|██████████| 1/1 [00:00<00:00, 34.52shard/s]


Unnamed: 0,qid,query,docid,docno,score,rank,text
0,1,measure importance of web pages,103415,2015.wwwconf_conference-2015.10,710.254028,0,"Essential Web Pages Are Easy to Find\n\n\n ABSTRACTIn this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries -i.e., we look at the problem as a version of MAXCOVER. Although in general MAXCOVER is approximable to within a factor of 1 − 1/e ≈ 0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal -as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MAXCOVER instances."
1,1,measure importance of web pages,90593,2005.airs_conference-2005.48,710.203308,1,Calculating Webpage Importance with Site Structure Constraints
2,1,measure importance of web pages,84119,2003.sigirconf_conference-2003.61,709.946533,2,"Searchers' criteria For assessing web pages\n\n\n ABSTRACTWe investigate the criteria used by online searchers when assessing the relevance of web pages to information-seeking tasks. Twenty four searchers were given three tasks each, and indicated the features of web pages which they employed when deciding about the usefulness of the pages. These tasks were presented within the context of a simulated work-task situation. The results of this study provide a set of criteria used by searchers to decide about the utility of web pages. Such criteria have implications for the design of systems that use or recommend web pages, as well as to authors of web pages."
3,1,measure importance of web pages,97057,2013.cikm_conference-2013.350,709.746582,3,"READFAST: high-relevance search-engine for big text\n\n\n ABSTRACTRelevance of search-results is a key factor for any search engine. In order to return and rank the Web-pages that are most relevant to the query, contemporary search engines use complex ranking functions that depend on hundreds of features. For example, presence or absence of the query keywords on the page, their proximity, frequencies, HTML markup are just a few to name. Additional features might include fonts, tags, hyperlinks, metadata, and parts of the Web-page description. All this information is used by the search-engine to rank HTML Web pages returned to the user, but is unfortunately absent in free text that has no HTML markup, tags, hyperlinks, and any other metadata, except implicit natural language structure.Here we demonstrate one of the first Big text search engines that leverages hidden structure of the natural language sentences in order to process user queries and return more relevant search-results than a standard keyword-search. It provides a structured index extracted from the text using Natural Language Processing (NLP) that can be used to browse and query free text."
4,1,measure importance of web pages,80596,2014.sigirconf_conference-2014.150,709.584961,4,"Analyzing the content emphasis of web search engines\n\n\n ABSTRACTMillions of people search the Web each day. As a consequence, the ranking algorithms employed by Web search engines have a profound influence on which pages users visit. Characterizing this influence, and informing users when different engines favor certain sites or points of view, enables more transparent access to the Web's information.We present PAWS, a platform for analyzing differences among Web search engines. PAWS measures content emphasis: the degree to which differences across search engines' rankings correlate with features of the ranked content, including point of view (e.g., positive or negative orientation toward their company's products) and advertisements. We propose an approach for identifying the orientations in search results at scale, through a novel technique that minimizes the expected number of human judgments required. We apply PAWS to news search on Google and Bing, and find no evidence that the engines emphasize results that express positive orientation toward the engine company's products. We do find that the engines emphasize particular news sites, and that they also favor pages containing their company's advertisements, as opposed to competitor advertisements."
5,1,measure importance of web pages,80794,2008.sigirconf_conference-2008.59,709.374512,5,"BrowseRank: letting web users vote for page importance\n\n\n ABSTRACTThis paper proposes a new method for computing page importance, referred to as BrowseRank. The conventional approach to compute page importance is to exploit the link graph of the web and to build a model based on that graph. For instance, PageRank is such an algorithm, which employs a discrete-time Markov process as the model. Unfortunately, the link graph might be incomplete and inaccurate with respect to data for determining page importance, because links can be easily added and deleted by web content creators. In this paper, we propose computing page importance by using a 'user browsing graph' created from user behavior data. In this graph, vertices represent pages and directed edges represent transitions between pages in the users' web browsing history. Furthermore, the lengths of staying time spent on the pages by users are also included. The user browsing graph is more reliable than the link graph for inferring page importance. This paper further proposes using the continuous-time Markov process on the user browsing graph as a model and computing the stationary probability distribution of the process as page importance. An efficient algorithm for this computation has also been devised. In this way, we can leverage hundreds of millions of users' implicit voting on page importance. Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks."
6,1,measure importance of web pages,123811,2019.ipm_journal-ir0volumeA56A3.40,709.356201,6,"An efficient page ranking approach based on vector norms using sNorm(p) algorithm\n\n\n In the whole world, the internet is exercised by millions of people every day for information retrieval. Even for a small to smaller task like fixing a fan, to cook food or even to iron clothes persons opt to search the web. To fulfill the information needs of people, there are billions of web pages, each having a different degree of relevance to the topic of interest (TOI), scattered throughout the web but this huge size makes manual information retrieval impossible. The page ranking algorithm is an integral part of search engines as it arranges web pages associated with a queried TOI in order of their relevance level. It, therefore, plays an important role in regulating the search quality and user experience for information retrieval. PageRank, HITS, and SALSA are well-known page ranking algorithm based on link structure analysis of a seed set, but ranking given by them has not yet been efficient. In this paper, we propose a variant of SALSA to give sNorm(p) for the efficient ranking of web pages. Our approach relies on a p-Norm from Vector Norm family in a novel way for the ranking of web pages as Vector Norms can reduce the impact of low authority weight in hub weight calculation in an efficient way. Our study, then compares the rankings given by PageRank, HITS, SALSA, and sNorm(p) to the same pages in the same query. The effectiveness of the proposed approach over state of the art methods has been shown using performance measurement technique, Mean Reciprocal Rank (MRR), Precision, Mean Average Precision (MAP), Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG). The experimentation is performed on a dataset acquired after pre-processing of the results collected from initial few pages retrieved for a query by the Google search engine. Based on the type and amount of in-hand domain expertise 30 queries are designed. The extensive evaluation and result analysis are performed using MRR, Precision@k, MAP, DCG, and NDCG as the performance measuring statistical metrics. Furthermore, results are statistically verified using a significance test. Findings show that our approach outperforms state of the art methods by attaining 0.8666 as MRR value, 0.7957 as MAP value. Thus contributing to the improvement in the ranking of web pages more efficiently as compared to its counterparts."
7,1,measure importance of web pages,114536,2013.tweb_journal-ir0volumeA7A1.0,709.35614,7,"Measuring the Visual Complexities of Web Pages\n\n\n Visual complexities (VisComs) of Web pages significantly affect user experience, and automatic evaluation can facilitate a large number of Web-based applications. The construction of a model for measuring the VisComs of Web pages requires the extraction of typical features and learning based on labeled Web pages. However, as far as the authors are aware, little headway has been made on measuring VisCom in Web mining and machine learning. The present article provides a new approach combining Web mining techniques and machine learning algorithms for measuring the VisComs of Web pages. The structure of a Web page is first analyzed, and the layout is then extracted. Using a Web page as a semistructured image, three classes of features are extracted to construct a feature vector. The feature vector is fed into a learned measuring function to calculate the VisCom of the page.In the proposed approach of the present study, the type of the measuring function and its learning depend on the quantification strategy for VisCom. Aside from using a category and a score to represent VisCom as existing work, this study presents a new strategy utilizing a distribution to quantify the VisCom of a Web page. Empirical evaluation suggests the effectiveness of the proposed approach in terms of both features and learning algorithms."
8,1,measure importance of web pages,97037,2013.cikm_conference-2013.330,709.293213,8,"Incorporating the surfing behavior of web users into pagerank\n\n\n ABSTRACTIn large-scale commercial web search engines, estimating the importance of a web page is a crucial ingredient in ranking web search results. So far, to assess the importance of web pages, two different types of feedback have been taken into account, independent of each other: the feedback obtained from the hyperlink structure among the web pages (e.g., PageRank) or the web browsing patterns of users (e.g., BrowseRank). Unfortunately, both types of feedback have certain drawbacks. While the former lacks the user preferences and is vulnerable to malicious intent, the latter suffers from sparsity and hence low web coverage. In this work, we combine these two types of feedback under a hybrid page ranking model in order to alleviate the above-mentioned drawbacks. Our empirical results indicate that the proposed model leads to better estimation of page importance according to an evaluation metric that relies on user click feedback obtained from web search query logs. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits)."
9,1,measure importance of web pages,108369,2001.wwwconf_conference-2001p.53,709.262085,9,"Keeping Web Indices up-to-date\n\n\n Search engines play a crucial role in the Web. Without search engines large parts of the Web becomes inaccessible for the majority of users. Search engines can make new and smaller sites accessible at low cost. Without them, other media, such as Television, would be needed to advertise the existence new site on the Web, only large commercial sites can follow this path. The Web would be endangered to become dominated by a few, well known sites. A crucial problem of search engines is to keep their index up-to-date. Especially if the index grows, the effort needed to update the index increases, since Web documents are dynamic and thus already stored data becomes obsolete. There have been various attempts to monitor the evolvement of the Web [1][7]. However, we believe, that change model used in prior work overestimates the rate of change due to an inadequate change model. Our change model has been adapted from the information retrieval field to distinguish index relevant changes from irrelevant modifications in Web documents, e.g. simple spelling corrections or dynamic advertisement links. We have monitored multiple smaller collections of documents over a time period of six month to measure the documents change."
