# IR Lab SoSe 2024: Splade Baseline

This jupyter notebook serves as baseline for dense retrieval using [Splade](https://arxiv.org/abs/2107.05720).
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)).

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a [splade](https://github.com/naver/splade) retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier) using the [pyterrier splade plugin](https://github.com/cmacdonald/pyt_splade).

In [None]:
!pip3 install python-terrier
!pip install -q git+https://github.com/naver/splade.git git+https://github.com/cmacdonald/pyt_splade.git
!pip3 install --upgrade git+https://github.com/tira-io/tira.git@development#\&subdirectory=python-client


In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
import pyterrier as pt

# do not truncate text in the dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



### Step 2: Load the Dataset, the Index, and define the Retrieval Pipeline


In [4]:
# load the dataset
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240411-training')

In [5]:
index = tira.pt_splade.splade_index(pt_dataset)

Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/2024-04-14-08-40-58.zip
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 206M/206M [00:10<00:00, 21.2MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-lab-sose-2024/ir-acl-anthology-20240411-training/naverlabseurope


In [6]:
import pyt_splade
splade = pyt_splade.SpladeFactory('naver/splade-cocondenser-ensembledistil')

Downloading:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/670 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [10]:
# Declarative pipeline:
# Step 1: retrieve the top 10 results with Splade
# Step 2: Add the document text

splade_retr = splade.query() >> pt.BatchRetrieve(index, wmodel='Tf')
splade_retr = splade_retr % 10 >> pt.text.get_text(pt_dataset, "text")

Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/ir-acl-anthology-20240411-inputs.zip?download=1
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 39.4M/39.4M [00:08<00:00, 4.74MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-lab-sose-2024/ir-acl-anthology-20240411-training/


### Step 3: Do Some Searches

In [11]:
splade_retr.search('PageRank')



Unnamed: 0,qid,docid,docno,rank,score,query_0,query,text
0,1,108551,2005.wwwconf_conference-2005.62,0,757.313629,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"PageRank as a function of the damping factor\n\n\n ABSTRACTPageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in α was discovered to be useful in link-spam detection . Moreover, an analytical justification of the value chosen for α is still missing. In this paper, we give the first mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O t k α t for the k-th derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank."
1,1,88828,2017.wsdm_conference-2017.83,1,753.002003,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"Unsupervised Ranking using Graph Structures and Node Attributes\n\n\n ABSTRACTPageRank has been the signature unsupervised ranking model for ranking node importance in a graph. One potential drawback of PageRank is that its computation depends only on input graph structures, not considering external information such as the attributes of nodes. This work proposes AttriRank, an unsupervised ranking model that considers not only graph structure but also the attributes of nodes. AttriRank is unsupervised and domain-independent, which is different from most of the existing works requiring either ground-truth labels or specific domain knowledge. Combining two reasonable assumptions about PageRank and node attributes, AttriRank transfers extra node information into a Markov chain model to obtain the ranking. We further develop approximation for AttriRank and reduce its complexity to be linear to the number of nodes or links in the graph, which makes it feasible for large network data. The experiments show that AttriRank outperforms competing models in diverse graph ranking applications."
2,1,91900,2002.ecir_conference-2002.5,2,741.532116,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"An Improved Computation of the PageRank Algorithm\n\n\n Abstract. The Google search site (http://www.google.com) exploits the link structure of the Web to measure the relative importance of Web pages. The ranking method implemented in Google is called PageRank . The sum of all PageRank values should be one. However, we notice that the sum becomes less than one in some cases. We present an improved PageRank algorithm that computes the PageRank values of the Web pages correctly. Our algorithm works out well in any situations, and the sum of all PageRank values is always maintained to be one. We also present implementation issues of the improved algorithm. Experimental evaluation is carried out and the results are also discussed."
3,1,110287,2008.trec_conference-2008.58,3,729.248601,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"Weighted PageRank: Cluster-Related Weights\n\n\n PageRank is a way to rank Web pages taking into account hyper-link structure of the Web. PageRank provides efficient and simple method to find out ranking of Web pages exploiting hyper-link structure of the Web. However, it produces just an approximation of the ranking since the random surfer model uses just uniform distributions for all situation of choice happening during the surf process. In particular, this implies that the random surfer has no preferences. The assumption is limited by its nature. Personalized PageRank was designed to solve the problem but it is still quite restrictive since it assumes non-uniform preferences just at jumping to arbitrary page on the Web and non-preferring behaviour when following outgoing hyper-links. Taking into account these limitations and restrictions of PageRank and Personalized PageRank we propose Weighted PageRank where we are free to weight hyper-links according any possible preferring behaviour of a user. In particular, cluster-related weights are considered."
4,1,108646,2005.wwwconf_conference-2005si.74,4,728.764257,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"TruRank: taking PageRank to the limit\n\n\n ABSTRACTPageRank is defined as the stationary state of a Markov chain depending on a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. It is common belief that values of α closer to 1 give a ""truer to the web"" PageRank, but a small α accelerates convergence. Recently, however, it has been shown that when α = 1 all pages in the core component are very likely to have rank 0 [1]. This behaviour makes it difficult to understand PageRank when α ≈ 1, as it converges to a meaningless value for most pages. We propose a simple and natural modification to the standard preprocessing performed on the adjacency matrix of the graph, resulting in a ranking scheme we call TruRank. TruRank ranks the web with principles almost identical to PageRank, but it gives meaningful values also when α ≈ 1."
5,1,108607,2005.wwwconf_conference-2005si.35,5,726.09453,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"TotalRank: ranking without damping\n\n\n ABSTRACTPageRank is defined as the stationary state of a Markov chain obtained by perturbing the transition matrix of a web graph with a damping factor α that spreads part of the rank. The choice of α is eminently empirical, but most applications use α = 0.85; nonetheless, the selection of α is critical, and some believe that link farms may use this choice adversarially. Recent results prove that the PageRank of a page is a rational function of α, and that this function can be approximated quite efficiently: this fact can be used to define a new form of ranking, TotalRank, that averages PageRanks over all possible α's. We show how this rank can be computed efficiently, and provide some preliminary experimental results on its quality and comparisons with PageRank."
6,1,96675,2002.cikm_conference-2002.70,6,725.53461,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"I/O-efficient techniques for computing pagerank\n\n\n ABSTRACTOver the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user; this means that it can be precomputed and then used to optimize the layout of the inverted index structure accordingly. However, computing the Pagerank measure requires implementing an iterative process on a massive graph corresponding to billions of web pages and hyperlinks.In this paper, we study I/O-efficient techniques to perform this iterative computation. We derive two algorithms for Pagerank based on techniques proposed for out-of-core graph algorithms, and compare them to two existing algorithms proposed by Haveliwala. We also consider the implementation of a recently proposed topic-sensitive version of Pagerank. Our experimental results show that for very large data sets, significant improvements over previous results can be achieved on machines with moderate amounts of memory. On the other hand, at most minor improvements are"
7,1,90343,2004.airs_conference-2004.14,7,719.745178,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"Literal-Matching-Biased Link Analysis\n\n\n Abstract. The PageRank algorithm, used in the Google Search Engine, plays an important role in improving the quality of results by employing an explicit hyperlink structure among the Web pages. The prestige of Web pages defined by PageRank is derived solely from surfers' random walk on the Web Graph without any textual content consideration. However, in the practical sense, user surfing behavior is far from random jumping. In this paper, we propose a link analysis that takes the textual information of Web pages into account. The result shows that our proposed ranking algorithms perform better than the original PageRank."
8,1,91623,2007.ecir_conference-2007.54,8,710.289465,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),PageRank: When Order Changes
9,1,79277,2005.wwwconf_workshop-2004webdyn.2,9,708.079331,PageRank,#combine:0=284.7705602645874(page) #combine:0=279.3522119522095(#base64(IyNyYW5r)) #combine:0=194.7157382965088(pages) #combine:0=168.357253074646(rank) #combine:0=153.48353385925293(ranked) #combine:0=140.11222124099731(index) #combine:0=68.25286746025085(rating) #combine:0=54.50073480606079(math) #combine:0=54.383015632629395(#base64(Og==)) #combine:0=43.51648390293121(dave) #combine:0=42.32596457004547(popularity) #combine:0=41.1358118057251(size) #combine:0=37.9627525806427(fuzzy) #combine:0=32.85408616065979(facebook) #combine:0=32.25846588611603(simon) #combine:0=32.02188014984131(hart) #combine:0=31.53606951236725(speed) #combine:0=25.538238883018494(google) #combine:0=23.23586642742157(site) #combine:0=22.877788543701172(image) #combine:0=21.641002595424652(bradley) #combine:0=21.29380702972412(button) #combine:0=18.95996928215027(reading) #combine:0=18.754321336746216(statistics) #combine:0=17.70050674676895(graph) #combine:0=17.41144210100174(website) #combine:0=17.340534925460815(stanley) #combine:0=15.870106220245361(eddie) #combine:0=12.882547080516815(brian) #combine:0=12.348129600286484(burke) #combine:0=11.66246011853218(chart) #combine:0=9.357908368110657(review) #combine:0=8.350376039743423(perry) #combine:0=5.3222812712192535(magazine) #combine:0=3.44972163438797(search) #combine:0=3.3687151968479156(score) #combine:0=3.11063714325428(fred) #combine:0=1.5645496547222137(richard) #combine:0=0.325055536814034(citation),"Local Methods for Estimating PageRank Values\n\n\n The Google search engine uses a method called PageRank, together with term-based and other ranking techniques, to order search results returned to the user. PageRank uses link analysis to assign a global importance score to each web page. The PageRank scores of all the pages are usually determined off-line in a large-scale computation on the entire hyperlink graph of the web, and several recent studies have focused on improving the efficiency of this computation, which may require multiple hours on a typical workstation. However, in some scenarios, such as online analysis of link evolution and mining of large web archives, it may be desirable to quickly approximate or update the PageRanks of individual nodes without performing a large-scale computation on the entire graph. We address this problem by studying several methods for efficiently estimating the PageRank score of a particular web page using only a small subgraph of the entire web."


In [15]:
splade_retr.search('measure importance web page')



Unnamed: 0,qid,docid,docno,rank,score,query_0,query,text
0,1,126114,1964.ipm_journal-ir0volumeA2A3.2,0,770.365657,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),On relevance as a measure
1,1,92452,2011.cikm_conference-2011.12,1,711.276196,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"User browsing behavior-driven web crawling\n\n\n ABSTRACTTo optimize the performance of web crawlers, various measures of page importance have been studied to select and order URLs in crawling. Most sophisticated measures (e.g. breadth-first and PageRank ) are based on link structure. In this paper, we treat the problem from another perspective and propose to directly measure page importance through mining user interest and behaviors from web browse logs. Unlike most existing approaches which work on single URL, in this paper, both the log mining and the crawl ordering are performed at the granularity of URL pattern. The proposed URL pattern-based crawl orderings are capable to properly predict the importance of newly created (unseen) URLs. Promising experimental results proved the feasibility of our approach."
2,1,120115,1981.jasis_journal-ir0volumeA32A3.2,2,599.019228,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),The measurement of term importance in automatic indexing
3,1,108679,2005.wwwconf_conference-2005si.107,3,598.779703,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"A study on combination of block importance and relevance to estimate page relevance\n\n\n ABSTRACTSome work showed that segmenting web pages into ""semantic independent"" blocks could help to improve the whole page retrieval. One key and unexplored issue is how to combine the block importance and relevance to a given query. In this poster, we first propose an automatic way to measure block importance to improve retrieval. After that, user information need is also concerned to refine block importance for different users."
4,1,107044,2006.wwwconf_conference-2006.25,4,581.188607,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"What's really new on the web?: identifying new pages from a series of unstable web snapshots\n\n\n ABSTRACTIdentifying and tracking new information on the Web is important in sociology, marketing, and survey research, since new trends might be apparent in the new information. Such changes can be observed by crawling the Web periodically. In practice, however, it is impossible to crawl the entire expanding Web repeatedly. This means that the novelty of a page remains unknown, even if that page did not exist in previous snapshots. In this paper, we propose a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls. Using this novelty measure, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web. We evaluated the precision, recall, and miss rate of the novelty measure using our Japanese web archive, and applied it to a Web archive search engine."
5,1,123019,1988.ipm_journal-ir0volumeA24A4.0,5,572.243179,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),Measuring relevance judgments
6,1,106264,2013.wwwconf_conference-2013c.272,6,566.643659,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"Measuring web quality\n\n\n ABSTRACTMeasuring the quality of web content, either at page level or website level, is at the heart of several key challenges in the Web. Without doubt, the main one is web search, to be able to rank results. However, there are other important problems such as web reputation or trust, and web spam detection and filtering. However, measuring intrinsic web quality is a hard problem, because of our limited (automatic) understanding of text semantics, which is even worse for other media. Hence, similarly to human trust assessing, where we use past actions, face expressions, body language, etc; in the Web we need to use indirect signals that serve as surrogates for web quality. In this keynote we attempt to present the most important signals as well as new signals that are or can be used to measure quality in the Web. We divide them using the traditional web content, structure, and usage trilogy. We also characterize them according to how easy is to measure these signals, who can measure them, and how well they scale to the whole Web."
7,1,90593,2005.airs_conference-2005.48,7,565.054166,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),Calculating Webpage Importance with Site Structure Constraints
8,1,102649,2011.wwwconf_conference-2011c.55,8,553.648066,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"A framework for evaluating network measures for functional importance\n\n\n ABSTRACTMany metrics such as degree, closeness, and PageRank have been introduced to determine the relative importance of a node within a network. The desired function of a network, however, is domain-specific. For example, the robustness can be crucial for a communication network, while efficiency is more preferred for fast spreading of advertisements in viral marketing. The information provided by some widely used measures are often conflicting under such varying demands. In this paper, we present a novel framework for evaluating network metrics regarding typical functional requirements. We also propose an analysis of five well established measures to compare their performance of ranking nodes on functional importance in a real-life network."
9,1,97037,2013.cikm_conference-2013.330,9,545.874962,measure importance web page,#combine:0=286.2156629562378(importance) #combine:0=271.6032028198242(measure) #combine:0=189.93037939071655(measurement) #combine:0=182.91382789611816(important) #combine:0=177.06921100616455(web) #combine:0=146.37192487716675(significance) #combine:0=145.82605361938477(page) #combine:0=130.60227632522583(measures) #combine:0=94.26724910736084(pages) #combine:0=73.70932698249817(math) #combine:0=73.25243949890137(attention) #combine:0=66.87921285629272(site) #combine:0=61.57311201095581(significant) #combine:0=59.62101221084595(website) #combine:0=45.82305550575256(assessment) #combine:0=37.91020214557648(gage) #combine:0=37.84126937389374(statistics) #combine:0=34.75472927093506(reading) #combine:0=28.366029262542725(smith) #combine:0=23.800459504127502(chart) #combine:0=22.983166575431824(accuracy) #combine:0=18.482641875743866(survey) #combine:0=18.137164413928986(thomas) #combine:0=16.646508872509003(fisher) #combine:0=16.189368069171906(tracking) #combine:0=13.567069172859192(simon) #combine:0=12.203605473041534(accounting) #combine:0=8.65863412618637(#base64(IyNwYWdl)) #combine:0=7.531464099884033(hawkins) #combine:0=7.2045765817165375(photography) #combine:0=7.14053139090538(austin) #combine:0=6.915504485368729(measuring) #combine:0=5.686277151107788(engineering) #combine:0=2.2957956418395042(value),"Incorporating the surfing behavior of web users into pagerank\n\n\n ABSTRACTIn large-scale commercial web search engines, estimating the importance of a web page is a crucial ingredient in ranking web search results. So far, to assess the importance of web pages, two different types of feedback have been taken into account, independent of each other: the feedback obtained from the hyperlink structure among the web pages (e.g., PageRank) or the web browsing patterns of users (e.g., BrowseRank). Unfortunately, both types of feedback have certain drawbacks. While the former lacks the user preferences and is vulnerable to malicious intent, the latter suffers from sparsity and hence low web coverage. In this work, we combine these two types of feedback under a hybrid page ranking model in order to alleviate the above-mentioned drawbacks. Our empirical results indicate that the proposed model leads to better estimation of page importance according to an evaluation metric that relies on user click feedback obtained from web search query logs. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits)."


In [24]:
splade_retr.search('predict index growth')



Unnamed: 0,qid,docid,docno,rank,score,query_0,query,text
0,1,115017,2009.jasis_journal-ir0volumeA60A2.15,0,677.038217,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),Simulating growth of the h-index
1,1,39917,W16-5616,1,600.494485,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"News Sentiment and Cross-Country Fluctuations\n\n\n What is the information content of news-based measures of sentiment? How are they related to aggregate economic fluctuations? I construct a sentiment index by measuring the net amount of positive expressions in the corpus of Economic news articles produced by Reuters over the period 1987 -2013 and across 12 countries. The index successfully tracks fluctuations in Gross Domestic Product (GDP) at the country level, is a leading indicator of GDP growth and contains information to help forecast GDP growth which is not captured by professional forecasts. This suggests that forecasters do not appropriately incorporate available information in predicting future states of the economy."
2,1,43812,2020.acl-main.649,2,557.576169,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"Predicting the Growth of Morphological Families from Social and Linguistic Factors\n\n\n We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as ""trump"", ""antitrumpism"", and ""detrumpify"", in social media. We introduce the novel task of Morphological Family Expansion Prediction (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP."
3,1,26723,W19-5509,3,555.496173,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"Leveraging {BERT} to Improve the {FEARS} Index for Stock Forecasting\n\n\n Financial and Economic Attitudes Revealed by Search (FEARS) index reflects the attention and sentiment of public investors and is an important factor for predicting stock price return. In this paper, we take into account the semantics of the FEARS search terms by leveraging the Bidirectional Encoder Representations from Transformers (BERT), and further apply a self-attention deep learning model to our refined FEARS seamlessly for stock return prediction. We demonstrate the practical benefits of our approach by comparing to baseline works."
4,1,67557,Y06-1042,4,543.396865,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"The stock index forecast based on dynamic recurrent neural network trained with {GA}\n\n\n Abstract：In order to forecast the stock market more accurately, according to the dynamic property for the stock market, propose the real time modeling forecast via dynamic recurrent neural network and use GA to study online, then it improves the network performance and better describes the dynamic characteristic of stock market. By forecasting Shanghai negotiable securities index, it shows better validity."
5,1,108630,2005.wwwconf_conference-2005si.58,5,533.520362,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"Predictive ranking: a novel page ranking approach by estimating the web structure\n\n\n ABSTRACTPageRank (PR) is one of the most popular ways to rank web pages. However, as the Web continues to grow in volume, it is becoming more and more difficult to crawl all the available pages. As a result, the page ranks computed by PR are only based on a subset of the whole Web. This produces inaccurate outcome because of the inherent incomplete information (dangling pages) that exist in the calculation. To overcome this incompleteness, we propose a new variant of the PageRank algorithm called, Predictive Ranking (PreR), in which different classes of dangling pages are analyzed individually so that the link structure can be predicted more accurately. We detail our proposed steps. Furthermore, experimental results show that this algorithm achieves encouraging results when compared with previous methods."
6,1,22960,W02-1903,6,531.368323,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"A Reliable Indexing Method for a Practical {QA} System\n\n\n We propose a fast and reliable Question-answering (QA) system in Korean, which uses a predictive answer indexer based on 2-pass scoring method. The indexing process is as follows. The predictive answer indexer first extracts all answer candidates in a document. Then, using 2-pass scoring method, it gives scores to the adjacent content words that are closely related with each answer candidate. Next, it stores the weighted content words with each candidate into a database. Using this technique, along with a complementary analysis of questions, the proposed QA system saves response time and enhances the precision."
7,1,94979,2020.cikm_conference-2020.58,7,520.46922,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),Predicting Economic Growth by Region Embedding: A Multigraph Convolutional Network Approach
8,1,106373,2013.wwwconf_conference-2013.67,8,518.673115,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"Modeling/predicting the evolution trend of osn-based applications\n\n\n ABSTRACTWhile various models have been proposed for generating social/friendship network graphs, the dynamics of user interactions through online social network (OSN) based applications remain largely unexplored. We previously developed a growth model to capture static weekly snapshots of user activity graphs (UAGs) using data from popular Facebook gifting applications. This paper presents a new continuous graph evolution model aimed to capture microscopic userlevel behaviors that govern the growth of the UAG and collectively define the overall graph structure. We demonstrate the utility of our model by applying it to forecast the number of active users over time as the application transitions from initial growth to peak/mature and decline/fatique phase. Using empirical evaluations, we show that our model can accurately reproduce the evolution trend of active user population for gifting applications, or other OSN applications that employ similar growth mechanisms. We also demonstrate that the predictions from our model can guide the generation of synthetic graphs that accurately represent empirical UAG snapshots sampled at different evolution stages."
9,1,2181,R15-1089,9,507.526049,predict index growth,#combine:0=295.01166343688965(index) #combine:0=273.28624725341797(growth) #combine:0=268.44704151153564(predict) #combine:0=189.8534655570984(predicted) #combine:0=188.15953731536865(grow) #combine:0=178.8409948348999(indexed) #combine:0=140.74702262878418(forecast) #combine:0=105.21880388259888(expansion) #combine:0=92.15607643127441(calculate) #combine:0=71.18721008300781(determine) #combine:0=60.55528521537781(statistics) #combine:0=57.710593938827515(estimate) #combine:0=46.43977880477905(stock) #combine:0=45.199185609817505(success) #combine:0=43.384236097335815(2019) #combine:0=32.61469304561615(gage) #combine:0=31.936761736869812(analysis) #combine:0=30.917447805404663(pearson) #combine:0=28.184354305267334(economic) #combine:0=25.72609782218933(tracking) #combine:0=24.547719955444336(correlation) #combine:0=21.098564565181732(research) #combine:0=19.726473093032837(investment) #combine:0=19.47171241044998(assessment) #combine:0=17.522238194942474(chart) #combine:0=17.23059117794037(fisher) #combine:0=16.99621081352234(roi) #combine:0=15.916690230369568(reading) #combine:0=15.898226201534271(graph) #combine:0=12.962949275970459(trend) #combine:0=12.661752104759216(rank) #combine:0=11.301986873149872(hawkins) #combine:0=9.974803775548935(future) #combine:0=5.452035367488861(accounting) #combine:0=5.078029632568359(math) #combine:0=0.6437043193727732(z),"Six Good Predictors of Autistic Text Comprehension\n\n\n This paper presents our investigation of the ability of 33 readability indices to account for the reading comprehension difficulty posed by texts for people with autism. The evaluation by autistic readers of 16 text passages is described, a process which led to the production of the first text collection for which readability has been evaluated by people with autism. We present the findings of a study to determine which of the 33 indices can successfully discriminate between the difficulty levels of the text passages, as determined by our reading experiment involving autistic participants. The discriminatory power of the indices is further assessed through their application to the FIRST corpus which consists of 25 texts presented in their original form and in a manually simplified form (50 texts in total), produced specifically for readers with autism."


In [23]:
splade_retr.search('lost web page')



Unnamed: 0,qid,docid,docno,rank,score,query_0,query,text
0,1,82664,2002.sigirconf_conference-2002.3,0,492.99461,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Analysis of lexical signatures for finding lost or related documents\n\n\n ABSTRACTA lexical signature of a web page is often su cient for nding the page, even if its URL has changed. We conduct a largescale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's 14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identi cation but poor at nding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which c o m bine DF with TF or TFIDF) seem to be the most promising candidates for generating e ective lexical signatures."
1,1,22217,L08-1411,1,438.305123,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"{G}lossa: a Multilingual, Multimodal, Configurable User Interface"
2,1,107131,2006.wwwconf_conference-2006.112,2,437.955888,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Verifying genre-based clustering approach to content extraction\n\n\n ABSTRACTThe content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter, particularly hurting users browsing on small cell phone and PDA screens and visually impaired users relying on speed rendering of web pages. Using the genre of a web page, we have created a solution, Crunch that automatically identifies clutter and removes it, thus leaving a clean content-full page. In order to evaluate the improvement in the applications for this technology, we identified a number of experiments. In this paper, we have those experiments, the associated results and their evaluation."
3,1,108604,2005.wwwconf_conference-2005si.32,3,434.733747,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Predicting outcomes of web navigation\n\n\n ABSTRACTTwo exploratory studies examined the relationships among web navigation metrics, measures of lostness, and success on web navigation tasks. The web metrics were based on counts of visits to web pages, properties of the web usage graph, and similarity to an optimal path. Metrics based on similarity to an optimal path were good predictors of lostness and task success."
4,1,82923,2013.sigirconf_conference-2013.117,4,412.79917,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Studying page life patterns in dynamical web\n\n\n ABSTRACTWith the ever-increasing speed of content turnover on the web, it is particularly important to understand the patterns that pages' popularity follows. This paper focuses on the dynamical part of the web, i.e. pages that have a limited lifespan and experience a short popularity outburst within it. We classify these pages into five patterns based on how quickly they gain popularity and how quickly they lose it. We study the properties of pages that belong to each pattern and determine content topics that contain disproportionately high fractions of particular patterns. These developments are utilized to create an algorithm that approximates with reasonable accuracy the expected popularity pattern of a web page based on its URL and, if available, prior knowledge about its domain's topics."
5,1,125220,2008.ipm_journal-ir0volumeA44A2.23,5,384.1805,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Query-level loss functions for information retrieval\n\n\n AbstractMany machine learning technologies such as support vector machines, boosting, and neural networks have been applied to the ranking problem in information retrieval. However, since originally the methods were not developed for this task, their loss functions do not directly link to the criteria used in the evaluation of ranking. Specifically, the loss functions are defined on the level of documents or document pairs, in contrast to the fact that the evaluation criteria are defined on the level of queries. Therefore, minimizing the loss functions does not necessarily imply enhancing ranking performances. To solve this problem, we propose using query-level loss functions in learning of ranking functions. We discuss the basic properties that a query-level loss function should have and propose a query-level loss function based on the cosine similarity between a ranking list and the corresponding ground truth. We further design a coordinate descent algorithm, referred to as RankCosine, which utilizes the proposed loss function to create a generalized additive ranking model. We also discuss whether the loss functions of existing ranking algorithms can be extended to query-level. Experimental results on the datasets of TREC web track, OHSUMED, and a commercial web search engine show that with the use of the proposed querylevel loss function we can significantly improve ranking accuracies. Furthermore, we found that it is difficult to extend the document-level loss functions to query-level loss functions."
6,1,107034,2006.wwwconf_conference-2006.15,6,379.38195,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Off the beaten tracks: exploring three aspects of web navigation\n\n\n ABSTRACTThis paper presents results of a long-term client-side Web usage study, updating previous studies that range in age from five to ten years. We focus on three aspects of Web navigation: changes in the distribution of navigation actions, speed of navigation and within-page navigation.""Navigation actions"" corresponding to users' individual page requests are discussed by type. We reconfirm links to be the most important navigation element, while backtracking has lost more than half of its previously reported share and form submission has become far more common. Changes of the Web and the browser interfaces are candidates for causing these changes.Analyzing the time users stayed on pages, we confirm Web navigation to be a rapidly interactive activity. A breakdown of page characteristics shows that users often do not take the time to read the available text or consider all links. The performance of the Web is analyzed and reassessed against the resulting requirements.Finally, habits of within-page navigation are presented. Although most selected hyperlinks are located in the top left corner of the screen, in nearly a quarter of all cases people choose links that require scrolling. We analyzed the available browser real estate to gain insights for the design of non-scrolling Web pages."
7,1,124063,2015.ipm_journal-ir0volumeA51A2.7,7,374.387564,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Building a better mousetrap: Compressing mouse cursor activity for web analytics\n\n\n a b s t r a c tWebsites can learn what their users do on their pages to provide better content and services to those users. A website can easily find out where a user has been, but in order to find out what content is consumed and how it was consumed at a sub-page level, prior work has proposed client-side tracking to record cursor activity, which is useful for computing the relevance for search results or determining user attention on a page. While recording cursor interactions can be done without disturbing the user, the overhead of recording the cursor trail and transmitting this data over the network can be substantial. In our work, we investigate methods to compress cursor data, taking advantage of the fact that not every cursor coordinate has equal value to the website developer. We evaluate 5 lossless and 5 lossy compression algorithms over two datasets, reporting results about client-side performance, space savings, and how well a lossy algorithm can replicate the original cursor trail. The results show that different compression techniques may be suitable for different goals: LZW offers reasonable lossless compression, but lossy algorithms such as piecewise linear interpolation and distance-thresholding offer better client-side performance and bandwidth reduction."
8,1,7046,P11-2024,8,371.222926,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Search in the Lost Sense of {``}Query{''}: Question Formulation in Web Search Queries and its Temporal Changes\n\n\n Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user's information need, are typically not expressed as full-length natural language sentences -in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent."
9,1,80625,2014.sigirconf_conference-2014.179,9,370.668981,lost web page,#combine:0=309.1687202453613(lost) #combine:0=231.3476800918579(web) #combine:0=211.07409000396729(loss) #combine:0=192.4658179283142(page) #combine:0=129.41399812698364(pages) #combine:0=102.01253890991211(site) #combine:0=92.38483905792236(forgotten) #combine:0=83.66556167602539(hidden) #combine:0=79.6230673789978(website) #combine:0=68.04164052009583(stolen) #combine:0=58.571791648864746(missing) #combine:0=56.55722618103027(error) #combine:0=48.545777797698975(lose) #combine:0=44.41806972026825(ghost) #combine:0=39.79119956493378(restoration) #combine:0=39.351823925971985(dave) #combine:0=39.10987079143524(craig) #combine:0=34.17539596557617(fake) #combine:0=30.17648458480835(nick) #combine:0=30.17028570175171(mia) #combine:0=27.93791890144348(ryan) #combine:0=27.24887728691101(facebook) #combine:0=25.79193115234375(ruin) #combine:0=23.847606778144836(copyright) #combine:0=22.445227205753326(escape) #combine:0=20.686589181423187(find) #combine:0=18.387308716773987(abandoned) #combine:0=17.44653284549713(steve) #combine:0=17.27866232395172(sites) #combine:0=16.41681343317032(privacy) #combine:0=16.317586600780487(stanley) #combine:0=14.351369440555573(hawkins) #combine:0=13.801568746566772(bug) #combine:0=12.17300146818161(message) #combine:0=11.857935786247253(broken) #combine:0=10.01262366771698(matt) #combine:0=9.997706860303879(charlie) #combine:0=9.083320200443268(nathan) #combine:0=8.300435543060303(disappeared) #combine:0=6.5411582589149475(portal) #combine:0=5.4329220205545425(gone) #combine:0=4.181629791855812(deleted) #combine:0=1.8754420801997185(discovery),"Uncovering the unarchived web\n\n\n ABSTRACTMany national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's aura: the web documents that were not included in the archived collection, but are known to have existed -due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages."


In [21]:
splade_retr.search('retrieval multi language')



Unnamed: 0,qid,docid,docno,rank,score,query_0,query,text
0,1,63684,A97-1049,0,814.774596,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"An Intelligent Multilingual Information Browsing and Retrieval System Using Information Extraction\n\n\n In this paper, we describe our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of information extraction (IE) technology in novel ways to improve the accuracy of cross-linguistic retrieval and to provide innovative methods for browsing and exploring multilingual document collections. The system indexes texts in different languages (e.g., English and Japanese) and allows the users to retrieve relevant texts in their native language (e.g., English). The retrieved text is then presented to the users with proper names and specialized domain terms translated and hyperlinked. Moreover, the system allows interactive information discovery from a multilingual document collection."
1,1,22284,2003.mtsummit-systems.10,1,780.968577,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"A system for {J}apanese/{E}nglish/{K}orean multilingual patent retrieval\n\n\n In response to growing needs for cross-lingual patent retrieval, we propose PRIME (Patent Retrieval In Multilingual Environment system), in which users can retrieve and browse patents in foreign languages only by their native language. PRIME translates a query in the user language into the target language, retrieves patents relevant to the query, and translates retrieved patents into the user language. To update a translation dictionary, PRIME automatically extracts new translations from parallel patent corpora. In the current implementation, trilingual (J/E/K) patent retrieval is available. We describe the system design and its evaluation."
2,1,75868,2004.clef_workshop-2004w.17,2,770.527988,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"Dublin City University at CLEF 2004: Experiments in Monolingual, Bilingual and Multilingual Retrieval\n\n\n The Dublin City University group participated in the monolingual, bilingual and multilingual retrieval tasks this year. The main focus of our investigation this year was extending our retrieval system to document languages other than English, and completing the multilingual task comprising four languages: English, French, Russian and Finnish. Results from our French monolingual experiments indicate that working in French is more effective for retrieval than adopting document and topic translation to English. However, comparison of our multilingual retrieval results using different topic and document translation reveals that this result does not extend to retrieved list merging for the multilingual task in a simple predictable way."
3,1,75421,2009.clef_workshop-2009w.33,3,756.307447,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"Cross-lingual Information Retrieval based on Multiple Indexes\n\n\n In this paper we present the technical details of the retrieval system with which we participated at the CLEF09 Ad-hoc TEL task. We present a retrieval approach based on multiple indexes for different languages which is combined with a conceptbased retrieval approach based on Explicit Semantic Analysis. In order to create the language-specific indices for each language, a language detection approach is applied as preprocessing step. We combine the different indices through rank aggregation and present our experimental results with different rank aggregation strategies. Our results show that the use of multiple indices (one for each language) does not improve upon a baseline index containing documents in all languages. The combination with concept based retrieval, however, results in better retrieval performance in some of the cases considered. For the bilingual tasks the final retrieval results of our system were the 5th best results on the BL dataset and the second best on the BNF dataset."
4,1,74320,2002.ntcir_workshop-2002.34,4,755.999071,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"NTCIR-3 Patent Retrieval Experiments at ULIS\n\n\n Given the growing number of patents filed in multiple countries, users are interested in retrieving patents across languages. We propose a multilingual patent retrieval system, which translates a user query into the target language, searches a multilingual database for patents relevant to the query, and improves the browsing efficiency by way of machine translation and clustering. Our system also extracts new translations from patent families consisting of comparable patents, to enhance the translation dictionary."
5,1,75753,2002.clef_workshop-2002.29,5,731.370598,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),Information Retrieval with Language Knowledge
6,1,75791,2004.clef_workshop-2004.22,6,731.151038,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"Dublin City University at CLEF 2004: Experiments in Monolingual, Bilingual and Multilingual Retrieval\n\n\n Abstract. The Dublin City University group participated in the monolingual, bilingual and multilingual retrieval tasks. The main focus of our investigation for CLEF 2004 was extending our information retrieval system to document languages other than English, and completing the multilingual task comprising four languages: English, French, Russian and Finnish. Our retrieval system is based on the City University Okapi BM25 system with document preprocessing using the Snowball stemming software and stopword lists. Our French monolingual experiments compare retrieval using French documents and topics, and documents and topics translated into English. Our results indicate that working directly in French is more effective for retrieval than adopting document and topic translation. A breakdown of our multilingual retrieval results by the individual languages shows that similar overall average precision can be achieved when there is significant underlying variation in performance for individual languages."
7,1,68007,2013.mtsummit-wmwumttt.7,7,729.863684,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"Multi-word processing in an ontology-based cross-language information retrieval model for specific domain collections\n\n\n This paper proposes a methodological approach to CLIR applications for the development of a system which improves multi-word processing when specific domain translation is required. The system is based on a multilingual ontology, which can improve both translation and retrieval accuracy and effectiveness. The proposed framework allows mapping data and metadata among language-specific ontologies in the Cultural Heritage (CH) domain. The accessibility of Cultural Heritage resources, as foreseen by recent important initiatives like the European Library and Europeana, is closely related to the development of environments which enable the management of multilingual complexity. Interoperability between multilingual systems can be achieved only by means of an accurate multi-word processing, which leads to a more effective information extraction and semantic search and an improved translation quality."
8,1,126143,2015.tois_journal-ir0volumeA33A4.5,8,728.970569,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"A Pólya Urn Document Language Model for Improved Information Retrieval\n\n\n The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency-that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly.We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes."
9,1,82173,2010.sigirconf_conference-2010.61,9,725.495395,retrieval multi language,#combine:0=304.85711097717285(retrieval) #combine:0=286.1713409423828(multi) #combine:0=211.3454818725586(language) #combine:0=204.21385765075684(retrieve) #combine:0=192.05530881881714(languages) #combine:0=140.77889919281006(multiple) #combine:0=140.72309732437134(retrieved) #combine:0=95.22781372070312(ryan) #combine:0=57.81473517417908(collection) #combine:0=56.14665746688843(memory) #combine:0=46.355536580085754(fuzzy) #combine:0=40.67874550819397(search) #combine:0=36.50223910808563(dave) #combine:0=34.531182050704956(archive) #combine:0=31.248939037322998(discovery) #combine:0=31.133416295051575(portal) #combine:0=29.882648587226868(restoration) #combine:0=29.665908217430115(technology) #combine:0=27.878862619400024(research) #combine:0=27.054911851882935(lane) #combine:0=25.396135449409485(marshall) #combine:0=25.117114186286926(document) #combine:0=23.2613667845726(marty) #combine:0=23.2033833861351(computer) #combine:0=22.17169851064682(barry) #combine:0=21.484506130218506(message) #combine:0=20.109359920024872(book) #combine:0=19.76471245288849(marcus) #combine:0=18.959712982177734(collins) #combine:0=18.255512416362762(database) #combine:0=17.31170415878296(avery) #combine:0=17.246945202350616(jerry) #combine:0=16.958749294281006(steve) #combine:0=16.377070546150208(merlin) #combine:0=16.146764159202576(re) #combine:0=15.825890004634857(compilation) #combine:0=14.576703310012817(roy) #combine:0=14.138422906398773(key) #combine:0=10.963233560323715(stanley) #combine:0=10.122495889663696(carter) #combine:0=10.049717873334885(technique) #combine:0=9.412618726491928(communication) #combine:0=8.988677710294724(library) #combine:0=8.668079227209091(merger) #combine:0=8.192110806703568(fisher) #combine:0=7.768755406141281(device) #combine:0=6.44177570939064(storage) #combine:0=6.058086454868317(craig) #combine:0=5.147111415863037(turner) #combine:0=4.6393755823373795(file) #combine:0=3.5253118723630905(hammer) #combine:0=3.321031853556633(clark) #combine:0=0.638468936085701(merge) #combine:0=0.15412606298923492(tool),"Multi-style language model for web scale information retrieval\n\n\n ABSTRACT1Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an ""openvocabulary"" smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system."
