# Terrier-Spark Example Notebook

More example notebooks can be found at https://github.com/terrier-org/terrier-spark/tree/master/example_notebooks/toree

Firstly, we need to download the dependencies.

In [1]:
//This can take a minute or two.
%AddDeps org.terrier terrier-core 5.0 --transitive --repository file:/root/.m2/repository --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/root/.m2/repository --transitive

Marking org.terrier:terrier-core:5.0 for download
Obtained 276 files
Marking org.terrier:terrier-spark:0.0.1-SNAPSHOT for download
Obtained 336 files


In [2]:
//Lets check we have Terrier downloaded.
org.terrier.Version.VERSION

5.0

In [11]:
//lets import what we need
import org.terrier.querying._
import org.terrier.spark._
import org.terrier.spark.ml._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, DoubleType}


In [12]:
//change this for the index you used
val indexref = IndexRef.of("/work/indexes/robust04.properties")

import org.terrier.querying._

val terrierHome = "/work/terrier-core/"

indexref = concurrent:/work/indexes/robust04.properties
terrierHome = /work/terrier-core/


/work/terrier-core/

In [13]:
val props = Map("terrier.home" -> terrierHome)
TopicSource.configureTerrier(props)
val model = "BM25"

val queryTransform = new QueryingTransformer()
    .setTerrierProperties(props)
    .setIndexReference(indexref)
    .setSampleModel(model)

 Assuming the value of terrier.home from the corresponding system property.
Please ensure that the property terrier.home
is specified in the file terrier.properties,
or as a system property in the command line.
TERRIER_HOME=/work/terrier-core/
terrier.etc=null
TERRIER_HOME=/work/terrier-core/
TERRIER_ETC=/work/terrier-core/etc


props = Map(terrier.home -> /work/terrier-core/)
model = BM25
queryTransform = QueryingTransformer_f0fb7d11c23a


QueryingTransformer_f0fb7d11c23a

Lets see if we can get results for an example query

In [15]:
val topics = Seq( ("1", "information retrieval") ).toDF("qid", "query")

val results = queryTransform.transform(topics)

Querying concurrent:/work/indexes/robust04.properties for 1 queries
Got for 999 results total


topics = [qid: string, query: string]
results = [qid: string, query: string ... 4 more fields]


[qid: string, query: string ... 4 more fields]

In [16]:
%%dataframe --limit 10
results

qid,query,docno,docid,score,rank
1,information retrieval,FT944-5797,380365,6.641416790363148,998
1,information retrieval,LA072889-0061,470850,6.642720463297927,997
1,information retrieval,LA031889-0131,423359,6.642720463297927,996
1,information retrieval,FT921-2446,207229,6.642720463297927,995
1,information retrieval,FR941202-1-00011,181568,6.642720463297927,994
1,information retrieval,FR940208-1-00060,135882,6.642720463297927,993
1,information retrieval,FT923-5198,225984,6.642720463297927,992
1,information retrieval,FT934-10926,316028,6.642720463297927,991
1,information retrieval,FR940419-2-00056,145309,6.650724514777319,990
1,information retrieval,FT921-3073,207856,6.658747878262264,989


Now lets do a TREC run

In [17]:

//change this for your topics file
val topicsFile = "file:/path/to/topics.txt"
val qrelsFile = "file:/path/to/qrels.txt"

val topics = TopicSource.extractTRECTopics(topicsFile).toList.toDF("qid", "query").repartition(1)

val r1 = queryTransform.transform(topics)
//r1 is a dataframe with results for queries in topics
val qrelTransform = new QrelTransformer()
    .setQrelsFile(qrelsFile)

val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList

val newSchema = StructType(topics.schema.fields ++ Array(StructField("ndcg", DoubleType, false)))
val rtr = spark.createDataFrame(topics.rdd.zipWithIndex.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(ndcg(index.toInt)))}, newSchema)

Querying concurrent:/work/indexes/robust04.properties for 250 queries
Got for 242108 results total
We have 311410 qrels


topicsFile = file:/topics.robust04.txt
qrelsFile = file:/qrels.robust04.txt
topics = [qid: string, query: string]
r1 = [qid: string, query: string ... 4 more fields]
qrelTransform = QrelTransformer_1b5693b67ffa
r2 = [qid: string, query: string ... 5 more fields]
ndcg = List(0.0, 0.17502679579397282, 0.11854207483654515, 0.03829285746486456, 0.14376931608695356, 0.08111548628241008, 0.16194241901521403, 0.252750465141966, 0.2849008613713492, 0.9157513515137172, 0.0, 0.26281773542943293, 0.1119083378307071, 0.7686972263849642, 0.934380395949751, 0.5116183560038224, 0.3455456406937284, 0.5077199282127245,...


List(0.0, 0.17502679579397282, 0.11854207483654515, 0.03829285746486456, 0.14376931608695356, 0.08111548628241008, 0.16194241901521403, 0.252750465141966, 0.2849008613713492, 0.9157513515137172, 0.0, 0.26281773542943293, 0.1119083378307071, 0.7686972263849642, 0.934380395949751, 0.5116183560038224, 0.3455456406937284, 0.5077199282127245,...

In [18]:
%%dataframe
rtr

qid,query,ndcg
301,international organized crime,0.0
302,poliomyelitis and post polio,0.1750267957939728
303,hubble telescope achievements,0.1185420748365451
304,endangered species mammals,0.0382928574648645
305,most dangerous vehicles,0.1437693160869535
306,african civilian deaths,0.08111548628241
307,new hydroelectric projects,0.161942419015214
308,implant dentistry,0.252750465141966
309,rap and crime,0.2849008613713492
310,radio waves and brain cancer,0.9157513515137172
