## Preparation

### Installation

We assume that the repo is cloned, all necessary packages are installed, including calling the script:

```./install_packages.sh```

and the code is compiled:

```./build.sh```

### Changing directory to the repo root

In [1]:
cd ../..

/Users/alp/Documents/kod/FlexNeuART


### Downloading demo data

In [2]:
!wget boytsov.info/datasets/flecsneurt-demo-2021-02-08.tar.bz2

--2021-04-22 16:19:46--  http://boytsov.info/datasets/flecsneurt-demo-2021-02-08.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7511061 (7.2M) [application/x-bzip2]
Saving to: ‘flecsneurt-demo-2021-02-08.tar.bz2’


2021-04-22 16:19:52 (1.36 MB/s) - ‘flecsneurt-demo-2021-02-08.tar.bz2’ saved [7511061/7511061]



In [3]:
# Unpacking it
!tar jxvf flecsneurt-demo-2021-02-08.tar.bz2

x collections/squad/
x collections/squad/exper_desc/
x collections/squad/input_data/
x collections/squad/input_data/train_bitext/
x collections/squad/input_data/dev1/
x collections/squad/input_data/test/
x collections/squad/input_data/dev2/
x collections/squad/input_data/train/
x collections/squad/input_data/train/AnswerFields.jsonl
x collections/squad/input_data/train/QuestionFields.jsonl
x collections/squad/input_data/train/qrels.txt
x collections/squad/input_data/dev2/AnswerFields.jsonl
x collections/squad/input_data/dev2/QuestionFields.jsonl
x collections/squad/input_data/dev2/qrels.txt
x collections/squad/input_data/test/AnswerFields.jsonl
x collections/squad/input_data/test/QuestionFields.jsonl
x collections/squad/input_data/test/qrels.txt
x collections/squad/input_data/dev1/AnswerFields.jsonl
x collections/squad/input_data/dev1/QuestionFields.jsonl
x collections/squad/input_data/dev1/qrels.txt
x collections/squad/input_data/train_bitext/AnswerFields.jsonl
x collections/squad/inp

In [4]:
# Creating a Lucene index
!scripts/index/create_lucene_index.sh squad

Using collection root: collections
Data directory: collections/squad/input_data
Index directory: collections/squad/lucene_index
Removing previously created index (if exists)
Checking data sub-directory: dev1
Found indexable data file: dev1/AnswerFields.jsonl
Checking data sub-directory: dev2
Found indexable data file: dev2/AnswerFields.jsonl
Checking data sub-directory: test
Found indexable data file: test/AnswerFields.jsonl
Checking data sub-directory: train
Found indexable data file: train/AnswerFields.jsonl
Checking data sub-directory: train_bitext
Found indexable data file: train_bitext/AnswerFields.jsonl
Found query file: dev1/QuestionFields.jsonl
Found query file: dev2/QuestionFields.jsonl
Found query file: test/QuestionFields.jsonl
Found query file: train/QuestionFields.jsonl
Found query file: train_bitext/QuestionFields.jsonl
Using the data input file: AnswerFields.jsonl
JAVA_OPTS=-Xms8388608k -Xmx14680064k -server
Creating a new Lucene index, maximum # of docs to process: 2147

In [5]:
#!creating a forward index for two fields:
# text is a parsed text field
# text_raw is a raw text field that keeps the text as is
# -clean removes all previous forward indices
!scripts/index/create_fwd_index.sh squad mapdb  \
                               "text:parsedText text_unlemm:raw" \
                               -clean

Using collection root: collections
Data directory:            collections/squad/input_data
Forward index directory:   collections/squad/forward_index/
Clean old index?:          1
Removing previously created index (if exists)
Field list definition:     text:parsedText text_unlemm:raw
Checking data sub-directory: dev1
Found indexable data file: dev1/AnswerFields.jsonl
Checking data sub-directory: dev2
Found indexable data file: dev2/AnswerFields.jsonl
Checking data sub-directory: test
Found indexable data file: test/AnswerFields.jsonl
Checking data sub-directory: train
Found indexable data file: train/AnswerFields.jsonl
Checking data sub-directory: train_bitext
Found indexable data file: train_bitext/AnswerFields.jsonl
Found query file: dev1/QuestionFields.jsonl
Found query file: dev2/QuestionFields.jsonl
Found query file: test/QuestionFields.jsonl
Found query file: train/QuestionFields.jsonl
Found query file: train_bitext/QuestionFields.jsonl
JAVA_OPTS=-Xms12582912k -Xmx14680064k -serv

## API demo

In [6]:
from scripts.py_flexneuart.setup import *

In [7]:
# add Java JAR to the class path
configure_classpath('target')

In [8]:
# create a resource manager
resource_manager=create_featextr_resource_manager('collections/squad/forward_index')

### Retrieval

In [9]:
from scripts.py_flexneuart.cand_provider import *
# create a candidate provider/generator
cand_prov = create_cand_provider(resource_manager, PROVIDER_TYPE_LUCENE, 'collections/squad/lucene_index')

In [10]:
QUERY_TEXT = "galatasaray university"

In [11]:
query_res = run_text_query(cand_prov, 20, QUERY_TEXT)
query_res

(965,
 [CandidateEntry(doc_id='@17353', score=9.142794609069824),
  CandidateEntry(doc_id='@11977', score=8.12243366241455),
  CandidateEntry(doc_id='@13223', score=6.387572288513184),
  CandidateEntry(doc_id='@14961', score=6.296083450317383),
  CandidateEntry(doc_id='@18822', score=6.275240898132324),
  CandidateEntry(doc_id='@14962', score=6.263032913208008),
  CandidateEntry(doc_id='@11602', score=6.22603702545166),
  CandidateEntry(doc_id='@982', score=6.221157550811768),
  CandidateEntry(doc_id='@9558', score=6.197885513305664),
  CandidateEntry(doc_id='@1962', score=6.1696696281433105),
  CandidateEntry(doc_id='@17533', score=6.122592926025391),
  CandidateEntry(doc_id='@5936', score=6.0756916999816895),
  CandidateEntry(doc_id='@10484', score=6.0641188621521),
  CandidateEntry(doc_id='@1513', score=6.034621715545654),
  CandidateEntry(doc_id='@20169', score=6.016384124755859),
  CandidateEntry(doc_id='@8122', score=6.014654636383057),
  CandidateEntry(doc_id='@5932', score=6.00

### Forward index demo

In [12]:
from scripts.py_flexneuart.fwd_index import get_forward_index

#### First let's play with a raw index that keeps ony unparsed text

In [13]:
raw_indx = get_forward_index(resource_manager, 'text_unlemm')

In [14]:
# the raw flag is set
raw_indx.is_raw

True

In [15]:
raw_indx.get_doc_raw('@17353')

'main sports ottomans engaged turkish wrestling hunting turkish archery horseback riding equestrian javelin throw arm wrestling swimming european model sports clubs formed spreading popularity football matches 19th century constantinople leading clubs timeline besiktas gymnastics club 1903 galatasaray sports club 1905 fenerbahce sports club 1907 istanbul football clubs formed provinces karsıyaka sports club 1912 altay sports club 1914 turkish fatherland football club later ulkuspor 1914 izmir'

#### A parsed index has more info

In [16]:
parsed_indx = get_forward_index(resource_manager, 'text')

In [17]:
# here is_raw is False
parsed_indx.is_raw

False

In [18]:
parsed_indx.get_doc_parsed('@17353')

DocEntryParsed(word_ids=[17, 136, 203, 632, 702, 1239, 1267, 1285, 1291, 1651, 1755, 1813, 2042, 2336, 2572, 3411, 3448, 3700, 3806, 3824, 3954, 4424, 4959, 5427, 5531, 5968, 6354, 6942, 8069, 8402, 8693, 8694, 10505, 10768, 11778, 20452, 31890, 38440, 45946, 56560, 57386, 73421, 73422, 73423, 73424, 73425], word_qtys=[1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 3, 1, 3, 1, 1, 1, 9, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], word_id_seq=[1291, 2336, 3411, 3824, 3448, 5531, 1651, 3448, 45946, 11778, 8069, 8693, 31890, 8694, 3806, 5531, 3700, 1267, 702, 2336, 3954, 136, 1813, 4959, 2572, 4424, 2042, 632, 8402, 17, 3954, 203, 20452, 73421, 6354, 3954, 5968, 56560, 2336, 3954, 6942, 73422, 2336, 3954, 5427, 10768, 2572, 3954, 136, 1285, 73423, 2336, 3954, 1755, 73424, 2336, 3954, 10505, 3448, 57386, 2572, 3954, 1239, 73425, 10505, 38440], doc_len=66)

In [19]:
# Let's extract the first document word and its info
parsed_indx.get_word_by_id(56560), parsed_indx.get_word_entry_by_id(56560)

('galatasaray', WordEntry(word_id=56560, word_freq=2))

### Ranker API demo

### First, we run two experiments that involve training a model

#### Delete old results if they are present

In [20]:
!rm -rf collections/squad/results/dev1/feat_exper

####  Running two experiments (you can add ``-no_separate_shell`` to print logs to the screen for debug purposes)

In [21]:
!scripts/exper/run_experiments.sh squad \
    exper_desc/exper_list.json \
    -train_part train \
    -test_part dev1 \
    -test_cand_qty_list 100

Using collection root: collections
The number of CPU cores:      4
The number of || experiments: 1
The number of threads:        4
Experiment descriptor file:                                 collections/squad/exper_desc/exper_list.json
Default test set:                                           dev1
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 4
Parsed experiment parameters:
experSubdir:feat_exper/bm25=text
extrType:exper_desc/extractors/bm25=text.json
testOnly:0
Started a process 4134, working dir: collections/squad/results/dev1/feat_exper/bm25=text
Process log file: collections/squad/results/dev1/feat_exper/bm25=text/exper.log
Waiting for 1 child processes
Process with pid=4134 finished successfully.
Parsed experiment parameters:
experSubdir:feat_exper/bm25=text+cosine=text
extrType:exper_desc/extractors/bm25=text+cosine=text.json
testOnly:0
Started a process 4194, working dir: collections/squad/results/dev1

#### Print results generated by ``trec_eval``

In [22]:

!cat collections/squad/results/dev1/feat_exper/bm25\=text+cosine\=text/rep/out_100.rep 

# of queries:    2448
NDCG@10:  0.923700
NDCG@20:  0.925800
NDCG@100: 0.928000
P20:      0.048500
MAP:      0.912100
MRR:      0.912100
Recall:   0.982026


### Second, we can use this model to re-rank and evaluate results using Python API

#### A basic example

In [23]:
from scripts.py_flexneuart.ranker import *

In [24]:
from scripts.config import QUESTION_FILE_JSON, QREL_FILE

In [25]:
MODEL_FILE_NAME='collections/squad/results/dev1/feat_exper/bm25=text+cosine=text/letor/out_squad_train_20.model'
FEAT_EXTR_FILE_NAME='collections/squad/exper_desc/extractors/bm25=text+cosine=text.json'
QUERY_FILE_NAME=f'collections/squad/input_data/dev1/{QUESTION_FILE_JSON}'
QREL_FILE_NAME=f'collections/squad/input_data/dev1/{QREL_FILE}'

#### A list of experimental descriptors, which in turn reference descriptors for feature extractors

In [26]:
!cat collections/squad/exper_desc/exper_list.json

[
  {
    "experSubdir": "feat_exper/bm25=text",
    "extrType" : "exper_desc/extractors/bm25=text.json",
    "testOnly": 0
  },
  {
    "experSubdir": "feat_exper/bm25=text+cosine=text",
    "extrType" : "exper_desc/extractors/bm25=text+cosine=text.json",
    "testOnly": 0
  }
]


#### A (two-feature) feature extractor confguration, which is used in the second experiment

In [27]:
!cat collections/squad/exper_desc/extractors/bm25=text+cosine=text.json

{
"extractors" : [
 {
  "type" : "TFIDFSimilarity",
  "params" : {
    "indexFieldName" : "text",
    "similType" : "bm25",
    "k1"        : "1.2",
    "b"         : "0.75"
  }
 },
 {
  "type" : "TFIDFSimilarity",
  "params" : {
    "indexFieldName" : "text",
    "similType" : "cosine"
  }
 }
]
}


#### A simple linear model trained to combine feature scores produced by the feature-extractor

In [28]:
!cat collections/squad/results/dev1/feat_exper/bm25=text+cosine=text/letor/out_squad_train_20.model

## Coordinate Ascent
## Restart = 10
## MaxIteration = 50
## StepBase = 0.05
## StepScale = 2.0
## Tolerance = 0.001
## Regularized = false
## Slack = 0.001
1:0.5464353855587661 2:-0.4535646144412339

#### A toy example where we generate a list of candidates for merely one query (using the candidate provider) and re-rank them (using the re-ranker object)

In [29]:
query_dict = create_text_query_dict(query_text=QUERY_TEXT, 
                                    query_id=FAKE_QUERY_ID, field_name=TEXT_FIELD_NAME)

In [30]:
ranker = QueryRanker(resource_manager, feat_extr_file_name=FEAT_EXTR_FILE_NAME, model_file_name=MODEL_FILE_NAME)

In [31]:
ranker.rank_candidates(query_res[1], query_dict)

{'@17353': 0.3475749472684666,
 '@11977': 0.29051454588664094,
 '@13223': 0.18396253449268837,
 '@14961': 0.20068165482631278,
 '@18822': 0.1731377107974293,
 '@14962': 0.19855210775089766,
 '@11602': 0.18245315249337635,
 '@982': 0.19075740006963843,
 '@9558': 0.19708435544036543,
 '@1962': 0.1838307923065211,
 '@17533': 0.1941509889419027,
 '@5936': 0.21041765643932336,
 '@10484': 0.19494172623620082,
 '@1513': 0.18996889541962164,
 '@20169': 0.19799479549866428,
 '@8122': 0.19272330327668807,
 '@5932': 0.18541586374812002,
 '@10450': 0.20170492921441818,
 '@9623': 0.2077968424078734,
 '@14421': 0.19609493857914606}

#### A comprehensive example where we evaluate **all** queries from `dev1`

In [32]:
from scripts.data_convert.convert_common import *
all_queries = read_queries(QUERY_FILE_NAME)

In [33]:
# Query sample
all_queries[0:5]

[{'DOCNO': '10595',
  'text_unlemm': 'beyonce lighter skin color costuming',
  'text': 'beyonce lighter skin color costume'},
 {'DOCNO': '10608',
  'text_unlemm': 'exclusion social political groups targets genocide cppcg legal',
  'text': 'exclusion social political group target genocide cppcg legal'},
 {'DOCNO': '10575',
  'text_unlemm': 'beyonce giselle knowles-carter',
  'text': 'beyonce giselle knowles-carter'},
 {'DOCNO': '10570',
  'text_unlemm': 'school architecture',
  'text': 'school architecture'},
 {'DOCNO': '10576',
  'text_unlemm': 'bee-yon-say born september 4 1981 american',
  'text': 'bee-yon-say bear september 4 1981 american'}]

In [34]:
from tqdm import tqdm
TOP_K=100

run_dict = {}
for query_dict in tqdm(all_queries):
    qid = query_dict[DOCID_FIELD]
    query_res = run_text_query(cand_prov, TOP_K, query_dict[TEXT_FIELD_NAME])
    rank_res = ranker.rank_candidates(query_res[1], query_dict)
    run_dict[qid] = rank_res

100%|██████████| 2448/2448 [00:21<00:00, 111.95it/s]


#### Finally, let us compute various metrics using our Python code. Note that results should match results previously produced by `trec_eval`

In [35]:
from scripts.common_eval import *
qrels=read_qrels_dict(QREL_FILE_NAME)

                                           

In [36]:
for eval_obj in [NormalizedDiscountedCumulativeGain(10), \
                 NormalizedDiscountedCumulativeGain(20), \
                 MeanAveragePrecision(), \
                 MeanReciprocalRank()]:
    print(eval_run(rerank_run=run_dict, metric_func=eval_obj, qrels_dict=qrels))

0.9236950321010553
0.9258372608440565
0.9120987070833969
0.9120987070833969


#### Optionally we can save the run to be later evaluated using external evaluation tools

In [37]:
write_run_dict(run_dict, 'run.txt')

In [38]:
!head run.txt

10595 Q0 @2069 1 0.5034178698142255 fake_run
10595 Q0 @17442 2 0.19901719649322208 fake_run
10595 Q0 @9111 3 0.19225397364078217 fake_run
10595 Q0 @1769 4 0.19119073815505058 fake_run
10595 Q0 @2228 5 0.19033063864976146 fake_run
10595 Q0 @8585 6 0.18989114401672533 fake_run
10595 Q0 @18222 7 0.18815638945124924 fake_run
10595 Q0 @9121 8 0.1853920865272263 fake_run
10595 Q0 @18223 9 0.18299288131025784 fake_run
10595 Q0 @9120 10 0.1778561648387402 fake_run
