# Solr Baseline

The easiest way to setup and run ad hoc single instance Solr is to use Docker. So make sure Docker is installed and run following commands in terminal:

```shell
docker pull solr # Pull latest docker solr image
docker run -d --name solr -p 8983:8983 -t solr # Run solr docker container and name it
docker exec -it solr bash # Launch interactive bash shell inside solr container
bin/solr create -c cord19_2020_05_19_abstract # Create schema-less solr core
exit # Exit from solr container bash shell
```

Let's test it on `abstract` to get a baseline.

## Import necessary libs

In [1]:
%reload_ext memory_profiler
import json
import os

In [2]:
%%writefile solr.py
import pysolr
from pathlib import Path
import pandas as pd

def create_solr_connection(collection_name, solr_connection_url):
    """Creates Solr connection
    """
    solr = pysolr.Solr(
        "{}/solr/{}".format(solr_connection_url, collection_name),
        always_commit=True,
        timeout=10
    )
    return solr

def get_ranked_lists(query: str, qid: int, run_name: str, solr: pysolr.Solr, top_k: int) -> list:
    """Get Top k ranked lists from Solr index"""
    solr_query = f"text:{query}"
    solr_query_param = {
        "fl": "id,score",
        "rows": top_k
    }
    template = "{} Q0 {} {} {:.6f} {}\n"
    results = solr.search(solr_query, **solr_query_param).docs
    return [template.format(qid, row['id'], idx+1, row['score'], run_name) for idx, row in enumerate(results)]

def write_results(output_fpath: Path, query_df: pd.DataFrame, query_txt_col: str, solr: pysolr.Solr, run_name: str, top_k=1000):
    """Writes retrieved resuls to text file"""
    with open(output_fpath, 'w', encoding='utf-8') as writer:
        for idx, query_row in query_df.iterrows():
            qid = idx
            query = query_row[query_txt_col]
            ranked_lists = get_ranked_lists(query, qid, run_name, solr, top_k)
            writer.writelines(ranked_lists)

Overwriting solr.py


In [3]:
%reload_ext autoreload
%autoreload 2
from solr import *

## Verify Solr connection

In [4]:
solr = create_solr_connection("cord19_2020_05_19_abstract", "http://0.0.0.0:8983")
# Verify connection is working
print("Ping connection url for health check")
print(solr.ping())

Ping connection url for health check
{
  "responseHeader":{
    "zkConnected":null,
    "status":0,
    "QTime":11,
    "params":{
      "q":"{!lucene}*:*",
      "distrib":"false",
      "df":"_text_",
      "rows":"10",
      "echoParams":"all",
      "rid":"-43"}},
  "status":"OK"}



## Load dataset

In [5]:
CORD19_PATH = Path('../data/input/trec_cord19_v0.csv')

def load_cord19(input_fpath: Path, dtype: str = 'csv', cols_to_keep: list = ['cord_uid', 'abstract'], index_col = 'cord_uid') -> pd.DataFrame:
    """Loads CORD19 data and returns it as pandas data frame
    """
    if dtype == 'csv':
        df = pd.read_csv(input_fpath, quotechar='"', index_col=index_col, usecols=cols_to_keep)
        # for each column
        for col in df.columns:
            # check if the columns contains string data
            if pd.api.types.is_string_dtype(df[col]):
                df[col] = df[col].str.strip() # removes front and end white spaces
                df[col] = df[col].str.replace('\s{2,}', ' ') # remove double or more white spaces
    return df

cord19 = load_cord19(CORD19_PATH, cols_to_keep = ['cord_uid', 'abstract', 'title+abstract'])
cord19.head()

Unnamed: 0_level_0,abstract,title+abstract
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1
ug7v899j,OBJECTIVE: This retrospective chart review des...,Clinical features of culture-proven Mycoplasma...
02tnwd4m,Inflammatory diseases of the respiratory tract...,Nitric oxide: a pro-inflammatory mediator in l...
ejv2xln0,Surfactant protein-D (SP-D) participates in th...,Surfactant protein-D and pulmonary host defens...
2b73a28n,Endothelin-1 (ET-1) is a 21 amino acid peptide...,Role of endothelin-1 in lung disease Endotheli...
9785vg6d,Respiratory syncytial virus (RSV) and pneumoni...,Gene expression in epithelial cells in respons...


In [6]:
cord19.info()

<class 'pandas.core.frame.DataFrame'>
Index: 127617 entries, ug7v899j to clmtwq4v
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   abstract        101395 non-null  object
 1   title+abstract  127617 non-null  object
dtypes: object(2)
memory usage: 2.9+ MB


In [7]:
cord19.dropna(subset=['abstract'], inplace=True)
cord19.isnull().sum()

abstract          0
title+abstract    0
dtype: int64

In [8]:
abstracts_dict = cord19['abstract'].to_dict()
len(abstracts_dict)

101395

## Build Solr Index

In [9]:
def build_solr_index(data_dict: dict, solr: pysolr.Solr):
    """Builds Solr Index using data dictionary where, key - docid and value - document text
    """
    solr_payloads = []
    for uid, text in data_dict.items():
        solr_payload = {
            "id": uid,
            "text": text
        }
        solr_payloads.append(solr_payload)
        if len(solr_payloads) == 1000:
            solr.add(solr_payloads)
            solr_payloads = []

In [10]:
# %%time
# %memit build_solr_index(abstracts_dict, solr)

**Cell Output**

```
peak memory: 292.41 MiB, increment: 2.48 MiB
CPU times: user 1.43 s, sys: 169 ms, total: 1.6 s
Wall time: 34.7 s
```

## Load topics 

In [11]:
def load_queries(input_fpath: Path, dtype: str = 'csv', cols_to_keep=['topic-id', 'query', 'question'], index_col=['topic-id']) -> pd.DataFrame:
    """Loads queries file and returns it as pandas data frame
    """
    if dtype == 'csv':
        df = pd.read_csv(input_fpath, quotechar='"', index_col=index_col, usecols=cols_to_keep)
        # for each column
        for col in df.columns:
            # check if the columns contains string data
            if pd.api.types.is_string_dtype(df[col]):
                df[col] = df[col].str.strip() # removes front and end white spaces
                df[col] = df[col].str.replace('\s{2,}', ' ') # remove double or more white spaces
    return df

QUERY_FPATH = Path('../data/CORD-19/CORD-19/topics-rnd3.csv')
topics = load_queries(QUERY_FPATH)
topics.head()

Unnamed: 0_level_0,query,question
topic-id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,coronavirus origin,what is the origin of COVID-19
2,coronavirus response to weather changes,how does the coronavirus respond to changes in...
3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...
4,how do people die from the coronavirus,what causes death from Covid-19?
5,animal models of COVID-19,what drugs have been active against SARS-CoV o...


In [12]:
topics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 1 to 40
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   query     40 non-null     object
 1   question  40 non-null     object
dtypes: object(2)
memory usage: 960.0+ bytes


In [13]:
topics_dict = topics['query'].to_dict()
len(topics_dict)

40

### Test Query

In [14]:
topic_id = 1
print(f"Topic id: {topic_id}")
full_text_query = f"text:{topics_dict[topic_id]}"
print(full_text_query)

solr_query_param = {
    "fl": "id, score",
    "rows": 10
}
results = solr.search(full_text_query, **solr_query_param).docs
print("results", "\n".join(str(e) for e in results))

Topic id: 1
text:coronavirus origin
results {'id': '5tb29n9s', 'score': 1.5096586}
{'id': 'wbtaoo0o', 'score': 1.5071222}
{'id': '2u5qraea', 'score': 1.5061529}
{'id': 'lajzpk2c', 'score': 1.4965621}
{'id': 'e53w0ext', 'score': 1.4965621}
{'id': 'hix57xwa', 'score': 1.4923987}
{'id': 'r2w5csll', 'score': 1.4920434}
{'id': 'x6bryq9d', 'score': 1.4804857}
{'id': 'uc0vp5pr', 'score': 1.4777424}
{'id': 'xhf8yg6o', 'score': 1.4773382}


## a) Run Solr on `abstract` + `query`

In [15]:
txt_cols = 'abstract_query'
run_name = f'solr_baseline_{txt_cols}'
query_txt_col = 'query'
output_fpath = Path('../data/output') / f'{run_name}.txt'
write_results(output_fpath, topics, query_txt_col, solr, run_name)

### Run TREC Eval

In [16]:
path_to_qrel_file = "../data/qrels/qrels-covid_d3_j0.5-3.txt"
path_to_result_file = f"../data/output/{run_name}.txt"
output_result_path = f"../data/results/{run_name}_trec_eval.txt"
os.system("trec_eval -c -m all_trec {} {} > {}".format(path_to_qrel_file, path_to_result_file, output_result_path))
with open(output_result_path, encoding='utf-8') as f:
    print(f.read())

runid                 	all	solr_baseline_abstract_query
num_q                 	all	40
num_ret               	all	39510
num_rel               	all	10001
num_rel_ret           	all	604
map                   	all	0.0094
gm_map                	all	0.0003
Rprec                 	all	0.0263
bpref                 	all	0.0577
recip_rank            	all	0.0737
iprec_at_recall_0.00  	all	0.0840
iprec_at_recall_0.10  	all	0.0305
iprec_at_recall_0.20  	all	0.0221
iprec_at_recall_0.30  	all	0.0160
iprec_at_recall_0.40  	all	0.0025
iprec_at_recall_0.50  	all	0.0000
iprec_at_recall_0.60  	all	0.0000
iprec_at_recall_0.70  	all	0.0000
iprec_at_recall_0.80  	all	0.0000
iprec_at_recall_0.90  	all	0.0000
iprec_at_recall_1.00  	all	0.0000
P_5                   	all	0.0350
P_10                  	all	0.0350
P_15                  	all	0.0350
P_20                  	all	0.0375
P_30                  	all	0.0392
P_100                 	all	0.0348
P_200                 	all	0.0315
P_500                 	all	0.0204
P

Key Metrics,

Solr `abstract` + `query` baseline results

- `MAP` - 0.0094
- `NDCG@10` - 0.0318
- `P@5` - 0.0350
- `R@1000` - 0.0643

vs. TF-IDF `abstract` + `query` baseline results

- `MAP` - 0.1069
- `NDCG@10` - 0.2899
- `P@5` - 0.38
- `R@1000` - 0.3480

Looks TF-IDF performs lot better compared to commercial text search engine, but has faster retrieval and index built time. 

Now let's do a final run with enhanced `abstract` and `query` + `question` run and compare that against TF-IDF results.

## b) enhanced `abstract`, `query` + `question`

Here are the steps again to follow:

1. Create new core in solr and connect to it
2. Create enhanced dict
3. Build index
4. Combine query and question
5. Write retrieved ranked results
6. Evaluate using TREC eval

In [20]:
solr2 = create_solr_connection("cord19_2020_05_19_title_abstract", "http://0.0.0.0:8983")
print(solr2.ping())

{
  "responseHeader":{
    "zkConnected":null,
    "status":0,
    "QTime":0,
    "params":{
      "q":"{!lucene}*:*",
      "distrib":"false",
      "df":"_text_",
      "rows":"10",
      "echoParams":"all",
      "rid":"-85"}},
  "status":"OK"}



In [19]:
title_abstract_dict = cord19['title+abstract'].to_dict()
len(title_abstract_dict)

101395

In [22]:
# %%time
# %memit build_solr_index(title_abstract_dict, solr2)

peak memory: 509.21 MiB, increment: 4.95 MiB
CPU times: user 1.22 s, sys: 196 ms, total: 1.42 s
Wall time: 28.2 s


**Cell Output**

```
peak memory: 509.21 MiB, increment: 4.95 MiB
CPU times: user 1.22 s, sys: 196 ms, total: 1.42 s
Wall time: 28.2 s
```

In [24]:
topics['query+question'] = topics['query'] + ' ' + topics['question']
topics.head()

Unnamed: 0_level_0,query,question,query+question
topic-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,coronavirus origin,what is the origin of COVID-19,coronavirus origin what is the origin of COVID-19
2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,coronavirus response to weather changes how do...
3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,coronavirus immunity will SARS-CoV2 infected p...
4,how do people die from the coronavirus,what causes death from Covid-19?,how do people die from the coronavirus what ca...
5,animal models of COVID-19,what drugs have been active against SARS-CoV o...,animal models of COVID-19 what drugs have been...


In [27]:
txt_cols = 'title_abstract_query_question'
run_name = f'solr_baseline_{txt_cols}'
query_txt_col = 'query+question'
output_fpath = Path('../data/output') / f'{run_name}.txt'
write_results(output_fpath, topics, query_txt_col, solr2, run_name)

In [28]:
path_to_qrel_file = "../data/qrels/qrels-covid_d3_j0.5-3.txt"
path_to_result_file = f"../data/output/{run_name}.txt"
output_result_path = f"../data/results/{run_name}_trec_eval.txt"
os.system("trec_eval -c -m all_trec {} {} > {}".format(path_to_qrel_file, path_to_result_file, output_result_path))
with open(output_result_path, encoding='utf-8') as f:
    print(f.read())

runid                 	all	solr_baseline_title_abstract_query_question
num_q                 	all	40
num_ret               	all	39581
num_rel               	all	10001
num_rel_ret           	all	595
map                   	all	0.0102
gm_map                	all	0.0003
Rprec                 	all	0.0277
bpref                 	all	0.0583
recip_rank            	all	0.0634
iprec_at_recall_0.00  	all	0.0816
iprec_at_recall_0.10  	all	0.0310
iprec_at_recall_0.20  	all	0.0234
iprec_at_recall_0.30  	all	0.0174
iprec_at_recall_0.40  	all	0.0027
iprec_at_recall_0.50  	all	0.0000
iprec_at_recall_0.60  	all	0.0000
iprec_at_recall_0.70  	all	0.0000
iprec_at_recall_0.80  	all	0.0000
iprec_at_recall_0.90  	all	0.0000
iprec_at_recall_1.00  	all	0.0000
P_5                   	all	0.0450
P_10                  	all	0.0425
P_15                  	all	0.0383
P_20                  	all	0.0437
P_30                  	all	0.0425
P_100                 	all	0.0385
P_200                 	all	0.0323
P_500               

Key Metrics,

Solr `title+abstract` and `query+question` results,

- `MAP` - 0.0102
- `NDCG@10` - 0.0355
- `P@5` - 0.0450
- `R@1000` - 0.0653

TF-IDF results `title+abstract` and `query+question` results,

- `MAP` - 0.1517
- `NDCG@10` - 0.4273
- `P@5` - 0.0450
- `R@1000` - 0.4296

While there is slight improvements over previous results, compared against TF-IDF performance is poor but index build speed and memory usage is much lower.