# Claim Sentence Query
___

This model is based on:

```Bibtex
@inproceedings{levyUnsupervisedCorpuswideClaim2017,
  title = {Unsupervised Corpus-Wide Claim Detection},
  author = {Levy, Ran and Gretz, Shai and Sznajder, Benjamin and Hummel, Shay and Aharonov, Ranit and Slonim, Noam},
  date = {2017},
  doi = {10.18653/v1/w17-5110},
}
```

Parameter:
- Threshold for the retrieval score

In [15]:
import json
import os

from config import CLAIM_LEXICON_PATH, DATA_PATH, INDEX_PATH, PYSERINI_PATH

import pandas as pd
from sklearn.metrics import classification_report

from src.searcher import convert_data, create_index, build_query
from src.dataset import load_dataset

### 0. Load data

In [16]:
data = load_dataset()

In [4]:
with open(CLAIM_LEXICON_PATH, "r") as inFile:  # load claim lexicon
    claim_lexicon = inFile.read().split("\n")

In [5]:
convert_data(data["Sentence"], data_path=PYSERINI_PATH)  # convert data

### 1. Setup index

In [6]:
searcher = create_index(data_path=PYSERINI_PATH, index_path=INDEX_PATH, language="english")

2021-12-13 16:35:28,294 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Setting log level to INFO
2021-12-13 16:35:28,296 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Starting indexer...
2021-12-13 16:35:28,297 INFO  [main] index.IndexCollection (IndexCollection.java:648) - DocumentCollection path: data/pyserini
2021-12-13 16:35:28,298 INFO  [main] index.IndexCollection (IndexCollection.java:649) - CollectionClass: JsonCollection
2021-12-13 16:35:28,299 INFO  [main] index.IndexCollection (IndexCollection.java:650) - Generator: DefaultLuceneDocumentGenerator
2021-12-13 16:35:28,300 INFO  [main] index.IndexCollection (IndexCollection.java:651) - Threads: 1
2021-12-13 16:35:28,300 INFO  [main] index.IndexCollection (IndexCollection.java:652) - Stemmer: porter
2021-12-13 16:35:28,301 INFO  [main] index.IndexCollection (IndexCollection.java:653) - Keep stopwords? false
2021-12-13 16:35:28,302 INFO  [main] index.IndexCollection (IndexCollection.java:654) - 

### 2. Search index

In [7]:
predicted = {idx: False for idx in data.index}  # create column for reults

In [8]:
for main_concept in data["Article"].unique():
    # create query
    should = ["that"] + main_concept.split(" ") + claim_lexicon

    # search index
    hits = searcher.search(" ".join(should), k=1000)

    # parse results
    scores = []
    for hit in hits:
        if hit.score > 5:  # threshold for acaptable results
            ids = json.loads(hit.raw)["id"]
            predicted[ids] = True
        scores.append(hit.score)
    
    # pd.DataFrame(scores).plot(xlabel="position", ylabel="score")

### 3. Evaluate results

In [9]:
print(classification_report(data["Claim"].to_list(), list(predicted.values())))

              precision    recall  f1-score   support

       False       0.55      0.81      0.66      1234
        True       0.64      0.34      0.44      1234

    accuracy                           0.58      2468
   macro avg       0.60      0.58      0.55      2468
weighted avg       0.60      0.58      0.55      2468

