# 作業3: 於Pubmed 200k資料集，建立反向索引

* 目標：學習使用現有工具(Pyserini)建立反向索引 
* 資料集：Pubmed200k
* 使用Library：Pyserini https://github.com/castorini/pyserini 
* 說明：
  * 使用Pubmed200k (The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017.) 資料集可從 https://www.dropbox.com/s/miyb2awm2esrcpk/pubmed%20220%20train.txt?dl=0 下載。

  * 使用Pyserini (Follow https://github.com/castorini/pyserini#how-do-i-index-and-search-my-own-documents 中的說明) 建立文章等級的反向索引。
* 需求：比較暴力搜尋法(i.e., linear scan)與使用反向索引(retreive by inverted index)所需的時間差異。

## Load dataset

In [4]:
!mkdir /content/dataset # 建立資料夾
!mkdir /content/dataset/input # 建立資料夾
!mkdir /content/dataset/output # 建立資料夾
!wget https://www.dropbox.com/s/miyb2awm2esrcpk/Pubmed%20220%20Train%202022-01-21.txt?dl=0 -O /content/dataset/Pubmed.txt # 將文件寫入指定的位址

mkdir: cannot create directory ‘/content/dataset’: File exists
--2023-03-09 11:20:49--  https://www.dropbox.com/s/miyb2awm2esrcpk/Pubmed%20220%20Train%202022-01-21.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601a:18::a27d:712
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/miyb2awm2esrcpk/Pubmed%20220%20Train%202022-01-21.txt [following]
--2023-03-09 11:20:49--  https://www.dropbox.com/s/raw/miyb2awm2esrcpk/Pubmed%20220%20Train%202022-01-21.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4ce9eebf11c0d8affd1ab200cd.dl.dropboxusercontent.com/cd/0/inline/B35TZ65P3eoHjg2XDDJnV9yJOXp3p2dYbPlrFfvEAF-NAnqxAiLQYszomulMVArjb3fAHTkoM1wHGydhPjsiuvqV8pn-QQra2-G9NLbz8Z576RVtmbdlnyGB-OdCIB1XYNFIA7zpkcb9bfa5-A38b2J-5n_UgKycqBn945o41QVMbw/file# [following]
--2023-03-09 11:20:50--  https://uc4c

In [5]:
import pandas as pd
import json

# 取得dataset
f = open("/content/dataset/Pubmed.txt", 'r')

temp = []
dict_ = {}
json_list = []
contents = ""

for line in f.readlines():

  # For id
  if "###" in line: 
    dict_['id'] = line.strip()[3:]

  # For end
  elif line in ['\n', '\r\n']:  
    dict_['contents'] = contents
    contents = ""
    json_list.append(dict_)
    dict_ = {}   

  # For contents
  else: 
    contents += line.split('\t')[1].strip()

# 將Python 的物件資料轉換成為JSON 物件
with open('/content/dataset/input/json_data.json', 'w') as f:
    json.dump(json_list, f, indent = 4) 

## 下載pyserini套件

In [6]:
pip install pyserini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.20.0-py3-none-any.whl (137.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.1/137.1 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyjnius>=1.4.0
  Downloading pyjnius-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lightgbm>=3.3.2
  Downloading lightgbm-3.3.5-py3-none-manylinux1_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting onnxruntime>=1.

### 使用pyserini的github提供的指令來進行反向索引
- https://github.com/castorini/pyserini

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input tests/resources/sample_collection_jsonl \
  --index indexes/sample_collection_jsonl \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

In [7]:
!python -m pyserini.index.lucene \
  --collection JsonCollection\
  --input '/content/dataset/input' \
  --index '/content/dataset/output'\
  --generator DefaultLuceneDocumentGenerator\
  --threads 1\
  --storePositions --storeDocvectors --storeRaw

2023-03-09 11:21:27,279 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Setting log level to INFO
2023-03-09 11:21:27,282 INFO  [main] index.IndexCollection (IndexCollection.java:394) - Starting indexer...
2023-03-09 11:21:27,283 INFO  [main] index.IndexCollection (IndexCollection.java:396) - DocumentCollection path: /content/dataset/input
2023-03-09 11:21:27,283 INFO  [main] index.IndexCollection (IndexCollection.java:397) - CollectionClass: JsonCollection
2023-03-09 11:21:27,284 INFO  [main] index.IndexCollection (IndexCollection.java:398) - Generator: DefaultLuceneDocumentGenerator
2023-03-09 11:21:27,284 INFO  [main] index.IndexCollection (IndexCollection.java:399) - Threads: 1
2023-03-09 11:21:27,284 INFO  [main] index.IndexCollection (IndexCollection.java:400) - Language: en
2023-03-09 11:21:27,285 INFO  [main] index.IndexCollection (IndexCollection.java:401) - Stemmer: porter
2023-03-09 11:21:27,285 INFO  [main] index.IndexCollection (IndexCollection.java:402) - 

### 下載faiss
faiss全稱為Facebook AI Similarity Search，該開源庫針對高維空間中的海量數據（稠密向量），提供了高效且可靠的相似性聚類和檢索方法，可支持十億級別向量的搜索，是目前最為成熟的近似近鄰搜索庫

In [8]:
pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [9]:
from pyserini.search.lucene import LuceneSearcher
import time

f = open('/content/dataset/input/json_data.json')
data = json.load(f)

In [10]:
searcher = LuceneSearcher('/content/dataset/output')
hits = searcher.search("cancer")

for i in range(10):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')
    print(hits[i].raw)

 1 24747090 2.50920
{
  "id" : "24747090",
  "contents" : "We examine the role of body mass index in the assessment of prostate cancer risk .A total of 3,258 participants who underwent biopsy ( including 1,902 men with a diagnosis of prostate cancer ) were identified from the Selenium and Vitamin E Cancer Prevention Trial .The associations of body mass index with prostate cancer and high grade prostate cancer were examined using logistic regression , adjusting for age , race , body mass index adjusted prostate specific antigen , digital rectal examination , family history of prostate cancer , biopsy history , prostate specific antigen velocity , and time between study entry and the last biopsy .The prediction models were compared with our previously developed body mass index adjusted Prostate Cancer Prevention Trial prostate cancer risk calculator .Of the study subjects 49.1 % were overweight and 29.3 % were obese .After adjustment , among men without a known family history of prostate

### Retrieve from Inverted Index

In [18]:
searcher = LuceneSearcher('/content/dataset/output')

# 計算Retrieve from Inverted Index的時間
start1 = time.perf_counter()      # perf_counter()具有最高可用分辨率的時鐘，以測量短持續時間
hits = searcher.search('cancer')  # 預設回傳最多10筆
end1 = time.perf_counter()

### 印出Retrieve from Inverted Index的計算時間(0.0226 seconds)

In [12]:
print("Total Retrieval: ", len(hits))
print(f"Inverted Index Retrieval: {end1 - start1:0.4f} seconds")

Total Retrieval:  10
Inverted Index Retrieval: 0.0226 seconds


### Retrive by Linear Scan

In [15]:
pubmed_df = pd.read_json('/content/dataset/input/json_data.json')
pubmed_df.head()

Unnamed: 0,id,contents
0,24491034,The emergence of HIV as a chronic condition me...
1,20497432,The aim of this study was to evaluate the effi...
2,19062107,The aim of this prospective randomized study w...
3,19769482,"To explore the effects of GengNianLe ( GNL , a..."
4,26077436,Topical formulations of nonsteroidal anti-infl...


In [19]:
start2 = time.perf_counter()
# 取出pubmed_df中的contents裡面有包含欲搜尋的字的資料
result = pubmed_df.loc[ pubmed_df['contents'].str.contains("cancer") ] # df.loc[]: 用index的標籤來取出資料
end2 = time.perf_counter()

### 印出Linear Scan的計算時間 (0.2936 seconds)

In [21]:
print(f"Linear Scan Retrieval: {end2 - start2:0.4f} seconds")

Linear Scan Retrieval: 0.2936 seconds
