In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/semantic_search_package_final

/content/drive/MyDrive/semantic_search_package_final


# **Semantic Search Engine**

This notebook is meant to showcase how to combine **Data processing, Retriever systems, QnA model & Summarization model** to get useful insights from different documents based on user query.

The below codes are based on **haystack pipeline**, **tiger nlp** and **huggingface models** but every function can be replaces with a custom module to make things better.

### **Imports and installations**




In [None]:
!pip install -r requirements.txt
# !pip uninstall -r requirements.txt -y

In [None]:
!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
!tar -xvf xpdf-tools-linux-4.04.tar.gz
!sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

In [None]:
from haystack.nodes import PDFToTextConverter,PreProcessor, BM25Retriever, TfidfRetriever,EmbeddingRetriever, DensePassageRetriever
from haystack.utils import print_documents, convert_files_to_docs
from haystack.document_stores import InMemoryDocumentStore,ElasticsearchDocumentStore,FAISSDocumentStore
import os, re
import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore") 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# **1. Knowledge base creation**

This section aims to convert our files into documents and then clean the documents using haystack pipeline.

**Haystack** is an open-source framework for building search systems that work intelligently over large document collection. 

For more information on haystack pipeline refer below link

https://docs.haystack.deepset.ai/docs/intro

## **1.1 Data Parsing**

This function reads different data sources (pdf/doc/images) and extracts information using haystacks pipeline. 

The PreProcessor class is designed to clean text and split text into sensible units.It performs cleaning of consecutive whitespaces and splits a single large document into smaller documents. Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences.

The Preprocessed document are then saved as document object.

Refer https://haystack.deepset.ai/tutorials/08_preprocessing for more information on this section

In [None]:
def data_processing(doc_dir = ''):
  """
  Data retrieval/processing function
  
  Parameters:
  doc_dir (str): Path to document location

  Returns:
  list: List of retrieved/processed documents 

  """

  # converting all documents to text for future use
  all_docs = convert_files_to_docs(dir_path=doc_dir)

  # Using preprocessor to clean the docs
  preprocessor = PreProcessor(
      clean_empty_lines=True,
      clean_whitespace=True,
      clean_header_footer=True,
      split_by="word",
      split_length=100,
      split_respect_sentence_boundary=True,
  )
  #Preprocessing the documents
  docs_processed = []
  for doc in all_docs:
      docs_processed.append(preprocessor.process(doc))
  docs = docs_processed.copy()
  
  return docs

In [None]:
processed_data = data_processing(doc_dir = 'data')
print(len(processed_data))



7


# **2. Retrieval system**

The Retriever takes a query as input and checks it against the Documents contained in the DocumentStore. It scores each document for its relevance to the query and returns the top candidates.

## **2.1 Document store creation**

DocumentStore is a database that stores our texts and meta data and provides them to the Retriever at query time.

We provide DocumentStore as an argument when we initialize a Retriever.

We cast our data into Document objects before writing into a DocumentStore. Load the processed data saved in above step.

Refer https://docs.haystack.deepset.ai/docs/document_store and 
https://docs.haystack.deepset.ai/docs/retriever for more information on this section.

In [None]:
def semantic_search(processed_doc = '', retriever_type='bm25'):
  """
  Semantic search for finding top documents based on query

  Parameters:
  query (str): User query
  processed_doc (list): list of processed docs
  retriever_type (str): Type of retriever to be used

  Returns:
  Object: Retriever object
  Object: Document store object

  """
  if retriever_type == 'bm25':
    document_store = InMemoryDocumentStore(use_bm25=True)
  else:
    document_store = InMemoryDocumentStore()

  # Writing the document to document store to be used by retriever pipeline
  for doc in processed_doc:
        document_store.write_documents(doc)

  if retriever_type == 'tf-idf':
    retriever = TfidfRetriever(document_store)
    
  elif retriever_type == 'bm25':
    retriever = BM25Retriever(document_store)

  return retriever, document_store

In [None]:
query='What are the findings of NSCLC group of study?'
retriever_type = 'bm25'

In [None]:
retriever, document_store = semantic_search(processed_doc = processed_data, retriever_type=retriever_type)

Updating BM25 representation...:   0%|          | 0/10 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/29 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/55 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/71 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/375 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/385 [00:00<?, ? docs/s]

Updating BM25 representation...:   0%|          | 0/411 [00:00<?, ? docs/s]

## **2.2 Retriever**

The Retriever performs document retrieval by sweeping through a DocumentStore and returning a set of candidate Documents that are relevant to the *query*.

**Retrieve()** method returns a list of Document objects.

In [None]:
candidate_documents = retriever.retrieve(
    query=query,
    top_k=5,
)

Top 5 candidate_document context are combined as one.


In [None]:
data=pd.DataFrame(candidate_documents)
context=' '.join(data.content)

In [None]:
context

'Another large\nsystematic genomic study reclassified 12 tumor types into 11 subtypes based on the\nsequencing data from 3527 tumor cases (DNA copy number, DNA methylation, mRNA\nexpression, microRNA expression, protein expression and somatic point mutation). Somatic\nmutations such as KEAP1 and STK11 are preferentially mutated in LUAD-enriched tumors\ngroup, containing most of the lung adenocarcinoma cases, while CDKN2A, NOTCH1,\nMLL2 and NFE2L2 were found mutated preferentially in squamous-like tumors group\nencompassing most of the lung squamous cell carcinoma cases. Squamous-like tumors also\nshowed frequent MYC amplification and loss of CDKN2A, RB1 and TP53.  Takada M, Fukuoka M, Kawahara M, Sugiura T, Yokoyama A, Yokota S, Nishiwaki Y,\nWatanabe K, Noda K, Tamura T, Fukuda H, et al. Phase iii study of concurrent versus sequential\nthoracic radiotherapy in combination with cisplatin and etoposide for limited-stage small-cell\nlung cancer: Results of the japan clinical oncology gro

# **3. Extractive qna**

The predict function takes a context and the question and extracts the answer from the given context

The context here could be a text, or table. This is usually solved with BERT-like models.

If the model name is not specified in the function,**default model(deepset/tinyroberta-squad2)** is taken into consideration.

We can use any qna pre trained model from Huggingface or our fine tuned Huggingface qna model to infer on the test input.

We can pass any kwargs inside inference function. (i.e doc_stride,max_answer_length,..) 

Refer https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.QuestionAnsweringPipeline to know about list of available parameters for qna

Below is an example snippet to use:


**Initialize the class**

In [None]:
%cd /content/drive/MyDrive/semantic_search_package_final/qna

/content/drive/MyDrive/semantic_search_package_final/qna


In [None]:
from question_answer import QnA
qna=QnA()

In [None]:
output=qna.predict(question=query,context=context,model_name='deepset/roberta-base-squad2')

Executed
Model Name: deepset/roberta-base-squad2


In [None]:
output

[{'question': 'What are the findings of NSCLC group of study?',
  'context': 'Another large\nsystematic genomic study reclassified 12 tumor types into 11 subtypes based on the\nsequencing data from 3527 tumor cases (DNA copy number, DNA methylation, mRNA\nexpression, microRNA expression, protein expression and somatic point mutation). Somatic\nmutations such as KEAP1 and STK11 are preferentially mutated in LUAD-enriched tumors\ngroup, containing most of the lung adenocarcinoma cases, while CDKN2A, NOTCH1,\nMLL2 and NFE2L2 were found mutated preferentially in squamous-like tumors group\nencompassing most of the lung squamous cell carcinoma cases. Squamous-like tumors also\nshowed frequent MYC amplification and loss of CDKN2A, RB1 and TP53.  Takada M, Fukuoka M, Kawahara M, Sugiura T, Yokoyama A, Yokota S, Nishiwaki Y,\nWatanabe K, Noda K, Tamura T, Fukuda H, et al. Phase iii study of concurrent versus sequential\nthoracic radiotherapy in combination with cisplatin and etoposide for limi

**Storing the results in the dataframe**

In [None]:
df=pd.DataFrame(output)
data[['name','split_id']]=data['meta'].apply(pd.Series)
df

Unnamed: 0,question,context,predicted_answer,model_name,score,start,end
0,What are the findings of NSCLC group of study?,Another large\nsystematic genomic study reclassified 12 tumor types into 11 ...,significant improvement in the vaccine\ngroup comparing to the supportive group,deepset/roberta-base-squad2,0.041723,2061,2139


# **4. Text Summarization**

The inference function takes context and extracts the summary for the given context.The context here could be a text, or table.

See the up-to-date list of available models on https://huggingface.co/models?pipeline_tag=summarization

If the model name is not specified in the function,default model(**facebook/bart-large-cnn**) is taken into consideration.

We can use any text summarization pre trained model from Huggingface or our fine tuned Huggingface model to infer on the test input. eg:model name - **"text/training"**

We can pass any kwargs inside predict function. (i.e min_length, max_length,..)

Refer https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.SummarizationPipeline to know about list of available parameters for summarization

**Initialize the class**

In [None]:
%cd /content/drive/MyDrive/semantic_search_package_final/summarization/

/content/drive/MyDrive/semantic_search_package_final/summarization


In [None]:
from table_text_summarization import Summarizer
summarizer=Summarizer()

## **4.1 Context summary**

For context summary, added 300 words before and after the **predicted answer** using start and end index.

In [None]:
context_window=300
# answer = str(df.predicted_answer)
score = int(df.score)
start = int(df.start)
end = int(df.end)
context_subset=context[context_window-start:end+context_window]

In [None]:
context_subset

'ir-\npollution/effects), as well as a variety of occupations are associated with an increased risk of getting lung\ncancer.\nPeople who’ve never smoked are more likely to develop one particular type of lung cancer called\nadenocarcinoma.\nLung cancer usually affects people over the age of 60. Younger people can develop lung cancer, but this is\nless common.\nWhat are the symptoms of lung cancer?\n Recently updated data from a phase IIB randomized study treating stage\nIIIB/IV NSCLC patients with L-BLP25 showed significant improvement in the vaccine\ngroup comparing to the supportive group (3-year survival rates: 31 vs. 17 % [241]).\nThe efficacy of TG4010, a recombinant vaccinia virus that combines the human MUC1 and\ninterleukin-2 coding sequences [66], in combination with cisplatin and vinorelbine or as\nmonotherapy has been investigated in a randomized phase II study for advanced NS'

In [None]:
subset_summary=summarizer.predict(context=context_subset,model_name="facebook/bart-large-cnn",min_length=5, max_length=10)
subset_summary

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'context': 'ir-\npollution/effects), as well as a variety of occupations are associated with an increased risk of getting lung\ncancer.\nPeople who’ve never smoked are more likely to develop one particular type of lung cancer called\nadenocarcinoma.\nLung cancer usually affects people over the age of 60. Younger people can develop lung cancer, but this is\nless common.\nWhat are the symptoms of lung cancer?\n Recently updated data from a phase IIB randomized study treating stage\nIIIB/IV NSCLC patients with L-BLP25 showed significant improvement in the vaccine\ngroup comparing to the supportive group (3-year survival rates: 31 vs. 17 % [241]).\nThe efficacy of TG4010, a recombinant vaccinia virus that combines the human MUC1 and\ninterleukin-2 coding sequences [66], in combination with cisplatin and vinorelbine or as\nmonotherapy has been investigated in a randomized phase II study for advanced NS',
  'predicted_summary': 'Lung cancer usually affects people over',
  'model_name': 'fa

## **4.2 Document summary**

In [None]:
summary=summarizer.predict(context=context,model_name="facebook/bart-large-cnn")

In [None]:
summary

[{'context': 'Another large\nsystematic genomic study reclassified 12 tumor types into 11 subtypes based on the\nsequencing data from 3527 tumor cases (DNA copy number, DNA methylation, mRNA\nexpression, microRNA expression, protein expression and somatic point mutation). Somatic\nmutations such as KEAP1 and STK11 are preferentially mutated in LUAD-enriched tumors\ngroup, containing most of the lung adenocarcinoma cases, while CDKN2A, NOTCH1,\nMLL2 and NFE2L2 were found mutated preferentially in squamous-like tumors group\nencompassing most of the lung squamous cell carcinoma cases. Squamous-like tumors also\nshowed frequent MYC amplification and loss of CDKN2A, RB1 and TP53.  Takada M, Fukuoka M, Kawahara M, Sugiura T, Yokoyama A, Yokota S, Nishiwaki Y,\nWatanabe K, Noda K, Tamura T, Fukuda H, et al. Phase iii study of concurrent versus sequential\nthoracic radiotherapy in combination with cisplatin and etoposide for limited-stage small-cell\nlung cancer: Results of the japan clinical

In [None]:
len(context)

3315

chunk the document if the size is huge

In [None]:
def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(context, 500)

for n in chunks:
  summary=summarizer.predict(context=n,model_name="facebook/bart-large-cnn")
  print(summary)

Your max_length is set to 142, but you input_length is only 128. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=64)


[{'context': 'Another large\nsystematic genomic study reclassified 12 tumor types into 11 subtypes based on the\nsequencing data from 3527 tumor cases (DNA copy number, DNA methylation, mRNA\nexpression, microRNA expression, protein expression and somatic point mutation). Somatic\nmutations such as KEAP1 and STK11 are preferentially mutated in LUAD-enriched tumors\ngroup, containing most of the lung adenocarcinoma cases, while CDKN2A, NOTCH1,\nMLL2 and NFE2L2 were found mutated preferentially in squamous-like', 'predicted_summary': 'Another large-scale genomic study reclassified 12 tumor types into 11 subtypes. Somatic mutations such as KEAP1 and STK11 are preferentially mutated in LUAD-enriched tumors. CDKN2A, NOTCH1, MLL2 and NFE2L2 were found mutated preferently in squamous-like tumors.', 'model_name': 'facebook/bart-large-cnn'}]
[{'context': 'tumors group\nencompassing most of the lung squamous cell carcinoma cases. Squamous-like tumors also\nshowed frequent MYC amplification and l

Your max_length is set to 142, but you input_length is only 122. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


[{'context': 'other factors, such as air pollution (blf.org.uk/support-for-you/air-\npollution/effects), as well as a variety of occupations are associated with an increased risk of getting lung\ncancer.\nPeople who’ve never smoked are more likely to develop one particular type of lung cancer called\nadenocarcinoma.\nLung cancer usually affects people over the age of 60. Younger people can develop lung cancer, but this is\nless common.\nWhat are the symptoms of lung cancer?\n Recently updated data from a phase IIB', 'predicted_summary': 'Lung cancer usually affects people over the age of 60. Younger people can develop lung cancer, but this is less common. Other factors, such as air pollution (blf.org.uk/support-for-you/air-pollution/effects), as well as a variety of occupations are associated with an increased risk.', 'model_name': 'facebook/bart-large-cnn'}]


Your max_length is set to 142, but you input_length is only 127. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)


[{'context': 'randomized study treating stage\nIIIB/IV NSCLC patients with L-BLP25 showed significant improvement in the vaccine\ngroup comparing to the supportive group (3-year survival rates: 31 vs. 17 % [241]).\nThe efficacy of TG4010, a recombinant vaccinia virus that combines the human MUC1 and\ninterleukin-2 coding sequences [66], in combination with cisplatin and vinorelbine or as\nmonotherapy has been investigated in a randomized phase II study for advanced NSCLC\npatients [242,243]. A subgroup with a', 'predicted_summary': 'The efficacy of TG4010, a recombinant vaccinia virus that combines the human MUC1 and \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0interleukin-2 coding sequences, in combination with cisplatin and vinorelbine or as asmonotherapy has been investigated in a randomized phase II study for advanced NSCLC patients.', 'model_name': 'facebook/bart-large-cnn'}]


Your max_length is set to 142, but you input_length is only 141. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=70)


[{'context': 'detectable CD8+ T-cell response was able to generate\nan immune response against MUC1 and had longer median survival [243].  Symptom improvement in lung cancer patients treated\nwith erlotinib: Quality of life analysis of the national cancer institute of canada clinical trials\ngroup study br.21. Journal of clinical oncology : official journal of the American Society of\nClinical Oncology. 2006; 24(24):3831–3837. [PubMed: 16921034]\n128. Ciuleanu T, Stelmakh L, Cicenas S, Miliauskas S, Grigorescu AC,', 'predicted_summary': 'Elotinib was able to generate an immune response against MUC1 and had longer median survival. detectable CD8+ T-cell response. Patients treated with the drug had a better quality of life, according to a Canadian clinical trials study. The study was published in the Journal of clinical oncology.', 'model_name': 'facebook/bart-large-cnn'}]


Your max_length is set to 142, but you input_length is only 104. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=52)


[{'context': 'Hillenbach C, Johannsdottir\nHK, Klughammer B, Gonzalez EE. Efficacy and safety of erlotinib versus chemotherapy in\nsecond-line treatment of patients with advanced, non-small-cell lung cancer with poor prognosis\n(titan): A randomised multicentre, open-label, phase 3 study. Lancet Oncol. 2012; 13(3):300–\n308. [PubMed: 22277837]\n129. ', 'predicted_summary': 'Elotinib is being trialled in patients with non-small-cell lung cancer with poor prognosis. The study was a randomised multicentre, open-label, phase 3 study. The results of the study were published in the Lancet Oncol 2012.', 'model_name': 'facebook/bart-large-cnn'}]
