# Identify the 3 top matching chunks using LlamaIndex

LlamaIndex:
- Github: https://github.com/run-llama/llama_index
- Documentation: https://docs.llamaindex.ai/en/latest/

In [1]:
import pandas as pd

# read the xlsx data into a pandas dataframe
df = pd.read_excel('../data/ReferenceErrorDetection_data.xlsx')

In [98]:
df

Unnamed: 0,Source,Citing Article ID,Citing Article DOI,Citing Article Title,Citing Article Retracted,Citing Article Downloaded,Domain,Statement with Citation,Reference Article ID,Reference Article DOI,Reference Article Title,Reference Article Abstract,Reference Article PDF Available,Reference Article Retracted,Reference Article Downloaded,Label,Explanation
0,PubPeer,c001,10.1016/j.est.2021.103553,Heating a residential building using the heat ...,Yes,Yes,Engineering,Others have aimed to reduce irreversibility or...,r001,10.1155/2021/2087027,A Fault Analysis Method for Three-Phase Induct...,The fault prediction and abductive fault diagn...,Yes,No,Yes,Unsubstantiate,Irrelevant
1,PubPeer,c001,10.1016/j.est.2021.103553,Heating a residential building using the heat ...,Yes,Yes,Engineering,Some researchers have also studied various hea...,r002,10.1016/j.physa.2018.12.031,Develop 24 dissimilar ANNs by suitable archite...,The artificial neural network optimization met...,Yes,No,Yes,Unsubstantiate,Irrelevant
2,PubPeer,c002,10.1155/2022/4601350,Oxidative Potential and Nanoantioxidant Activi...,Yes,Yes,Chemistry,The relative content of total flavonoids in th...,r003,10.1088/1742-6596/1937/1/012038,Lipid Data Acquisition for devices Treatment o...,"Recently, the widespread deployment of smart p...",Yes,No,Yes,Unsubstantiate,Irrelevant
3,PubPeer,c003,10.1155/2022/2408685,The Choice of Anesthetic Drugs in Outpatient H...,Yes,Yes,Medicine,Research has shown that remimazolam tosylate e...,r004,10.1186/s12871-018-0543-3,"Effect of propofol on breast cancer cell, the ...",Breast cancer is the second leading cause of c...,Yes,No,Yes,Unsubstantiate,Irrelevant
4,PubPeer,c004,10.1155/2022/4783847,A Fault-Tolerant Structure for Nano-Power Comm...,Yes,Yes,Engineering,if the efficiency of the routing algorithm is ...,r005,10.36410/jcpr.2022.23.3.312,Analysis and research hotspots of ceramic mate...,"From the perspective of scientometrics, comb t...",Yes,No,Yes,Unsubstantiate,Irrelevant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,"Smith & Cumberledge, 2020",c175,10.1038/ncomms15218,Fast oxygen diffusion and iodide defects media...,No,Yes,Chemistry,The impact of film microstructure on charge ca...,r239,10.1063/1.4889845,Role of the crystallization substrate on the p...,We have fabricated CH3NH3PbI3−xClx perovskite ...,Yes,No,Yes,Fully substantiate,
246,"Smith & Cumberledge, 2020",c176,10.1038/ncomms16007,Muscle-specific CRISPR/Cas9 dystrophin gene ed...,No,Yes,Medicine,Mutations in the dystrophin (DMD) gene result ...,r240,10.1038/345315a0,Deficiency of a glycoprotein component of the ...,"Dystrophin, the protein encoded by the Duchenn...",Yes,No,Yes,Fully substantiate,
247,"Smith & Cumberledge, 2020",c085,10.1038/s41467-017-00519-2,Nanodiamonds suppress the growth of lithium de...,No,Yes,Chemistry,Aggregation of nanodiamond particles cannot be...,r241,10.1016/j.diamond.2008.01.033,Deagglomeration and functionalisation of deton...,We have achieved the covalent functionalisatio...,Yes,No,Yes,Fully substantiate,
248,"Smith & Cumberledge, 2020",c177,10.1038/ncomms15081,Single-cell RNA-seq enables comprehensive tumo...,No,Yes,Medicine,Single-cell genome analysis is expected to hav...,r242,10.1101/gr.191098.115,The first five years of single-cell cancer gen...,Single-cell sequencing (SCS) is a powerful new...,Yes,No,Yes,Fully substantiate,


In [102]:
first_entry = df.query("`Reference Article Retracted` == 'No' and `Label` == 'Fully substantiate'").iloc[0]
first_entry

Source                                                     Smith & Cumberledge, 2020
Citing Article ID                                                               c058
Citing Article DOI                                           10.1126/science.aan0177
Citing Article Title               High dislocation density-induced large ductili...
Citing Article Retracted                                                          No
Citing Article Downloaded                                                        Yes
Domain                                                                       Physics
Statement with Citation            Tensile properties of our steel compared with ...
Reference Article ID                                                            r091
Reference Article DOI                                     10.2320/matertrans.46.1839
Reference Article Title            The role of retained austenite on tensile prop...
Reference Article Abstract         In high-carbon, silicon-rich s

In [103]:
import glob

import xml.etree.ElementTree as ET

# Construct the file path pattern using the Reference Article ID of the first entry
file_pattern = f"../data/extractions/{first_entry['Reference Article ID']}*.xml"

# Find the file that matches the pattern
file_list = glob.glob(file_pattern)
if file_list:
    file_path = file_list[0]
    
    # Parse the XML file
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Extract the text content from the XML file
    reference_text = ''.join(root.itertext())
else:
    print("No matching file found.")

## Set OpenAI key

In [72]:
# Read the content of open_ai_key.txt into a variable
with open('../open_ai_key.txt', 'r') as file:
    open_ai_key = file.read().strip()

In [None]:
# import os
# import openai

# os.environ["OPENAI_API_KEY"] = open_ai_key
# openai.api_key = os.environ["OPENAI_API_KEY"]

# import nest_asyncio

# nest_asyncio.apply()

## Setting up vector index

### Chunk splitting

In [104]:
from llama_index.core.node_parser import TokenTextSplitter

token_text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)

In [105]:
reference_chunks = token_text_splitter.split_text(reference_text)

In [106]:
len(reference_chunks)

17

In [107]:
model = "gpt-3.5-turbo-0125"
model_embeddings = "text-embedding-3-small"

In [108]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model=model_embeddings, api_key=open_ai_key)
Settings.llm = OpenAI(model=model)

In [109]:
from llama_index.core import VectorStoreIndex, Document

documents = [Document(text=chunk) for chunk in reference_chunks]
index = VectorStoreIndex.from_documents(documents)

In [110]:
statement = first_entry["Statement with Citation"]
statement

'Tensile properties of our steel compared with those of other existing high strength metallic materials. These include nanobainite steel [citation 36].'

In [130]:
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)
response_synthesizer = get_response_synthesizer(
    response_mode="no_text",
)
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

In [131]:
response = query_engine.query(statement)

In [138]:
print(len(response.source_nodes))

3


In [139]:
print(response.source_nodes[0])

Node ID: bfb6ba3a-1bdd-4ef4-bafd-9ae5f4d0d0bc
Text: - A machine learning software for extracting information from
scholarly documents
Bainite                                         retained austenite
mechanical stability
morphology                                         chemical
composition                                         ductility
strength
In high-carbon, silicon-rich steels it is possible to obtain a very
fine bainitic micros...
Score:  0.640

