<a href="https://www.kaggle.com/code/konggas/ai-research-db?scriptVersionId=236011131" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Goal: AI is a rapidly expanding field, making it challenging to keep up with the latest research. This notebook aims to assist newcomers in navigating this fast-paced domain, starting with keywords or a specific paper found online.

# Summary

- Collection of infulential papers
- Information retrieval with LLM
- Embedding text indformation
- Vector DB construction
- Retrival process for the relevant papers

# Input Data

- An organization collected influential papers.
    - (Most Influential ArXiv (Artificial Intelligence) Papers (2025-03 Version))
    - https://www.paperdigest.org/2025/03/most-influential-arxiv-artificial-intelligence-papers-2025-03-version/
- The pdf-formatted (300+) papers are loaded as the notebook session starts as "aipapers".

## Collecting file paths

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
fileList = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        fileList.append(os.path.join(dirname, filename))

/kaggle/input/aipapers/2309.10253.pdf
/kaggle/input/aipapers/1712.01815.pdf
/kaggle/input/aipapers/2205.10330.pdf
/kaggle/input/aipapers/2011.01975.pdf
/kaggle/input/aipapers/2107.09645.pdf
/kaggle/input/aipapers/1106.0675.pdf
/kaggle/input/aipapers/2203.15103.pdf
/kaggle/input/aipapers/2103.10213.pdf
/kaggle/input/aipapers/1505.03953.pdf
/kaggle/input/aipapers/2402.01817.pdf
/kaggle/input/aipapers/1401.3841.pdf
/kaggle/input/aipapers/1410.3916.pdf
/kaggle/input/aipapers/2008.06693.pdf
/kaggle/input/aipapers/1502.03552.pdf
/kaggle/input/aipapers/1506.02465.pdf
/kaggle/input/aipapers/2205.09712.pdf
/kaggle/input/aipapers/1207.4166.pdf
/kaggle/input/aipapers/2308.02490.pdf
/kaggle/input/aipapers/1304.2759.pdf
/kaggle/input/aipapers/1602.01585.pdf
/kaggle/input/aipapers/2206.06994.pdf
/kaggle/input/aipapers/2310.12036.pdf
/kaggle/input/aipapers/1510.04935.pdf
/kaggle/input/aipapers/1207.1359.pdf
/kaggle/input/aipapers/2011.08612.pdf
/kaggle/input/aipapers/1509.08973.pdf
/kaggle/input/aipa

# Importing necessary libraries 

In [None]:
!pip install jupyterlab
!pip uninstall -qy kfp jupyterlab libpysal thinc spacy fastai ydata-profiling google-cloud-bigquery google-generativeai
!pip install -U -q "google-genai==1.7.0"
!pip install pandas
!pip install chromadb

# Importing sectet key

In [3]:
import os
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [26]:
fileList2 = fileList

# Generation of JSON (text) with Prompting technique.

- (Structured output/JSON mode/controlled generation : the first gen AI capabilities)
- (Document understanding : the second gen AI capabilities)

## JSON has the following information
    - name of the paper
    - abstract
    - concise summary of the paper
    - summary of the paper
    - keywords

In [None]:
from google import genai
from google.genai import types
import pathlib
import httpx

aPapers = []
client = genai.Client()

for idx, f in enumerate(fileList2):
    filepath = pathlib.Path(f)
    id = doc_url.split("/")[-1]

    prompt = """Generate a JSON response from the attached PDF that adheres to the following schema:
    ```json
    {
      "type": "object",
      "properties": {
        "id": {
          "type": "string",
          "description": "The file name."
        },
        "name": {
          "type": "string",
          "description": "The title or name of the paper."
        },
        "summary": {
          "type": "string",
          "description": "A concise summary of the entire document."
        },
        "abstract": {
          "type": "string",
          "description": "The abstract of the paper."
        },
        "keywords": {
          "type": "array",
          "items": {
            "type": "string"
          },
          "description": "An array of keywords extracted from the main text."
        },
        "text_summary": {
           "type": "string",
           "description": "A long summary of the main text"
        }
      },
      "required": [
        "id",
        "name",
        "summary",
        "abstract",
        "keywords",
        "text_summary"
      ],
      "description": "A JSON object containing information about the paper."
    }
    """
    response = client.models.generate_content(
      model="gemini-1.5-flash",
      contents=[
          types.Part.from_bytes(
            data=filepath.read_bytes(),
            mime_type='application/pdf',
          ),
          prompt])
    aPapers.append(response.text)
    print(idx, id)


# Generation of text information list from JSON text

In [36]:
import json as json

long_string = "This string has string and another string."
substringBeg = "{"
substringEnd = "}"
indices = []
uPapers = []

for st in aPapers:
    BegIndex = st.find(substringBeg, 0)
    if BegIndex == -1:
        continue
    EndIndex = st.find(substringEnd, BegIndex + 1)
    if BegIndex == -1:
        continue
    newString = st[BegIndex:EndIndex+1]
    
    new_string = newString.replace("\"", " ")
    try:
        python_dictionary = json.loads(newString)
        uPapers.append(python_dictionary)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        print(f"The extracted JSON string was:\n'{newString}'")
        print(f"The extracted JSON string was:\n'{response.text}'")
    


Error decoding JSON: Expecting ',' delimiter: line 18 column 4 (char 4244)
The extracted JSON string was:
'{
  "type": "object",
  "properties": {
    "id": "1402.6028v1",
    "name": "Algorithms for the multi-armed bandit problem",
    "summary": "This paper presents a thorough empirical study of popular multi-armed bandit algorithms, revealing that simple heuristics often outperform theoretically sound algorithms.  The study identifies settings where algorithms perform well or poorly, which is not explained by current theory.  It also applies bandit algorithms to clinical trials, simulating a real study and demonstrating a significant improvement in patient outcomes using adaptive strategies. ",
    "abstract": "The stochastic multi-armed bandit problem is an important model for studying the exploration-exploitation tradeoff in reinforcement learning. Although many algorithms for the problem are well-understood theoretically, empirical confirmation of their effectiveness is generally

# Creation of vector DB (ChromaDB) with {'id', 'summary'}

In [9]:
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")

## Preparation of the information vector DB

In [16]:
documents = []
ids = []
lPapers = []
for l in uPapers:
    if l["id"] not in ids:
        ids.append(l["id"])
        lPapers.append(l)
        documents.append(l["summary"])

# The information for vector DB was added to ChromaDB.

- (Embeddings : the third gen AI capabilities)
- (Vector search/vector store/vector database: the fourth gen AI capabilities)

In [17]:
collection.add(
    documents=documents,
    ids=ids
)

# Information retrieval

## The retrieval of the papers with the phrase ("transformer is different from bert")

In [15]:
results = collection.query(
    query_texts=["transformer is different from bert"], # Chroma will embed this for you
    n_results=10 # how many results to return
)
print(results)

{'ids': [['arXiv:2302.09419v3', 'arXiv:1906.00346v2', 'arxiv_2205.13504v3.pdf', 'arXiv:1902.00098v1', '2407.21783v3', 'IJISRT24MAY1483', 'arxiv_2205.15241v2', 'arXiv:2308.02490v4', 'arXiv:2302.12095v5', 'A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers.pdf']], 'embeddings': None, 'documents': [['Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A PFM (e.g., BERT, ChatGPT, and GPT-4) is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications. In contrast to earlier approaches that utilize convolution and recurrent modules to extract features, BERT learns bidirectional encoder representations from Transformers, which are trained on large datasets as contextual language models. Similarly, the Generative Pretrained Transformer (GPT) method employs Transformers as the feature extractor and is trained using an autor

# End of Notebook


# Other codes not related to the work

## The following code is to download pickle file for the snapshot for the development.

In [18]:
import pickle as pickle

variables_to_store = {
    'aPapers': aPapers,
    'uPapers':uPapers,
    'lPapers':lPapers,
    'ids':ids,
    'documents':documents,
    'fileList':fileList
}

# Specify the filename to save the pickled data
filename = "aidbInput2.pkl"

try:
    with open(filename, 'wb') as f:
        pickle.dump(variables_to_store, f)
    print(f"User-defined variables stored in '{filename}'")

except Exception as e:
    print(f"An error occurred during pickling: {e}")

User-defined variables stored in 'aidbInput2.pkl'


In [39]:
import pickle as pickle

variables_to_store = {
    'aPapers': aPapers,
    'uPapers':uPapers
}

# Specify the filename to save the pickled data
filename = "auPapers.pkl"

try:
    with open(filename, 'wb') as f:
        pickle.dump(variables_to_store, f)
    print(f"User-defined variables stored in '{filename}'")

except Exception as e:
    print(f"An error occurred during pickling: {e}")

User-defined variables stored in 'auPapers.pkl'


In [7]:
import pickle


with open('/kaggle/input/aupapers/auPapers.pkl', 'rb') as f:
    data = pickle.load(f)

# ids = data['ids']
# documents = data['documents']
aPapers = data['aPapers']
uPapers = data['uPapers']

In [31]:
import pandas as pd

df = pd.DataFrame(uPapers)

In [30]:
df

Unnamed: 0,id,name,summary,abstract,keywords,text_summary,pdfName
0,arXiv:2309.10253v4,GPTFUZZER: Red Teaming Large Language Models w...,"This paper introduces GPTFUZZER, a novel black...",Large language models (LLMs) are widely used b...,"[Large Language Models, LLMs, Jailbreak Attack...",The paper addresses the challenge of evaluatin...,arXiv:2309.10253v4
1,1712.01815v1,Mastering Chess and Shogi by Self-Play with a ...,"This paper introduces AlphaZero, a general rei...",The game of chess is the most widely-studied d...,"[reinforcement learning, AlphaZero, chess, sho...","The paper details AlphaZero, a general-purpose...",1712.01815v1
2,arXiv:2205.10330v5,A Review of Safe Reinforcement Learning: Metho...,This paper reviews safe reinforcement learning...,Reinforcement Learning (RL) has achieved treme...,"[safe reinforcement learning, safety optimisat...",This comprehensive review delves into the fiel...,arXiv:2205.10330v5
3,arXiv:2011.01975v1,Rearrangement: A Challenge for Embodied AI,This paper proposes a framework for research a...,We describe a framework for research and evalu...,"[Embodied AI, Rearrangement, Robotics, Reinfor...",The paper introduces rearrangement as a canoni...,arXiv:2011.01975v1
4,arxiv_2107.09645v1.pdf,Mastering Visual Continuous Control: Improved ...,"This paper introduces DrQ-v2, a model-free rei...","We present DrQ-v2, a model-free reinforcement ...","[Reinforcement Learning, Visual Continuous Con...","The paper presents DrQ-v2, an improved version...",arxiv_2107.09645v1.pdf
...,...,...,...,...,...,...,...
321,Identifying Mislabeled Training Data.pdf,Identifying Mislabeled Training Data,This paper introduces a new approach to identi...,This paper presents a new approach to identify...,"[mislabeled training data, supervised learning...",The paper addresses the problem of mislabeled ...,Identifying Mislabeled Training Data.pdf
322,arXiv:1803.05457v1,Think you have Solved Question Answering? Try ...,This paper introduces the AI2 Reasoning Challe...,"We present a new question set, text corpus, an...","[Question Answering, AI, Reasoning, Machine Le...",The AI2 Reasoning Challenge (ARC) is introduce...,arXiv:1803.05457v1
323,UNCERTAINTY_IN_ARTIFICIAL_INTELLIGENCE_PROCEED...,Learning to Cooperate via Policy Search,This paper introduces a gradient-based distrib...,Cooperative games are those in which both agen...,"[Cooperative games, Policy search, Reinforceme...",The paper addresses the problem of cooperative...,Causal Inference in the Presence of Latent Var...
324,Causal Inference in the Presence of Latent Var...,Causal Inference in the Presence of Latent Var...,This paper presents a method for discovering c...,"We show that there is a general, informative a...","[Causal Inference, Latent Variables, Selection...",The paper addresses the challenges of causal i...,arXiv:2405.00451v2
