## Summarizing research papers with LLM, serpapi and Python ##

# Problem Statement:

It is proposed to create an agent that receives a Bibtex file as input and returns a text report summarizing the documents, mentioning the most relevant documents, researchers, and institutions related to the topic of the summarized documents.

# Approach to the Problem:

The problem was understood as a linear sequence of steps:
1. Input Bibtex file
2. Process file and extract titles
3. Download corresponding documents for each title
4. Summarize each document
5. Obtain topics for each document
6. For each topic, obtain the most relevant articles, authors, and institutions related to the topic
7. Report

# Decisions Taken for Problem Solution:

1. PDF documents only. A more complete solution should consider other alternatives.
2. Each document will have a single topic to facilitate formulation of the search in Google Scholar. A more complete solution should consider documents with multiple topics, like article compilations.
3. The authority source for retrieving relevant authors, articles, and institutions will be Google Scholar's ranking. This implies that those who sign the most relevant articles for a given topic are also the most relevant authors, and the same applies to their affiliated institutions. This is a simplification that may be excessive depending on the discipline. A more complete solution should apply reputation studies, analysis of practice communities, among other scientometrics methodologies. Depending on the discipline, other types of institutions, such as regulatory ones, should also be considered.
4. The final report will be plain text, without formatting. In a more complete solution, templates should be considered for formatting the text and enhancing the readability of the report.
5. The LangChain library will be used. This library became the source of learning about agents and chains, as well as the main toolbox. This led to the most difficult decision: not coding an agent in the strict sense and as understood by LangChain. An agent, according to the authors of the library, has a non-linear workflow, where the agent determines the course of action and chooses which tools to use among those available. Understanding the problem as a linear process no longer fits the workflow of an agent.
6. A more complete solution should build an agent with a set of tools that allows it to plan the sequence of actions appropriately and dynamically change towards alternative workflows. For example, iteratively refining search results in Google Scholar, adapting strategies for finding relevant authors, articles, and institutions according to the discipline, being more flexible in processing information in its various formats, and formatting the report appropriately for the user.

# Notebook Structure:

The notebook develops the approach to the problem as a linear sequence of steps and is structured in 6 cells. In addition to the usual ones for importing necessary libraries and configuring initial variables, the remaining cells are:
- Functions for working with Bibtex
- Functions for searching information on Google
- Use of LangChain
- Main execution loop




import the libraries

In [None]:
import os
import re
import json
import requests
from bibtexparser import loads
from serpapi import GoogleSearch
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.document_loaders import PyMuPDFLoader
from langchain.chains.summarize import load_summarize_chain

Environment and local variables

In [None]:
# Keys
serpapi_key = ''
os.environ['OPENAI_API_KEY'] = ''

# Paths
path_bibtex_file = "test-data/items2.bib"
path_download_files = 'downloaded_texts'

# Model to use and fundamental parameters.
# Temperature set to 0 to ensure result reproducibility.
# Default model defined in LangChain is maintained. 
llm = OpenAI(temperature=0)

Bibtex related functions

In [None]:
# Parse Bibtex file and return  title list
def load_bibtex(file_path):
  with open(file_path) as bibtex_file:
    bib_db = loads(bibtex_file.read())
  titles = []
  for entry in bib_db.entries:    
    title = entry.get("title", "")    
    titles.append(title)
  return titles

# Return PDF download url from title
def get_url_from_title(title):
    url = None
    search_params = {
        "engine": "google_scholar",
        "q": f"{title}",
        "api_key": serpapi_key,        
        "hl": "en"
    }
    search = GoogleSearch(search_params)
    results = search.get_dict()
    if 'resources' in results['organic_results'][0]:
        if 'PDF' in str(results['organic_results'][0]['resources'][0]):
            url = results['organic_results'][0]['resources'][0]['link']
    return url

# Download pdf from url and return file path
def download_document(url, folder, title):
  # Clean and shortening of title
  title = re.sub('[^a-zA-Z]', '', title)
  title = title[:12]
  file_name = title + '.pdf'
  # Check folder existence before download
  if not os.path.exists(folder):
      os.makedirs(folder)
  file_path = os.path.join(folder, file_name)
  session = requests.Session()
  response = session.get(url, stream=True, headers={'User-Agent': 'Mozilla/5.0'}, allow_redirects=True)
  # If request succesfull (status code 200),
  # download file
  if response.status_code == 200:
      print('Downloading...')  
      with open(file_path, 'wb') as f:
           for chunk in response.iter_content(4096):
            f.write(chunk)
      if os.path.isfile(file_path):
            print('File downloaded successfully to '+ file_path + ' !')
      else:
            print('Download failed.')
            return None
      return file_path
  else:
      print('Download failed.')
      return None

SerpAPI related functions

In [None]:
# Return authors IDs, and string with titles and authors
# from topics
def get_ids_authors_articles(topic):
    res = ''
    authors_ids = []
    search_params = {
        "engine": "google_scholar",
        "q": f"{topic}",
        "api_key": serpapi_key,
        "num": 10,
        "hl": "en"
    }
    search = GoogleSearch(search_params)
    results = search.get_dict()
    results_organic = results["organic_results"]
    # Extract information from article
    for result in results_organic:
        title = result["title"]
        publication_info = result["publication_info"]
        if 'authors' in publication_info:
            authors = [author["name"] for author in publication_info["authors"]]
            first_author_id = result['publication_info']['authors'][0]['author_id']    
            res += "Title: " + title + '/n'
            res += "Authors: " + str(authors) + '/n'
            authors_ids.append(first_author_id)        
    return authors_ids, res


# Return dict author:afiliation from IDs authors list
def get_author_affiliations(author_ids):
    affiliations = {}
    for author_id in author_ids:
        url = f"https://serpapi.com/search.json?engine=google_scholar_author&author_id={author_id}&api_key={serpapi_key}"
        response = requests.get(url)
        data = json.loads(response.text)
        if 'affiliations' in data["author"]:
            affiliations[author_id] = data["author"]["affiliations"]
        else:
            affiliations[author_id] = None
    return affiliations

Functions for summarizing documents, extracting topics, and reporting on main authors, documents, and institutions per topic

In [None]:
# Return document abstract in document_path 
# using LLMChain
def summarize_document(document_path):
    loader = PyMuPDFLoader(document_path)
    docs = loader.load() 
    text = '' 
    # Checks if  pdf returns usable string 
    for doc in docs:
        text += doc.page_content
    pattern = r'[a-zA-Z]'
    matches = re.findall(pattern, text)
    if bool(matches):
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summary = chain.run(docs)
        return summary
    return None 

# Return topic from abstract using LLMChain
def extract_topic(summary):
    prompt = PromptTemplate(
    input_variables=["text"],
    template="From the abstract in this text extract its main topic: {text}",
)
    chain = LLMChain(llm=llm, prompt=prompt)
    topic = chain.run(summary)
    return topic

# Returns top 5 authors from a string containing author 
# and article information using LLMChain to handle ambiguities 
# in names, repetitions, etc.
def extract_main_authors(authors_and_articles):
    prompt = PromptTemplate(
    input_variables=["text"],
    template="From the following text extract the top five authors. Each author in a new line: {text}",
)
    chain = LLMChain(llm=llm, prompt=prompt)
    main_authors = chain.run(authors_and_articles)
    return main_authors

# Returns top 5 articles from a string containing author and article 
# information using LLMChain to handle repeated titles with slight 
# variations, etc.
def extract_main_articles(authors_and_articles):
    prompt = PromptTemplate(
    input_variables=["text"],
    template="From the following text extract the top five titles. Mention its authors: {text}",
)
    chain = LLMChain(llm=llm, prompt=prompt)
    main_articles = chain.run(authors_and_articles)
    return main_articles

# Returns main centers and institutions for a given topic 
# based on authors' affiliations
def report_main_centers_and_institutions(affiliations):
    prompt = PromptTemplate(
    input_variables=["text"],
    template="Report only institution names from the following text. Ignore personal names, titles and roles. Also ignore incomplete information. Each institution in a new line: {text}",
)
    chain = LLMChain(llm=llm, prompt=prompt)
    top_centers = chain.run(str(affiliations))
    return top_centers

Main loop

In [None]:
# Returns list of titles from Bibtex file at path_bibtex_file
titles = load_bibtex(path_bibtex_file)

for title in titles:
    # Obtain URL for PDF download version of title
    url = get_url_from_title(title)

    # If PDF version cannot be obtained, move on to the next title
    if url == None:
        print('Impossible to find pdf of: ' + title)
        continue
    
    # Download document
    document_path = download_document(url, path_download_files, title)
    
    # If document cannot be downloaded, move on to the next title
    if document_path == None:
        print('Impossible to download pdf of: ' + title)
        continue
    
    # Summarize document
    summary = summarize_document(document_path)
    
    # If text cannot be extracted from document, move on to the next title
    if summary == None:
        print('Impossible to extract text from pdf: ' + title)
        continue
    
    # Extract topic of the document
    topic = extract_topic(summary)
    
    # Get information about main authors and articles in the topic
    author_ids, authors_and_articles = get_ids_authors_articles(topic)
    
    # Obtain main authors
    main_authors = extract_main_authors(authors_and_articles)
    
    # Obtain main articles
    main_articles = extract_main_articles(authors_and_articles)
   
    # Get affiliations of each author
    affiliations = get_author_affiliations(author_ids)
    
    # Extract main centers and institutions from affiliations
    top_centers = report_main_centers_and_institutions(str(affiliations))
   
    # Print obtained information for this title
    print('')
    print(f'Title: {title}')
    print('Abstract:')
    print(summary)
    print(topic)
    print('Main authors in the topic: ')
    print(main_authors)
    print('')
    print('Main documents in the topic: ')
    print(main_articles)
    print('')
    print('Main centers and institutions: ')
    print(top_centers)
    print('')
    print('--------------------------------------')