# SerpAPI Google Scholar Search

**Author:** Jack Galbraith-Edge
**Date:** 8th January 2025

Following the failure of my experiments with Scholarly, I opted to use SerpAPI to query Google Scholar and clean and manipulate the results to add to my systematic review database/literature search. SerpAPI has a free-tier up to 100 Google Scholar requests per month, so I decided to try this.

## Setup

In [3]:
# import libraries
import os   # for navigating file system
import json # for working with json files
from dotenv import load_dotenv  # to load enviromental variables and aid privacy of API keys and similar
from serpapi import GoogleSearch    # for querying google scholar
import pandas as pd # for working with dataframes

In [4]:
# load environmental variables
load_dotenv()

# get SerpAPI key from .env file.
try: 
    SERP_API_KEY = os.getenv('SERP_API_KEY')
    print("SerpAPI key loaded.")
except KeyError:
    print("SerpAPI Key not found")

# get file output path from .env
try:
    OUTPUT_PATH = os.getenv('OUTPUT_PATH')
    print("OUTPUT_PATH loaded.")
except KeyError:
    print("Error: OUTPUT_PATH not found in .env file")


SerpAPI key loaded.
OUTPUT_PATH loaded.


## Functions

In [19]:
# define a function to query google scholar
def query_google_scholar(google_scholar_query, api_key=SERP_API_KEY, max_results=300):

    """
    Usage: 
        - google_scholar_query: a string search query
        - api_key = SerpAPI key, available at: https://serpapi.com/
        - max_results = 300. Set at this level as single author and some evidence says that this is suitable limit.
    """

    all_results = []  # store all results here
    start = 0         # start with the first page

    while len(all_results) < max_results:
        # define parameters for searching google scholar
        params = {
          "api_key": api_key,  # SerpAPI key
          "engine": "google_scholar", # google_scholar engine
          "q": google_scholar_query, # define query
          "start": start,
          "num": 20
        }

        # search google scholar
        search = GoogleSearch(params)
        results = search.get_dict() # get results as dictionary

        # get organic results
        organic_results = results.get("organic_results", []) # retrieve organic list of results from results

        # add results to the list
        all_results.extend(organic_results)

        # check if there are no more results
        if len(organic_results) < 20:
            print("No more results to fetch.")
            break  # exit the loop

        # increment parameter for next page
        start += 20
        print(f"Fetched {len(all_results)} results so far...")

    return all_results[:max_results]

In [98]:
def scholar_to_df(organic_results, all_results_dict):

    """
    A function to transform dictionary output of SerpAPI Google Scholar Query into a
    uniform dataframe for use my systematic review.

    Requirements:
        - Pandas

    Usage: 
        - Takes input from query_google_scholar function.
        - Outputs a pandas DataFrame
    """
# convert organic results to dictionary
    organic_results_dict = {}

    for item in organic_results:
        organic_results_dict[len(organic_results_dict)] = item

    organic_results_df = pd.DataFrame(columns=[
    'Publication Year', 'First Author', 'Summary', 'Authors', 'Publication Title',
    'Title', 'Abstract', 'url', 'Database', 'Exclude', 'Reason ID',
    'Reason'])

    # create empty list to store rows
    rows = []

    # iterate through organic_results_dict
    for item in organic_results_dict.values():
        # extract information
        title = item.get('title', None)
        abstract = item.get('snippet', None)

        summary = item['publication_info'].get('summary', "Summary not found.")
        if summary:
            authors = summary.split(' - ')[0].strip() if ' - ' in summary else "No Author Listed"
            publication_info = summary.split(' - ')[1].strip() if ' - ' in summary else "No Publication Information"
            publication_title = publication_info.split(',')[0].strip() if ',' in publication_info else None
            year = publication_info.split(',')[1].strip() if ',' in publication_info else None
        else:
            authors = publication_title = year = None

        # add database specific fields
        database = "Google Scholar"
        exclude = None
        reason_id = None
        reason = None

        # append row as dictionary
        rows.append({
            'Publication Year': year,
            'First Author': f"{authors.split(' ')[-1]}, {authors.split(' ')[0][0]}." if authors else None,
            'Authors': authors,
            'Summary': summary,
            'Publication Title': publication_title,
            'Title': title,
            'Abstract': abstract, 
            'URL': item.get('link', None),
            'Database': database,
            'Exclude': exclude,
            'Reason ID': reason_id,
            'Reason': reason
        })

    organic_results_df = pd.DataFrame(rows).sort_values(by="Publication Year", ascending=True).reset_index(drop=True)

    return organic_results_df

In [78]:
# append to all results dictionary
def append_results(query, df, results_dict):

    """
    A function that appends query and the resultant dataframe to a dictionary.
    Allows for tracking of queries and for number of results from these queries. 
    """

    if results_dict is None:
        results_dict = {}
    results_dict[len(results_dict)] = {
        "query": query,
        "results": df}
    return results_dict

## Query

In [79]:
all_results_dict = {}

### Query 1

In [80]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign obj*" OR "foreign bod*")
                        AND
                        ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
                        AND
                        ("ingest*" OR "swallow*"))
                        AND
                        ("surg*" OR "endoscop*" OR "EGD" OR "OGD" OR "Esophagogastroduodenoscopy" OR "Oesophagogastroduodenoscopy" OR "manag*")"
                        """

In [81]:
organic_results = query_google_scholar(google_scholar_query)


Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
No more results to fetch.


In [82]:
google_scholar_results_df = scholar_to_df(organic_results)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

### Query 2

In [83]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign object" OR "foreign body")
                        """

In [84]:
# query google scholar
organic_results = query_google_scholar(google_scholar_query)

Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
Fetched 140 results so far...
Fetched 160 results so far...
Fetched 180 results so far...
Fetched 200 results so far...
Fetched 220 results so far...
Fetched 240 results so far...
Fetched 260 results so far...
Fetched 280 results so far...
Fetched 300 results so far...


In [85]:
google_scholar_results_df = scholar_to_df(organic_results)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

### Query Summary

In [101]:
# show number of results for each query and store in dataframe

# create results dataframe
results_df = pd.DataFrame(columns=["Query", "Num Results"])

# initialise list to store rows in 
rows = []

# iterate through all_results_dict that contains queries and results dataframes from query
for item in all_results_dict.values():
    
    query = item.get('query') # get query
    num_results = len(item.get('results')) # get number of results - max is 300.

    # append row as dictionary
    rows.append({
        "Query": query,
        "Num Results": num_results
        })

# create dataframe from rows
results_df = pd.DataFrame(rows)

results_df

Unnamed: 0,Query,Num Results
0,"\n (""foreign obj*"" OR ""...",60
1,"\n (""foreign object"" OR...",300
