# SerpAPI Google Scholar Search

**Author:** Jack Galbraith-Edge

**Date:** 8th January 2025

## Background
Google Scholar is popular tool for searching grey literature. Unlike databases like PubMed and Embase which are curated, maintained by humans and contain only published literature. Google Scholar, on the other hand, searches the internet for what it believes are academic materials and then displays them in a list. As a result, Google Scholar is excellent for seaching grey literature, as results will include unpublished data such as thesises and other useful material, reducing publication bias and susceptability bias (Haddaway et al., 2015)

Unfortunately, because Google Scholar references the internet in this way and is not itself a database, this can make querying cumbersome and the number of results returned by queries and be overwhelming. 

Following the failure of my experiments with Scholarly, I opted to use SerpAPI to query Google Scholar and clean and manipulate the results to add to my systematic review database/literature search. SerpAPI has a free-tier up to 100 Google Scholar requests per month, so I utilise my free trial and give it a go.

This document details my search queries and how I've tailored them to gather an appropriate number of results.

## Important
The result limit in my query function is set to 300. This is because some research exists to researches should focus on the first 200-300 results, with useful res (Haddaway et al., 2015).

## Setup

In [56]:
# import libraries
import os   # for navigating file system
import json # for working with json files
from dotenv import load_dotenv  # to load enviromental variables and aid privacy of API keys and similar
from serpapi import GoogleSearch    # for querying google scholar
import pandas as pd # for working with dataframes

In [57]:
# load environmental variables
load_dotenv()

# get SerpAPI key from .env file.
try: 
    SERP_API_KEY = os.getenv('SERP_API_KEY')
    print("SerpAPI key loaded.")
except KeyError:
    print("SerpAPI Key not found")

# get file output path from .env
try:
    OUTPUT_PATH = os.getenv('GS_SEARCHES_OUTPUT_PATH')
    print("OUTPUT_PATH loaded.")
except KeyError:
    print("Error: OUTPUT_PATH not found in .env file")


SerpAPI key loaded.
OUTPUT_PATH loaded.


## Functions

In [58]:
# define a function to query google scholar
def query_google_scholar(google_scholar_query, api_key=SERP_API_KEY, max_results=300):

    """
    Query google scholar using SerpAPI
    
    Parameters: 
        - google_scholar_query: a string search query
        - api_key = SerpAPI key, available at: https://serpapi.com/
        - max_results = 300. Set at this level as single author and some evidence says that this is suitable limit.
    """

    all_results = []  # store all results here
    start = 0         # start with the first page

    while len(all_results) < max_results:
        # define parameters for searching google scholar
        params = {
          "api_key": api_key,  # SerpAPI key
          "engine": "google_scholar", # google_scholar engine
          "q": google_scholar_query, # define query
          "start": start,
          "num": 20
        }

        # search google scholar
        search = GoogleSearch(params)
        results = search.get_dict() # get results as dictionary

        # get organic results
        organic_results = results.get("organic_results", []) # retrieve organic list of results from results

        # add results to the list
        all_results.extend(organic_results)

        # check if there are no more results
        if len(organic_results) < 20:
            print("No more results to fetch.")
            break  # exit the loop

        # increment parameter for next page
        start += 20
        print(f"Fetched {len(all_results)} results so far...")

        if len(all_results) == max_results:
            print(f"Maximum number of {max_results} results reached.")

    return all_results[:max_results]

In [59]:
def scholar_to_df(organic_results, all_results_dict):

    """
    A function to transform dictionary output of SerpAPI Google Scholar Query into a
    pandas dataframe.

    Requirements:
        - Pandas

    Usage: 
        - Takes input from query_google_scholar function.
        - Outputs a pandas DataFrame
    """
# convert organic results to dictionary
    organic_results_dict = {}

    for item in organic_results:
        organic_results_dict[len(organic_results_dict)] = item

    organic_results_df = pd.DataFrame(columns=[
    'Publication Year', 'First Author', 'Summary', 'Authors', 'Publication Title',
    'Title', 'Abstract', 'url', 'Database', 'Exclude', 'Reason ID',
    'Reason'])

    # create empty list to store rows
    rows = []

    # iterate through organic_results_dict
    for item in organic_results_dict.values():
        # extract information
        title = item.get('title', None)
        abstract = item.get('snippet', None)

        summary = item['publication_info'].get('summary', "Summary not found.")
        if summary:
            authors = summary.split(' - ')[0].strip() if ' - ' in summary else "No Author Listed"
            publication_info = summary.split(' - ')[1].strip() if ' - ' in summary else "No Publication Information"
            publication_title = publication_info.split(',')[0].strip() if ',' in publication_info else None
            year = publication_info.split(',')[-1].strip() if ',' in publication_info else None
        else:
            authors = publication_title = year = None

        # add database specific fields
        database = "Google Scholar"
        exclude = None
        reason_id = None
        reason = None

        # append row as dictionary
        rows.append({
            'Publication Year': year,
            'First Author': f"{authors.split(' ')[-1]}, {authors.split(' ')[0][0]}." if authors else None,
            'Authors': authors,
            'Summary': summary,
            'Publication Title': publication_title,
            'Title': title,
            'Abstract': abstract, 
            'URL': item.get('link', None),
            'Database': database,
            'Exclude': exclude,
            'Reason ID': reason_id,
            'Reason': reason
        })

    organic_results_df = pd.DataFrame(rows).sort_values(by="Publication Year", ascending=True).reset_index(drop=True)

    return organic_results_df

In [60]:
# append to all results dictionary
def append_results(query, df, results_dict):

    """
    A function that appends query and the resultant dataframe to a dictionary.
    Allows for tracking of queries and for number of results from these queries. 
    """

    if results_dict is None:
        results_dict = {}

    results_dict[len(results_dict)] = {
        "query": query,
        "results": df}
    return results_dict

## Query

In [61]:
all_results_dict = {}

### Query 1

In [62]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign obj*" OR "foreign bod*")
                        """

In [63]:
organic_results = query_google_scholar(google_scholar_query)


Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
Fetched 140 results so far...
Fetched 160 results so far...
Fetched 180 results so far...
Fetched 200 results so far...
Fetched 220 results so far...
Fetched 240 results so far...
Fetched 260 results so far...
Fetched 280 results so far...
Maximum number of 300 results reached.
Fetched 300 results so far...


In [64]:
google_scholar_results_df = scholar_to_df(organic_results, all_results_dict)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

In [65]:
# export to csv

### Query 2

In [66]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign obj*" OR "foreign bod*")
                        AND
                        ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
                        """

In [67]:
# query google scholar
organic_results = query_google_scholar(google_scholar_query)

Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
Fetched 140 results so far...
Fetched 160 results so far...
Fetched 180 results so far...
Fetched 200 results so far...
Fetched 220 results so far...
Fetched 240 results so far...
Fetched 260 results so far...
Fetched 280 results so far...
Maximum number of 300 results reached.
Fetched 300 results so far...


In [68]:
google_scholar_results_df = scholar_to_df(organic_results, all_results_dict)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

### Query 3

In [69]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign obj*" OR "foreign bod*")
                        AND
                        ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
                        AND
                        ("ingest*" OR "swallow*")
                        """

In [70]:
# query google scholar
organic_results = query_google_scholar(google_scholar_query)

Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
No more results to fetch.


In [71]:
google_scholar_results_df = scholar_to_df(organic_results, all_results_dict)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

### Query 4

In [72]:
# Define google scholar query
google_scholar_query =  """
                        ("foreign obj*" OR "foreign bod*")
                        AND
                        ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
                        AND
                        ("ingest*" OR "swallow*"))
                        AND
                        ("surg*" OR "endoscop*" OR "EGD" OR "OGD" OR "Esophagogastroduodenoscopy" OR "Oesophagogastroduodenoscopy" OR "manag*")"
                        """

In [73]:
# query google scholar
organic_results = query_google_scholar(google_scholar_query)

Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
No more results to fetch.


In [74]:
google_scholar_results_df = scholar_to_df(organic_results, all_results_dict)

# append results to final
all_results_dict = append_results(google_scholar_query, google_scholar_results_df, all_results_dict)

### Query Summary

In [75]:
# show number of results for each query and store in dataframe

# create results dataframe
results_df = pd.DataFrame(columns=["Query", "Num Results"])

# initialise list to store rows in 
rows = []

# iterate through all_results_dict that contains queries and results dataframes from query
for item in all_results_dict.values():
    
    query = item.get('query') # get query
    num_results = len(item.get('results')) # get number of results - max is 300.

    # append row as dictionary
    rows.append({
        "Query": query,
        "Num Results": num_results
        })

# create dataframe from rows
results_df = pd.DataFrame(rows)

results_df

Unnamed: 0,Query,Num Results
0,"\n (""foreign obj*"" OR ""...",300
1,"\n (""foreign obj*"" OR ""...",300
2,"\n (""foreign obj*"" OR ""...",134
3,"\n (""foreign obj*"" OR ""...",60


In [76]:
def all_results_to_csv(results, output_path):
    """
    Function that exports dataframes from queries to CSV.

    Parameters:
        - Results dictionary
    """
    # check results are in dictionary format
    if not isinstance(results, dict):
        raise TypeError(f"Expected a dictionary, but got {type(result).__name__} instead.")

    # export results to csv to inspect
    for n, value in enumerate(results.values()):
        df = value['results']
        output_file = f"{output_path}/google_scholar_results_{n}.csv"
        df.to_csv(output_file, index=False)

all_results_to_csv(all_results_dict, OUTPUT_PATH)

In [90]:
def results_to_csv(results, output_path):
    """
    Export query and query counts to a CSV file.

    Parameters:
        - results: A pandas DataFrame.
        - output_path: A string representing the output file path (including the file name).
    """
    # Check that results is a pandas dataframe
    if not isinstance(results, pd.DataFrame):
        raise TypeError(f"Expected a DataFrame, but got {type(results).__name__} instead.")
    
    # strip out \n from all cells in the dataframe
    results = results.apply(lambda col: col.map(lambda x: x.replace('\n', '') if isinstance(x, str) else x))
    
    output_file = f"{output_path}/google_query_results_counts.csv"

    # Save the DataFrame to CSV
    results.to_csv(output_file, index=False)

In [91]:
# export queries and associated results count to review with supervisor.
results_to_csv(results_df, OUTPUT_PATH)

## Bibliography

1.	Haddaway NR, Collins AM, Coughlin D, Kirk S. The Role of Google Scholar in Evidence Reviews and Its Applicability to Grey Literature Searching. Wray KB, editor. PLoS ONE. 2015 Sep 17;10(9):e0138237. 