# Google Scholar Search

**Author:** Jack Galbraith-Edge

**Date:** 8th January 2025

## Background
Google Scholar is popular tool for searching grey literature. Unlike databases like PubMed and Embase which are curated, maintained by humans and contain only published literature. Google Scholar, on the other hand, searches the internet for what it believes are academic materials and then displays them in a list. As a result, Google Scholar is excellent for seaching grey literature, as results will include unpublished data such as thesises and other useful material, reducing publication bias and susceptability bias (Haddaway et al., 2015)

Unfortunately, because Google Scholar references the internet in this way and is not itself a database, this can make querying cumbersome and the number of results returned by queries and be overwhelming. 

Following the failure of my experiments with Scholarly, I opted to use SerpAPI to query Google Scholar and clean and manipulate the results to add to my systematic review database/literature search. SerpAPI has a free-tier up to 100 Google Scholar requests per month, so I utilise my free trial and give it a go.

This document details my search queries and how I've tailored them to gather an appropriate number of results.

## Important
The result limit in my query function is set to 300. This is because some research exists to researches should focus on the first 200-300 results, with useful resources being found after the 80 result mark (Haddaway et al., 2015).

## Setup

In [3]:
from msc_code.scripts.notebook_setup import *
from msc_code.scripts.helpers import *

## Queries

Here I will try multiple queries to see how many results I get.

In [4]:
# create dictionary to store search query and results in for ease of reference later.
results_dict = {}

# Query One
query_one =  '("foreign obj*" OR "foreign bod*")' 
# Run search
results_one = search_google_scholar(query_one, SERP_API_KEY)
# Add results to dictionary
results_dict = append_google_scholar_results_to_dictionary(query_one, results_one, results_dict)

# Query Two
query_two = """
            ("foreign obj*" OR "foreign bod*")
            AND
            ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
            """
# Run search
results_two = search_google_scholar(query_two, SERP_API_KEY)
# Add results to dictionary
results_dict = append_google_scholar_results_to_dictionary(query_two, results_two, results_dict)

# Query Three
query_three = """
            ("foreign obj*" OR "foreign bod*")
            AND
            ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
            AND
            ("ingest*" OR "swallow*")
            """
# Run search
results_three = search_google_scholar(query_three, SERP_API_KEY)
# Add results to dictionary
results_dict = append_google_scholar_results_to_dictionary(query_three, results_three, results_dict)

# Query Four
# Define google scholar query
query_four = """
            ("foreign obj*" OR "foreign bod*")
            AND
            ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
            AND
            ("ingest*" OR "swallow*"))
            AND
            ("surg*" OR "endoscop*" OR "EGD" OR "OGD" OR "Esophagogastroduodenoscopy" OR "Oesophagogastroduodenoscopy" OR "manag*")"
            """
# Run search
results_four = search_google_scholar(query_four, SERP_API_KEY)
# Add results to dictionary
results_dict = append_google_scholar_results_to_dictionary(query_four, results_four, results_dict)

Searching Google Scholar: ("foreign obj*" OR "foreign bod*")
Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
Fetched 140 results so far...
Fetched 160 results so far...
Fetched 180 results so far...
Fetched 200 results so far...
Fetched 220 results so far...
Fetched 240 results so far...
Fetched 260 results so far...
Fetched 280 results so far...
Fetched 300 results so far...
Maximum number of 300 results reached.
Searching Google Scholar: 
            ("foreign obj*" OR "foreign bod*")
            AND
            ("intent*" OR "deliberate*" OR "purpose*" OR "self-injur*" OR "selfharm*" OR "self-harm*")
            
Fetched 20 results so far...
Fetched 40 results so far...
Fetched 60 results so far...
Fetched 80 results so far...
Fetched 100 results so far...
Fetched 120 results so far...
Fetched 140 results so far...
Fetched 160 results so far...
Fetched 180 

### Query Summary

In [18]:
# Show number of results for each query and store in dataframe
results_summary = create_search_query_summary(results_dict)

# export queries and associated results count to review with supervisor.
export_search_result_summary_to_csv(results_summary, os.path.join(*[RAW_DATA_DIR, "google_scholar"]))

In [19]:
# Export each search result dataframe to CSV
export_search_results_to_csvs(results_dict, os.path.join(*[RAW_DATA_DIR, "google_scholar"]))

## Clean

In [30]:
# import google_df from previously exported csv
google_df = pd.read_csv("/".join([RAW_DATA_DIR, "google_scholar", "google_scholar_results_2.csv"]))

google_df['First Author'] = google_df['First Author'].str.title() # clean first author names
google_df['Publication Title'] = google_df['Publication Title'].str.title() 
google_df['Title'] = google_df['Title'].str.title()
google_df['URL'] = google_df['URL'].str.lower()

### Duplicate Removal

In [8]:
start_count = len(google_df)
google_df = google_df.drop_duplicates(subset=google_df.columns, keep='first')
end_count = len(google_df)

duplicates_removed_count = end_count - start_count

print(f"{duplicates_removed_count} duplicates were removed at this stage.")


0 duplicates were removed at this stage.


### Export for Title and Abstract Review

In [9]:
# export for title and abstract review
google_df.to_csv(os.path.join(PROC_DATA_DIR, "google_scholar", "google_scholar_title_abstract_screen_start_1.csv"))

At this stage, after discussing with my supervisor and a librarian, it is decided that the 135 result search strategy would suffice. 
The results were exported to CSV and the URLs were accessed to access title and abstracts. 
Abstracts - where available - were transferred to the spreadsheet using Microsoft excel and exclusion criteria were applied.

## Bibliography

1.	Haddaway NR, Collins AM, Coughlin D, Kirk S. The Role of Google Scholar in Evidence Reviews and Its Applicability to Grey Literature Searching. Wray KB, editor. PLoS ONE. 2015 Sep 17;10(9):e0138237. 