# Query Google Scholar with Scholarly

**Author:** Jack Galbraith-Edge
**Date:** 8th January 2025

## Introduction
Google Scholar is popular tool for searching grey literature. Unlike databases like PubMed and Embase which are curated, maintained by humans and contain only published literature. Google Scholar, on the other hand, searches the internet for what it believes are academic materials and then displays them in a list. As a result, Google Scholar is excellent for seaching grey literature, as results will include unpublished data such as thesises and other useful material.

Unfortunately, because Google Scholar references the internet in this way and is not itself a database, this can make querying cumbersome and the number of results returned by queries and be overwhelming. 

In this document, I intend to use the python library 'Scholarly' to query and scrape information from Google Scholar. I will then manipulate this data into a useable format for use in my systematic review. 

## Setup

In [1]:
# import libraries
from scholarly import scholarly, ProxyGenerator
import requests

Because scraping information from Google Scholar is technically against the google scholar licence agreement, a proxy is used. Otherwise, google will simply block IPs that it believe to be are bots, or are working at super human speed with their queries

In [None]:
# initialise proxy generator object
pg = ProxyGenerator()

# set up a proxy with free proxy
success = pg.FreeProxies()

# if proxy successfully deployed
if success:
    scholarly.use_proxy(pg)
    print("Free proxy set up. Test beginning shortly.")
    pg_dict = (pg.__dict__) # print pg dictionary object

    proxies = pg_dict.get("_proxies")

    # try to run a test query
    try:
        test_result = scholarly.search_pubs("foreign body ingestion") # run a test query
        first_result = next(test_result) # get first result
        scholarly.pprint(first_result) # print first result
    except Exception as e:
        print(f"Error during query execution: {e}")
else:
    print("Failed to set up proxy.")

Free proxy set up. Test beginning shortly.
Error during query execution: Cannot Fetch from Google Scholar.


In [14]:
def initialise_free_proxy():
    
    # initialise proxy generator object
    pg = ProxyGenerator()

    try:
        # set up a proxy with free proxy
        success = pg.FreeProxies()

        # check if proxy set up correctly
        if success:
            # get internal dictionary of ProxyGenerator object
            pg_dict = (pg.__dict__) 
            proxies = pg_dict.get("_proxies")

            # store proxy details in a dictionary
            proxy_dict = {
                'http': proxies['http://'],
                'https': proxies['https://']
            }


            # print proxies to stdout
            print(f"HTTP Proxy: {proxy_dict['http']}, HTTPS Proxy: {proxy_dict['https']}")


            # check proxies are working by making a request to google scholar 
            
            # make request
            response = requests.get("https://scholar.google.com", proxies=proxy_dict, verify=False, timeout=10)

            # check response status
            if response.status_code == 200:
                print("Proxy is working! Response received successfully.")
                print("Running test query...")

                # after successful request, run test query
                try:
                    scholarly.use_proxy(pg)
                    search_results = scholarly.search_pubs("machine learning")
                    first_result = next(search_results)
                    scholarly.pprint(first_result)  # print first result
                    return True, pg # proxy successfully initialised
                # if error making query, print exception.    
                except Exception as e:
                    print(f"Error during test query execution: {e}")
                    return False

            # if unable to make request fails, print response status code
            else:
                print(f"Proxy test failed with status code: {response.status_code}")
                return False

        else: 
            print("Failed to set up free proxy.")
            return False

    except Exception as e:
        print(f"Error during proxy initialisation: {e}")
        return False

initialise_free_proxy()

HTTP Proxy: http://168.234.75.168:80, HTTPS Proxy: http://168.234.75.168:80




Proxy is working! Response received successfully.
Running test query...
Error during test query execution: Cannot Fetch from Google Scholar.


False

Unfortunately, the error 'Cannot Fetch from Google Scholar'. Kept appearing for me. 
I tried with multiple free proxies, but they kept failing, even with short queries like 'foreign body ingestion'. 

Paid proxies cost upwards of $50 per month, so I sought other avenues to achieve this.

I opted to change tactic and use a different library from SerpAPI.