# Scrape Articles from Web of Science

- give set of keywords
- safe all corresponding papers as bibtex files

- bibtex as zip can be used as input for biblioshiny


For this project we will use Selenium since compared to the more basic scraping libraries it does support more functions that are needed in this case (button pushing, choosing from drop down menu). Additionally it disguises to some extend as a user and incorporates therefore some basic protection from scrape-defense mechanisms.

Orientation on the following Tutorial:
- https://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/

To run selenium a up to date chromedriver is needed that must be in the same folder as the skript.
It can be found here: https://chromedriver.chromium.org/downloads

Changelog v5 to v6:
- also specifically exclude Withdrawn papers and reprints while searching

In [20]:
## set up set of keywords

keywords = ['portugal','lisbon','czech republic','prague','italy','rome','vatican city','hungary','budapest','romania','bucharest']



#### how this scraper works:
WoS only shows the first 100.000 records of a search. no matter how many results it actually finds.
To make sure we always get less than 100.000 records per search we search for each keyword on its own.
Additionally, we search twice for each keyword - once for articles before 2019 (excl.) and once for articles after 2019 (incl.).
In that fashion, each search finds less than 100.000 records and the actual scraping can be conducted without any issue.

The split is not necessary for the example research presented in this repro but left in the script for exemplary reasons.

1. for each keyword we generate the appropriate advanced search string
    - innovation in all search categories
    - the current keyword in "TS" (title,abstract,keywords,keyword+)
    - the timeframe where the articles were published
    - Doc type =  Article, not book chapter or book or proceeding




In [21]:

# generate query code for advanced search:
# depending on application this should be adapted


def generate_query_time1(keyword):
    query = f"(((TS=({keyword}*)) AND DT=(Article)) NOT DT=(Book Chapter OR Proceedings Paper OR Book OR Reprint OR Withdrawn Publication)) AND PY=(1900-2018)"
    return query

def generate_query_time2(keyword):
    query = f"(((TS=({keyword}*)) AND DT=(Article)) NOT DT=(Book Chapter OR Proceedings Paper OR Book OR Reprint OR Withdrawn Publication)) AND PY=(2019-2021)"
    return query


queries = [generate_query_time1(kw) for kw in keywords]+[generate_query_time2(kw) for kw in keywords]
print(len(queries))

22


In [22]:
import math
from tqdm import tqdm
import os
import os.path
import time


# define actual scraping function

# get number of iterations: n_papers/1000

def scrape_search(n_papers,keyword,batch):
    '''
    input: n_papers - number of papers that the current search found
        keyword - keyword on which current search is based
        batch - 0 or 1 depending on before or after 2019
        
    output: none, saves BibTeX files directly under an appropriate name (indicating keyword and batch)
    
    the function is downloading all records of a search. thereby each BibTeX file can contain a maximum of 1000 records
    if a search contains more records the function detects that automatically and downloads all records, saving them 
    in n_papers/1000 papers.
    '''

    iterations = math.ceil(n_papers/1000)
    start = 1
    end = 1000

    # for each iteration:
    for i in tqdm(range(iterations)):
                
        time.sleep(5)
        
        while True:
            try:
                # 1. click Export button to open drop down
                browser.find_element(By.XPATH, '//*[@id="snRecListTop"]/app-export-menu/div/button').click()
                break
            except:
                time.sleep(2)
        
        time.sleep(5)
        
        # 2. click on BibTex
        browser.find_element(By.XPATH, '//*[@id="exportToBibtexButton"]').click()
        

        time.sleep(1)
        # 3. open "Record Content" drop down menu and choose "Full Record"
        browser.find_element(By.XPATH, '/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()
        browser.find_element(By.XPATH, '//*[@id="global-select"]/div/div[2]/div[3]').click()

        time.sleep(1)
        # 4. choose "Records from" and enter correct number depending on iteration
        browser.find_element(By.XPATH, '//*[@id="radio3"]/label/span[1]').click()
        time.sleep(1)
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/fieldset/mat-radio-group/div[3]/mat-form-field[1]/div/div[1]/div[3]/input').clear()
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/fieldset/mat-radio-group/div[3]/mat-form-field[1]/div/div[1]/div[3]/input').send_keys(start)
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/fieldset/mat-radio-group/div[3]/mat-form-field[2]/div/div[1]/div[3]/input').clear()
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/fieldset/mat-radio-group/div[3]/mat-form-field[2]/div/div[1]/div[3]/input').send_keys(end)

        # update start and end
        start += 1000
        end += 1000

        # 5. click export and save resulting bibtex file
        browser.find_element(By.XPATH, '/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()
        
        ## problems with the naming
        ## wait until download ended before renaming
        file_path = r'C:\Users\Lion\Documents\UNI4_RWTH\tim_hiwi_job\bibliometric_analysis\github_public\scrape_data\savedrecs.bib'
        
        while not os.path.exists(file_path):
            time.sleep(1)

        if os.path.isfile(file_path):
            filename = r"scrape_data/savedrecs.bib"
            new_filename = f"scrape_data/{keyword}_{i}{batch}.bib"
            os.rename(filename,new_filename)
        else:
            raise ValueError("%s isn't a file!" % file_path)
        

In [23]:
import re

# define function to extract the keyword from the generated query
# to pass that keyword to our scrape function so we can name the bibtex files correctly

def extract_kw(q):
    m = re.search("[a-z]+([_&]?[a-z]+)*",q)
    return m.group(0)

In [24]:
## starts a new chrome window, dont close!

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from selenium.webdriver.common.by import By

# download chromedriver here:
# https://sites.google.com/chromium.org/driver/
# unpack zip and add to skript directory, change path if needed

# set download directory
prefs = {'download.default_directory' : r'C:\Users\Lion\Documents\UNI4_RWTH\tim_hiwi_job\bibliometric_analysis\github_public\scrape_data'}
options = Options()
options.add_experimental_option("prefs", prefs)

# # start instance of browser with selenium
path_to_chromedriver = 'chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver,
                          options = options)


In [28]:
# uses the open chrome window to do search for each query and call scrape function


url = "https://www.webofscience.com/wos/woscc/advanced-search"

# if scraping interrupts (often on WoS side with "internal server error)
# put in here - if first batch done -  only once, if both twice
done = []

# loop that searches each generated query and then calls our earlier defined scraping function

for count,query in enumerate(queries): #keep track of number of iterations to define batch (count) with enumerate()
    
    # generate batch info for file naming
    # enter manually based on on queries
    if count <= 10:
        batch=1
    else: batch=2
    
    # extract the current keyword from the query
    kw = extract_kw(query)
    
    print(kw)
    if kw in done:
        done.remove(kw)
    else:
        # open page and conduct search
        browser.get(url)

        # important to cancel out the time needed to load certain element on a web page
        # sleep statements might have to be longer or can be shorter depending on the internet connection
        time.sleep(2)

        # reject cookies if needed (only first time)
        try:
            browser.find_element(By.XPATH,'//*[@id="onetrust-reject-all-handler"]').click()
        except: pass

        # put generated query into search form after clearing searh form
        browser.find_element(By.ID,'advancedSearchInputArea').clear()
        browser.find_element(By.ID,'advancedSearchInputArea').send_keys(query)

        time.sleep(1)

        # click search button
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div/div[2]/app-input-route/app-search-home/div[2]/div/app-input-route/app-search-advanced/app-advanced-search-form/form/div[3]/div[1]/div[1]/div/button[2]').click()

        time.sleep(4)

        # get and print number of papers

        n_paper = int(browser.find_element(By.CLASS_NAME, "brand-blue").text.replace(",",""))
        print(f"{count}: {kw}: {n_paper} papers found")

        # close helper window on webpage if needed
        try:
            browser.find_element(By.XPATH, '//*[@id="pendo-button-e5808a4c"]').click()
        except: pass

        # call scrape function
        scrape_search(n_paper,kw,batch)
    

portugal
0: portugal: 25226 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 26/26 [08:14<00:00, 19.02s/it]


lisbon
1: lisbon: 3824 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:13<00:00, 18.38s/it]


czech
2: czech: 16960 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 17/17 [05:17<00:00, 18.65s/it]


prague
3: prague: 4415 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:27<00:00, 17.42s/it]


italy
4: italy: 96319 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 97/97 [31:15<00:00, 19.33s/it]


rome
5: rome: 17057 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [05:31<00:00, 18.43s/it]


vatican
6: vatican: 85 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.60s/it]


hungary
7: hungary: 18059 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [05:52<00:00, 18.54s/it]


budapest
8: budapest: 2491 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:52<00:00, 17.37s/it]


romania
9: romania: 19478 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [06:13<00:00, 18.67s/it]


bucharest
10: bucharest: 1590 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.13s/it]


portugal
11: portugal: 9377 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [03:15<00:00, 19.52s/it]


lisbon
12: lisbon: 1444 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:37<00:00, 18.64s/it]


czech
13: czech: 5299 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [01:52<00:00, 18.82s/it]


prague
14: prague: 1075 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.58s/it]


italy
15: italy: 34376 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 35/35 [11:54<00:00, 20.41s/it]


rome
16: rome: 5295 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [01:48<00:00, 18.11s/it]


vatican
17: vatican: 48 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.54s/it]


hungary
18: hungary: 5256 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [01:53<00:00, 18.91s/it]


budapest
19: budapest: 663 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.58s/it]


romania
20: romania: 8341 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [02:53<00:00, 19.26s/it]


bucharest
21: bucharest: 747 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.53s/it]
