# Scrape Articles from Web of Science

- give set of keywords
- safe all corresponding papers as bibtex files

- bibtex as zip can be used as input for biblishiny


For this project we will use Selenium since compared to the more basic scraping libraries it does support more functions that are needed in this case (button pushing, choosing from drop down menu). Additionally it disguises to some extend as a user and incorporates therefore some basic protection from scrape-defense mechanisms.

Orientation on the following Tutorial:
- https://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/


Changelog v5 to v6:
- also specifically exclude Withdrawn papers and reprints while searching

In [1]:
## set up set of keywords

keywords = ['alliance', 'joint_venture', 'network', 'university_industry',
       'ecosystem', 'interorg', 'consorti', 'r&d', 'new_product_development',
       'npd', 'partnership', 'interfirm', 'across_boundaries', 'platform',
       'tournament', 'contest', 'intermediar', 'broadcast_search',
       'collaboration', 'collective_intelligence', 'community', 'co_creation',
       'cumulative_innovation', 'crowdsourcing', 'customer_involvement',
       'customer_integration', 'distributed_innovation', 'lead_user',
       'open_innovation', 'open_source', 'peer_production', 'user_innovation',
       'consumer_innovation', 'wisdom_of_the_crowd', 'co_production',
       'free_innovation', 'household_innovation']


# certain keywwords had to be adjusted to work with the WoS search function:
# rnd > r&d
# interorg > interorg*
# connsorti > consorti*
# etc.

## for simplicity each search term got the * at the end
## this tells WoS to search for any term that starts with the given letters
## eventually one wants to decide for the final dataset which keywords should get the * and which ones not!


#### how this scraper works:
Since the last scrape, the amount of literature exploded. WoS only shows the first 100.000 records of a search.
To make sure we always get less than 100.000 records per search we search for each keyword on its own.
Additionally, we search twice for each keyword - once for articles before 2019 (excl.) and once for articles after 2019 (incl.).
In that fashion, each search finds less than 100.000 records and the actual scraping can be conducted without any issue.

1. for each keyword we generate the appropriate advanced search string
    - innovation in all search categories
    - the current keyword in "TS" (title,abstract,keywords,keyword+)
    - the timeframe where the articles were published
    - Doc type =  Article, not book chapter or book or proceeding




In [2]:

# generate query code for advanced search:
# add * to make sure that interorg gets also interorganisational

def generate_query_time1(keyword):
    query = f"((((TS=({keyword}*)) AND ALL=(innovation)) AND DT=(Article)) NOT DT=(Book Chapter OR Proceedings Paper OR Book OR Reprint OR Withdrawn Publication)) AND PY=(1900-2018)"
    return query

def generate_query_time2(keyword):
    query = f"((((TS=({keyword}*)) AND ALL=(innovation)) AND DT=(Article)) NOT DT=(Book Chapter OR Proceedings Paper OR Book OR Reprint OR Withdrawn Publication)) AND PY=(2019-2021)"
    return query


queries = [generate_query_time1(kw) for kw in keywords]+[generate_query_time2(kw) for kw in keywords]


In [3]:
import math
from tqdm import tqdm
import os
import os.path
import time


# define actual scraping function

# get number of iterations: n_papers/1000

def scrape_search(n_papers,keyword,batch):
    '''
    input: n_papers - number of papers that the current search found
        keyword - keyword on which current search is based
        batch - 0 or 1 depending on before or after 2019
        
    output: none, saves BibTeX files directly under an appropriate name (indicating keyword and batch)
    
    the function is downloading all records of a search. thereby each BibTeX file can contain a maximum of 1000 records
    if a search contains more records the function detects that automatically and downloads all records, saving them 
    in n_papers/1000 papers.
    '''

    iterations = math.ceil(n_papers/1000)
    start = 1
    end = 1000

    # for each iteration:
    for i in tqdm(range(iterations)):
                
        time.sleep(5)
        
        while True:
            try:
                # 1. click Export button to open drop down
                browser.find_element(By.XPATH, '//*[@id="snRecListTop"]/app-export-menu/div/button').click()
                break
            except:
                time.sleep(2)
        
        time.sleep(5)
        
        # 2. click on BibTex
        browser.find_element(By.XPATH, '//*[@id="exportToBibtexButton"]').click()
        

        time.sleep(1)
        # 3. open "Record Content" drop down menu and choose "Full Record"
        browser.find_element(By.XPATH, '/html/body/app-wos/div/div/main/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()
        browser.find_element(By.XPATH, '//*[@id="global-select"]/div/div[2]/div[3]/span').click()

        time.sleep(1)
        # 4. choose "Records from" and enter correct number depending on iteration
        browser.find_element(By.XPATH, '//*[@id="radio3"]/label/span[1]').click()
        time.sleep(1)
        browser.find_element(By.CSS_SELECTOR,'[aria-label="Input starting record range"]').clear()
        browser.find_element(By.CSS_SELECTOR,'[aria-label="Input starting record range"]').send_keys(start)
        browser.find_element(By.CSS_SELECTOR,'[aria-label="Input ending record range"]').clear()
        browser.find_element(By.CSS_SELECTOR,'[aria-label="Input ending record range"]').send_keys(end)

        # update start and end
        start += 1000
        end += 1000

        # 5. click export and save resulting bibtex file
        browser.find_element(By.XPATH, '/html/body/app-wos/div/div/main/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()
        
        ## problems with the naming
        ## wait until download ended before renaming
        file_path = r'C:\Users\Lion\Documents\UNI4_RWTH\tim_hiwi_job\bibliometric_analysis\scrape_data\savedrecs.bib'
        
        while not os.path.exists(file_path):
            time.sleep(1)

        if os.path.isfile(file_path):
            filename = r"scrape_data/savedrecs.bib"
            new_filename = f"scrape_data/{keyword}_{i}{batch}.bib"
            os.rename(filename,new_filename)
        else:
            raise ValueError("%s isn't a file!" % file_path)
        

In [7]:
import re

# define function to extract the keyword from the generated query
# to pass that keyword to our scrape function so we can name the bibtex files correctly

def extract_kw(q):
    m = re.search("[a-z]+([_&]?[a-z]+)*",q)
    return m.group(0)

In [5]:
## starts a new chrome window, dont close!

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from selenium.webdriver.common.by import By

# download chromedriver here:
# https://sites.google.com/chromium.org/driver/
# unpack zip and add to skript directory, change path if needed

# set download directory
prefs = {'download.default_directory' : r'C:\Users\Lion\Documents\UNI4_RWTH\tim_hiwi_job\bibliometric_analysis\scrape_data'}
options = Options()
options.add_experimental_option("prefs", prefs)

# # start instance of browser with selenium
path_to_chromedriver = 'chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver,
                          options = options)


In [14]:
# uses the open chrome window to do search for each query and call scrape function


url = "https://www.webofscience.com/wos/woscc/advanced-search"

# if scraping interrupts (often on WoS side with "internal server error)
# put in here - if first batch done only once, if both twice
done = ['alliance', 'joint_venture', 'network', 'university_industry',
       'ecosystem', 'interorg', 'consorti', 'r&d', 'new_product_development',
       'npd', 'partnership', 'interfirm', 'across_boundaries', 'platform',
       'tournament', 'contest', 'intermediar', 'broadcast_search']

# loop that searches each generated query and then calls our earlier defined scraping function

for count,query in enumerate(queries): #keep track of number of iterations to define batch (count) with enumerate()
    
    # until query 36 we have pre 2019, afterwards all the post 2019 queries
    if count <= 36:
        batch=1
    else: batch=2
    
    # extract the current keyword from the query
    kw = extract_kw(query)
    
    print(kw)
    if kw in done:
        done.remove(kw)
    else:
        # open page and conduct search
        browser.get(url)

        # important to cancel out the time needed to load certain element on a web page
        # sleep statements might have to be longer or can be shorter depending on the internet connection
        time.sleep(2)

        # reject cookies if needed (only first time)
        try:
            browser.find_element(By.XPATH,'//*[@id="onetrust-reject-all-handler"]').click()
        except: pass

        # put generated query into search form after clearing searh form
        browser.find_element(By.ID,'advancedSearchInputArea').clear()
        browser.find_element(By.ID,'advancedSearchInputArea').send_keys(query)

        time.sleep(1)

        # click search button
        browser.find_element(By.XPATH,'/html/body/app-wos/div/div/main/div/div[2]/app-input-route/app-search-home/div[2]/div/app-input-route/app-search-advanced/app-advanced-search-form/form/div[3]/div[1]/div[1]/div/button[2]').click()

        time.sleep(4)

        # get and print number of papers

        n_paper = int(browser.find_element(By.CLASS_NAME, "brand-blue").text.replace(",",""))
        print(f"{count}: {kw}: {n_paper} papers found")

        # close helper window on webpage if needed
        try:
            browser.find_element(By.XPATH, '//*[@id="pendo-button-e5808a4c"]').click()
        except: pass

        # call scrape function
        scrape_search(n_paper,kw,batch)
    

alliance
joint_venture
network
university_industry
ecosystem
interorg
consorti
r&d
new_product_development
npd
partnership
interfirm
across_boundaries
platform
tournament
contest
intermediar
broadcast_search
collaboration
18: collaboration: 9986 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [02:27<00:00, 14.70s/it]


collective_intelligence
19: collective_intelligence: 102 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.49s/it]


community
20: community: 26982 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [09:04<00:00, 20.17s/it]


co_creation
21: co_creation: 877 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.47s/it]


cumulative_innovation
22: cumulative_innovation: 54 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.48s/it]


crowdsourcing
23: crowdsourcing: 465 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.46s/it]


customer_involvement
24: customer_involvement: 94 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.48s/it]


customer_integration
25: customer_integration: 61 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.49s/it]


distributed_innovation
26: distributed_innovation: 110 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.47s/it]


lead_user
27: lead_user: 205 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.48s/it]


open_innovation
28: open_innovation: 1623 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.45s/it]


open_source
29: open_source: 2003 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:53<00:00, 17.79s/it]


peer_production
30: peer_production: 30 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.43s/it]


user_innovation
31: user_innovation: 232 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.55s/it]


consumer_innovation
32: consumer_innovation: 31 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.45s/it]


wisdom_of_the_crowd
33: wisdom_of_the_crowd: 17 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.43s/it]


co_production
34: co_production: 429 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.47s/it]


free_innovation
35: free_innovation: 5 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.41s/it]


household_innovation
36: household_innovation: 2 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.43s/it]


alliance
37: alliance: 1368 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.51s/it]


joint_venture
38: joint_venture: 212 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.46s/it]


network
39: network: 55847 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 56/56 [19:11<00:00, 20.57s/it]


university_industry
40: university_industry: 375 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.47s/it]


ecosystem
41: ecosystem: 11367 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [04:05<00:00, 20.43s/it]


interorg
42: interorg: 372 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.50s/it]


consorti
43: consorti: 1866 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.50s/it]


r&d
44: r&d: 4284 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:43<00:00, 20.69s/it]


new_product_development
45: new_product_development: 515 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.51s/it]


npd
46: npd: 343 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.47s/it]


partnership
47: partnership: 2537 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:54<00:00, 18.15s/it]


interfirm
48: interfirm: 201 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.48s/it]


across_boundaries
49: across_boundaries: 27 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.47s/it]


platform
50: platform: 18619 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [06:20<00:00, 20.04s/it]


tournament
51: tournament: 132 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.49s/it]


contest
52: contest: 1143 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.00s/it]


intermediar
53: intermediar: 963 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.50s/it]


broadcast_search
54: broadcast_search: 1 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.41s/it]


collaboration
55: collaboration: 7628 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:03<00:00, 22.99s/it]


collective_intelligence
56: collective_intelligence: 84 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.51s/it]


community
57: community: 20082 papers found


100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [07:02<00:00, 20.10s/it]


co_creation
58: co_creation: 1266 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.98s/it]


cumulative_innovation
59: cumulative_innovation: 24 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.44s/it]


crowdsourcing
60: crowdsourcing: 473 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.50s/it]


customer_involvement
61: customer_involvement: 66 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.49s/it]


customer_integration
62: customer_integration: 29 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.46s/it]


distributed_innovation
63: distributed_innovation: 34 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.45s/it]


lead_user
64: lead_user: 60 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.47s/it]


open_innovation
65: open_innovation: 1388 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.51s/it]


open_source
66: open_source: 2032 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:54<00:00, 18.15s/it]


peer_production
67: peer_production: 21 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.45s/it]


user_innovation
68: user_innovation: 108 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.49s/it]


consumer_innovation
69: consumer_innovation: 33 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.47s/it]


wisdom_of_the_crowd
70: wisdom_of_the_crowd: 17 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.44s/it]


co_production
71: co_production: 578 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.46s/it]


free_innovation
72: free_innovation: 13 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.45s/it]


household_innovation
73: household_innovation: 10 papers found


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.41s/it]
