**Tool Showcase**
=======

<font color = 'E3A440'>*Webscraping tutorial*</font>
=============

This tutorial is a short hands-on tutorial to introduce the webscraping pratice.
It was presented during the <font color = 'E3A440'>Tool Showcase</font> at [P4IE Conference - Measuring Metrics that Matter](https://event.fourwaves.com/p4ie/pages), which took place on 9-10-11 May 2022 at the *Hilton Garden Inn*, in Ottawa.

Structure of the showcase:
1. General Framework
2. Web scraping step by step
3. Launch a program 

This tutorial cannot be considered as ehaustif of the domain. 

### Author: 
- Davide Pulizzotto <davide.pulizzotto@polymtl.ca>

### Table of Contents

- [Section 1. Introduction](#introduction)
- [Section 2. Step by step](#step-by-step)
- [Section 3. Launch a program](#Launch)


<a id='introduction'></a>
# <font size = '6' color='E3A440'>Section 1. Introduction</font>



[Journal of Responsible Innovation](https://www.tandfonline.com/action/journalInformation?show=aimsScope&journalCode=tjri20) by Taylor & Francis.


## 0.1 Preparation of Colab Virtual Machine

In order to work correctly on Colab, we need to prepare the environment by executing two main steps:
1. Download data from the GitHub project 
2. Install package to run code of this workshop

In [206]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time

In [64]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}


Make the request to the server

In [11]:
# response_req = requests.get('https://www.sciencedirect.com/journal/international-journal-of-innovation-studies/issues',
#                             headers = headers,
#                             timeout = 10)

In [74]:
response_req = requests.get('https://www.tandfonline.com/toc/tjri20/9/1?nav=tocList',
                            headers = headers,
                            timeout = 10)

In [75]:
response_req.status_code

200

Parse the response's content

In [76]:
if response_req.status_code == 200:
    soup = BeautifulSoup(response_req.content, features="lxml")

In [77]:
list_article = soup.findAll('div', attrs={"class":re.compile("art_title linkable")})

Get list articles, and url

In [80]:

for x in list_article:
    print(x.text)

Engaging with societal challenges in responsible innovation
What’s wrong with global challenges?
Co-creation in support of responsible research and innovation: an analysis of three stakeholder workshops on nanotechnology for health
Innovation and equality: an approach to constructing a community governed network commons
Nanoscientists’ perceptions of serving as ethical leaders within their organization: Implications from ethical leadership for responsible innovation
The uses of grand challenges in research policy and university management: something for everyone
Toward institutionalization of responsible innovation in the contemporary research university: insights from case studies of Arizona State University
Looking beyond the ‘horizon’ of RRI: moving from discomforts to commitments as early career researchers
Responsibility and innovation
New horizons, old friends: taking an ‘ARIA in six keys’ approach to the future of R(R)I
Against bureaucrapitalism: a response to Shanley and collea

In [90]:
list_url_articles = []
for art_element in list_article:
    for link_ in art_element.findAll("a", href=True):
        print(link_)
        list_url_articles.append(link_['href'])


/doi/full/10.1080/23299460.2022.2063910
/doi/full/10.1080/23299460.2021.2000130
/doi/full/10.1080/23299460.2021.1994195
/doi/full/10.1080/23299460.2022.2043681
/doi/full/10.1080/23299460.2022.2043630
/doi/full/10.1080/23299460.2022.2040870
/doi/full/10.1080/23299460.2022.2042983
/doi/full/10.1080/23299460.2022.2049506
/doi/full/10.1080/23299460.2022.2050570
/doi/full/10.1080/23299460.2022.2050592
/doi/full/10.1080/23299460.2022.2055993


In [82]:
list_url_articles

['/doi/full/10.1080/23299460.2022.2063910',
 '/doi/full/10.1080/23299460.2021.2000130',
 '/doi/full/10.1080/23299460.2021.1994195',
 '/doi/full/10.1080/23299460.2022.2043681',
 '/doi/full/10.1080/23299460.2022.2043630',
 '/doi/full/10.1080/23299460.2022.2040870',
 '/doi/full/10.1080/23299460.2022.2042983',
 '/doi/full/10.1080/23299460.2022.2049506',
 '/doi/full/10.1080/23299460.2022.2050570',
 '/doi/full/10.1080/23299460.2022.2050592',
 '/doi/full/10.1080/23299460.2022.2055993']

In [85]:
for url_article in list_url_articles:
    url_art_temp = f"https://www.tandfonline.com{url_article}"
    print(url_art_temp)


https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2063910
https://www.tandfonline.com/doi/full/10.1080/23299460.2021.2000130
https://www.tandfonline.com/doi/full/10.1080/23299460.2021.1994195
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2043681
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2043630
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2040870
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2042983
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2049506
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2050570
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2050592
https://www.tandfonline.com/doi/full/10.1080/23299460.2022.2055993


In [102]:
for url_article in list_url_articles:
    url_art_temp = f"https://www.tandfonline.com{url_article}"
    response_req_article = requests.get(url_art_temp,
                            headers = headers,
                            timeout = 10)
    if response_req_article.status_code == 200:
        soup_article = BeautifulSoup(response_req_article.content, features="lxml")
        if soup_article.find("div", attrs = {"class": "toc-heading"}).text == 'Editorial':
            continue
    break

In [103]:
type_article = soup_article.find("div", attrs = {"class": "toc-heading"}).text
print(type_article)

Research Articles


In [104]:
title = soup_article.find("span", attrs = {"class": re.compile("article-title")})
print(title.text)

What’s wrong with global challenges?


In [106]:
list_authors = []
for author_temp in soup_article.findAll("a", attrs = {'class': "author"}):
    list_authors.append(author_temp.text)
    print(author_temp.text)

David Ludwig
Vincent Blok
Marie Garnier
Phil Macnaghten
Auke Pols


In [108]:
authors = '; '.join(list_authors)
authors

'David Ludwig; Vincent Blok; Marie Garnier; Phil Macnaghten; Auke Pols'

Get abstract

In [131]:
abstract_el =  soup_article.find("div", attrs = {'class': "abstractSection abstractInFull"})
abstract_text = abstract_el.find('p').next_sibling.text
abstract_text

"Global challenges such as climate change, food security, or public health have become dominant concerns in research and innovation policy. This article examines how responses to these challenges are addressed by governance actors. We argue that appeals to global challenges can give rise to a ‘solution strategy' that presents responses of dominant actors as solutions and a ‘negotiation strategy' that highlights the availability of heterogeneous and often conflicting responses. On the basis of interviews and document analyses, the study identifies both strategies across local, national, and European levels. While our results demonstrate the co-existence of both strategies, we find that global challenges are most commonly highlighted together with the solutions offered by dominant actors. Global challenges are ‘wicked problems' that often become misframed as ‘tame problems’ in governance practice and thereby legitimise dominant responses."

prepare your programe and launch it!

In [245]:
df_data_jri = pd.DataFrame(columns=['Vol','Issue','Year','Authors', 'Title','Abstract','url'])

In [246]:
# Volumes
for vol_ in [7]:
    # Issues
    for issue_ in [1,2,3]:
        url_issue = f"https://www.tandfonline.com/toc/tjri20/{vol_}/{issue_}?nav=tocList"
        issue_req = requests.get(url_issue,
                                    headers = headers,
                                    timeout = 10)
        if issue_req.status_code == 200:
            soup_issue = BeautifulSoup(issue_req.content, features="lxml")
        
        # Year
        title_issue_el = soup_issue.find("div", "toc-title")
        year = re.search("(\()([0-9]{4})(\))", title_issue_el.text).groups()[1]
        print(f"\n{title_issue_el.text")
        # Articles
        list_url_articles = soup_issue.findAll('div', attrs={"class":re.compile("art_title linkable")})
        for url_article in list_url_articles:
            time.sleep(1)
            url_arti_href  = url_article.find("a")['href']
            url_art_temp = f"https://www.tandfonline.com{url_arti_href}"
            # print(url_art_temp)
            req_article = requests.get(url_art_temp,
                                    headers = headers,
                                    timeout = 10)
            # print(req_article.status_code)
            if req_article.status_code == 200:
                soup_article = BeautifulSoup(req_article.content, features="lxml")
                # print(soup_article.find("span", attrs = {"class": "article-type"}).text)
                
                if not re.search("Article", soup_article.find("div", attrs = {"class": "toc-heading"}).text):
                    # print(soup_article.find("div", attrs = {"class": "toc-heading"}).text)
                    continue
            # title
            title = soup_article.find("span", attrs = {"class": re.compile("article-title")})
            print(title.text)
            # author
            list_authors = []
            for author_temp in soup_article.findAll("a", attrs = {'class': "author"}):
                list_authors.append(author_temp.text)
            authors = '; '.join(list_authors)
            # abstract
            abstract_el =  soup_article.find("div", attrs = {'class': "abstractSection abstractInFull"})
            abs_list_par = []
            for par_ in abstract_el.findAll('p'):
                if not re.search("^abstract", par_.text, re.IGNORECASE):
                    abs_list_par.append(par_.text)
            abstract_text = '\n'.join(abs_list_par)
            # fill database
            idx = len(df_data_jri)
            df_data_jri.loc[idx] = [vol_, issue_, year, authors, title.text, abstract_text, url_art_temp]

Journal of Responsible Innovation, Volume 7, Issue 1 (2020)
Responsible innovation as empowering ways of knowing
Traditional ecological knowledge in innovation governance: a framework for responsible and just innovation
The design and testing of a tool for developing responsible innovation in start-up enterprises
When desirability and feasibility go hand in hand: innovators’ perspectives on what is and is not responsible innovation in health
The objects of technology assessment. Hermeneutic extension of consequentialist reasoning
Journal of Responsible Innovation, Volume 7, Issue 2 (2020)
Responsible research and innovation: hopes and fears in the scientific community in Europe
Subtle voices, distant futures: a critical look at conditions for patient involvement in Alzheimer’s biomarker research and beyond
Land use conflicts between biomass and power production – citizens’ participation in the technology development of Agrophotovoltaics
Creating relevant knowledge in transdisciplinary 

In [249]:
df_data_jri

Unnamed: 0,Vol,Issue,Year,Authors,Title,Abstract,url
0,7,1,2020,Govert Valkenburg; Annapurna Mamidipudi; Poona...,Responsible innovation as empowering ways of k...,In pursuit of responsible research and innovat...,https://www.tandfonline.com/doi/full/10.1080/2...
1,7,1,2020,David Ludwig; Phil Macnaghten,Traditional ecological knowledge in innovation...,Change in Traditional Ecological Knowledge (TE...,https://www.tandfonline.com/doi/full/10.1080/2...
2,7,1,2020,Thomas B. Long; Vincent Blok; Steven Dorrestij...,The design and testing of a tool for developin...,"Innovation leads to new products, business mod...",https://www.tandfonline.com/doi/full/10.1080/2...
3,7,1,2020,Lysanne Rivard; Pascale Lehoux,When desirability and feasibility go hand in h...,While the conceptual foundations of Responsibl...,https://www.tandfonline.com/doi/full/10.1080/2...
4,7,1,2020,Armin Grunwald,The objects of technology assessment. Hermeneu...,"Since the beginning of this century, the estab...",https://www.tandfonline.com/doi/full/10.1080/2...
5,7,2,2020,Martin Carrier ; Minea Gartzlaff,Responsible research and innovation: hopes and...,We conducted interviews among some 80 research...,https://www.tandfonline.com/doi/full/10.1080/2...
6,7,2,2020,Karen Dam Nielsen ; Marianne Boenink,"Subtle voices, distant futures: a critical loo...",Patient involvement is increasingly regarded a...,https://www.tandfonline.com/doi/full/10.1080/2...
7,7,2,2020,Daniel Ketzer ; Nora Weinberger ; Christine...,Land use conflicts between biomass and power p...,Despite the technical feasibility of renewable...,https://www.tandfonline.com/doi/full/10.1080/2...
8,7,2,2020,Andrea Schikowitz,Creating relevant knowledge in transdisciplina...,Transdisciplinarity aims to address ‘grand soc...,https://www.tandfonline.com/doi/full/10.1080/2...
9,7,2,2020,Theo Papaioannou,"Innovation, value-neutrality and the question ...",Since the reconstruction of Joseph Schumpeter’...,https://www.tandfonline.com/doi/full/10.1080/2...
