You can choose to do one, two or all three activities depending on how confident you feel.   
You do not have to use Google Scholar. You are allowed to use any academic search engine or scrape from other sources if you find something more appropriate.

## 1. Web scrape a list of all of my publications since 2015 (e.g. [search link]('https://scholar.google.com/citations?user=ETIBghkAAAAJ&hl=en')) 

## 2. Scrape a list of all the co-authors of my papers including a numerical value that corresponds to the number of co-authorships.

## 3. Scrape the abstract/keywords from these papers.

Caveat: there are numerous people with my name and not all of my publications are at the same institution so this search may not be as easy as it sounds! 

You may also find yourself blocked by captchas and such, in which case you might have to find workarounds. 

**Get Data to work with**
1. Read in the data
2. get data into a data frame from the table - like did with cfl data
3. strip out anything before 2015
4. need html links to be able to dig into each of those articles and then do more scraping from there

**Get coauthors**
1. All listed with google scholar in the table so just get it from there
2. strip out all the S McGraths

**Scrape abstract from paper**
1. Open link to papers
2. see if they are in a similar format to get the abstract
3. get abstract out and keywords


In [83]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [84]:
url = 'https://scholar.google.com/citations?user=ETIBghkAAAAJ&hl=en'

page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

In [87]:
def get_table_headers(soup):
    table_head = soup.find_all('th')
    col_headers = []
    for th in table_head:
        col_headers.append(th.text)

    return col_headers

col_headers = get_table_headers(soup)

def get_table_data(soup):
    table_body = soup.find('tbody', attrs={'id': 'gsc_a_b'})
    rows = table_body.find_all('tr')
    
    data = []
    for row in rows:
        row_data = []

        info_cols = row.find_all('td', attrs={'class': 'gsc_a_t'})
        for info in info_cols:
            # get link element
            link = 'https://scholar.google.com' + info.findChild('a').get('data-href')
            #get title
            title = info.findChild('a').get_text()
            # get authors 
            author_info = info.findChild('div').get_text()
            # add link and author to row
            row_data.extend([title, link, author_info])

        year_cols = row.find_all('td', attrs={'class': 'gsc_a_y'})
        for year in year_cols:
            row_data.append(year.get_text())
        #add row to data
        data.append(row_data)
    return data

table_data = get_table_data(soup)

In [125]:
df = pd.DataFrame(table_data, columns=['title', 'url', 'authors', 'year'])

#filter results before 2015
data = df.loc[df['year'] >= '2015']

In [150]:
def get_page_soups(data, col_id):
    page_soups = []
    for index, row in data.iterrows():
        page = requests.get(row[col_id])
        soup = BeautifulSoup(page.content, 'html.parser')
        page_soups.append(soup)
    return page_soups


In [151]:
page_soups = get_page_soups(data, 'url')


In [176]:
print(len(page_soups))

def get_abstracts(page_soups):
    abstracts = []
    for soup in page_soups:
        abstract = soup.find('div', attrs={'id': 'gsc_vcd_descr'}).get_text()
        if abstract is None:
            print('no abstract found')
        abstracts.append(abstract)
    return abstracts
        
abstracts = get_abstracts(page_soups)

abstracts_df = pd.DataFrame(abstracts, columns=['abstracts'])

df = data.assign(abstracts=abstracts_df)
#sort newest to oldest
df = df.sort_values('year', axis=0, ascending=False)
df.head()


14


Unnamed: 0,title,url,authors,year,abstracts
10,Breaking the workflow: Design heuristics to su...,https://scholar.google.com/citations?view_op=v...,S McGrath,2020,The investigation that follows presents the re...
9,DESIGNING AND DEVELOPING USER-CENTRED SYSTEMS,https://scholar.google.com/citations?view_op=v...,S McGrath,2018,Our work explores the implications for the des...
11,The Rough Mile: a Design Template for Locative...,https://scholar.google.com/citations?view_op=v...,"A Hazzard, J Spence, C Greenhalgh, S McGrath",2018,"The rapid development of mobile devices, netwo..."
5,The Rough Mile: Testing a framework of immersi...,https://scholar.google.com/citations?view_op=v...,"J Spence, A Hazzard, S McGrath, C Greenhalgh, ...",2017,We present our case study on gifting digital m...
6,The user experience of mobile music making: An...,https://scholar.google.com/citations?view_op=v...,"S McGrath, S Love",2017,The research herein describes the investigatio...


In [192]:
from nltk.stem import PorterStemmer
nltk.download('stopwords')

df_authors = df['authors'].tolist()

def split_authors(authors):
    author_list = []
    for author in authors:
        split_authors = author.split(',')
        author_list.extend(split_authors)
    for i, author in enumerate(author_list):
        author_list[i] = author.strip().lower()
    return author_list

author_list = split_authors(df_authors)

def get_word_freq(content):
   ps = PorterStemmer()
   word_frequencies = {}
   for tok in content:
      tok = ps.stem(tok)
      if tok not in word_frequencies.keys():
         word_frequencies[tok] = 1
      else:
         word_frequencies[tok] += 1
   return word_frequencies

get_word_freq(author_list)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hugho\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'s mcgrath': 13,
 'a hazzard': 6,
 'j spenc': 3,
 'c greenhalgh': 4,
 's benford': 7,
 's love': 1,
 'ap mcpherson': 1,
 'a chamberlain': 5,
 'sa mcgrath': 1}

In [174]:
def get_paper_urls(page_soups):
    urls = []
    for soup in page_soups:
        url = soup.find('a', attrs={'class': 'gsc_vcd_title_link'}).get('href')
        urls.append(url)
    return urls

paper_urls = get_paper_urls(page_soups)

df = df_abstracts.assign(real_paper_url=paper_urls)
df.head()


"This paper explores the design of digital musical instruments (DMIs) for exploratory play. Based on Gaver's principles of ludic design, we examine the ways in which people come to terms with an unfamiliar musical interface. We describe two workshops with the D-Box, a DMI designed to be modified and hacked by the user. The operation of the D-Box is deliberately left ambiguous to encourage users to develop their own meanings and interaction techniques. During the workshops we observed emergent patterns of exploration which revealed a rich process of exploratory play. We discuss our observations in relation to previous literature on appropriation, ambiguity and ludic engagement, and we provide recommendations for the design of playful and exploratory interfaces."

In [155]:
real_paper_soups = get_page_soups(df, 'real_paper_url')

In [160]:
print(len(real_paper_soups))

def get_keywords(paper_soups):
    for soup in paper_soups:
        keyword_soups = soup.body.findAll(text='Keyword')
        print(keyword_soups)

get_keywords(real_paper_soups[0:3])

14
[]
[]
[]
