This notebook illustrates how to use python to assist data collection for academic research. 

The collected data should only serve academic research purposes. 

One should never post the collected data in publicly accessible websites without explicit consent from the original website. 

One should never profit financially from the collected data. 

# Install required packages (only need to run the first time)

In [None]:
# First: install Anaconda python: https://www.anaconda.com/products/individual

# install necessary packages
## package: selenium for navigating webpages using google chrome
# Note: also need to download chromedriver to hard drive
# !pip install -U selenium
# install using: 
!pip3 install -U selenium
# then:
!pip3 install webdriver-manager

## package: time for pausing the script execution
!pip3 install time
## package: BeautifulSoup for processing web scripts
!pip3 install -U BeautifulSoup
## package: pandas for process data
!pip3 install -U pandas
## package: os for accessing operating system information
!pip3 install -U os
## package: html5lib for parsing html
!pip3 install html5lib

# Import required packages

In [1]:
from selenium import webdriver
# import Keys to allow for hotkeys
from selenium.webdriver.common.keys import Keys
# import ActionChains to allow for page scrolling
from selenium.webdriver.common.action_chains import ActionChains
# import expected_conditions to wait for elements to appear in web before preceeding
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# import Select to select options
from selenium.webdriver.support.ui import Select

from selenium.webdriver.chrome.service import Service
# install using: 
#   pip3 install -U selenium
# then:
#   pip3 install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
import time

from time import sleep

from bs4 import BeautifulSoup as soup

import pandas as pd

import os.path

# Python selenium with chromedriver

We will use python selenium to control a chromedriver application to navigate websites.

The chromedriver is an open source software that needs to be downloaded to the hard drive. 

After downloading the chromedriver to local disk, change the path and run:

In [2]:
# start a chrome browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))



Current google-chrome version is 109.0.5414
Get LATEST chromedriver version for 109.0.5414 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/109.0.5414.74/chromedriver_mac64.zip
Driver has been saved in cache [/Users/lhe/.wdm/drivers/chromedriver/mac64/109.0.5414.74]


set url and load url

In [3]:
# set url (as a text string)
url = "https://www.aeaweb.org/research?path=research&page=1&per-page=10#list"
# load url
driver.get(url)
# notes: 
## driver is an object, the webdriver controlled by selenium
## get is a method (think of it as a function) defined in the selenium package

# HTML tags and scripts

We need to understand some basic structures of the HTML language to instruct the python packages to perform any task.

In [4]:
# set url
url = "https://en.wikipedia.org/wiki/HTML"
# load url
driver.get(url)

### HTML content is structured around elements

In [None]:
# set url
url = "https://en.wikipedia.org/wiki/HTML_element"
# load url
driver.get(url)

It is essential to understand the following concepts:
* elements
* tags
* attributes
* content

In [None]:
# set url
url = "https://en.wikipedia.org/wiki/HTML_element#Syntax"
# load url
driver.get(url)

Think of **elements** as containers on a web page.

Right click in browser, and choose: "Inspect".

Most **elements** are marked by **tags** at the beginning and the end of an element. 

Each tag is marked by angle brackets, such as `<a>`

And **attributes** are special words inside the opening tag to control the element's behavior. 

Such as `<a href="/wiki/HTML" title="HTML">HTML</a>`

The **contents** are usually the visible information on the web page, such as texts or images.

# Navigate and retrieve information using Selenium

Now we return to the AEA website:

In [5]:
# set url (as a text string)
url = "https://www.aeaweb.org/research?path=research&page=1&per-page=10#list"
# load url
driver.get(url)

## Task 1: click the "Journals" link on top of the page

The main challenge is that we need to "tell" selenium which element contains the link we want to click.

For this, we need to pass on some information about the element of choice. 

To find out features that identifies an element, we right click *on* the element in the browser, and select "Inspect" to see the HTML source code of the element of choice. 

To identify the element of choice, we need some additional "selector language" to pin down the element in the massive HTML source code. 

This takes a lot of trial and error.

### Find element by text (using Xpath)

**Xpath selector** is a script language to select elements in HTML scripts. 

To find all the `<a>` elements in a web page with the exact text of `Journals`, the Xpath would be:

`//a[text()='Journals']`

One good tutorial is: https://www.w3schools.com/xml/xpath_intro.asp

Try your selector by:
1. click in the source code pan in Chrome
2. hit "control + F" (windows) or "cmd + F" (mac) to search
3. type/paste in your selector, see 
    * if chrome can find it, and 
    * how many elements it found
4. if you found the desired elements, you are good to proceed

In [6]:
# ask selenium to return elements by tag and contained text using Xpath
link_elements = driver.find_elements_by_xpath("//a[text()='Journals']")
# Notice: we used "" outside to mark the entire string, and use '' to mark the internal text for the xpath

# inspect the returned elements (a python list)
link_elements

  link_elements = driver.find_elements_by_xpath("//a[text()='Journals']")


[<selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="c962dcf3-ca4c-46da-98aa-923166daed32")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="c7a8d178-070e-43fe-b54b-3d7ca6e6544e")>]

In [7]:
# inspect how many elements were found
len(link_elements)

2

In [8]:
# retrieve the first element
link = link_elements[0]
# inspect the link
link

<selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="c962dcf3-ca4c-46da-98aa-923166daed32")>

### Click an element

In [9]:
# click on the link
link.click()

## Task 2: Retrieve text information by tag and attributes

### Find element by tag and attributes (using Xpath)

Sometimes we need to identify elements of interests using tags and attributes.

For example, each news is contained in a `div` with a specific class attribute.

So we try the xpath:

`//div[@class='news-text']`

In [11]:
# set url (as a text string)
url = "https://www.aeaweb.org/research?path=research&page=1&per-page=10#list"
# load url
driver.get(url)

In [12]:
# ask selenium to return elements by tag and contained text using Xpath
news_elements = driver.find_elements_by_xpath("//div[@class='news-text']")
# inspect the returned elements (a python list)
news_elements

  news_elements = driver.find_elements_by_xpath("//div[@class='news-text']")


[<selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="d0a98691-619e-42a0-ab17-369c860a45d3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="36a77cc0-7511-4e99-a638-7aadbc3046d5")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="e9a4e4ff-b781-4ed4-8dea-1536db2b5499")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="f6cd3baf-db38-4ddf-aa8a-5fb0dc16c003")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="d8f871d3-852b-4c52-b260-668825f44ebe")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="3c6a78de-95cc-4fff-a0fb-2ca703e69d2f")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5d7b3de76bbabe48244cbf6993cb12cf", element="ea9f31af-45e7-4655-80d7-62

In [13]:
# inspect how many elements were found
len(news_elements)

10

In [14]:
# store the first element
news_element = news_elements[0]
# return its text
news_element.text

'Mental health therapy in the developing world\nJanuary 23, 2023\nNathan Barker discusses the impact of cognitive behavioral therapy in rural Ghana.'

We can now use the `.split` method in python, to separate the text into three parts.

In [15]:
# split the returned text by `\n`, which is linebreak, into a list
text_list = news_element.text.split('\n')
# inspect
text_list

['Mental health therapy in the developing world',
 'January 23, 2023',
 'Nathan Barker discusses the impact of cognitive behavioral therapy in rural Ghana.']

In [16]:
# store title
title = text_list[0]
# store date
date = text_list[1]
# store date
summary = text_list[2]

In [17]:
title

'Mental health therapy in the developing world'

In [18]:
date

'January 23, 2023'

In [19]:
summary

'Nathan Barker discusses the impact of cognitive behavioral therapy in rural Ghana.'

### A note: CSS selectors

**CSS selectors** is another script language to select elements in HTML scripts, which has slightly different conventions from Xpath.

Sometimes, it may be more convenient to use CSS selectors. 

For example, if the HTML element of choice is 
`<a href="/journals">Journals</a>`,
we might come up with a CSS selector of 

`a[href='/journals']`

to select by tag `<a>` and attribute `href` that is equal to the text string `'/journals'`

# Python lists

Python list is a data structure used to store a series of related contents, such as values or strings. 

In [None]:
# initiate an example list
test = []
test

In [None]:
# define list by entering data
test = [1, 2, 3]
# inspect
test

In [None]:
# append element to the end a list
test.append(4)
# inspect
test

In [None]:
# define list by entering text
test = ['a', 'b', 'c']
# inspect
test

In [None]:
# append element to the end a list
test.append('d')
# inspect
test

Now we can prepare lists to store collected text data

In [30]:
# initiate empty lists to store result
title = []
date = []
summary = []

# `for` loops in Python

`for` loops in python is marked by `:` and indented scripts with four spaces, there is no `end` mark for a loop.

In [28]:
news_element = news_elements[0]
text_list = news_element.text.split('\n')
text_list[0]

'Mental health therapy in the developing world'

In [31]:
# loop through all elements found previously in news_elements
for news_element in news_elements:
    # split text from each element
    text_list = news_element.text.split('\n')
    # append title
    title.append(text_list[0])
    # append date
    date.append(text_list[1])
    # append summary
    summary.append(text_list[2])

In [32]:
# inspect
title

['Mental health therapy in the developing world',
 'Teacher diversity in the classroom',
 '2022 in Research Highlights',
 'How good is popular financial advice?',
 'Parental investment and educational attainment',
 'The costs of cultural traditions',
 'The economic origins of self-governance',
 'Fundraising Appeals and the Lift/Shift Question',
 'Local elections in China',
 'School bullying, cyberbullying, and remote learning']

In [34]:
# inspect
date

['January 23, 2023',
 'January 10, 2023',
 'December 27, 2022',
 'December 13, 2022',
 'November 28, 2022',
 'November 15, 2022',
 'November 2, 2022',
 'October 20, 2022',
 'October 3, 2022',
 'September 19, 2022']

In [33]:
# inspect
summary

['Nathan Barker discusses the impact of cognitive behavioral therapy in rural Ghana.',
 'Does exposure to same-race teachers help students in the long run?',
 'Economists addressed the returns to an economics degree, why workers in bigger cities have higher wages, grade inflation, and more.',
 'James Choi discusses how advice from popular financial authors differs from the theories and models of academic economists.',
 'How does family background affect college completion?',
 'Eduardo Montero and Dean Yang discuss the long-term economic consequences of religious festivals in Mexico that coincide with peak planting or harvest months.',
 'How did commerce in England influence parliamentary representation?',
 'Sarah Smith discusses the impact of major fundraising interventions on overall charitable donations.',
 'Why do autocracies sometimes use democratic institutions?',
 'Andrew Bacher-Hicks and Joshua Goodman discuss the effects of the pandemic on bullying.']

# Python package: pandas

Pandas is a python library designed to work with data analysis. It provides a large set of tools in data processing. 

Here, we will use pandas to organize our lists of results into a excel file. 

In [45]:
### set output excel file path and file name
xls_name = "/Users/lhe/Downloads/research_highlights.xlsx"

# set column names for the data frame
column_names = ["title", "date", "summary"]

### create data frame using scraped and updated information
# create full set of column names
result_list_new = pd.DataFrame(columns = column_names)
# store results into df
result_list_new["title"] = title
result_list_new["date"] = date
result_list_new["summary"] = summary
# add df index by 1
result_list_new.index += 1 

# inspect
result_list_new


Unnamed: 0,title,date,summary
1,Mental health therapy in the developing world,"January 23, 2023",Nathan Barker discusses the impact of cognitiv...
2,Teacher diversity in the classroom,"January 10, 2023",Does exposure to same-race teachers help stude...
3,2022 in Research Highlights,"December 27, 2022",Economists addressed the returns to an economi...
4,How good is popular financial advice?,"December 13, 2022",James Choi discusses how advice from popular f...
5,Parental investment and educational attainment,"November 28, 2022",How does family background affect college comp...
6,The costs of cultural traditions,"November 15, 2022",Eduardo Montero and Dean Yang discuss the long...
7,The economic origins of self-governance,"November 2, 2022",How did commerce in England influence parliame...
8,Fundraising Appeals and the Lift/Shift Question,"October 20, 2022",Sarah Smith discusses the impact of major fund...
9,Local elections in China,"October 3, 2022",Why do autocracies sometimes use democratic in...
10,"School bullying, cyberbullying, and remote lea...","September 19, 2022",Andrew Bacher-Hicks and Joshua Goodman discuss...


In [None]:

# save file to excel (freeze top row)
result_list_new.to_excel(xls_name,index=True,freeze_panes=(1,2))

# Python package: BeautifulSoup

BeautifulSoup is a python package that processes HTML source code *quickly*.

In [36]:
# pass HTML source code from selenium to BeautifulSoup
# pass selenium driver source page to BS4
page_source = driver.page_source
# parse page source with BS4
page_soup = soup(page_source, "html.parser")

In [38]:
div = page_soup.find("div", {"class": "news-text"})
div

<div class="news-text">
<img alt="" class="news-thumbnail" src="/content/file?id=18007"/>
<h4>
<a href="/research/cognitive-behavioral-therapy-ghana">Mental health therapy in the developing world</a> </h4>
<h6 class="published-at">January 23, 2023</h6>

                    
                    Nathan Barker discusses the impact of cognitive behavioral therapy in rural Ghana.
                    </div>

In [40]:
# initiate empty lists to store result
title = []
date = []
summary = []

In [41]:
# loop through all `<a>` tags with a given href pattern
for div in page_soup.find_all("div", {"class" : "news-text"}):
    title.append(div.find("a").text.strip())
    date.append(div.find("h6", {"class" : "published-at"}).text.strip())
    summary.append(div.contents[-1].strip())

In [None]:
### set output excel file path and file name
xls_name = "/Users/lhe/Downloads/research_highlights.xlsx"

# set column names for the data frame
column_names = ["title", "date", "summary"]

### create data frame using scraped and updated information
# create full set of column names
result_list_new = pd.DataFrame(columns = column_names)
# store results into df
result_list_new["title"] = title
result_list_new["date"] = date
result_list_new["summary"] = summary
# add df index by 1
result_list_new.index += 1 

# inspect
result_list_new


Unnamed: 0,title,date,summary
1,Mental health therapy in the developing world,"January 23, 2023",Nathan Barker discusses the impact of cognitiv...
2,Teacher diversity in the classroom,"January 10, 2023",Does exposure to same-race teachers help stude...
3,2022 in Research Highlights,"December 27, 2022",Economists addressed the returns to an economi...
4,How good is popular financial advice?,"December 13, 2022",James Choi discusses how advice from popular f...
5,Parental investment and educational attainment,"November 28, 2022",How does family background affect college comp...
6,The costs of cultural traditions,"November 15, 2022",Eduardo Montero and Dean Yang discuss the long...
7,The economic origins of self-governance,"November 2, 2022",How did commerce in England influence parliame...
8,Fundraising Appeals and the Lift/Shift Question,"October 20, 2022",Sarah Smith discusses the impact of major fund...
9,Local elections in China,"October 3, 2022",Why do autocracies sometimes use democratic in...
10,"School bullying, cyberbullying, and remote lea...","September 19, 2022",Andrew Bacher-Hicks and Joshua Goodman discuss...


In [None]:

# save file to excel (freeze top row)
result_list_new.to_excel(xls_name,index=True,freeze_panes=(1,2))