This notebook illustrates how to use python to assist data collection for academic research. 

The collected data should only serve academic research purposes. 

One should never post the collected data in publicly accessible websites without explicit consent from the original website. 

One should never profit financially from the collected data. 

# Install required packages (only need to run the first time)

In [None]:
# First: install Anaconda python: https://www.anaconda.com/products/individual

# install necessary packages
## package: selenium for navigating webpages using google chrome
# Note: also need to download chromedriver to hard drive
!pip install selenium
## package: time for pausing the script execution
!pip install time
## package: BeautifulSoup for processing web scripts
!pip install BeautifulSoup
## package: pandas for process data
!pip install pandas
## package: os for accessing operating system information
!pip install os

# Import required packages

In [None]:
from selenium import webdriver

from time import sleep

from bs4 import BeautifulSoup as soup

import pandas as pd

import os.path

# Python selenium with chromedriver

We will use python selenium to control a chromedriver application to navigate websites.

The chromedriver is an open source software that needs to be downloaded to the hard drive. Link: https://chromedriver.chromium.org/downloads

Note that:
1. the user needs administer access in the computer
2. the version of the chromedriver needs to be consistent with the google Chrome browser installed in the computer system; the chromedriver may need to be replaced after updates of the Chrome browser

After downloading the chromedriver to local disk, change the path and run:

In [None]:
# open brower window (on windows)
if os.name == 'nt':
    driver = webdriver.Chrome("/Users/lhe/Dropbox/Utilities/chromedriver.exe")
# open brower window (on mac)
else:
    driver = webdriver.Chrome("/Users/lhe/Dropbox/Utilities/chromedriver")

set url and load url

In [None]:
# set url (as a text string)
url = "https://www.aeaweb.org/research?path=research&page=1&per-page=10#list"
# load url
driver.get(url)
# notes: 
## driver is an object, the webdriver controlled by selenium
## get is a method (think of it as a function) defined in the selenium package

# HTML tags and scripts

We need to understand some basic structures of the HTML language to instruct the python packages to perform any task.

In [None]:
# set url
url = "https://en.wikipedia.org/wiki/HTML"
# load url
driver.get(url)

### HTML content is structured around elements

In [None]:
# set url
url = "https://en.wikipedia.org/wiki/HTML_element"
# load url
driver.get(url)

It is essential to understand the following concepts:
* elements
* tags
* attributes
* content

In [None]:
# set url
url = "https://en.wikipedia.org/wiki/HTML_element#Syntax"
# load url
driver.get(url)

Think of **elements** as containers on a web page.

Right click in browser, and choose: "Inspect".

Most **elements** are marked by **tags** at the beginning and the end of an element. 

Each tag is marked by angle brackets, such as `<a>`

And **attributes** are special words inside the opening tag to control the element's behavior. 

Such as `<a href="/wiki/HTML" title="HTML">HTML</a>`

The **contents** are usually the visible information on the web page, such as texts or images.

# Navigate and retrieve information using Selenium

Now we return to the AEA website:

In [None]:
# set url (as a text string)
url = "https://www.aeaweb.org/research?path=research&page=1&per-page=10#list"
# load url
driver.get(url)

## Task 1: click the "Journals" link on top of the page

The main challenge is that we need to "tell" selenium which element contains the link we want to click.

For this, we need to pass on some information about the element of choice. 

To find out features that identifies an element, we right click *on* the element in the browser, and select "Inspect" to see the HTML source code of the element of choice. 

To identify the element of choice, we need some additional "selector language" to pin down the element in the massive HTML source code. 

This takes a lot of trial and error.

### Find element by text (using Xpath)

**Xpath selector** is a script language to select elements in HTML scripts. 

To find all the `<a>` elements in a web page with the exact text of `Journals`, the Xpath would be:

`//a[text()='Journals']`

One good tutorial is: https://www.w3schools.com/xml/xpath_intro.asp

Try your selector by:
1. click in the source code pan in Chrome
2. hit "control + F" (windows) or "cmd + F" (mac) to search
3. type/paste in your selector, see 
    * if chrome can find it, and 
    * how many elements it found
4. if you found the desired elements, you are good to proceed

In [None]:
# ask selenium to return elements by tag and contained text using Xpath
link_elements = driver.find_elements_by_xpath("//a[text()='Journals']")
# Notice: we used "" outside to mark the entire string, and use '' to mark the internal text for the xpath

# inspect the returned elements (a python list)
link_elements

In [None]:
# inspect how many elements were found
len(link_elements)

In [None]:
# retrieve the first element
link = link_elements[0]
# inspect the link
link

### Click an element

In [None]:
# click on the link
link.click()

## Task 2: Retrieve text information by tag and attributes

### Find element by tag and attributes (using Xpath)

Sometimes we need to identify elements of interests using tags and attributes.

For example, each news is contained in a `div` with a specific class attribute.

So we try the xpath:

`//div[@class='news-text']`

In [None]:
# ask selenium to return elements by tag and contained text using Xpath
news_elements = driver.find_elements_by_xpath("//div[@class='news-text']")
# inspect the returned elements (a python list)
news_elements

In [None]:
# inspect how many elements were found
len(news_elements)

In [None]:
# store the first element
news_element = news_elements[0]
# return its text
news_element.text

We can now use the `.split` method in python, to separate the text into three parts.

In [None]:
# split the returned text by `\n`, which is linebreak, into a list
text_list = news_element.text.split('\n')
# inspect
text_list

In [None]:
# store title
title = text_list[0]
# store date
date = text_list[1]
# store date
summary = text_list[2]

In [None]:
title

In [None]:
date

In [None]:
summary

### A note: CSS selectors

**CSS selectors** is another script language to select elements in HTML scripts, which has slightly different conventions from Xpath.

Sometimes, it may be more convenient to use CSS selectors. 

For example, if the HTML element of choice is 
`<a href="/journals">Journals</a>`,
we might come up with a CSS selector of 

`a[href='/journals']`

to select by tag `<a>` and attribute `href` that is equal to the text string `'/journals'`

# Python lists

Python list is a data structure used to store a series of related contents, such as values or strings. 

In [None]:
# initiate an example list
test = []
test

In [None]:
# define list by entering data
test = [1, 2, 3]
# inspect
test

In [None]:
# append element to the end a list
test.append(4)
# inspect
test

In [None]:
# define list by entering text
test = ['a', 'b', 'c']
# inspect
test

In [None]:
# append element to the end a list
test.append('d')
# inspect
test

Now we can prepare lists to store collected text data

In [None]:
# initiate empty lists to store result
title = []
summary = []

# `for` loops in Python

`for` loops in python is marked by `:` and indented scripts with four spaces, there is no `end` mark for a loop.

In [None]:
# loop through all elements found previously in news_elements
for news_element in news_elements:
    # split text from each element
    text_list = news_element.text.split('\n')
    # append title
    title.append(text_list[0])
    # append summary
    summary.append(text_list[2])

In [None]:
# inspect
title

In [None]:
# inspect
summary

# Python package: pandas

Pandas is a python library designed to work with data analysis. It provides a large set of tools in data processing. 

Here, we will use pandas to organize our lists of results into a excel file. 

In [None]:
### set output excel file path and file name
xls_name = "/Users/lhe/Downloads/research_highlights.xlsx"

# set column names for the data frame
column_names = ["title", "summary"]

### create data frame using scraped and updated information
# create full set of column names
result_list_new = pd.DataFrame(columns = column_names)
# store results into df
result_list_new["title"] = title
result_list_new["summary"] = summary
# add df index by 1
result_list_new.index += 1 

# save file to excel (freeze top row)
result_list_new.to_excel(xls_name,index=True,freeze_panes=(1,2))

# Python package: BeautifulSoup

BeautifulSoup is a python package that processes HTML source code *quickly*.

In [None]:
# pass HTML source code from selenium to BeautifulSoup
# pass selenium driver source page to BS4
page_source = driver.page_source
# parse page source with BS4
page_soup = soup(page_source, "html.parser")

In [None]:
# loop through all `<a>` tags with a given href pattern
for div in page_soup.find_all("div", {"class" : "news-text"}):
    title.append(div.find("a").text.strip())
    summary.append(div.contents[-1].strip())