## Tutorial contents
**Part 1: Introduction**
* [What is web scraping?](#What-is-web-scraping?)
* [Libraries overview](#Libraries)

**Part 2: Basic web scraping**
* [Making web requests](#Making-web-requests)
* [Intro to HTML](#Intro-to-HTML)
* [Parsing HTML, selecting and extracting relevant info](#Parsing-HTML)
* [Building a database out of the scraped data](#Saving-data)

**Part 3: Advanced topics**
* [Browser emmulation](#Browser-emmulation)

**Part 4: Other advice**
* [Legal and ethical issues](#Legal-and-ethical-concerns)
* [Boxed solutions - Boierplate and other methods](#Boxed-solutions)


# Part I: Introduction
## What is web scraping?
Collecting information from web pages. There are multiple ways of doing this, such as writing your own scraper, using APIs, using boxed software and tools. Today we will focus on the first type, but I will provide resources that cover the other two broad categories. 

## Libraries
Multiple useful libraries. The main ones that we'll be using today are [Requests](http://docs.python-requests.org/en/master/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), and [Selenium WebDriver](http://selenium-python.readthedocs.io/). 


In [None]:
import requests
from bs4 import BeautifulSoup
import selenium

# Part II: Basic web scraping
## Making web requests
When you open a web page, you send instructions to a web server, requesting for the information to be displayed in your browser. The server in return sends the code (usually HTML) to your browser. The browser translates the code using a Document Object Model(DOM). 


Let's make a request for the IOC webpage. 

In [None]:
#Specify the url
url = "https://www.iocsummerschoolexeter.org/"
# Query the website and return the html. 
# Save the html as a new variable page.  
page = requests.get(url)

We can inspect the content of the page object, which is what we'll be working on later. 

In [None]:
page.content[0:3000]

## Intro to HTML
The response looks intimidating, but once you understand basic HTTP things look a lot easier. Let's have a look at what this code would look like for a very simple page. Go to www.example.com, and right click anywhere in the page, selecting "View page source". 
You need to be aware of the simple elements of syntax for HTML pages. 
Every tag serves a block inside the webpage:
1. <!DOCTYPE html>: HTML documents must start with a type declaration.
2. The HTML document is contained between html and /html.
3. The meta and script declaration of the HTML document is between head and /head.
4. The visible part of the HTML document is between body and /body tags.
5. Title headings are defined with the h1 through hn tags.
6. Paragraphs are defined with the p tag.

Other useful tags include a for hyperlinks, table for tables, tr for table rows, and td for table columns.

HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

## Parsing HTML
The next step is to parse the HTTP and get the info that is relevant to us. 

In [None]:
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page.text, 'html.parser')
soup

In [None]:
# Get the page title
soup.title

In [None]:
soup.title.string
# Returns the content inside a title tag as a string

In [None]:
soup.title.parent.name

In [None]:
soup.a
# Return the first matching anchor tag</a>


In [None]:
# Or, we could use the find all, and return all the matching anchor tags
soup.find_all('a')


In [None]:
#Or to to get them in a nicer format: 
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
#Or to to get only links that contain a certain text
import re
for link in soup.find_all('a', href=re.compile("mailto")):
    print(link.get('href'))

# Part III: Advanced topics

## Browser emmulation
Sometimes browsers make it hard (or impossible) for you to scrape them using these methods. 


Sometimes you have to pretend that you are human! 
Selenium allows you to do that. 

Let's first download [Cromedriver](https://chromedriver.storage.googleapis.com/index.html?path=2.40/), save it an unzip it to a location of your choice. You will have to remember that location and replace it in the code below:

In [None]:
from selenium import webdriver
import datetime
import time
import random
#####################################
# Setup
#####################################
chromedriver='C:/ProgramData/Anaconda3/chromedriver.exe'
# Load browser
browser = webdriver.Chrome(executable_path=chromedriver)
# Got to UoE library page to access the Q-Step page
url = 'http://socialsciences.exeter.ac.uk/q-step/'
browser.get(url)
time.sleep(random.uniform(3,4))

### Intro to XPaths
XPaths provide a useful way to navigate elements in a DOM hierarchy. They provide the \path" (or address) to particular objects in the DOM. Some useful features include the ability to select object be id, select all objects of a specific type, extracting text, select all elements that contain a certain object. Let's have a look at the page source for this. 

In [None]:
# Let's look at some ofher sections
browser.find_element_by_xpath('//*[@id="exeter-nav"]/li[3]/a').click()

In [None]:
# Expand an element on the page
browser.find_element_by_xpath('//*[@id="ac0"]/div[2]/div[1]/h4/a').click()

We want to save all the names, positions and departments of academic staff. 

In [None]:
# create a table object
table = browser.find_element_by_xpath('//*[@id="ac0a1"]/div/table')

In [None]:
# Get the table column names
col_names=[]
head=browser.find_element_by_xpath('//*[@id="ac0a1"]/div/table/tbody/tr[1]')
data=head.find_elements_by_tag_name('th')
for name in data:
    col_names.append(name.text)
    #col_name=name.text.encode('utf8')
col_names

In [None]:
# Let's store all the info into a Pandas data frame. 
import pandas as pd
df = pd.DataFrame(columns=col_names)

In [None]:
#Let's get the row info 
body = table.find_element_by_tag_name('tbody')
file_data = []
body_rows = body.find_elements_by_tag_name('tr')
for row in body_rows[1:]:
    data = row.find_elements_by_tag_name('td')
    file_row = []
    for dat in data:
        dat_text = dat.text
        file_row.append(dat_text)
    file_data.append(file_row)
df = df.append(pd.DataFrame(file_data, columns=col_names),ignore_index=True)

In [None]:
df

## Saving data
Saving the table we just created to a csv file. 

In [None]:
df.to_csv("staff_info.csv", index=False)

## Exercise (20 minutes): 
1. Extract the names, positions and departments of ALL members of staff and save them to a csv file. 

You can select buttons, elements in drop down menus, and insert text in fields. 

In [None]:
# Go to contact page
browser.find_element_by_xpath('//*[@id="exeter-nav"]/li[11]/a').click()

In [None]:
# Select the find an academic button
browser.find_element_by_xpath('//*[@id="searchoption2"]').click()

In [None]:
# Write query
text_field=browser.find_element_by_xpath('//*[@id="query"]')
text_field.send_keys("banducci")

In [None]:
# Click send
browser.find_element_by_xpath('//*[@id="right-contents"]/form/div/div/button').click()


In [None]:
# Go back
browser.back()

# Part IV: Further advice
## Legal and ethical concerns

Rules of conduct:
1. Respect the site's robot.txt. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
2. Be kind to the servers, do not overburden them, build delays into your scripts. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. Make sure you follow university institutional review board and ethics eommittee guidelines. 
4. Follow the laws, rules and regulations related to privacy, data use, property rights, etc. 
5. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed