<a id="top"></a>
# Introduction to Web Scraping with Python
[NaLette Brodnax](http://www.nalettebrodnax.com), Postdoctoral Fellow, [IQSS](http://iq.harvard.edu)<br>
[nbrodnax@iq.harvard.edu](mailto:nbrodnax@iq.harvard.edu)<br>

### Contents

1. [Introduction](#intro)<br>
    1.1 [What is Web Scraping?](#scraping)<br>
    1.2 [Locating and Accessing Data](#locating)
2. [About the Tools](#tools)
3. [Python Review](#python)<br>
    3.1 [Data Types](#datatypes)<br>
    3.2 [Data Structures](#datastructures)<br>
    3.3 [Control Structures](#controlstructures)<br>
    3.4 [Functions and Methods](#functions)<br>
    3.5 [Modules and Packages](#modules)<br>
4. [RAPTOR](#raptor)<br>
    4.1 [Review](#review)<br>
    4.2 [Access](#access)<br>
    4.3 [Parse](#parse)<br>
    4.4 [Transform](#transform)<br>
    4.5 [Store](#store)<br>
    
### File Downloads
[hello.py](https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/files/hello.py)<br>
[scraper.py](https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/files/scraper.py)

<a id="intro"></a>
## 1. Introduction

#### Workshop Goals

 * Review key Python syntax
 * Introduce a method for designing web scrapers
 * Provide opportunities to practice
 
If you are attending this workshop in person, please [CLICK HERE](http://bit.ly) to complete a one-minute survey.

<a id="scraping"></a>
### What is Web Scraping?
Web scraping is a set of techniques for extracting information from the web and transforming it into structured data that we can store and analyze.  

There are many tools that you can use to scrape the web and many different types of web scraping programs.  For this workshop, we will use a general purpose programming language called Python.  Python is easy to learn, highly readable, powerful and flexible.  There are two versions widely in use, Python 2.7 and Python 3.  Python 3 is not backwards compatible.  All examples and exercises will use the most recent stable release, Python 3.6. 

#### When should you scrape the web?
Web scraping should help you be more **efficient**. There are many ways to collect data and building a web scraper is not always the most efficient approach.  For example, if you want the text of a single blog post, you could copy and paste the text directly from the webpage into a text file.  However, if you wanted to get copies of every blog post on a website for the past two years, you could save a considerable amount of time by building a web scraper to crawl around the site and collect the data for you.


 
[Back to the top](#top)

----------

<a id="locating"></a>
### Locating and Accessing Data

A simple model of locating and processing data.

<img src="https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/images/model.png" width="500" title="Locating and Processing Data">


#### Example 1: Browse and open a CSV file

<img src="https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/images/example1.png" width="500" title="Browse and open a CSV file">

The program and data are located on the same computer.

#### Example 2: View a web page

<img src="https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/images/example2.png" width="500" title="Browse and view a web page">

The program and data are located on different computers.


### Goal: Scrape a static web page

A basic web scraper, which we will cover today, extracts data from a static website, meaning the content of the website is sitting in a file on a webserver and it only changes when the author modifies the file.  

A more advanced web scraper might collect data from a dynamic website that is rendered while you interact with it.  For example, your Facebook feed isn’t a static file; as you interact with the website, it uses an algorithm to query information from a database and then display that information on your feed in real time.

<img src="https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/images/goal.png" width="500">

  

A third category of web scraping requests data through an API, or application programming interface.  An API is a set of protocols that allow different applications on the web to exchange information.  For example, if you’ve encountered a website that lets you to create an account using your Google or Facebook login information, that website is asking your permission to acquire your credentials from the Google or Facebook API and copy them to set up your new account. 
 
[Back to the top](#top)

----------

<a id="tools"></a>
## 2. About the Tools 

#### Python Interpreter for Conversing in Python
The Python **interpreter** is an interface between you and the underlying software that translates and responds to your Python commands.  When using the interpreter, you type Python commands directly into a console (window) and the console displays a response.  You may hear people refer to an interpreter as an **interactive shell** or a **REPL** (Read-Evaluate-Print Loop).  You can install Python, which comes with the interpreter, from the [Python Software Foundation](https://www.python.org/). 

#### Text Editor for Writing Programs
A text editor is a tool that you can use to write commands in plain, unprocessed text and save them in a script.  All operating systems include a basic text editor, such as NotePad or TextEdit.  Filenames of Python scripts end in *.py*.

#### Command Line Interface for Interacting with Your OS
Every operating system has a special interpreter that allows the programmer to interact with the system through text-based commands.  The **command line interface** can be used to access any programs or files available to the programmer. This text-based approach is generally more flexible and more powerful than using a graphical user interface. 

### _Mise En Place_: Integrated Development Environments
Software that combines an interpreter, text editor, and command line interface into an easily navigable user interface is called an **integrated development environment (IDE)**.  An IDE might also include a file browser, version control integration, or other tools. If you install Python using installers from the [Python Software Foundation](https://www.python.org/), the installation will include a basic IDE called IDLE.

#### Anaconda
[Anaconda](https://www.continuum.io/what-is-anaconda) is a free data science plaform that you can [download](https://www.anaconda.com/downloads/) from Continuum Analytics.  Anaconda bundles Python with several programs and tools, including an IDE called Spyder and many of the most popular Python libraries.  Anaconda is useful because it provides several tools that programmers would otherwise need to install and configure individually.  It provides both a command line utility called **`conda`** and a graphical user interface called **`anaconda-navigator`** for programmers who are less familiar or less comfortable using the command line.  

### Python Notebooks

This webpage allows you to interact with a Python interpreter running on a web server rather than on your machine.  Web-based tools that mix prose with chunks of executable code are called **notebooks**.  This [Jupyter Notebook](http://jupyter.org/) runs Python, but notebooks can be configured for different programming languages.  Notebooks are useful for exploration and documentation and some consider them a viable alternative to IDEs.  I recommend using notebooks when working with Python interactively and using an IDE for larger projects.

To run the following code, click inside the cell and then click the Run Cell button from the toolbar or Cell menu at the top of the page.  The output of the code will appear below the cell.

In [None]:
print('Hello, world.')

This tutorial assumes that you have some familiarity with Python, although we will provide a review in the next section.  As you work through the tutorial, you can add chunks of code to your own script.  All examples from the Python review in Section [3](#python) can be found in the [hello.py](https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/files/hello.py) file.  The code for the web scraper that you will build in Section [4](#raptor) can be found in the [scraper.py](https://raw.githubusercontent.com/nmbrodnax/iqss-python-scrape/master/files/scraper.py) file.

### Getting Help

There are many resources for getting help when you run into problems.  The Python Software Foundation provides detailed [documentation](https://docs.python.org/3/) as well as [tutorials](https://wiki.python.org/moin/BeginnersGuide) for beginners, installation instructions, and links to programmer forums.  The site [Stack Overflow](https://stackoverflow.com/) is also helpful for specific programming questions.  

[Back to the top](#top)

----------

<a id="python"></a>
## 3. Python Review

All programming languages share some common features, based on the most basic elements of computer architecture.  Below are the features most germane to the core elements of a researcher's workflow: data collection, management, and analysis.

<a id="datatypes"></a>
### Data Types

**Data types** are categories for storing different kinds of information in memory. Examples of data types include integers, floats (numbers with a fractional component), and characters.  Data types are analogous to word types (i.e., parts of speech) in spoken language.  Just as the English language has a system of rules for how verbs relate to nouns, Python has a system of rules governing how different types of data relate to each other.  Programming languages also use data types and data structures (discussed in the next section) to efficiently allocate the computer's working memory (RAM).

##### How To: Peforming Operations
The following operators have special meanings in Python.

In [None]:
# Assignment Operators

# = assign
movie = 'Rogue One'
print(movie)

In [None]:
# Assignment Operators

# += add and assign
i = 1
i += 1
print(i)

In [None]:
# String Operators

# + concatenate
print('A' + 'B')

In [None]:
# String Operators

# * repeat
print('me'*3)

In [None]:
# Comparison Operators

# == equal
print('a' == 'a')
print('a' == 1)

In [None]:
# Comparison Operators

# != not equal
print(5 != 25/5)

[Back to the top](#top)

----------

<a id="datastructures"></a>
### Data Structures

Data structures are containers of data that are organized, formatted, and stored in a specific way.  In Python, data structures are also considered data types and referred to as **sequences**.  Examples of sequences include strings, lists, and dictionaries.

 - A **string** is an ordered sequence of characters, such as `'happy'`, and must be enclosed in single or double quotes.  A multi-line string must be enclosed in triple quotes.
 - A **list** is an *ordered* sequence of items, such as `['Leia', 'Rey', 'Maz']`.  A list is always contained within brackets, `[]`, and Python allows you to create an empty list.  Python has another data structure called a **tuple**, which is contained within parentheses, `()`.  Though lists and tuples are similar, the elements of a list can be modified and the elements of a tuple cannot be modified.  The technical term for whether a structure can be modified in memory is *mutability*.  
 - A **dictionary** is an *unordered* sequence of key-value pairs, such as `{'name': 'Kylo', 'side': 'dark'}`.  A dictionary is always contained within braces, `{}`.

##### How To: Referencing Sequences
Referencing the elements of a sequence depends on whether the sequence is *ordered* or *unordered*.  Elements within an ordered sequence are referenced by location.  Each item in the sequence has an index number, **starting with zero**.  Elements can also be referenced starting from the end of the sequence, using negative index numbers beginning with -1.  To access a range of elements, use the colon operator to retrieve the elements from the index to the left of the colon up to *but not including* the index to the right of the colon.  You can also omit an index to retrieve all the elements in that direction.

Try to guess the output of the following commands prior to running the code.

In [None]:
mystring = 'happy'
print(mystring[0])
print(mystring[2:4])
print(mystring[3:])

mylist = ['Leia', 'Rey', 'Maz']
print(mylist[-1])

The elements of a dictionary are unordered, so they cannot be referenced using index numbers.  Instead, the **values** are accessed using the **keys** of the key-value pairs.

In [None]:
mydict = {'name': 'Kylo', 'side': 'dark'}
print(mydict['name'])

[Back to the top](#top)

----------

<a id="controlstructures"></a>
### Control Structures

Programming langues use logic to give the programmer control over how a program executes a command or set of commands.  Within Python, control structures are defined using **keywords**, **colons**, and **indentation**.  The code within a control structure must be indented by four spaces.  Many IDEs are configured to automatically indent the next line after a colon.  Programmers can also configure the Tab key on their keyboard to indent four spaces.  

**Conditionals** are control structures that allow decision making within a program using the keywords **`if`**, **`elif`**, and **`else`**.  Python evaluates whether the logical statement after the keyword (e.g., `count >= 1`) is true and executes the commands within an indented block of code based on the result.  

The general formula for a conditional is:

```
if <expression>:
    <commands to execute if expression returns TRUE>
```

or

```
if <expression>:
    <commands to execute if expression returns TRUE>
else:
    <commands to execute if expression returns FALSE>
```

or

```
if <expression1>:
    <commands to execute if expression1 returns TRUE>
elif <expression2>:
    <commands to execute if expression2 returns TRUE>
else:
    <commands to execute if neither expression returns TRUE>
```

Python has a data type for `TRUE` and `FALSE` called a **Boolean**.  When Python encounters the **`if`** keyword, it expects the keyword to be followed by an expression that evaluates to a Boolean.  It is unnecessary to include an additional comparison to `TRUE`.  For example, `'a' in 'happy'` returns `TRUE`, so `'a' in 'happy' == TRUE` is redundant.

In [None]:
name = 'Grace Hopper'

if len(name) < 20:
    print('Yes')
else:
    print('No')

**Loops** are control structures that allow repeated behavior within a program using the keywords **`for`** or **`while`**.  A **for loop** iterates through a sequence and executes the commands within the indented block until it reaches the end of the sequence.  A **while loop** is paired with a logical expression and executes the commands until the expression is no longer true.

Most loops can be written using a for loop or a while loop.  Generally, for loops are used when the number of iterations is fixed (e.g., print each name in a list of names), and while loops are used when the number of iterations is unknown (e.g., ask the user to enter her password until she enters the correct password).  

The general formula for a loop is:

```
for <item> in <sequence>:
    <commands to execute>
```

or

```
while <expression>:
    <commands to execute>
    <modification to expression>
```

When using while loops, the programmer must build in a mechanism for the expression to eventually return `FALSE`.  Otherwise, the expression will always return `TRUE` and the loop will continue infinitely.  

Try to predict the output of the following code before running it.

In [None]:
# Example for loop

name = 'Grace Hopper'

i = 0
for letter in name:
    if letter in ['a', 'e', 'i', 'o', 'u']:
        i = i + 1
print(name + ' has ' + str(i) + ' vowels.')

The preceding example highlights several important considerations when programming with Python.

#### Variables

Variables are used throughout control structures, whether explicitly defined (such as `i`) or implicitly defined (such as `letter`). Use short, informative variable names to make your code more readable and easier to debug. In the example above, the code tells Python to access each letter in a name, check whether the letter is in a list of vowels, increase `i` by 1 if the letter is a vowel, and print the number of vowels.<br><br>Notice that the following code produces the same result but is harder to understand. 

In [None]:
# Example for loop with less informative variables

y = 'Grace Hopper'

i = 0
for x in y:
    if x in ['a', 'e', 'i', 'o', 'u']:
        i = i + 1
print(name + ' has ' + str(i) + ' vowels.')

#### Indentation
Python uses 4-space indentation to execute the commands within a given control structure.  The example includes both a for loop and a conditional, and the indentation is key to differentiating between the two.<br><br>The following code includes the original example with a slight change to the indentation.  Try to predict the output before running the code.

In [None]:
# Example for loop with indentation change

name = 'Grace Hopper'

i = 0
for letter in name:
    if letter in ['a', 'e', 'i', 'o', 'u']:
        i = i + 1
    print(name + ' has ' + str(i) + ' vowels.')

#### Overloading
Python uses a set of rules for operating on data types.  Unlike some other programming languages, Python does not require the programmer to declare that a given variable has a specific data type.  Instead, Python examines the data type of the object assigned to the variable to determine how the operations should proceed.  Consider the command `a + b`.  If `a` and `b` are integers, Python will return the sum of the two integers.  If `a` and `b` are strings, Python will return the concatenation of the two strings.  The `+` operator is **overloaded** because it uses one interface (`+`) to perform the same type of function on different data types.  Overloading is more efficient than creating separate interfaces for similar functionality, but its use requires more care on the part of the programmer.  

Try to figure out why the following code produces an error.  (Hint: What is the value of `i`?)

In [None]:
# Example for loop that produces an error

name = 'Grace Hopper'

i = 0
for letter in name:
    if letter in ['a', 'e', 'i', 'o', 'u']:
        i = i + 1
    print(name + ' has ' + i + ' vowels.')

[Back to the top](#top)

----------

<a id="functions"></a>
### Functions and Methods

A **function** is a named block of code used to execute a command or set of commands.  Functions can be defined to accept any number of inputs called **arguments**.  In this tutorial, we've already seen a couple functions, including `print()`, `len()`, and `str()`. The general syntax for calling a function is

```
function_name(argument1, argument 2, ...)
```

A **method** is a function with a built-in argument for the object being acted upon.  The general syntax for calling a method on a given object is

```
object.function_name(argument1, argument 2, ...)
```


In [None]:
# Example function

my_string = 'aBcDe'
print(my_string)

# Example method
print(my_string.lower())

#### Writing Your Own Functions

Python has many built-in functions and methods, but you can also write your own functions using the **`def`** keyword.  The general syntax for a function definition is

```
def function_name(argument1, ...):
    command to execute with <argument1>
    .
    .
    .
    return <object>
```
Note that both arguments and return objects are optional.

In [None]:
# Example function

def say_hello(name_string):
    print('Hello, ' + str(name_string) + '!')
    return None

say_hello('NaLette')

<a id="modules"></a>
### Modules and Packages

While many functions come built in, Python has a number of separate **modules** that you can **import** into your current Python session.  A module is a file containing Python definitions and statements that you can use in your workflow. The general syntax for importing a module is

```
import module_name
```

The import statement loads only the name of the module into memory.  If you want to use functions from the module in your script, you need to call the module and function names together:

```
my_variable = module_name.function_name(argument1, argument2,...)
```

The import keyword provides flexibility to

 * rename a module as you're importing it: `import module_name as mymod` <br>
 * import only certain function names: `from module_name import function1, function2`
 * import all function names: `from module_name import *`


A **package** is a special type of module that has a folder of submodules and additional tools to manage them.  **Library** is a generic term referring to a collection of code that can be used for many different purposes.  As programming jargon goes, the terms **module**, **package**, and **library** have different meanings, but you'll often see them used interchangeably to represent Python functionality that comes from outside the core language.  

A number of useful modules are included in the [Python Standard Library](https://docs.python.org/3/library/), so you don't need to install them.  Below are some of the modules that are useful to social scientists.

Category | Module Name   | Description                    
:---|:---|:----
Data | `datetime` | basic date and time types
| `json` | JSON encoding and decoding
Files | `csv` | CSV file reading and writing
Text | `re` | regular expression operations
| `string` | common string operations
System | `os` | miscellaneous operating system interfaces
| `sys` | system-specific parameters and functions
Math | `math` | mathematical functions
| `random` | generating pseudo-random numbers

<br>
Programmers can create their own packages and publish them for use by others.  Python's programmer community, along with the numerous libraries available to the public, make it one of the most popular programming languages.  Most packages are published on the [Python Package Index](https://pypi.python.org/pypi) and can be installed directly from your computer's command line using the package manager **`pip`**.  There are over 118,000 packages on PyPI.  The packages below are among the most popular and are also included with the [Anaconda](https://docs.anaconda.com/anaconda/packages/pkg-docs) Python distribution.

Example Usage | Package Name  | Description                   
:---|:---|:----
Web Scraping | `requests` | http protocol library
| `beautifulsoup4` | xml and html parsing
Data Analysis | `scipy` | algorithms and mathematical tools
| `pandas` | high-performance data structures
| `numpy` | array processing and advanced math
| `matplotlib` | publication quality figures
Text Analysis | `nltk` | working with human language data
Image Analysis | `pillow` | processing images and graphics
Machine Learning | `scikit-learn` | data mining and analysis

[Back to the top](#top)

----------

<a id="raptor"></a>
## 4. Working with Files

Creating, accessing, and manipulating files are among the most basic tasks that a programmer will do with Python.  In this section, we will work through an example to demonstrate different options for working with files.  If you are following this tutorial using your own installation of Python, you need to download the [kipling_jungle_book.txt](https://raw.githubusercontent.com/nmbrodnax/iqss-python-intro/master/files/kipling_jungle_book.txt) file.  All code for this section can be found in the [workshop.py](https://raw.githubusercontent.com/nmbrodnax/iqss-python-intro/master/files/workshop.py) file.

What do we want to do?
1. Ask the user for the name of a text file using the `input()` function. Note: `input()` accepts and returns a string and accepts an optional prompt argument.
2. Ask the user for the number of lines to display using the `input()` function.
3. Convert the number of lines to an integer using the `int()` function.
4. Display that number of lines from the file using the `print()` function.

<a id='open'></a>
### Using `open()` and `.close()`

The `open()` function creates a special type of Python object that contains tools for moving through a file.  Once a file has been opened and the tasks have been completed, this **file object** must be closed using the `.close()` method.  

In [None]:
# Example using open()

print('What file should I read from?')
filename = input('> ')

print('How many lines should I read?')
lines_to_read = input('>')

line_counter = 0
file = open(filename, 'r')
while line_counter < int(lines_to_read):
     print(file.readline())
     line_counter = line_counter + 1
file.close()
print(str(line_counter) + " lines read\n")

[Back to the top](#top)

----------

<a id='withopen'></a>
### Using `with open()`

Opening a file using the keyword **`with`** and the `open()` function has the added benefit of automatically closing the file once the commands in the `with` block have been executed.

In [None]:
# Example using with open()

print('What file should I read from?')
filename = input('> ')

print('How many lines should I read?')
lines_to_read = input('>')

line_counter = 0
with open(filename, 'r') as file:
     while line_counter < int(lines_to_read):
         print(file.readline())
         line_counter += 1
print(str(line_counter) + " lines read\n")

[Back to the top](#top)

----------

<a id="csv"></a>
### Using the `csv` module

In this example, we will import the csv module using the **import** keyword.  All functions available within the `csv` module become available for use in the program.  Since a function may have the same name as another function already available during the Python session, the programmer must use the function as a method of the module (e.g., `csv.writer()`) so that Python can differentiate between the two.

```python
# Example header for opening two files at once
import csv

with open("workshop.csv", 'w') as csvfile, open(filename, 'r') as txtfile:
    writer = csv.writer(csvfile)
```

### Example: Creating a CSV file
The example below integrates many of the topics discussed in the tutorial.  

In [None]:
# A function to count the number of words in a phrase

def count_words(mytext):
     """Returns the number of words in a string
     str -> int"""
     words = mytext.split(" ")
     return len(words)

print(count_words("I am going to learn Python programming."))

In [None]:
# A function to get the length of the longest word in a phrase

def longest_word_length(mytext):
    """Returns the length of the longest word in a string
    str -> int"""
    words = mytext.split(" ")
    word_lengths = []
    for word in words:
        word_lengths.append(len(word))
    return max(word_lengths)

print(longest_word_length("I am going to learn Python programming"))

In [None]:
# Example script to create a csv file with information from a file

import csv

with open("workshop.csv", 'w') as csvfile, \
     open(filename, 'r') as txtfile:
        writer = csv.writer(csvfile)
        writer.writerow(["line number", "word count", "longest word"]) 
        line_counter = 0
        while line_counter < int(lines_to_read):
            line_number = line_counter + 1
            content = txtfile.readline()
            word_count = count_words(content)
            longest_word = longest_word_length(content) 
            writer.writerow([line_number, word_count, longest_word]) 
            line_counter += 1

[Back to the top](#top)

----------