# Introduction to Python and Morph.io (for scraping)

This tutorial is intended to build on some basic coding concepts and introduce Morph.io. By the end you should:

* Be able to use GitHub to edit Python files
* Use Morph.io to run Python code hosted on GitHub

Let's get started.


## Get started with Morph.io

1. [create an account on GitHub](https://github.com/) if you haven't got one already, and [sign in to Morph.io using your GitHub account](https://morph.io/users/auth/github)
2. [Click **New scraper**](https://morph.io/scrapers/new) on the menu at the top of Morph.io. You will be taken to a new page asking you to specify more details
3. On the dropdown menu for *Language*, select **Python**. Give your scraper a name in the next box - something like 'startingtocode' (no spaces), and in the final box write 'none' - this isn't a scraper yet: we're just using Morph.io as a place to learn code.
4. Click **Create scraper**

It will take Morph.io a few moments to create the files for your scraper (the files are being created on GitHub). 

When it has finished, you will be taken to a new page for the scraper. Look on the right where it says *Scraper code*. There should be a link to `startingtocode / scraper.py ` - this will take you to the pages on GitHub where the code is now hosted: `scraper.py` is the file for the code itself; `startingtocode` is the link to the repository containing that file.

Open the link to `scraper.py` in a separate tab or window, but also keep your Morph.io page for this scraper open in another tab or window - you will need to edit the code on GitHub, and run it to see the results on Morph.io.

Now we're ready to start.

## Introducing the template code

When you create a new scraper on Morph.io, it creates it with some template code as shown below. 

Each line begins with a hash symbol: `#`. There are two ways that these are most commonly used:

* Firstly, as a way of creating **comments** in Python code: any code starting with a `#` does not do anything, so the hash symbol allows you to add comments which are not treated as working code.
* Secondly, as a way of *disabling* code - what's called *commenting out* code. Rather than delete an entire line of code, it is easier to add a `#` at the front to turn it 'off' to test what happens, so you can always turn it back 'on' again quickly by removing the `#`.

In the template code below generated by Morph.io, the *entire* code is commented out. The idea is that you can **uncomment** the sections you want to use in your own code, saving you time writing scraping code from scratch. We'll come back to this later.

In [None]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

# import scraperwiki
# import lxml.html
#
# # Read in a page
# html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

## Libraries in Morph.io

Make sure you are on this file in GitHub, and click the edit button to make some changes.

Uncomment the two lines that start with `import` so the code looks like below.

These two lines bring in two **libraries** to Morph.io: 

* Scraperwiki is a library which has useful functions for scraping webpages and storing the results in a database
* lxml.html is a library which is useful for *parsing* HTML webpages - i.e. drilling down to particular pieces of information you want.

In [1]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
# html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Next, uncomment the line `html = scraperwiki.scrape("http://foo.com")`. 

This line is looking at a URL - foo.com - so it's worth checking that site in another window to see what's there.

It's in parentheses, which means it's being used as an ingredient in a function - `scrape()`. Specifically, `scraperwiki.scrape()`, which means it's part of the **scraperwiki library**. 

When using a library it's always useful to check the **documentation** for that library - [here's the documentation for Scraperwiki](https://classic.scraperwiki.com/docs/python/), or at least it's 'Classic' version which was used by Morph.io. There's a link to [where the documentation is now hosted, on GitHub](https://github.com/scraperwiki/code-scraper-in-browser-tool/wiki)

The `scrape` function grabs the contents of the given URL and stores it in the new variable `html`:

In [3]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Now **commit** your changes (GitHub's version of saving), and switch back to the scraper in Morph.io. Run the scraper.

The first time you do this you may have to wait while Morph.io installs the libraries - but this only has to happen once and it should run more quickly the second time.

*Note: if you get a 'status code 255' error then it means Morph.io isn't working properly right now. You can only leave it and come back later. (This doesn't happen that often).*

The next section of code *converts* the `html` variable into another new variable called `root`, and drills down further into that using something called `cssselect`, which uses **css selectors** to grab very specific pieces of information from the page. We'll talk about this in class but search around for more about those selectors and think how they could be used in scraping.

In [11]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

How can we see what's happening? 

Add a `print` command - or three.

In [None]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
print html
print root
print root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Note that the first `print` command prints the variable `html` and the second `print` command prints the variable `root` - but the third doesn't print any variable at all. It is just added to the front of the line `root.cssselect("div[align='left']")` - that's because the results of that line of code are not saved anywhere. 

We could save it in a variable first and then print it. How would you do that?

Anyway, the results of those last two print commands are a bit cryptic. Here's one:

`<Element html at 0x7faa5365c470>`

And here's the other:

`[]`

The `<Element html at 0x7faa5365c470>` bit looks funny because `root` has been created with an `lxml.html` function: it what we call an lxml.html *object*. So what this tells us is that we need to find some way to convert or decode this information back into something understandable.

In theory that's what `cssselect` should do. But the results of that are pretty unimpressive: `[]`

What can we work out from that? The square brackets are a big clue. Square brackets indicate a **list**, and that's exactly what `cssselect` generates: a list of elements that match a css selector.

But there's nothing in those square brackets, so our list is **empty**. Why? Because `cssselect` found *no* matches. 

Look at the code: `cssselect` was looking for this: `"div[align='left']"`. In other words, any content inside the HTML tags `<div align="left">`.

Check the source code of the webpage that is being scraped: foo.com. Are there any such tags? No. 

We can alter it, then, to look for a tag we *know* is on that page: link tags for example. The css selector for a link is `a` (as in `<a href=`), so we can alter the code like so:

`root.cssselect("a")`

In [None]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
print html
print root
print root.cssselect("a")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

This time we get a different result:

`[<Element a at 0x7f80ac4b4838>, <Element a at 0x7f80ac4b4890>]`

Once again those square brackets tell us this is a list, and we can count two items in that list now. Checking the source HTML on the webpage we can see there are two `a` tags there too. But again these have been encoded as lxml objects.

If it's a list then we need to loop through it. We also need to store it in a variable first:

In [3]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
#I've commented out the 3 lines below so we can focus on the result of the new ones
#print html
#print root
#print root.cssselect("a")
alistoflinks = root.cssselect("a")
for link in alistoflinks:
    print link
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

<Element a at 0x10708c5e8>
<Element a at 0x109fc0c78>


To decode them *back* from lxml we need to use some lxml code: `.text` shows the text *within* the HTML tags that have been grabbed:

In [4]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
#I've commented out the 3 lines below so we can focus on the result of the new ones
#print html
#print root
#print root.cssselect("a")
alistoflinks = root.cssselect("a")
for link in alistoflinks:
    print link.text
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Foo.com
Privacy Policy


Using `.text` will show you the contents of a tag, but *not* any text inside other tags *inside* the tag. For example, if the HTML was `<p>My friend <em>Paul</em></p>` and we were grabbing all the `<p>` tags, `.text` would only show 'My friend'.

An alternative, then, is `.text_content()`, which shows the text content of the tag *including any context inside child tags*. Here's the code with that instead:

In [5]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
#I've commented out the 3 lines below so we can focus on the result of the new ones
#print html
#print root
#print root.cssselect("a")
alistoflinks = root.cssselect("a")
for link in alistoflinks:
    print link.text_content()
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Foo.com
Privacy Policy


A third useful piece of code is `.attrib`, which allows us to grab an attribute of the HTML tag (which needs to be specified as a string inside square brackets like so: `.attrib['href']`). 

Here is the same code again adapted to use that:

In [None]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
#I've commented out the 3 lines below so we can focus on the result of the new ones
#print html
#print root
#print root.cssselect("a")
alistoflinks = root.cssselect("a")
for link in alistoflinks:
    print link.attrib['href']
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

With just those 3 you can grab most of the information that you might want to scrape. For other options [see the documentation on parsing HTML with the lxml library](http://lxml.de/parsing.html)

## Storing the results

Now that we've grabbed the links we need some way to store them.

We can do that for each piece of information in the usual way, like so: 

`linktext = link.text_content()`

However, when it comes to storing multiple pieces of information we need a different type of variable - we need a **dictionary** variable. 

Remember that a dictionary variable stores a series of **key-value** pairs. The key is like a column heading; the value is like a cell that would be in that column. Here's an example:

`{'link': '/digimedia_privacy_policy.html', 'linktext': 'Privacy Policy'}`

This is the type of variable we need to store information in a database. 

First, then, we create a empty dictionary variable that we choose to call 'record':

`record = {}`

To add new data to this, we have to name the *key* inside square brackets after the name of the dictionary variable like so: `record['linktext']` - then, after an equals sign, the *value* that we want to store against that key. Taken together it looks like this:

`record['linktext'] = link.text_content()`

Now that we have data in the dictionary variable, we can store it in the scraperwiki database. The code for that takes some explaining:

`scraperwiki.sqlite.save(unique_keys=['link'], data=record)`

First, we have the function that is needed to save to the database: `scraperwiki.sqlite.save()`. 

The detail comes inside those brackets: we need to specify what data we are saving and - before that - the *unique* key from that dictionary variable.

`unique_keys=['link']`, then, specifies that `'link'` is going to be the unique key. This means if it comes across the same link twice it will only save one of those. Ideally we want something that isn't going to recur, or something that we only want to store once. 

`data=record` specifies that our data is in a variable called `record`.

Here's that code in action - I'm going to explain the looping bit in a minute...

In [9]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
root = lxml.html.fromstring(html)
#I've commented out the 3 lines below so we can focus on the result of the new ones
#print html
#print root
#print root.cssselect("a")
alistoflinks = root.cssselect("a")
record = {}
for link in alistoflinks:
    record['linktext'] = link.text_content()
    record['link'] = link.attrib['href']
    print record
    scraperwiki.sqlite.save(unique_keys=['link'], data=record)
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

{'link': '/', 'linktext': 'Foo.com'}
{'link': '/digimedia_privacy_policy.html', 'linktext': 'Privacy Policy'}


The final thing to emphasise is that our code to `save` data is placed within a *loop*. This is because the scraper saves *one row of data at a time*. Let's recap:

1. We have 3 lines that steadily narrow down to the specific data that we want to store: first we grab the whole webpage (`html = scraperwiki.scrape("http://foo.com")`); then we convert it into an lxml object (`root = lxml.html.fromstring(html)`); and then we grab all the 'a' tags (`alistoflinks = root.cssselect("a")`)
2. We create an empty dictionary variable, ready to be filled in a moment: `record = {}`
3. We begin looping through that list of a tags: `for link in alistoflinks:`
4. We fill that empty dictionary with two items *from the first item in the list*: first the text of the 'a' tag (`record['linktext'] = link.text_content()`) and second the href attribute of the a tag (`record['link'] = link.attrib['href']`).
5. We `print record` to see what it now looks like. It is like one row of a table: `{'link': '/', 'linktext': 'Foo.com'}`
6. We save that 'record' variable: `scraperwiki.sqlite.save(unique_keys=['link'], data=record)`
7. The loop runs again on the *second item in the list*. The variable 'record' is overwritten with info from that second item. When *this* is printed it looks like a different row in the same table: `{'link': '/digimedia_privacy_policy.html', 'linktext': 'Privacy Policy'}`
8. If there were more items in the list it would continue looping, and we would get more and more data in the database. 

Notice a key difference here: the variable `record` only ever contains 2 keys and 2 corresponding values. Those values keep changing as the loop runs, but at the end it still only has 2 values.

The database, however, contains the 2 keys plus *all* the values that the `record` variable *ever* contained. It can now be queried, or added to, or altered.