# Introduction to Python and Morph.io (for scraping)

This tutorial is intended to build on some basic coding concepts and introduce Morph.io. By the end you should:

* Be able to use GitHub to edit Python files
* Use Morph.io to run Python code hosted on GitHub

Let's get started.


## Get started with Morph.io

1. [create an account on GitHub](https://github.com/) if you haven't got one already, and [sign in to Morph.io using your GitHub account](https://morph.io/users/auth/github)
2. [Click **New scraper**](https://morph.io/scrapers/new) on the menu at the top of Morph.io. You will be taken to a new page asking you to specify more details
3. On the dropdown menu for *Language*, select **Python**. Give your scraper a name in the next box - something like 'startingtocode' (no spaces), and in the final box write 'none' - this isn't a scraper yet: we're just using Morph.io as a place to learn code.
4. Click **Create scraper**

It will take Morph.io a few moments to create the files for your scraper (the files are being created on GitHub). 

When it has finished, you will be taken to a new page for the scraper. Look on the right where it says *Scraper code*. There should be a link to `startingtocode / scraper.py ` - this will take you to the pages on GitHub where the code is now hosted: `scraper.py` is the file for the code itself; `startingtocode` is the link to the repository containing that file.

Open the link to `scraper.py` in a separate tab or window, but also keep your Morph.io page for this scraper open in another tab or window - you will need to edit the code on GitHub, and run it to see the results on Morph.io.

Now we're ready to start.

## Introducing the template code

When you create a new scraper on Morph.io, it creates it with some template code as shown below. 

Each line begins with a hash symbol: `#`. There are two ways that these are most commonly used:

* Firstly, as a way of creating **comments** in Python code: any code starting with a `#` does not do anything, so the hash symbol allows you to add comments which are not treated as working code.
* Secondly, as a way of *disabling* code - what's called *commenting out* code. Rather than delete an entire line of code, it is easier to add a `#` at the front to turn it 'off' to test what happens, so you can always turn it back 'on' again quickly by removing the `#`.

In the template code below generated by Morph.io, the *entire* code is commented out. The idea is that you can **uncomment** the sections you want to use in your own code, saving you time writing scraping code from scratch. We'll come back to this later.

In [None]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

# import scraperwiki
# import lxml.html
#
# # Read in a page
# html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

## Libraries in Morph.io

Make sure you are on this file in GitHub, and click the edit button to make some changes.

Uncomment the two lines that start with `import` so the code looks like below.

These two lines bring in two **libraries** to Morph.io: 

* Scraperwiki is a library which has useful functions for scraping webpages and storing the results in a database
* lxml.html is a library which is useful for *parsing* HTML webpages - i.e. drilling down to particular pieces of information you want.

In [1]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
# html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Next, uncomment the line `html = scraperwiki.scrape("http://foo.com")`:

In [3]:
# This is a template for a Python scraper on morph.io (https://morph.io)
# including some code snippets below that you should find helpful

import scraperwiki
import lxml.html
#
# # Read in a page
html = scraperwiki.scrape("http://foo.com")
#
# # Find something on the page using css selectors
# root = lxml.html.fromstring(html)
# root.cssselect("div[align='left']")
#
# # Write out to the sqlite database using scraperwiki library
# scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
#
# # An arbitrary query against the database
# scraperwiki.sql.select("* from data where 'name'='peter'")

# You don't have to do things with the ScraperWiki and lxml libraries.
# You can use whatever libraries you want: https://morph.io/documentation/python
# All that matters is that your final data is written to an SQLite database
# called "data.sqlite" in the current working directory which has at least a table
# called "data".

Now **commit** your changes (GitHub's version of saving), and switch back to the scraper in Morph.io. Run the scraper.

In [7]:
mynewvariable = Paul

NameError: name 'Paul' is not defined

The error tells us that `'Paul' is not defined`. In other words, we have not *defined* a function or variable called Paul. That's because it's not a function or variable - it's a string, but we forgot to indicate that by putting it inside quotation marks.

Numbers do not need quotation marks because they are, well, numerical. And we want to treat them that way. Sometimes we *might* want to treat numbers as strings - for example if we are putting them into a URL, or adding them to a postcode or similar code. In those cases we would put them in quotation marks. But if want to perform calculations like adding, multiplying, and so on, we don't use quotation marks.

It's worth mentioning here that there is more than one type of number: **integers** are whole numbers, and **floats** are numbers with decimal points. 

As well as strings, integers and floats, there are two other types of variable in Python worth mentioning here: **lists** and **dictionaries**.

A list is useful for storing multiple objects that you might want to pull items from, or measure. For example if you wanted to add together a series of numbers, you'd probably want to store them in a list in order to do that. Likewise, if you wanted to check whether a document contained any directors from a particular company, you'd probably want to store the names of those directors in a list too.

A list is created by using square brackets, with each item separated by a comma, like so:

In [11]:
mynewlist = [10,20,40]
myotherlist = ["Paul", "Sarah", "Diane"]

Lists are often used with **loops**, which we'll come on to.

The other type of variable - a **dictionary** - is essentially a different type of list; a list of *key-value pairs*. It's easiest to demonstrate with an example:

In [None]:
mynewdict = {"name" : "Paul", "age" : 18, "hometown" : "Birmingham"}

The *key* is like a column heading in a spreadsheet: name, age, hometown. The *value* is like the cells underneath that column heading: "Paul", 18, "Birmingham". A **colon** is used to create the *key-value pair*, and commas are used to separate each pair into a list.

Dictionaries are useful in storing data. In a scraper, for example, you might grab one piece of information from a webpage and store it in a dictionary variable against a relevant key, then grab a different piece of information and store that against another key, and so on. If you get into scraping you'll get more experience with this.

## Loops

Loops are a way to perform repetitive actions in code. They are extremely useful for situations involving lists or ranges of numbers: for example you might loop through a list of items and check each one, or save it in a datastore; or loop through a range of numbers and add them to a URL to create a page number.

Here is an example:

In [12]:
for i in mynewlist:
    print(i)

10
20
40


A loop has a number of parts which are worth breaking down:

* `for` and `in`
* the variable here called `i`
* the colon
* the indented code after the colon

The word `for` indicates that we want to begin looping. What we want to loop *through* is indicated by `in`. So in this case we want to loop through items in the list variable `mynewlist`. I'll come back to `i` in a minute.

At the end of this line comes a colon, and that begins an indented section which contains the code we want to be executed for *each* item in that list. In this case the command `print(i)` will run 3 times - once for each of the 3 items in the list.

Now what about that `i`?

When we loop through a list we need some way to store each item while we're working with it. The word between `for` and `in` is a way of assigning that item to a variable - so `i` is a name for the item the loop is currently working with. The first time the loop runs, `i` is `10` (the first item in the list); the second time, `i` is `20`, and so on. 

Within the indented code, then, we can use that `i` variable and do things with it, each time the loop runs. In other words, we can do things with *each item* in a list.

The choice of `i` is entirely arbitrary: we can use any name we want. It's quite common for people to use `i` in loops because it's short and also represents an 'item' in a 'list', but we can make other choices which are more meaningful. For example, if we wanted to loop through a list of names we might write code like this:

In [13]:
usernames = ["Paul", "Sarah", "Diane"]
for username in usernames:
    print("The user is "+username)

The user is Paul
The user is Sarah
The user is Diane


## If/else tests

An if/else test in programming allows us to test if something is true or false, and then do different things depending on the answer. Here's an example:

In [14]:
userage = 30
if userage > 25:
    print("The user is over 25")
else:
    print("The user is 25 or younger")

The user is over 25


In the example above we store the number 30 in the variable `userage`. The first `if` test asks if that variable is above 25. Note that it ends in a colon and the indented code after that colon will only run if that test returns `True`.

If it does not return `True`, then it ignores the indented code and moves to the next line that begins `else`. Again, this ends in a colon and some indented code, which will now run.

In [16]:
userage = 21
if userage > 25:
    print("The user is over 25")
else:
    print("The user is 25 or younger")

The user is 25 or younger


These tests can be very useful in all sorts of contexts. For example you can use it to only run certain code if you know it is going to work (asking for the 5th item in a list, for example, means you may need to test that there are at least 5 items in that list).

In addition to `if` and `else` you can insert extra conditions using `elif` (else if) like so:

In [17]:
userage = 21
if userage > 25:
    print("The user is over 25")
elif userage <18:
    print("The user is below 18")
else:
    print("The user is between 18 and 25")

The user is between 18 and 25


Because these tests are likely to be used more than once, you may want to store them as a **function** so you don't have to write the same code over and over again.

## Functions

In order to do things with variables, it's likely that we will need **functions** of some sort. These are special words which perform some sort of action, such as calculating an average from a collection of numbers, or the biggest number in that collection, or measuring the numbers of characters in a string, or replacing certain characters.

We've already come across one function: `print`. This is used to display something in the console. We used it with the string "hello world" and with variables too.

The function `print` is followed by parentheses containing what it needs to work properly: the object you want it to print.

This is a defining feature of functions: they are always followed by parentheses, and those parentheses contain any *ingredients* that are needed for it to work. Some functions have 1 ingredient, some have 2 or more. A few have none, but they still need the parentheses.

The function `len`, for example, will tell you the length of a string. Here we use it to find out how long our variable is:

In [10]:
len(mynewvariable)

5

The function `sum` will add up all numbers in a list. Below we first create a list variable, and then use the function on it:

In [16]:
mylist = [1,3]
sum(mylist)

4

A list can also refer to variables. Here for example we store the area for a number of countries in appropriately named variables. Then we store those in a *list*. Finally we use `sum` to calculate the total of that list, and print it.

In [8]:
#Is Africa really as big as this: https://twitter.com/simongerman600/status/944535955867881472
africa = 30.37
#Here are the rest based on quick Google searches
usa = 9.834
india = 3.287
china = 9.597
europe = 10.18
mexico = 1.964
japan = 0.377
countrieslist = [usa,india,china,europe,mexico,japan]
print(countrieslist)
allcountries = sum(countrieslist)
print(allcountries)
print(africa<allcountries)

[9.834, 3.287, 9.597, 10.18, 1.964, 0.377]
35.239
True


### Creating your own functions

Earlier I mentioned how if/else tests might be used over and over again, in which case you may want to save it as a user-defined function. These are functions created by you - the user - and they are created with the command `def` like so:

In [18]:
def myagetest(userage):
    if userage > 25:
        print("The user is over 25")
    elif userage <18:
        print("The user is below 18")
    else:
        print("The user is between 18 and 25")

The `def` is short for *define*. It needs to be followed by a *name* for your new function (so that it can be used - called - later), then some *parentheses* containing any ingredients it needs to work, followed by a *colon*.

As you might guess, after the colon comes some indented code. That indented code is the code you want to run when the function is *called*. If you've specified any ingedients then chances are the code is going to do something with that variable.

Once you've **defined** a function like this you call it by using its name like so:

In [19]:
myagetest(22)

The user is between 18 and 25


What happens here is worth breaking down a little. 

Firstly, the code looks for a function called `myagetest` - it needs to have been defined *before* this line runs. 

Secondly, it *passes* the value `22` to that function, as an ingredient - an *argument*.

Now we look at the function as we defined it above. In the line `def myagetest(userage):` that ingredient is given the name `userage`. In other words, `22` is stored in a variable called `userage`, which can then be used by the code inside the function. 

This is what's called a **local variable**. In other words, it only exists within the scope of the function itself. Once the function is finished, the variable no longer exists. 

That variable is then tested by the code that comes next, and the relevant `print` command is executed.

### Using functions from other libraries

Python comes with a number of built-in functions, but you can access more functions by importing **libraries**. Using libraries in Python notebooks is a bit more complex than using in environments like Morph.io, so I've [covered that in a separate notebook]()