# Introduction to Python






Python is a popular general purpose, free language. As opposed to other systems that are focused towards a particular task (e.g. R for statistics), Python has a strong following on the web, on systems operations and in data analysis. For scientific computing, a large number of useful add-ons ("libraries") are available to help you analyse and process data. This is an invaluable resource.

In addition to being free, Python is also very portable, and easy to pick up. 


The aim of this Chapter is to introduce you to some of the fundamental concepts in Python. Mainly, this is based around fundamental data types in Python (`int`, `float`, `str`, `bool` etc.) and ways to group them (`tuple`, `list`, array, string and `dict`).

Although some of the examples we use are very simple to explain a concept, the more developed ones should be directly applicable to the sort of programming you are likely to need to do.

Further, a more advanced section of the chapter is available, that goes into some more detail and complkications. This too has a set of exercises with worked examples.


In this session, you will be introduced to some of the basic concepts in Python.

**The session should last 4 hours (one week).**

## Getting Started

### Input cells

In a notebook, a code cell (i.e. one we can write computer code in) is labelled 

    In [N]:
    
where `N` is a number indicating the index of when the cell was run. So, when we run our first input cell, with will be labelled 

    In [1]:
    
then

    In [2]:
 
etc.

You can 'run' the code in a code cell by placing the cursor in the ciode block to be run, then use the `Run` widget (above), use the `Cell->Run Cells` menu selection, or hit the keys ('typing') `<SHIFT>` and `<RETURN>` at the same time. 

<span class="girk">Do this in the cells below:</span>

In [14]:
# hello world

In [15]:
# hello world again

<span class="girk">Now, run those cells (above) again and note how the index increases.</span>

You can re-start running cells ('restart the kernel') in a notebook (and clear the output if you like) using options from the `Kernel` menu.


<span class="girk">**Exercise**

* Restart the kernel now and re-run the cells above
* What happens to the indices now?</span>

<span class="burk">**Answer**

When you restart the kernel with `Kernel->Restart & Clear Output`, the indices in the cells above should become blank:

    In [ ]:
    
When you re-run these, they should be numbered:

    In [1]:
    In [2]:

If you restart the kernel without clearing, the index is not blanked first.</span>

--------


### # Comments

Comments are statements ignored by the language interpreter.

Any text after a `#` in a *code cell* (input cell) is a comment.


   

<span class="girk">**Exercise**

* Run the code block below
* Explain what happened ('what the computer did')</span>

In [16]:
# Hello world

<span class="burk">**Answer**

Run the code block with the cursor in the relevant cell:
* with `<SHIFT-RETURN>`, i.e. `<SHIFT>` and `<RETURN>` keys at the same time.
* or `Cell->Run Cells` from the notebook menu
* or click on the `Run` button

Explain what happened:
In the case, the code in the cell was:

    # Hello world
    
When we run this, 'nothing' seems to happen, because anything after `#` in a code line is a comment (i.e. not interpreted).</span>

--------

### String and text blocks as comments

A `string` is a type of variable used to contain some text. A single line of text is defined within quotes (either single quotes `'` or double quotes `"`), and a multi-line string between `'''` triple quotes.

We use strings for a range of purposes, but you will often see them used as 'document strings', which yiou can think of as a form of comment.

In the code block below, we demonstrate the `#` comment again, and also show two types of string text. 

Because the final string:

        '''
        Hello 
        world
        is printed
        '''
        
is at the end of the cell, this string is 'printed' to the output cell `Out [3]:` (assuming the index is at `3` here).

<span class="girk">Run this cell to confirm this happens.</span>

In [17]:
# single line text
'hello world is not printed'

# multi-line text
'''
Hello 
world
is printed
'''

'\nHello \nworld\nis printed\n'

 
<span class="girk">**Exercise**
 
* Copy the text from above in the window below

* Then put a new text block at the end and note down what happens
        
* What does the `\n` mean/do?</span>


In [5]:
# do the exercise here

In [6]:
# Answer:

# ---------------------
# copy the previous text 
# single line text
'hello world is not printed'

# multi-line text
'''
Hello 
world
is printed
'''

# ---------------------
# a new text block
# A newline character \n is placed in the 
# string for each new line in multi-line text
# The final entry in the input cell is output to 
# the correspinding output cell
'''
This is new text
'''

'\nThis is new text\n'

--------

### Print function

To print some statement to the screen you are using, use the `print()` function.

<span class="girk">Run the two cells below to execute some print statements.</span>

In [18]:
# print a single string
print('hello world')

hello world


In [19]:
# print a list of strings
print('hello','world')

hello world


In Python 3.X, `print` is a function with the *argument(s)* (here, the string you want printed) enclosed in the function's (round) brackets.

<span class="girk">**Exercise**

* Copy the print statement from the above code block.

* Change the words in the quotes and print them out.

* Add some comments to the code block explaining what you have done and seen.</span>


In [12]:
# do the exercise here

In [11]:
# Answer

'''
The print statement below sends text to the output channel (the 'screen' as it were). 
In this case, it prints the string 'good', followed by a space, then the string 'morning'
'''
print('good','morning')

good morning


### Variables and values

The idea of **variables** is fundamental to any programming. You can think of this as the *name* of *something*, so it is a way of allowing us to refer to some object in the language.

What the variable *is* set to is called its **value**.

So let's start with a variable we will call (*declare to be*) `x`.

We will give a *value* of the string `'one'` to this variable:

In [13]:
x = 'one'

print(x)

one


**E1.1.4 Exercise**

* set the a variable called x to some different string (e.g. 'hello world')
    
* print the value of the variable `x`
    
* Try this again, putting some 'newlines' (`\n`) in the string

In [None]:
# do the exercise here

In [None]:
# Now we set x to the value 1

x = 1

print(x,'is type',type(x))

In a computing language, the *sort of thing* the variable can be set to is called its **data type**. 

In python, we can access this with the method `type()` as in the example above.

In the example above, the datatype is an **integer** number (e.g. `1, 2, 3, 4`).

In 'natural language', we might read the example above as 'x is one'.


**E1.1.5 Exercise**

* set the a variable called `x` to the integer `5`
    
* print the value and type of the variable `x`
    
* change the data type used for `x` to something else (e.g. a string)

In [None]:
# do the exercise here


Setting `x = 1` is different to:

In [None]:
x = 'one'

because here we have set value of the variable `x` to a **string** (i.e. some text).

A string is enclosed in quotes, e.g. `"one"` or `'one'`, or even `"'one'"` or `'"one"'`.



In [None]:
print ("one")
print ('one')
print ("'one'")
print ('"one"')

**E1.1.6 Exercise**

* create a variable `name` containing your name, as a string.
        
* using this variable and the print function, print out a statement such as `my name is Fred` (if your name were `Fred`)
        

In [None]:
# do the exercise here

Setting `x = 1` or `x = 'one'` is different to:

In [None]:
x = 1.0

because here we have set value of the variable `x` to a **floating point** number (these are treated and stored differently to integers in computing).

This in turn is different to:

In [None]:
x = True

where `True` is a **logical** or **boolean** datatype (something is `True` or `False`).



**E1.1.7 Exercise**

* in the code block below, create a variable called `my_var` and set it to some value (your choice of value, but be clear about the data type you intend)
           
* print the value of the variable to the screen, along with the data type.

In [None]:
# do the exercise here

We have so far seen four datatypes:

- integer (`int`): 32 bits long on most machines
- (double-precision) floating point (`float`): (64 bits long)
- Boolean (`bool`)
- string (`str`)

but we will come across more (and even create our own!) as we go through the course.

In each of these cases above, we have used the variable `x` to contain these different data types.

As we saw above, if you want to know what the data type of a variable is, you can use the method `type()`

In [None]:
print (type(1));
print (type(1.0));
print (type('one'));
print (type(True));

You can explicitly convert between data types, e.g.:

In [None]:
print ('int(1.1) = ',int(1.1))
print ('float(1) = ',float(1))
print ('str(1) = ',str(1))
print ('bool(1) = ',bool(1))

but only when it makes sense:

In [None]:
print ("converting the string '1' to an integer makes sense:",int('1'))

In [None]:
print ("converting the string 'one' to an integer doesn't:",int('one'))

When you get an error (such as above), you will need to learn to *read* the error message to work out what you did wrong.

**E1.1.8 Exercise**

* why did the statement abovce not work?
    
* type some other data conversions below that *do* work.

In [None]:
# do the exercise here

### 1.1.3 Arithmetic

Often we will want to do some [arithmetic](http://www.tutorialspoint.com/python/python_basic_operators.htm) with numbers in a program, and we use the 'normal' (derived from C) operators for this.

Note the way this works for integers and floating point representations.

In [None]:
'''
    Some examples of arithmetic operations in Python

    Note how, if we mix float and int, the result is raised to float
    (as the more general form)
'''

print (10 + 100)     # int addition
print (10. - 100)    # float subtraction
print (1./2.)        # float division
print (1/2)          # int division
print (10.*20.)      # float multiplication
print (2 ** 3.)      # float exponent

print (65%2)         # int remainder
print (65//2)        # floor operation

**E1.1.9 Exercise**

* change the numbers in the examples above to make sure you understand these basic operations.

* try combining operations and use brackets `()` to check that that works as expected.

* see what happens when you add (i.e. use `+`) strings together

In [None]:
# do the exercise here

### 1.1.4 Assignment Operators

In [None]:
'''
    Assignment operators

    x = 3   assigns the value 3 to the variable x
    x += 2  adds 2 onto the value of x
            so is the same as x = x + 2
            similarly /=, *=, -=
    x %= 2  is the same as x = x % 2
    x **= 2 is the same as x = x ** 2
    x //= 2 is the same as x = x // 2

    A 'magic' trick
    ===============

    based on
    https://www.wikihow.com/Read-Someone%27s-Mind-With-Math-(Math-Trick)

    whatever you put as myNumber, the answer is 42

    Try this with integers or floating point numbers ...
'''

# pick a number 
myNumber = 34.67

x = myNumber

x *= 2

x *= 5

x /= myNumber

x -= 7

x += 39

# The answer will always be 42
print(x)

**E1.1.10 Exercise**

* change the number assigned to `myNumber` and check if `42` is still returned
* copy and edit the code to print the value of `x` each time you change it, and add comments explaining what is happening for each line of code. This should allow you to follow more carefully what has happened with the arithmetic and also to simplify the code (use fewer statements to achieve the same thing).

In [None]:
# do the exercise here

### 1.1.5 Logical Operators

Logical operators combine boolean variables. Recall from above:

In [None]:
print (type(True),type(False));

The three main logical operators you will use are:

    not, and, or
    
The impact of the `not` opeartor should be straightforward to understand, though we can first write it in a 'truth table':   



| A  | not A  | 
|:---:|:---:|
|  T | F | 
|   F |  T | 

In [None]:
print('not True is',not True)
print('not False is',not False)

**E1.1.11 Exercise:**
    
* write a statement to set a variable `x` to `True` and print the value of `x` and `not x`
       
* what does `not not x` give? Make sure you understand why 
    





In [None]:
# do the exercise here

The operators `and` and `or` should also be quite straightforward to understand: they have the same meaning as in normal english. Note that `or` is 'inclusive' (so, read `A or B` as 'either A or B or both of them').

In [None]:
print ('True and True is',True and True)
print ('True and False is',True and False)
print ('False and True is',False and True)
print ('False and False is',False and False)

So, `A and B` is `True`, if and only if both `A` is `True` and `B` is `True`. Otherwise, it is `False`

We can represent this in a 'truth table':


| A  | B  | A and B  | 
|:---:|:---:|:---:|
|  T |  T |  T | 
|  T |  F |  F | 
|  F |  T |  F | 
|  F |  F |  F | 


**E1.1.12 Exercise:**

* draw a truth table *on some paper*, label the columns `A`, `B` and `A and B` and fill in the columns `A` and `B` as above
* without looking at the example above, write the value of `A and B` in the third column.
* draw another truth table *on some paper*, label the columns `A`, `B` and `A and B` and fill in the columns `A` and `B` as above
* write the value of `A or B` in the third column.

If you are unsure, test the response using code, below.

In [None]:
# do the testing here e.g.
print (True or False)

**E1.1.13 Exercise**

* Copy the following truth table onto paper and fill in the final column:

| A  | B  | C | ((A and B) or C)  | 
|:---:|:---:|:---:|:---:|
| T|  T |  T |   | 
| T|  T |  F |   | 
| T|  F |  T |   | 
| T |  F |  F |   | 
| F|  T |  T |   | 
| F|  T |  F |   | 
| F|  F |  T |   | 
| F |  F |  F |   | 

* Try some other compound statements

If you are unsure, or to check your answers, test the response using code, below.


In [None]:
# do the testing here e.g.
print ((True and False) or True)

### 1.1.6 Comparison Operators and `if`

A comparison operator 'compares' two terms (e.g. variables) and returns a boolean data type (`True` or `False`).

For example, to see if the value of some variable `a` is 'the same value as' ('equivalent to') the value of some variable `b`, we use the equivalence operator (`==`). To test for non equivalence, we use the not equivalent operator `!=` (read the `!` as 'not'):


In [None]:
a = 100
b = 10

# Note the use of \n and \t in here
# 
print ('a is',a,'and\nb is',b,'\n')
print ('\ta is equivalent to b?',a == b)

**E1.1.14 Exercise**

* copy the code above and change the values (or type) of the variables `a` and `b` to test their equivalence.
* what does the `\t` in the print statement do?
* add a `print` statement to your code that tests for 'non equivalence'
* write some code to see if `(a or b)` is equivalent to `(b or a)` or not

In [None]:
# do the exercise here

A full set of comparison operators is:

|symbol| meaning|
|:---:|:---:|
| is | [is identical to](https://www.geeksforgeeks.org/difference-operator-python/) |
| is not | is not identical to |
| == | is equivalent to |
| != | is not equivalent to |
| > | greater than |
|>= | greater than or equal to|
|<  | less than|
|<=  | less than or equal to    |

so that, for example:

In [None]:
# Comparison examples

# is one plus one list identical to two list?
print ([1 + 1] is [2])

# is one plus one list equal to two list?
print ([1 + 1] == [2])

# is one less than or equal to 0.999?
print (1 <= 0.999)

# is one plus one not equal to two?
print (1 + 1 != 2)

# note the use of single quotes inside a double quoted string here
# is 'more' greater than 'less'?
print ("more" > "less")

# "is 100 less than 2?"
print (100 < 2)


**Aside on string comparisons**

In the case of string comparisons, the [ASCII](http://www.asciitable.com) codes of the string characters are compared. So for example the statement "more" > "less" returns True.

Here, the comparison is effectively

    m > l

Since `m` comes after `l` in the alphabet, the ASCII
code for `m` (109) is greater than the ASCII code 
for `l` (108) (see http://www.asciitable.com) so

    109 > 108

returns True. Note that ASCII capital letters come before the lower case letters. 

In practice, we mainly avoid string comparisons (other than to confirm equivalence). So there is little direct use of string comparisons other than `==`. It is useful to know how this works however, in case it crops up or happens 'by accident'. It is also worth understanding what ASCII codes are.


**Conditional test**

One common use of comparisons is for program control, using an `if` statement:

    if condition1 is True:
        doit1()
    elif condition2 is True:
        doit2()
    else:
        doit3()
        
where `is` compares identity. This allows us to run blocks of code (e.g. the method `doit1()`) only under a particular condition (or set of conditions).

In Python, the statement(s) we run on condition (here `doit1()` etc.) are *indented*. 

The indent can be one or more spaces or a `<tab>` character, the choice is up to the programmer. However, it **must be consistent**.



In [None]:
test = [1+1]
print('test is {}'.format(test))

# initialise retval
retval = None

# conduct some tests, and set the 
# variable retval to True if we pass
# any test

if test is [2]:
    retval = True
    print('passed test 1: "if test is [2]"')
elif test == [2]:
    retval = True
    print('passed test 2: "if test == [2]"')
else:
    retval = False
    print('failed both tests')
    
print('retval is',retval)

**E1.1.15 Exercise**

* copy the example above, and change it to use other examples from the 'Comparison examples' code block. Change the value of `test` to get different responses and make notes as to why you get the result you do.
* try out some more complicated conditions, e.g. multipler tests, combined with an `and` operator.

In [None]:
# do the exercise here

### 1.1.7 Summary

In this section, you have had an introduction to the Python programming language, running in a [`jupyter notebook`](http://jupyter.org) environment.

You have seen how to write comments in code, how to form `print` statements and basic concepts of variables, values, and data types. You have seen how to maniputae data with arithmetic and assignment operators, as well as the basics in dealing with logic and tests returning logical values.



## 1.2  Text and looping
In Python, collections of characters (`a`, `b`, `1`, ...) are called strings. Strings and characters are input by surrounding the relevant text in either double (`"`) or single (`'`) quotes. There are a number of special characters that can be encoded in a string provided they're "escaped". For example, some we have come across are:

* `\n`: the carriage return
* `\t`: a tabulator

In [None]:
print ("I'm a happy string")
print ('I\'m a happy string') # the apostrophe has been escaped as not to be confused by end of string
print ("\tI'm a happy string")
print ("I'm\na\nhappy\nstring")

We can do a number of things with strings, which are very useful. These so-called string methods are defined on all strings by Python by default, and can be used with every string.
For one, we can concatenate strings using the `+` symbol as we saw above.


### 1.2.1  `len`

Gives the length of the string as number of characters:

In [None]:
t = ''
print ('the length of',t,'is',len(t))


s = "Hello" + "there" + "everyone"

print ('the length of',s,'is',len(s))

**Exercise E1.2.1**

* what does a zero-length string look like?
* The `Hello there everyone` example above has no spaces between the words. Copy the code to the block below and modify it to have spaces.
* confirm that you get the expected increase in length.

In [None]:
# do exercise here

### 1.2.2 `for ... in ...` and `enumerate`

Very commonly, we need to iterate or 'loop' over some set of items.

The basic stucture for doing this (in Python, and many other languages) is `for item in group:`, where `item` is the name of some variable and `group` is a set of values. 

The loop is run so that `item` takes on the first value in `group`, then the second, etc.

In [None]:
# for loop
group = [4,3,2,1]

for item in group:
    '''print counter in loop'''
    print(item)
    
print ('blast off!')

The `group` in this example is the list of integer numbers `[4,3,2,1]`. A `list` is a group of comma-separated items contained in square brackets `[]`.

In Python, the statement(s) we run whilst looping (here `print(item)`) are *indented*. 

The indent can be one or more spaces or a `<tab>` character, the choice is up to the programmer. However, it **must be consistent**.

**Exercise 1.2.1**

* generate a list of strings called `group` with the names of (some of) the items in your pocket or bag (or make some up!)
* set up a `for` loop to go through and print each item

In [None]:
# do exercise here

Quite often, we want to keep track of the 'index' of the item in the loop (the 'item number').

One way to do this would be to use a variable (called `count` here). 

Before we enter the loop, we initialise the value to zero.

In [None]:
# for loop
group = ['hat','dog','keys']

# initialise a variable count
count = 0

for item in group:
    '''print counter in loop'''
    
    print('item',count,'is',item)
    
    # add 1 onto count
    count += 1

**Exercise 1.2.2**

* copy the code above, and check to see if the value of `count` at the end of the loop is the same as the length of the list. Why should this be so?
* change the code so that the counting starts at 1, rather than 0. 


In [None]:
# do exercise here

Since counting in loops is a common task, we can use the built in method `enumerate()` to achieve the same thing as above. The syntax is then:

In [None]:
# for loop
group = ['hat','dog','keys']

for count,item in enumerate(group):
    '''print counter in loop'''
    print('item',count,'is',item)


**Exercise 1.2.3**

* copy the code above, and check to see if the value of `count` at the end of the loop is the same as the length of the list.
* change the code so that the printed count starts at 1, rather than 0. 

Hint: how can you make it print `count+1` rather than `count`? 

In [None]:
# do exercise here

### 1.2.3 `slice` 

A string can be thought of as an ordered 'array' of characters. 

So, for example the string `hello` can be thought of as a construct containing `h` then `e`, `l`, `l`, and `o`. 

We can index a string, so that e.g. `'hello'[0]` is `h`, `'hello'[1]` is `e` etc.

We have seen above the idea of the 'length' of a string. In this example, the length of the string `hello` is 5.

In [None]:
string = 'hello'

# length
slen = len(string)
print('length of {} is {}'.format(string,slen))

# select these indices
indices = 0,1,3

# loop over each item in indices
for index in indices:
    print('character {} of {} is {}'.format(index,string,string[index]))


**Exercise E1.2.4**

* copy the code above, and see what happens if you set a value in `indices` that is the value of length of the string. Why does it respond so?
* make the code robust to this issue, but using an `if` statement to test if `index` is in the required range.

In [None]:
# do exercise here



We can use the idea of a 'slice' to access particular elements within the string.

For a slice, we can specify:

* start index (0 is the first)
* stop index (not including this)
* skip (do every 'skip' character)

When specifying this as array access, this is given as, e.g.:

`array[start:stop:skip]`

* The default start is 0
* The default stop is the length of the array
* The default skip is 1

You can specify a slice with the default values by leaving the terms out:

`array[::2]`

would give values in the array `array` from 0 to the end, in steps of 2.

This idea is fundamental to array processing in Python. We will see later that the same mechanism applies to all ordered groups.


In [None]:
s = "Hello World"
print (s,len(s))

start = 0
stop  = 11
skip  = 2
print (s[start:stop:skip])

# use -ve numbers to specify from the end
# use None to take the default value

start = -3
stop  = None
skip  = 1
print (s[start:stop:skip])

**Exercise E1.2.5**

The example above allows us to access an individual character(s) of the array.

* copy the example above, and print the string starting from the default start value, up to the default stop value, in steps of `2`.
* write code to print out the 4$^{th}$ letter (character) of the string `s`.


In [None]:
# do exercise here

### 1.2.4 `replace`

We can replace all occurrences of a string within a string by some other string. We can also replace a string by an empty string, thus in effect removing it:

In [None]:
print ("I'm a very happy string".replace("happy", "unhappy"))

**Exercise E1.2.6**

* copy the statement above, and use the `replace` method to make it print out `"I'm a happy string"`. 

Hint: you want to replace the string `very ` with, effectively, nothing, i.e. a zero-length string.

In [None]:
# do exercise here

### 1.2.5 `find`

Quite often, we might want to find a string inside another string, and potentially give the location (as in characters from the start of the string) where this string occurs. We can use the `find` method, which will return either a `-1` if the string isn't found, or an integer giving the index of where the string starts (for the first time).

In [None]:
print ("I'm a very happy string".find("a"))
print ("I'm a very happy string".find("happy"))

Let's use the idea of `find()` to sort out a messy table of data that we get from a web page.

First, we need to import the package `requests` to access some information from a [URL](https://en.wikipedia.org/wiki/URL) (from a web page). The data we get will be in [html](https://en.wikipedia.org/wiki/HTML).

The data we will examine is a dataset of [ENSO](https://en.wikipedia.org/wiki/ENSO) values for each month of the year from January 1950 to present, made available by [NOAA](https://en.wikipedia.org/wiki/NOAA)/

If you visit  you will see the data table we are interested in. So, how do we 'grab' this?

The [URL](https://en.wikipedia.org/wiki/URL) points to [html](https://en.wikipedia.org/wiki/HTML) code. When you display this in a browser, it is rendered appropriately. 

If you access the html directly, you will get the following:

In [None]:
# Web scraping example

import requests

url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"

# This line will pull the URL data as a string
txt = requests.get(url).text

# show the first 1000 characters (see 'slice' above: this is the same as [None:1000:None])
print(txt[:1000])

We notice the presence of html codes in the text string (e.g. `<html>`, `<pre>`). There are particular packages for neatly parsing html (scraping information from web pages), one of the most common being [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). This will tend to be more useful if the html is well fomatted, and the data contained in `<table>` sections, or similar structures. Here, we just have a block of text in the `<pre>` section.

If we want to *just* access the dataset here then, we might notice that the data we want to access starts when we see the string `YEAR`.

We can use `find()` to discover the index of this in the string:

In [None]:
start = txt.find('YEAR')

print('start of useful data at index {}\n---------------------------------'.format(start))
print(txt[start:start+1000])

If we look again at the web page [http://www.esrl.noaa.gov/psd/enso/mei.old/table.html](http://www.esrl.noaa.gov/psd/enso/mei/table.html), we might notice that the end of the useful data is delimited by two newlines and the string `(1)`, i.e., as a string `\n\n(1)`. So we should be able to use `find()` again to get the location of the end of the data (i.e. `stop`, in the sense of a slice).

**Exercise 1.2.6**

* use this observation to form a string called `data_table`, containing all of the useful data (i.e. `txt[start:stop]`).
* print the string `data_table`.


In [None]:
# do exercise here

This exercise is a very good example of [web scraping](https://en.wikipedia.org/wiki/Web_scraping). Web scraping is often rather messy (you have to work out some 'key' to reliably delimit the information you want) but can be extreemely valuable for accessing datasets that are not cleanly presented. We have only gonbe part of the way to extracting a useful dataset here, because the dataset we are interested in (the ENSO data) are still represented as a string, whereas we really want them to be a set of floating point numbers. We will deal with this later.



### 1.2.5 `split` and `splitlines`

The first 'line' of `` should contain the 'header' information, i.e. the title of the data columns (`YEAR`, `DECJAN` etc.). We want to separate the header from the numbers in the data table, so we want to 'split' the string called `data_table` into a header string and data string. 

One approach to this would be split the string into 'lines' of text (rather than one block). Effectively that means splitting into multiple strings whenever we hit a `\n` character. Rather than do that explicitly, we use the `splitlines()` method:



In [None]:
import requests
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start = txt.find('YEAR')
stop  = txt.find('\n\n(1)')
data_table = txt[start:stop]

# split into a list of strings
data_lines = data_table.splitlines()

# tell me something useful
print(type(data_lines),len(data_lines))

# loop over some examples
for i in 0,1,len(data_lines)-1:
    print('line {} {}\n\t{}'.format(i,type(data_lines[i]),data_lines[i]))

This splits each 'line' of text into an entry in a `list`, so that the header data is now given in the first entry (`data_lines[0]`) and the lines containinmg data, after that.


From the print out above, we notice that the final 'data line' (index `-1`) is shorter than (has fewer entries than) the other lines. This is because we are only part way through this year!.

In 'real' datasets, we quite often have 'messy' lines of data such as this (or data missing for other reasons). How you want to deal with the 'messy bits' depends on the sort of analysis you want to do. 

One option (the simplest) would be to simply remove the last line (ignore this year's data):

In [None]:
header = data_lines[0]

# select the data block as being from entry 1 to -1
# so, **not including the last row**
data = data_lines[1:-1]

print('header:',header)

for i in 0,1,len(data)-1:
    print('line {} {}\n\t{}'.format(i,type(data[i]),data[i]))

**Exercise 1.2.7**

* copy the code from above and explore the response using line indices `-1` and `-2`.

In [None]:
# do exercise here

If we want to manipulate or plot the information contained in this (the numbers), we need to convert each of the string representations to a floating point number, e.g. the number `-1.03` rather than the string `'-1.03'`.

Each entry in the list `data` is a string, as we saw above.

We can split an individual string (such as `data[0]` into a list of strings, using the string method `split()`. By default, this splits on 'white space' (i.e. spaces or tab characters), so, e.g.:



In [None]:
line = data[0].split()
print(data[0])
print(line,len(line))

So, we have split the long string into 13 strings in a list. 

We want to generate a new list with 13 corresponding floating point values:

In [None]:
# split the line on whitespace
line = data[0].split()

# make a new list of the same length
# by copying the variable line
float_data = line.copy()

for index,line_data in enumerate(line):
    # insert the cast float into the list
    # in the right order (use index)
    float_data[index] = float(line_data)
    
# this is the string list
print(line)

# this is the float list
print(float_data)

**Exercise 1.2.8**

* set a variable to be the string `"2, 3, 5, 7, 11, 13, 17, 19, 23, 29"`
* use the approach above to generate a **list of integers** of the first 10 prime numbers. 
* print the list with syntax of the pattern of 'prime number 3 is 7'

Make sure you convert each prime number to an integer, rather than leaving it as a string!

Hint: We can still use the method `split()` to do split the string into a list of strings, but this time the [separator](https://python-reference.readthedocs.io/en/latest/docs/str/split.html) is a comma, rather than whitespace. 

In [None]:
# do exercise here
pstring = "2, 3, 5, 7, 11, 13, 17, 19, 23, 29"

Normally, we wouldn't go to the trouble of first copying the list. 

Instead, **where the contents of the loop are simple** (e.g. a single statement) we would use a different way of using a `for` loop, called an **implicit loop**.

In this case:

    for item in group:
        doit(group)
        
becomes:

    [doit(group) for item in group]
    
with the additional feature that everything returned by `doit(group)` for each item of `group` is put in a list.

In [None]:
# split the line on whitespace

# implicit for loop
float_data = [float(line_data) for line_data in data[0].split()]
    
# this is the string list
print(line)
# this is the float list
print(float_data)

The statement:

    float_data = [float(line_data) for line_data in line]

is much more [Pythonic](https://docs.python-guide.org/writing/style/) than the code above. It is simple, elegant and neat.

We can *nest* for statements, i.e. put one for loop inside another. This allows us to treat data of multiple dimensions.

In the examples above, we converted only the data in `data[0]` to a list of floating point numbers.
If we wanted to process *all* lines of data, we would have to loop over them as well, in an 'outer' loop.



In [None]:
# use a step of 10 for illustration purposes
# to save space when printing

step = 10

for index,line in enumerate(data_table.splitlines()[1:-1:step]):
    # convert each line to list of floats
    float_data = [float(line_data) for line_data in line.split()]
    print('line {} is {}'.format(index*step,float_data))

Note that whilst we have calculated `float_data` in the loop for each line, it gets over-written with each new line as things stand.

We can do the same thing, and generate a list of the responses more neatly, using an implicit loop inside another implicit loop:

In [None]:
all_float_data = [[float(line_data) for line_data in line.split()] for line in data_table.splitlines()[1:-1]]

The variable `all_float_data` is now a sort of 'two dimensional' list, within which we can refer to individual items as e.g. `all_float_data[10][3]` for row `10`, column `3`.

Let's use this idea to print out column 0 of each row (containing the `YEAR` data). We will use the method `range(nrows)` that (implicitly) generates a list `[0,1,2,3, ..., nrows-1]`. 

Notice the use of `end=' '` in the `print` statement. This replaces the usual newline by whetever is specified by the keyword `end`. Note also that we have used `{:.0f}` to specify the format term. This indicates that the term is to be printed as a floating point number (the `f`) with zero numbers after the decimal point (`.0`)

In [None]:
nrows = len(all_float_data)
i = 0

print('column {} of the data gives:\n'.format(i))
for row in range(nrows):
    print('{:.0f}'.format(all_float_data[row][i]),end=' ')

**Exercise 1.2.9**

* use an implicit loop to create a list of ENSO values in a variable `enso` for the years 1950 up to last year for the period `DECJAN`.
* produce a plot of ENSO for `DECJAN` as a function of year (see below on how to do that).

Hint: check which column in the header is `DECJAN`. To start you off on this, we give you the implicit loop code for extracting the column containing the `YEAR` data (column 0). We also give you the code to achieve the plotting.

In [None]:
# do exercise here

# generate a list called years of column 0 data
years = [all_float_data[row][0] for row in range(nrows)]

# you need to put the enso data in here!
# this is put in as a dummy that should plot a straight line!
enso = years.copy()

# for plotting
import pylab as plt
%matplotlib inline

# 
plt.figure(0,figsize=(12,3))
plt.plot(years,enso)
plt.xlabel('year')
plt.ylabel('ENSO')

### 1.2.6 Summary

In section 1.2 you have been introduced to text representation in Python, as strings (type `str`), and shown that this sort of variable can be thought of an an 'array', and that it has a length attribute that can be accessed with `len()`.

Other useful string manipulation methods you were introduced to are: `replace()`, `find()`, `split()` and `splitlines()`, though of course there are [many more](https://docs.python.org/3/library/string.html).

In an 'array', we can use an index to refer to a particular item (e.g. index 0 for the first item, 1 for the second, -1 for the last). We can use this idea to manipulate strings. 

In a more general sense, we can take a 'slice' of an array, with the syntax `[start:stop:skip]` giving access to a regularly spaced part of an array. We can use this, for example, to print out every 10th value (`skip=10`).

You were also introduced to the idea of looping control structures, using a `for ... in ...:` statement, and the equivalent implicit form. This introduced the idea of [indented code blocks](https://wiki.python.org/moin/Why%20separate%20sections%20by%20indentation%20instead%20of%20by%20brackets%20or%20%27end%27) and (related) nested structures (loops within loops).

In passing, you have also been shown how to pull html data from a URL (scraping) using the [`requests`](http://docs.python-requests.org/en/master/) package, and also how to produce a simple data plot, using [`pylab`](https://matplotlib.org/index.html).

## 1.3. Groups of things
Very often, we will want to group items together. There are several main mechanisms for doing this in Python, known as:

* string e.g. `hello`
* tuple, e.g. `(1, 2, 3)`
* list, e.g. `[1, 2, 3]`
* numpy array e.g. `np.array([1, 2, 3])`

A slightly different form of group is a dictionary:

* dict, e.g. `{1:'one', 2:'two', 3:'three'}`

You will notice that each of the grouping structures tuple, list and dict use a different form of bracket. The numpy array is fundamental to much work that we will do later.

We have dealt with the idea of a string as an ordered collection in the material above, so will deal with the others here.

We noted the concept of length (`len()`), that elements of the ordered collection could be accessed via an index, and came across the concept of a slice. All of these same ideas apply to the first set of groups (string, tuple, list, numpy array) as they are all ordered collections.

A dictionary is not (by default) ordered, however, so indices have no role. Instead, we use 'keys'.

### 1.3.1 `tuple`
A tuple is a group of items separated by commas. In the case of a tuple, the brackets are optional.
You can have a group of differnt types in a tuple (e.g. int, int, str, bool)

In [None]:
# load into the tuple
t = (1, 2, 'three', False)

# unload from the tuple
a,b,c,d = t

print(t)
print(a,b,c,d)

If there is only one element in a tuple, you must put a comma , at the end, otherwise it is not interpreted as a tuple:



In [None]:
t = (1)
print (t,type(t))
t = (1,)
print (t,type(t))

You can have an empty tuple though:



In [None]:
t = ()
print (t,type(t))

**E1.3.1 Exercise**

* create a tuple called t that contains the integers 1 to 5 inclusive
* print out the value of t
* use the tuple to set variables a1,a2,a3,a4,a5

In [None]:
# do exercise here


### 1.3.2  `list`
A `list` is similar to a `tuple`. One main difference is that you can change individual elements in a list but not in a tuple.
To convert between a list and tuple, use the 'casting' methods `list()` and `tuple()`:

In [None]:

# a tuple
t0 = (1,2,3)

# cast to a list
l = list(t0)

# cast to a tuple
t = tuple(l)

print('type of {} is {}'.format(t,type(t)))
print('type of {} is {}'.format(l,type(l)))

You can concatenate (join) lists or tuples with the `+` operator:



In [None]:
l0 = [1,2,3]
l1 = [4,5,6]

l = l0 + l1
print ('joint list:',l)

**E1.3.2 Exercise**
* copy the code from the cell above, but instead of lists, use tuples
* loop over each element in the tuple and print out the data type and value of the element

Hint: use a `for ... in ...` construct.

In [None]:
# do exercise here

A common method associated with lists or tuples is:
* `index()`

Some useful methods that will operate on lists and tuples are:
* `len()`
* `sort()`
* `min(),max()`



In [None]:
l0 = (2,8,4,32,16)

# print the index of the item integer 4 
# in the tuple / list

item_number = 4

# Note the dot . here
# as index is a method of the class list
ind  = l0.index(item_number)

# notice that this is different
# as len() is not a list method, but 
# does operatate on lists/tuples
# Note: do not use len as a variable name!
llen = len(l0)

# note the use of integers in the braces e.g. {0}
# rather than empty braces as before. This allows us to
# refer to particular items in the format argument list
print('the index of {0} in {1} is {2}'.format(item_number,l0,ind))
print('the length of the {0} {1} is {2}'.format(type(l0),l0,llen))


**E1.3.3 Exercise**

* copy the code to the block below, and test that this works with lists, as well as tuples
* find the index of the integer 16 in the tuple/list
* what is the index of the first item?
* what is the length of the tuple/list?
* what is the index of the last item?

In [None]:
# do exercise here

A list has a much richer set of methods than a tuple. This is because we can add or remove list items (but not tuple).

* `insert(i,j)` : insert `j` beore item `i` in the list
* `append(j)` : append `j` to the end of the list
* `sort()` : sort the list

This shows that tuples and lists are 'ordered' (i.e. they maintain the order they are loaded in) so that indiviual elements may be accessed through an 'index'. The index values start at 0 as we saw above. The index of the last element in a list/tuple is the length of the group, minus 1. This can also be referred to an index `-1`.

In [None]:
l0 = [2,8,4,32,16]

# insert 64 at the begining (before item 0)
# Note that this inserts 'in place'
# i.e. the list is changed by calling this
l0.insert(0,64)


# insert 128 *before* the last item (item -1)
l0.insert(-1,128)

# append 256 on the end
l0.append(256)

# copy the list 
# and sort the copy
# Note the use of the copy() method here
# to create a copy
l1 = l0.copy()

# Note that this sorts 'in place'
# i.e. the list is changed by calling this
l1.sort()

print('the list {0} once sorted is {1}'.format(l0,l1))

**E1.3.4 Exercise**

* copy the above code and try out some different locations for inserting values (e.g. what does index `-2` mean?)
* what happens if you take off the `.copy()` statement in the line `l1 = l0.copy()`, i.e. just use `l1 = l0`?  [Why is this?](https://www.afternerd.com/blog/python-copy-list/)

In [None]:
# do exercise here

### 1.3.3 `np.array`

An array is a group of objects of the same type. Because they are of the same type, they can be stored efficiently in compter memory, and also accessed efficiently.

Whilst there are different ways of forming arrays, the most common is to use numpy arrays, using the package `numpy`. To use this, we must first import the package into the current workspace. We do this with the `import` method. Using the optional `as` statement allows us to use a shorter (or more suitable) name for the package. We will generally call numpy `np`, so we use:

`import numpy as np`

to import ('load') the numpy package. 

Often, we will read data from a file/URL as we did above for the ENSO dataset. In that case, we had to step through each item to convert from string form to floating point number.

This sort of thing is much more simply done using methods associated with numpy arrays. 

A particularly useful numpy method is `np.loadtxt(file)` that loads an ASCII table of data straight into a numpy array.

Whilst this is designed to load data from a file, we can use `io.StringIO()` from the `io` package to make data that we already have as a string seem to `np.loadtxt` as if it were a file. This is a useful 'trick' for using methods that expect data in a file. The `unpack=True` option makes sure the data array is compoised the way we would expect it. The `usecols` option lets us select only those data columns we wish to read (0 and 1 here).


An alternative to `np.loadtxt()` is `np.genfromtxt()`. This has some additional features, such the `invalid_raise` flag. If this is set `False`, the loading is made somewhat tolerant to data errors (e.g. inconsistent number of columns). Further, we can explicitly set what will indicate `missing_values` in the input and what we would like to replace them with (`filling_values`) which can be useful for tidying up datasets.




In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

# select a data column
data_column = 1

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True,usecols=[0,data_column])

# so data[0] is the year data
#    data[1] is the enso data for column data_column
# print some attributes of the data array

print('array type',type(data))
print('data type',data.dtype)
print('number of dimensions',data.ndim)
print('data shape',data.shape)
print('data size',data.size)

# for plotting
import pylab as plt
%matplotlib inline

# 
plt.figure(0,figsize=(12,3))
plt.plot(data[0],data[1],label=header[data_column])
plt.xlabel('year')
plt.ylabel('ENSO')
plt.title('ENSO data from {0}'.format(url))
plt.legend(loc='best')

We saw in the example above that a numpy array (`<class 'numpy.ndarray'>`) has a set of attributes that include `shape`, `ndim`, `dtype` and `size` that we can use to query information about the array. We will learn morre about processing data with numpy arrays later in the course, but you should already see that they are a useful construct for manipulating multi-dimensional datasets.

**Exercise 1.3.4**

* copy the code from the block above and modify it to plot the ENSO data for the period `FEBMAR`. Check this by looking at the data in the [original table](http://www.esrl.noaa.gov/psd/enso/mei/table.html).
* modify the code to produce a plot of *all* periods (so the graph should have 12 lines, correctly labelled)

Hint: You will need to consider what, if anything to set of `usecols` (what happends if you don't set `usecols`?) and provide a looping structure for the plotting.

In [None]:
# do exercise here

### 1.3.4 `dict`



The collections we have used so far have all been ordered. This means that we can refer to a particular element in the group by an index, e.g. `array[10]`.

A dictionary is not (by default) ordered. Instead of indices, we use 'keys' to refer to elements: each element has a key associated with it. It can be very useful for data organisation (e.g. databases) to have a key to refer to, rather than e.g. some arbitrary column number in a gridded dataset.

A dictionary is defined as a group in braces (curley brackets). For each elerment, we specify the key and then the value, separated by `:`.

In [None]:
a = {'one': 1, 'two': 2, 'three': 3}

# we then refer to the keys and values in the dict as:

print ('a:\n\t',a)
print ('a.keys():\n\t',a.keys())     # the keys
print ('a.values():\n\t',a.values()) # returns the values
print ('a.items():\n\t',a.items())   # returns a list of tuples

Because dictionaries are not ordered, we cannot guarantee the order they will come out in a `for` loop, but we will often use such a loop to iterate over the items in a dictionary.

In [None]:
for key,value in a.items():
    print(key,value)

We refer to specific items using the key e.g.:

In [None]:
print(a['one'])

You can add to a dictionary:

In [None]:
a.update({'four':4,'five':5})
print(a)

# or for a single value
a['six'] = 6
print(a)

Quite often, you find that you have the keys you want to use in a dictionary as a list or array, and the values in another list.

In such a case, we can use the method `zip(keys,values)` to load into the dictionary. For example:

In [None]:
values = [1,2,3,4]
keys = ['one','two','three','four']

a = dict(zip(keys,values))

print(a)

We will use this idea to make a dictionary of our ENSO dataset, using the items in the header for the keys. In this way, we obtain a  more elegant representation of the dataset, and can refer to items by names (keys) instead of column numbers.

In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))

key = 'MAYJUN'
# plot data
plt.figure(0,figsize=(12,7))
plt.title('ENSO data from {0}'.format(url))
plt.plot(data_dict['YEAR'],data_dict[key],label=key)
plt.xlabel('year')
plt.ylabel('ENSO')
plt.legend(loc='best')

**Exercise 1.3.5**

* copy the code above, and modify so that datasets for months `['MAYJUN','JUNJUL','JULAUG']` are plotted on the graph

Hint: use a for loop

In [None]:
# do exercise here

We can also usefully use a dictionary with a printing format statement. In that case, we refer directly to the key in ther format string. This can make printing statements much easier to read. We don;'t directly pass the dictionary to the `fortmat` staterment, but rather `**dict`, where `**dict` means "treat the key-value pairs in the dictionary as additional named arguments to this function call".

So, in the example:

In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))
print(data_dict.keys())

# print the data for MAYJUN
print('data for MAYJUN: {MAYJUN}'.format(**data_dict))

The line `print('data for MAYJUN: {MAYJUN}'.format(**data_dict))` is equivalent to writing:

    print('data for {MAYJUN}'.format(YEAR=data_dict[YEAR],DECJAN=data_dict[DECJAN], ...))
    
In this way, we use the keys in the dictionary as keywords to pass to a method.

Another useful example of such a use of a dictionary is in saving a numpy dataset to file.

If the data are numpy arrays in a dictionary as above, we can store the dataset using:



In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))

filename = 'enso_mei.npz'

# save the dataset
np.savez_compressed(filename,**data_dict)

What we load from the file is a dictionary-like object `<class 'numpy.lib.npyio.NpzFile'>`.

If needed, we can cast this to a dictionary with `dict()`, but it is generally more efficient to keep the original type.

In [None]:
# load the dataset

filename = 'enso_mei.npz'

loaded_data = np.load(filename)

print(type(loaded_data))

# test they are the same using np.array_equal
for k in loaded_data.keys():
    print('\t',k,np.array_equal(data_dict[k], loaded_data[k]))

**Exercise 1.3.6**

* Using what you have learned above, access the Met Office data file (`https://www.metoffice.gov.uk/hadobs/hadukp/data/monthly/HadSEEP_monthly_qc.txt`)[https://www.metoffice.gov.uk/hadobs/hadukp/data/monthly/HadSEEP_monthly_qc.txt] and create a 'data package' in a numpy`.npz` file that has keys of `YEAR` and each month in the year, with associated datasets of Monthly Southeast England precipitation (mm).
* confirm that tha data in your `npz` file is the same as in your original dictionary
* produce a plot of October rainfall using these data for the years 1900 onwards

In [None]:
# do exercise here

### 1.3.5 Summary

In this section, we have extended the types of data we might come across to include groups . We dealt with ordered groups of various types (`tuple`, `list`), and introduced the numpy package for numpy arrays (`np.array`). We saw dictionaries as collections with which we refer to individual items with a key.

We learned in the previous section how to pull apart a dataset presented as a string using loops and various using methods and to construct a useful dataset 'by hand' in a list or similar structure. It is useful, when learning to program, to know how to do this.

Here, we saw that packages such as numpy provide higher level routines that make reading data easier, and we would generally use these in practice. We saw how we can use `zip()` to help load a dataset from arrays into a dictionary, and also the value of using a dictionary representation when saving numpy files.