# Strings

## Introduction

### Purpose

In this section we will learn some more depth about strings.


### Prerequisites

You will need some understanding of the following:

* Using Notebooks
* Getting help
* [010 Variables, comments and `print()`](010_Python_Introduction.ipynb)
* [011 Data types](011_Python_data_types.ipynb) In particular, you should be understand strings.
* lists
* if

### Timing

The session should take around XX hours.



## String features

### Quotes and escapes

We have seen strings before, and noted that they are collections of characters (`a`, `b`, `1`, ...). Strings and characters are input by surrounding the relevant text in either double (`"`) or single (`'`) quotes. You can use this feature to print out a string with quotes, for example:

In [6]:
print ("'a string in single quotes'")
print ('"a string in double quotes"')

'a string in single quotes'
"a string in double quotes"


We have seen that some elements of the string may be special codes for print formatting, such as newline `\n` or tab `\t`. If we insert these in the string, they will add a newline or a tab respectively. Both of these might *look like* multiple characters, but rather are interpreted instead as a single character.

What if we needed to print out `\n` as part of the string, e.g. print the string:

        "beware of \n and \t"
        
we will find that they are (as we probably suspected) interpreted. Using single or double quotes will make no difference:

In [31]:
print("beware of \n and \t")
print('beware of \n and \t')

beware of 
 and 	
beware of 
 and 	


What we need to do is to present the `print()` with two characters `\` and `n`, instead of the single character `\n`. The problem now is that `\` has special meaning in a string: it *escapes* the following character, i.e. it makes the interpreter ignore the meaning of the following character. If we tried to generate a string:

        "\"
 
 the code would fail, because `\"` means *don't* interpret `"` in its usual sense (i.e. as a quote) and we would have an unclosed string.
 
 The trick then, is to use `\` to escape the meaning of `\`. So, if we want to print `\`, we set the string as `\\`:

In [33]:
print("\\")

\


#### Exercise

* insert a new cell below here
* Use what we have learned above to print the phrase `"beware of \n and \t"`, including quotes.

In [80]:
# Use what we have learned above to print the phrase
# "beware of \n and \t", including quotes.

# try this first
string = "beware of \n and \t"
print('wrong:\t\t',string)

# now escape the \ characters
string = "beware of \\n and \\t"
print('good:\t\t',string,'\t\tbut no quotes')

# now escape the \ characters
# and add quotes
string = '"beware of \\n and \\t"'
print('great:\t\t',string)

# now escape the \ characters
# and add quotes by escaping
string = "\"beware of \\n and \\t\""
print('great:\t\t',string)

wrong:		 beware of 
 and 	
good:		 beware of \n and \t 		but no quotes
great:		 "beware of \n and \t"
great:		 "beware of \n and \t"


Another time we use the `\` as an escape character is in trying to make long strings in our code more readable. We can do this by putting an escape `\` **just before** we hit the return key (newline!) on the keyboard, and so spread what would be a command or variable over a single long line over multiple lines.

For example:

In [74]:
# from https://www.usgs.gov/faqs/what-remote-sensing-and-what-it-used?
string = \
"Remote sensing is the process of detecting and \
monitoring the physical characteristics of an \
area by measuring its reflected and emitted \
radiation at a distance (typically from \
satellite or aircraft)."

print(string)

Remote sensing is the process of detecting and monitoring the physical characteristics of an area by measuring its reflected and emitted radiation at a distance (typically from satellite or aircraft).


Here, when we type `string = ` on the first line, the Python interpreter expects a string to be specified next. By using instead `\` *just before we hit the return*, we are essentially escaping that newline, and the rest of the command (the string definition here) can take place on the following line. We repeat this idea to spread the string over multiple lines.

This can be really useful. 

In the special case of a string that we want to define over multiple lines though, Python has a special format using triple quotes (single or double):

    '''
    multiple 
    line
    string
    '''
    
that means we don't need to escape each end of line within the text.

In [56]:
# from https://www.usgs.gov/faqs/what-remote-sensing-and-what-it-used?
string = '''
Remote sensing is the process of detecting and 
monitoring the physical characteristics of an 
area by measuring its reflected and emitted 
radiation at a distance (typically from 
satellite or aircraft).
'''

print(string)


Remote sensing is the process of detecting and 
monitoring the physical characteristics of an 
area by measuring its reflected and emitted 
radiation at a distance (typically from 
satellite or aircraft).



Notice how this is different to the case when we escaped the newline characters withing the string. In fact, at the end of each line of text, this string contains `\n` newline characters (we just don't see them).

#### Exercise

* Insert a new cell below here
* Write Python code that prints a string containing the following text, spaced over four lines as intended. There should be no space at the start of the line.

        The Owl and the Pussy-cat went to sea 
        In a beautiful pea-green boat, 
        They took some honey, and plenty of money, 
        Wrapped up in a five-pound note.

* Write Python code that prints a string containing the above text, all on a single line.

In [58]:
# ANSWER

# Write Python code that prints a string containing 
# the following text, spaced over four lines as intended.

lear = '''
The Owl and the Pussy-cat went to sea
In a beautiful pea-green boat,
They took some honey, and plenty of money,
Wrapped up in a five-pound note.
  '''
print(lear)


The Owl and the Pussy-cat went to sea
In a beautiful pea-green boat,
They took some honey, and plenty of money,
Wrapped up in a five-pound note.
  


In [61]:
# ANSWER

# Write Python code that prints a string 
# containing the above text, all on a single line.

# we escape the new lines now
lear = "\
The Owl and the Pussy-cat went to sea \
In a beautiful pea-green boat, \
They took some honey, and plenty of money, \
Wrapped up in a five-pound note."
print(lear)

The Owl and the Pussy-cat went to sea In a beautiful pea-green boat, They took some honey, and plenty of money, Wrapped up in a five-pound note.


In [44]:
# ANSWER

# lets set up a variable called string to make this clearer
# and do this piece by piece
string = 'beware of \n and \t'
print("wrong:", string)

# escape the \
string = 'beware of \\n and \\t'
print("good:\t\t", string, '\tbut no quotes')

# escape the \
string = '"beware of \\n and \\t"'
print("great:\t\t", string)

# or ... escape the quotes. as well!
string = "\"beware of \\n and \\t\""
print("great again:\t", string)

wrong: beware of 
 and 	
good:		 beware of \n and \t 	but no quotes
great:		 "beware of \n and \t"
great again:	 "beware of \n and \t"


## String Methods

### Concatenate strings: `+` and `len()`

We can do a number of things with strings, which are very useful. These so-called string methods are defined on all strings by Python by default, and can be used with every string. As 

For one, we can concatenate strings using the `+` symbol:

In [7]:
string1 = 'hello'
string2 = 'world'
spacer = ' '

# concatenate these
result = string1 + spacer + string2
print(result)

hello world


Another method we will find useful with strings is the `len()` function.

In [11]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



When the object is a string, the 'number of items' refers to the number of characters, so `len(str)` returns the length of the string.

In [18]:
# generate a string called t
# and see how long it is
t = ''
print ('the length of',t,'is',len(t))

# generate a string called s
# and see how long it is
quote = '"'
s = "Hello" + "there" + "everyone"
print ('the length of',quote+s+quote,'is',len(s))

the length of  is 0
the length of "Hellothereeveryone" is 18


#### Exercise

* insert a new cell below here
* what might a zero-length string look like? Try to generate one, and check its length.
* the `Hello there everyone` example above has no spaces between the words. Copy the code and modify it to have spaces.
* confirm that you get the expected increase in length.

### `replace()` and `strip()`

In [62]:
help(str.replace)

Help on method_descriptor:

replace(self, old, new, count=-1, /)
    Return a copy with all occurrences of substring old replaced by new.
    
      count
        Maximum number of occurrences to replace.
        -1 (the default value) means replace all occurrences.
    
    If the optional argument count is given, only the first count occurrences are
    replaced.



The string method `replace()` replaces substrings defined in `old` with those defined in `new`. 

In the example below, we replace the sub-string `"happy"` with a new string containing the emoji "😃": 

In [72]:
original_string = "I'm a very happy string"
print('original:\t',original_string)

new_string = original_string.replace("happy", "😀")
print ('new:\t\t',new_string)

original:	 I'm a very happy string
new:		 I'm a very 😀 string


In [94]:
help(str.strip)

Help on method_descriptor:

strip(self, chars=None, /)
    Return a copy of the string with leading and trailing whitespace removed.
    
    If chars is given and not None, remove characters in chars instead.



`strip()` is very useful in string formatting and general tidying up.

Suppose we had the string:

    ":::😀:😀:😀::::::"
    
but what we wanted was:

    "😀:😀:😀"
    
i.e. we want to strip the `:` characters from the right and left ends of the string. We can't easily use `replace()` without affecting the `:` characters we want to keep. We can achieve this with the `strip()` method though.

In [98]:
old_string = ":::😀:😀:😀::::::"
print(old_string)

new_string = old_string.strip(':')
print(new_string)

:::😀:😀:😀::::::
😀:😀:😀


#### Exercise

* Insert a new cell below here
* Take the multi-line string:

`'''
----Remote sensing is the process of detecting and 
monitoring the physical characteristics of an 
area by measuring its reflected and emitted 
radiation at a distance (typically from 
satellite or aircraft).----
'''`

  and use it to generate a single line string, without the `-` characters at either end.
    

In [101]:
old_string = '''
----Remote sensing is the process of detecting and 
monitoring the physical characteristics of an 
area by measuring its reflected and emitted 
radiation at a distance (typically from 
satellite or aircraft).----
'''
print(old_string)

# replace newline with empty string!
# and strip the result after
new_string = old_string.replace('\n','').strip('-')
print(new_string)


----Remote sensing is the process of detecting and 
monitoring the physical characteristics of an 
area by measuring its reflected and emitted 
radiation at a distance (typically from 
satellite or aircraft).----

Remote sensing is the process of detecting and monitoring the physical characteristics of an area by measuring its reflected and emitted radiation at a distance (typically from satellite or aircraft).


### `split()` and `join()`

In [63]:
help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1)
    Return a list of the words in the string, using sep as the delimiter string.
    
    sep
      The delimiter according which to split the string.
      None (the default value) means split according to any whitespace,
      and discard empty strings from the result.
    maxsplit
      Maximum number of splits to do.
      -1 (the default value) means no limit.



A pair if really useful string methods are `split()` and `join()`. The former is used to split a string into a list of sub-strings. For example:

In [122]:
string = \
"   Remote sensing is the process of detecting and \
monitoring the physical characteristics of an \
area by measuring its reflected and emitted \
radiation at a distance (typically from \
satellite or aircraft).   "

string_list = string.split()

print(string_list)

['Remote', 'sensing', 'is', 'the', 'process', 'of', 'detecting', 'and', 'monitoring', 'the', 'physical', 'characteristics', 'of', 'an', 'area', 'by', 'measuring', 'its', 'reflected', 'and', 'emitted', 'radiation', 'at', 'a', 'distance', '(typically', 'from', 'satellite', 'or', 'aircraft).']


We see that the string is 'parsed' into a list of separate sub-strings, which in this case represent words in the sentence. The default delimiter used to split the string is `' '`, whitespace (space or tab), though we could specify others if we needed.

Any whitespece to the left or right of the string has no impact here, so we do not need to explicitly `strip()` the string.

If we want to generate a string from a set of sub-strings, we use the `join()` method.

In [118]:
help(str.join)

Help on method_descriptor:

join(self, iterable, /)
    Concatenate any number of strings.
    
    The string whose method is called is inserted in between each given string.
    The result is returned as a new string.
    
    Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'



 For this, we declare the string delimiter we wish to use. For example, to reconstruct the sentence from the string list with whitespace delimitation:

In [121]:
string_list = ['Remote', 'sensing', 'is', 'the', 'process', 
               'of', 'detecting', 'and', 'monitoring', 'the', 
               'physical', 'characteristics', 'of', 'an', 'area', 
               'by', 'measuring', 'its', 'reflected', 'and', 'emitted', 
               'radiation', 'at', 'a', 'distance', '(typically', 'from',
               'satellite', 'or', 'aircraft).']

string = ' '.join(string_list)
print(string)

Remote sensing is the process of detecting and monitoring the physical characteristics of an area by measuring its reflected and emitted radiation at a distance (typically from satellite or aircraft).


#### Exercise

* Insert a new cell below here
* Take the string 

      The Owl and the Pussy-cat went to sea 
      In a beautiful pea-green boat, 
      They took some honey, and plenty of money, 
      Wrapped up in a five-pound note.
    
  and split it into a list of sub-strings.
* Then re-construct the string, separating each word by a colon character `':'`
* Print out the list of sub-strings and the re-constructed string

In [126]:
# Answer

# Take the string
string = '''
The Owl and the Pussy-cat went to sea 
In a beautiful pea-green boat, 
They took some honey, and plenty of money, 
Wrapped up in a five-pound note.
'''

# and split it into a list of sub-strings.
list_string = string.split()
# print this out
print(list_string)

# Then re-construct the string, separating each word by a colon character ':'
recon_string = ':'.join(list_string)
# print this out
print(recon_string)

['The', 'Owl', 'and', 'the', 'Pussy-cat', 'went', 'to', 'sea', 'In', 'a', 'beautiful', 'pea-green', 'boat,', 'They', 'took', 'some', 'honey,', 'and', 'plenty', 'of', 'money,', 'Wrapped', 'up', 'in', 'a', 'five-pound', 'note.']
The:Owl:and:the:Pussy-cat:went:to:sea:In:a:beautiful:pea-green:boat,:They:took:some:honey,:and:plenty:of:money,:Wrapped:up:in:a:five-pound:note.


In [22]:
# ANSWER
# what might a zero-length string look like? 
# Try to generate one, and check its length.
s = ''
print(s,len(s))

 0


In [21]:
# the Hello there everyone example above has no spaces between the words. 
# copy the code and modify it to have spaces.

# generate a string called s
# and see how long it is

# lets have a spacer variable
spacer = ' '
quote = '"'
# add the spaces in
s = "Hello" + spacer + "there" + spacer + "everyone"
print ('the length of',quote+s+quote,'is',len(s))

# confirm that you get the expected increase in length.
# It is now 20 rather than 18 above

the length of "Hello there everyone" is 20


### `slice` 

A string can be thought of as an ordered 'array' of characters. 

So, for example the string `hello` can be thought of as a construct containing `h` then `e`, `l`, `l`, and `o`. 

We can index a string, so that e.g. `'hello'[0]` is `h`, `'hello'[1]` is `e` etc. Notice that index `0` is used for the first item.

We have seen above the idea of the 'length' of a string. In this example, the length of the string `hello` is 5. The final item in this case would be `'hello'[4]`, because we count indices from 0.

In [132]:
string = 'hello'

# length
slen = len(string)
print('length of',string,'is',slen)

# select these indices
i = 0
print('character',i,'of',string,'is',string[i])

i = 3
print('character',i,'of',string,'is',string[i])

i = 4
print('character',i,'of',string,'is',string[i])


length of hello is 5
character 0 of hello is h
character 3 of hello is l
character 4 of hello is o


#### Exercise

* Insert a new cell below here
* copy the code above, and see what happens if you set `i` to be the value of length of the string. Why does it respond so?
* make the code robust to this issue, but using an `if` statement to test if `index` is in the required range.

In [136]:
# ANSWER

# copy the code
string = 'hello'

# length
slen = len(string)
print('length of', string, 'is', slen)

# select these indices
i = slen
# This will fail because string[5] does not exist
# so we use an if statement
# make the code robust to this issue, but using
# an `if` statement to test if `index` is in the required range.
if (i > 0) and (i < slen):
    print('character', i, 'of', string, 'is', string[i])
else:
    # print a meaningful error message
    print("out of bounds error for i =", i, "for string", string)

length of hello is 5
out of bounds error for i = 5 for string hello




We can use the idea of a 'slice' to access particular elements within the string.

For a slice, we can specify:

* start index (0 is the first)
* stop index (not including this)
* skip (do every 'skip' character)

When specifying this as array access, this is given as, e.g.:

`array[start:stop:skip]`

* The default start is 0
* The default stop is the length of the array
* The default skip is 1

We can use negative numbers in specifying `start:stop:skip`: in that case, they are counted from the end of the string (`-1` is the last character).

We can specify a slice with the default values by leaving the terms out:

`array[::2]`

would give values in the array `array` from 0 to the end, in steps of 2.

This idea is fundamental to array processing in Python. We will see later that the same mechanism applies to all ordered groups.


In [139]:
s = "Hello World"
print (s,len(s))

start = 0
stop  = 11
skip  = 2
print (s[start:stop:skip])

# use -ve numbers to specify from the end
# use None to take the default value
start = -3
stop  = None
skip  = 1
print (s[start:stop:skip])

Hello World 11
HloWrd
rld


#### Exercise

The example above allows us to access an individual character(s) of the array.

* Insert a new cell below here
* based on the example above, print the string starting from the default start value, up to the default stop value, in steps of `2`. This should be `HloWrd`.
* write code to print out the 4$^{th}$ letter (character) of the string `s`. This should be `l`.


In [141]:
# ANSWER

s = "Hello World"
print (s,len(s))

# based on the example above, print the string starting 
# from the default start value, up to the default stop value, in steps of `2`.

# default start -> None
start = None
# default stop -> None
stop  = None
skip  = 2
print (s[start:stop:skip])

Hello World 11
HloWrd


In [143]:
# ANSWER

s = "Hello World"
# write code to print out the 4 𝑡ℎ  letter (character) of the string s.
# index 3 is the 4th character !!!
print(s[3])

l


### 1.2.5 `find`

Quite often, we might want to find a string inside another string, and potentially give the location (as in characters from the start of the string) where this string occurs. We can use the `find` method, which will return either a `-1` if the string isn't found, or an integer giving the index of where the string starts (for the first time).

In [None]:
print ("I'm a very happy string".find("a"))
print ("I'm a very happy string".find("happy"))

Let's use the idea of `find()` to sort out a messy table of data that we get from a web page.

First, we need to import the package `requests` to access some information from a [URL](https://en.wikipedia.org/wiki/URL) (from a web page). The data we get will be in [html](https://en.wikipedia.org/wiki/HTML).

The data we will examine is a dataset of [ENSO](https://en.wikipedia.org/wiki/ENSO) values for each month of the year from January 1950 to present, made available by [NOAA](https://en.wikipedia.org/wiki/NOAA)/

If you visit  you will see the data table we are interested in. So, how do we 'grab' this?

The [URL](https://en.wikipedia.org/wiki/URL) points to [html](https://en.wikipedia.org/wiki/HTML) code. When you display this in a browser, it is rendered appropriately. 

If you access the html directly, you will get the following:

In [None]:
# Web scraping example

import requests

url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"

# This line will pull the URL data as a string
txt = requests.get(url).text

# show the first 1000 characters (see 'slice' above: this is the same as [None:1000:None])
print(txt[:1000])

We notice the presence of html codes in the text string (e.g. `<html>`, `<pre>`). There are particular packages for neatly parsing html (scraping information from web pages), one of the most common being [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). This will tend to be more useful if the html is well fomatted, and the data contained in `<table>` sections, or similar structures. Here, we just have a block of text in the `<pre>` section.

If we want to *just* access the dataset here then, we might notice that the data we want to access starts when we see the string `YEAR`.

We can use `find()` to discover the index of this in the string:

In [None]:
start = txt.find('YEAR')

print('start of useful data at index {}\n---------------------------------'.format(start))
print(txt[start:start+1000])

If we look again at the web page [http://www.esrl.noaa.gov/psd/enso/mei.old/table.html](http://www.esrl.noaa.gov/psd/enso/mei/table.html), we might notice that the end of the useful data is delimited by two newlines and the string `(1)`, i.e., as a string `\n\n(1)`. So we should be able to use `find()` again to get the location of the end of the data (i.e. `stop`, in the sense of a slice).

**Exercise 1.2.6**

* use this observation to form a string called `data_table`, containing all of the useful data (i.e. `txt[start:stop]`).
* print the string `data_table`.


In [None]:
# do exercise here

This exercise is a very good example of [web scraping](https://en.wikipedia.org/wiki/Web_scraping). Web scraping is often rather messy (you have to work out some 'key' to reliably delimit the information you want) but can be extreemely valuable for accessing datasets that are not cleanly presented. We have only gonbe part of the way to extracting a useful dataset here, because the dataset we are interested in (the ENSO data) are still represented as a string, whereas we really want them to be a set of floating point numbers. We will deal with this later.



### 1.2.5 `split` and `splitlines`

The first 'line' of `` should contain the 'header' information, i.e. the title of the data columns (`YEAR`, `DECJAN` etc.). We want to separate the header from the numbers in the data table, so we want to 'split' the string called `data_table` into a header string and data string. 

One approach to this would be split the string into 'lines' of text (rather than one block). Effectively that means splitting into multiple strings whenever we hit a `\n` character. Rather than do that explicitly, we use the `splitlines()` method:



In [None]:
import requests
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start = txt.find('YEAR')
stop  = txt.find('\n\n(1)')
data_table = txt[start:stop]

# split into a list of strings
data_lines = data_table.splitlines()

# tell me something useful
print(type(data_lines),len(data_lines))

# loop over some examples
for i in 0,1,len(data_lines)-1:
    print('line {} {}\n\t{}'.format(i,type(data_lines[i]),data_lines[i]))

## String formating

We know that we can join

This splits each 'line' of text into an entry in a `list`, so that the header data is now given in the first entry (`data_lines[0]`) and the lines containinmg data, after that.


From the print out above, we notice that the final 'data line' (index `-1`) is shorter than (has fewer entries than) the other lines. This is because we are only part way through this year!.

In 'real' datasets, we quite often have 'messy' lines of data such as this (or data missing for other reasons). How you want to deal with the 'messy bits' depends on the sort of analysis you want to do. 

One option (the simplest) would be to simply remove the last line (ignore this year's data):

In [None]:
header = data_lines[0]

# select the data block as being from entry 1 to -1
# so, **not including the last row**
data = data_lines[1:-1]

print('header:',header)

for i in 0,1,len(data)-1:
    print('line {} {}\n\t{}'.format(i,type(data[i]),data[i]))

**Exercise 1.2.7**

* copy the code from above and explore the response using line indices `-1` and `-2`.

In [None]:
# do exercise here

If we want to manipulate or plot the information contained in this (the numbers), we need to convert each of the string representations to a floating point number, e.g. the number `-1.03` rather than the string `'-1.03'`.

Each entry in the list `data` is a string, as we saw above.

We can split an individual string (such as `data[0]` into a list of strings, using the string method `split()`. By default, this splits on 'white space' (i.e. spaces or tab characters), so, e.g.:



In [None]:
line = data[0].split()
print(data[0])
print(line,len(line))

So, we have split the long string into 13 strings in a list. 

We want to generate a new list with 13 corresponding floating point values:

In [None]:
# split the line on whitespace
line = data[0].split()

# make a new list of the same length
# by copying the variable line
float_data = line.copy()

for index,line_data in enumerate(line):
    # insert the cast float into the list
    # in the right order (use index)
    float_data[index] = float(line_data)
    
# this is the string list
print(line)

# this is the float list
print(float_data)

**Exercise 1.2.8**

* set a variable to be the string `"2, 3, 5, 7, 11, 13, 17, 19, 23, 29"`
* use the approach above to generate a **list of integers** of the first 10 prime numbers. 
* print the list with syntax of the pattern of 'prime number 3 is 7'

Make sure you convert each prime number to an integer, rather than leaving it as a string!

Hint: We can still use the method `split()` to do split the string into a list of strings, but this time the [separator](https://python-reference.readthedocs.io/en/latest/docs/str/split.html) is a comma, rather than whitespace. 

In [None]:
# do exercise here
pstring = "2, 3, 5, 7, 11, 13, 17, 19, 23, 29"

Normally, we wouldn't go to the trouble of first copying the list. 

Instead, **where the contents of the loop are simple** (e.g. a single statement) we would use a different way of using a `for` loop, called an **implicit loop**.

In this case:

    for item in group:
        doit(group)
        
becomes:

    [doit(group) for item in group]
    
with the additional feature that everything returned by `doit(group)` for each item of `group` is put in a list.

In [None]:
# split the line on whitespace

# implicit for loop
float_data = [float(line_data) for line_data in data[0].split()]
    
# this is the string list
print(line)
# this is the float list
print(float_data)

The statement:

    float_data = [float(line_data) for line_data in line]

is much more [Pythonic](https://docs.python-guide.org/writing/style/) than the code above. It is simple, elegant and neat.

We can *nest* for statements, i.e. put one for loop inside another. This allows us to treat data of multiple dimensions.

In the examples above, we converted only the data in `data[0]` to a list of floating point numbers.
If we wanted to process *all* lines of data, we would have to loop over them as well, in an 'outer' loop.



In [None]:
# use a step of 10 for illustration purposes
# to save space when printing

step = 10

for index,line in enumerate(data_table.splitlines()[1:-1:step]):
    # convert each line to list of floats
    float_data = [float(line_data) for line_data in line.split()]
    print('line {} is {}'.format(index*step,float_data))

Note that whilst we have calculated `float_data` in the loop for each line, it gets over-written with each new line as things stand.

We can do the same thing, and generate a list of the responses more neatly, using an implicit loop inside another implicit loop:

In [None]:
all_float_data = [[float(line_data) for line_data in line.split()] for line in data_table.splitlines()[1:-1]]

The variable `all_float_data` is now a sort of 'two dimensional' list, within which we can refer to individual items as e.g. `all_float_data[10][3]` for row `10`, column `3`.

Let's use this idea to print out column 0 of each row (containing the `YEAR` data). We will use the method `range(nrows)` that (implicitly) generates a list `[0,1,2,3, ..., nrows-1]`. 

Notice the use of `end=' '` in the `print` statement. This replaces the usual newline by whetever is specified by the keyword `end`. Note also that we have used `{:.0f}` to specify the format term. This indicates that the term is to be printed as a floating point number (the `f`) with zero numbers after the decimal point (`.0`)

In [None]:
nrows = len(all_float_data)
i = 0

print('column {} of the data gives:\n'.format(i))
for row in range(nrows):
    print('{:.0f}'.format(all_float_data[row][i]),end=' ')

**Exercise 1.2.9**

* use an implicit loop to create a list of ENSO values in a variable `enso` for the years 1950 up to last year for the period `DECJAN`.
* produce a plot of ENSO for `DECJAN` as a function of year (see below on how to do that).

Hint: check which column in the header is `DECJAN`. To start you off on this, we give you the implicit loop code for extracting the column containing the `YEAR` data (column 0). We also give you the code to achieve the plotting.

In [None]:
# do exercise here

# generate a list called years of column 0 data
years = [all_float_data[row][0] for row in range(nrows)]

# you need to put the enso data in here!
# this is put in as a dummy that should plot a straight line!
enso = years.copy()

# for plotting
import pylab as plt
%matplotlib inline

# 
plt.figure(0,figsize=(12,3))
plt.plot(years,enso)
plt.xlabel('year')
plt.ylabel('ENSO')

### 1.2.6 Summary

In section 1.2 you have been introduced to text representation in Python, as strings (type `str`), and shown that this sort of variable can be thought of an an 'array', and that it has a length attribute that can be accessed with `len()`.

Other useful string manipulation methods you were introduced to are: `replace()`, `find()`, `split()` and `splitlines()`, though of course there are [many more](https://docs.python.org/3/library/string.html).

In an 'array', we can use an index to refer to a particular item (e.g. index 0 for the first item, 1 for the second, -1 for the last). We can use this idea to manipulate strings. 

In a more general sense, we can take a 'slice' of an array, with the syntax `[start:stop:skip]` giving access to a regularly spaced part of an array. We can use this, for example, to print out every 10th value (`skip=10`).

You were also introduced to the idea of looping control structures, using a `for ... in ...:` statement, and the equivalent implicit form. This introduced the idea of [indented code blocks](https://wiki.python.org/moin/Why%20separate%20sections%20by%20indentation%20instead%20of%20by%20brackets%20or%20%27end%27) and (related) nested structures (loops within loops).

In passing, you have also been shown how to pull html data from a URL (scraping) using the [`requests`](http://docs.python-requests.org/en/master/) package, and also how to produce a simple data plot, using [`pylab`](https://matplotlib.org/index.html).

## 1.3. Groups of things
Very often, we will want to group items together. There are several main mechanisms for doing this in Python, known as:

* string e.g. `hello`
* tuple, e.g. `(1, 2, 3)`
* list, e.g. `[1, 2, 3]`
* numpy array e.g. `np.array([1, 2, 3])`

A slightly different form of group is a dictionary:

* dict, e.g. `{1:'one', 2:'two', 3:'three'}`

You will notice that each of the grouping structures tuple, list and dict use a different form of bracket. The numpy array is fundamental to much work that we will do later.

We have dealt with the idea of a string as an ordered collection in the material above, so will deal with the others here.

We noted the concept of length (`len()`), that elements of the ordered collection could be accessed via an index, and came across the concept of a slice. All of these same ideas apply to the first set of groups (string, tuple, list, numpy array) as they are all ordered collections.

A dictionary is not (by default) ordered, however, so indices have no role. Instead, we use 'keys'.

### 1.3.1 `tuple`
A tuple is a group of items separated by commas. In the case of a tuple, the brackets are optional.
You can have a group of differnt types in a tuple (e.g. int, int, str, bool)

In [None]:
# load into the tuple
t = (1, 2, 'three', False)

# unload from the tuple
a,b,c,d = t

print(t)
print(a,b,c,d)

If there is only one element in a tuple, you must put a comma , at the end, otherwise it is not interpreted as a tuple:



In [None]:
t = (1)
print (t,type(t))
t = (1,)
print (t,type(t))

You can have an empty tuple though:



In [None]:
t = ()
print (t,type(t))

**E1.3.1 Exercise**

* create a tuple called t that contains the integers 1 to 5 inclusive
* print out the value of t
* use the tuple to set variables a1,a2,a3,a4,a5

In [None]:
# do exercise here


### 1.3.2  `list`
A `list` is similar to a `tuple`. One main difference is that you can change individual elements in a list but not in a tuple.
To convert between a list and tuple, use the 'casting' methods `list()` and `tuple()`:

In [None]:

# a tuple
t0 = (1,2,3)

# cast to a list
l = list(t0)

# cast to a tuple
t = tuple(l)

print('type of {} is {}'.format(t,type(t)))
print('type of {} is {}'.format(l,type(l)))

You can concatenate (join) lists or tuples with the `+` operator:



In [None]:
l0 = [1,2,3]
l1 = [4,5,6]

l = l0 + l1
print ('joint list:',l)

**E1.3.2 Exercise**
* copy the code from the cell above, but instead of lists, use tuples
* loop over each element in the tuple and print out the data type and value of the element

Hint: use a `for ... in ...` construct.

In [None]:
# do exercise here

A common method associated with lists or tuples is:
* `index()`

Some useful methods that will operate on lists and tuples are:
* `len()`
* `sort()`
* `min(),max()`



In [None]:
l0 = (2,8,4,32,16)

# print the index of the item integer 4 
# in the tuple / list

item_number = 4

# Note the dot . here
# as index is a method of the class list
ind  = l0.index(item_number)

# notice that this is different
# as len() is not a list method, but 
# does operatate on lists/tuples
# Note: do not use len as a variable name!
llen = len(l0)

# note the use of integers in the braces e.g. {0}
# rather than empty braces as before. This allows us to
# refer to particular items in the format argument list
print('the index of {0} in {1} is {2}'.format(item_number,l0,ind))
print('the length of the {0} {1} is {2}'.format(type(l0),l0,llen))


**E1.3.3 Exercise**

* copy the code to the block below, and test that this works with lists, as well as tuples
* find the index of the integer 16 in the tuple/list
* what is the index of the first item?
* what is the length of the tuple/list?
* what is the index of the last item?

In [None]:
# do exercise here

A list has a much richer set of methods than a tuple. This is because we can add or remove list items (but not tuple).

* `insert(i,j)` : insert `j` beore item `i` in the list
* `append(j)` : append `j` to the end of the list
* `sort()` : sort the list

This shows that tuples and lists are 'ordered' (i.e. they maintain the order they are loaded in) so that indiviual elements may be accessed through an 'index'. The index values start at 0 as we saw above. The index of the last element in a list/tuple is the length of the group, minus 1. This can also be referred to an index `-1`.

In [None]:
l0 = [2,8,4,32,16]

# insert 64 at the begining (before item 0)
# Note that this inserts 'in place'
# i.e. the list is changed by calling this
l0.insert(0,64)


# insert 128 *before* the last item (item -1)
l0.insert(-1,128)

# append 256 on the end
l0.append(256)

# copy the list 
# and sort the copy
# Note the use of the copy() method here
# to create a copy
l1 = l0.copy()

# Note that this sorts 'in place'
# i.e. the list is changed by calling this
l1.sort()

print('the list {0} once sorted is {1}'.format(l0,l1))

**E1.3.4 Exercise**

* copy the above code and try out some different locations for inserting values (e.g. what does index `-2` mean?)
* what happens if you take off the `.copy()` statement in the line `l1 = l0.copy()`, i.e. just use `l1 = l0`?  [Why is this?](https://www.afternerd.com/blog/python-copy-list/)

In [None]:
# do exercise here

### 1.3.3 `np.array`

An array is a group of objects of the same type. Because they are of the same type, they can be stored efficiently in compter memory, and also accessed efficiently.

Whilst there are different ways of forming arrays, the most common is to use numpy arrays, using the package `numpy`. To use this, we must first import the package into the current workspace. We do this with the `import` method. Using the optional `as` statement allows us to use a shorter (or more suitable) name for the package. We will generally call numpy `np`, so we use:

`import numpy as np`

to import ('load') the numpy package. 

Often, we will read data from a file/URL as we did above for the ENSO dataset. In that case, we had to step through each item to convert from string form to floating point number.

This sort of thing is much more simply done using methods associated with numpy arrays. 

A particularly useful numpy method is `np.loadtxt(file)` that loads an ASCII table of data straight into a numpy array.

Whilst this is designed to load data from a file, we can use `io.StringIO()` from the `io` package to make data that we already have as a string seem to `np.loadtxt` as if it were a file. This is a useful 'trick' for using methods that expect data in a file. The `unpack=True` option makes sure the data array is compoised the way we would expect it. The `usecols` option lets us select only those data columns we wish to read (0 and 1 here).


An alternative to `np.loadtxt()` is `np.genfromtxt()`. This has some additional features, such the `invalid_raise` flag. If this is set `False`, the loading is made somewhat tolerant to data errors (e.g. inconsistent number of columns). Further, we can explicitly set what will indicate `missing_values` in the input and what we would like to replace them with (`filling_values`) which can be useful for tidying up datasets.




In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

# select a data column
data_column = 1

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True,usecols=[0,data_column])

# so data[0] is the year data
#    data[1] is the enso data for column data_column
# print some attributes of the data array

print('array type',type(data))
print('data type',data.dtype)
print('number of dimensions',data.ndim)
print('data shape',data.shape)
print('data size',data.size)

# for plotting
import pylab as plt
%matplotlib inline

# 
plt.figure(0,figsize=(12,3))
plt.plot(data[0],data[1],label=header[data_column])
plt.xlabel('year')
plt.ylabel('ENSO')
plt.title('ENSO data from {0}'.format(url))
plt.legend(loc='best')

We saw in the example above that a numpy array (`<class 'numpy.ndarray'>`) has a set of attributes that include `shape`, `ndim`, `dtype` and `size` that we can use to query information about the array. We will learn morre about processing data with numpy arrays later in the course, but you should already see that they are a useful construct for manipulating multi-dimensional datasets.

**Exercise 1.3.4**

* copy the code from the block above and modify it to plot the ENSO data for the period `FEBMAR`. Check this by looking at the data in the [original table](http://www.esrl.noaa.gov/psd/enso/mei/table.html).
* modify the code to produce a plot of *all* periods (so the graph should have 12 lines, correctly labelled)

Hint: You will need to consider what, if anything to set of `usecols` (what happends if you don't set `usecols`?) and provide a looping structure for the plotting.

In [None]:
# do exercise here

### 1.3.4 `dict`



The collections we have used so far have all been ordered. This means that we can refer to a particular element in the group by an index, e.g. `array[10]`.

A dictionary is not (by default) ordered. Instead of indices, we use 'keys' to refer to elements: each element has a key associated with it. It can be very useful for data organisation (e.g. databases) to have a key to refer to, rather than e.g. some arbitrary column number in a gridded dataset.

A dictionary is defined as a group in braces (curley brackets). For each elerment, we specify the key and then the value, separated by `:`.

In [None]:
a = {'one': 1, 'two': 2, 'three': 3}

# we then refer to the keys and values in the dict as:

print ('a:\n\t',a)
print ('a.keys():\n\t',a.keys())     # the keys
print ('a.values():\n\t',a.values()) # returns the values
print ('a.items():\n\t',a.items())   # returns a list of tuples

Because dictionaries are not ordered, we cannot guarantee the order they will come out in a `for` loop, but we will often use such a loop to iterate over the items in a dictionary.

In [None]:
for key,value in a.items():
    print(key,value)

We refer to specific items using the key e.g.:

In [None]:
print(a['one'])

You can add to a dictionary:

In [None]:
a.update({'four':4,'five':5})
print(a)

# or for a single value
a['six'] = 6
print(a)

Quite often, you find that you have the keys you want to use in a dictionary as a list or array, and the values in another list.

In such a case, we can use the method `zip(keys,values)` to load into the dictionary. For example:

In [None]:
values = [1,2,3,4]
keys = ['one','two','three','four']

a = dict(zip(keys,values))

print(a)

We will use this idea to make a dictionary of our ENSO dataset, using the items in the header for the keys. In this way, we obtain a  more elegant representation of the dataset, and can refer to items by names (keys) instead of column numbers.

In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei.old/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))

key = 'MAYJUN'
# plot data
plt.figure(0,figsize=(12,7))
plt.title('ENSO data from {0}'.format(url))
plt.plot(data_dict['YEAR'],data_dict[key],label=key)
plt.xlabel('year')
plt.ylabel('ENSO')
plt.legend(loc='best')

**Exercise 1.3.5**

* copy the code above, and modify so that datasets for months `['MAYJUN','JUNJUL','JULAUG']` are plotted on the graph

Hint: use a for loop

In [None]:
# do exercise here

We can also usefully use a dictionary with a printing format statement. In that case, we refer directly to the key in ther format string. This can make printing statements much easier to read. We don;'t directly pass the dictionary to the `fortmat` staterment, but rather `**dict`, where `**dict` means "treat the key-value pairs in the dictionary as additional named arguments to this function call".

So, in the example:

In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))
print(data_dict.keys())

# print the data for MAYJUN
print('data for MAYJUN: {MAYJUN}'.format(**data_dict))

The line `print('data for MAYJUN: {MAYJUN}'.format(**data_dict))` is equivalent to writing:

    print('data for {MAYJUN}'.format(YEAR=data_dict[YEAR],DECJAN=data_dict[DECJAN], ...))
    
In this way, we use the keys in the dictionary as keywords to pass to a method.

Another useful example of such a use of a dictionary is in saving a numpy dataset to file.

If the data are numpy arrays in a dictionary as above, we can store the dataset using:



In [None]:
import requests
import numpy as np
import io

# access dataset as above
url = "http://www.esrl.noaa.gov/psd/enso/mei/table.html"
txt = requests.get(url).text

# copy the useful data
start_head = txt.find('YEAR')
start_data = txt.find('1950\t')
stop_data  = txt.find('2018\t')

header = txt[start_head:start_data].split()
data = np.loadtxt(io.StringIO(txt[start_data:stop_data]),unpack=True)

# use zip to load into a dictionary
data_dict = dict(zip(header, data))

filename = 'enso_mei.npz'

# save the dataset
np.savez_compressed(filename,**data_dict)

What we load from the file is a dictionary-like object `<class 'numpy.lib.npyio.NpzFile'>`.

If needed, we can cast this to a dictionary with `dict()`, but it is generally more efficient to keep the original type.

In [None]:
# load the dataset

filename = 'enso_mei.npz'

loaded_data = np.load(filename)

print(type(loaded_data))

# test they are the same using np.array_equal
for k in loaded_data.keys():
    print('\t',k,np.array_equal(data_dict[k], loaded_data[k]))

**Exercise 1.3.6**

* Using what you have learned above, access the Met Office data file (`https://www.metoffice.gov.uk/hadobs/hadukp/data/monthly/HadSEEP_monthly_qc.txt`)[https://www.metoffice.gov.uk/hadobs/hadukp/data/monthly/HadSEEP_monthly_qc.txt] and create a 'data package' in a numpy`.npz` file that has keys of `YEAR` and each month in the year, with associated datasets of Monthly Southeast England precipitation (mm).
* confirm that tha data in your `npz` file is the same as in your original dictionary
* produce a plot of October rainfall using these data for the years 1900 onwards

In [None]:
# do exercise here

### 1.3.5 Summary

In this section, we have extended the types of data we might come across to include groups . We dealt with ordered groups of various types (`tuple`, `list`), and introduced the numpy package for numpy arrays (`np.array`). We saw dictionaries as collections with which we refer to individual items with a key.

We learned in the previous section how to pull apart a dataset presented as a string using loops and various using methods and to construct a useful dataset 'by hand' in a list or similar structure. It is useful, when learning to program, to know how to do this.

Here, we saw that packages such as numpy provide higher level routines that make reading data easier, and we would generally use these in practice. We saw how we can use `zip()` to help load a dataset from arrays into a dictionary, and also the value of using a dictionary representation when saving numpy files.