*Edited: 2022-09-13*

# CMM201 &mdash; Lab 2.1

Now that we are in to the second block of CMM201, the focus is on data. We need to look at storing multiple items of data in a single variable. This can be done with collections, such as lists, and set which we will see next week. But this unit is all about the **tuple**, which is the simplest collection type.

## Creating Tuples

We can create a tuple by separating the elements with commas.

In [None]:
1, 25, 91, 6

Notice the Jupyter output displays the tuple with round brackets `(` `)`. These aren't really part of the tuple, it's just how it's displayed. If you like, you can put round brackets around it. It won't change what object you're creating.

In [None]:
(1, 25, 91, 6)

We can assign a tuple to a variable, in the same we we do with any other type.

In [None]:
x = 2, 4, 8, 16

and recall the variable later.

In [None]:
x

There is no way to change the tuple once it is created, but we can create a new tuple and reuse the old variable name.

Let's say we want the tuple to add `32` to the end of the tuple, just throw away the old tuple and create a brand new tuple containing the items we want.

In [None]:
x = 2, 4, 8, 16, 32

x

We can add elements to a tuple using the `+` operator, however, this does *not* change the tuple, it simply create a new tuple!

Let's say we want to add `64, 128, 256`, the following code will create a new tuple containing both the old and newly added elements.

By the way, notice that in this case we do need he brackets.

In [None]:
x + (64, 128, 256)

Notice that `x` is unaffected.

In [None]:
x

If we want `x` to change, we must use the assignment operator `x = `... something

For example:

In [None]:
x = x + (64, 128, 256)

x

The `+=` operator also works as an assignment.

Recall from unit one that if we have a variable `y` which store a number, we can increase the variable by one by writing `y = y + 1` or the shorthand `y += 1`.

**Note:** You may notice that there is an extra comma `,` here, that's not a mistake. We are specifying a length-one tuple. A tuple with **exactly one item** needs a trailing comma.

In [None]:
x += (512, )

x

In the next unit, we will encounter *mutable* collections, but for tuples, we can't change the individual values, except by creating a new tuple.

**(a)** Create a tuple called `shopping` to store three **strings** which are items or shopping/groceries. You can choose which three items to use.

In [None]:
...

We can check the length of the tuple using the `len` method. The following should show `3` as an output cell, assuming you created the tuple with exactly three items.

In [None]:
len(shopping)

**(b)** Modify the variable `shopping` so that in addition to the current three items, it also contains once more item, which you can choose.

In [None]:
...

Now we can check to verify that your `shopping` has updated and now has your four items in total.

In [None]:
shopping

In [None]:
len(shopping)

## Indexing and Slicing

The items of a tuple are indexed, starting from `0` and going up to one less than the length. We can retrieve an element using indexing.

This syntax will give the second (not the first!!!) element from your shopping list.

In [None]:
second_item = shopping[1]

second_item

Actually, strings can be indexed in the same way, let's get the first character of the string of this item.

In [None]:
first_character = second_item[0]

first_character

We can also index backwards, for example, if we want the last item of a tuple, we could count the number of items, then subtract one:

In [None]:
shopping[len(shopping) - 1]

but it is better to use negative indexing. This is a lot easier for someone to read an understand your code.

In [None]:
shopping[-1]

Similarly, `-2` will give the second-last element.

Usually we use positive indexing, but if we were doing something like selecting the last element, then negative indexing come sin handy.

Let's define another tuple to use an example. This will be of length `5`.

In [None]:
example = 10, 1000, 42, 'Banana', 3.14

Here is a table showing how the indices line up.

|Element|10|1000|42|'Banana'|3.14|
|:-|-:|-:|-:|-:|-:|
|Positive Index|0|1|2|3|4|
|Negative Index|-5|-4|-3|-2|-1|

**(c)** Use positive indexing to get the element 'Banana'.

In [None]:
...

**(d)** Now use negative indexing to get the same element.

In [None]:
...

There is a method defined on tuples called `index`, which will tell you the index of the first instance of an item. For example, this will tell you the index `4`. It will give the positive index, not the negative index.

In [None]:
example.index(3.14)

**(e)** Use `.index` to check the index of 'Banana'.

In [None]:
...

Slicing works like indexing, except that we are getting more than one item.

Looking again at our table of indices:

|Element|10|1000|42|'Banana'|3.14|
|:-|-:|-:|-:|-:|-:|
|Positive Index|0|1|2|3|4|
|Negative Index|-5|-4|-3|-2|-1|

Say we wanted the items `1000`, `42`, and `Banana`. We need to specify:

- The index of the first item (the index of `1000` is `1` or `-4`)
- The index one higher than the index of the last item (the index of `Banana` is `3`, so one higher is `4` or `-1`)

In [None]:
example[1:4]

We could have used negative indexing, so any of these would work:

In [None]:
example[1:-1]

In [None]:
example[-4:4]

In [None]:
example[-4:-1]

We can also specify a stride.

A slice that begins at index `0`, end at index `4` is `[0:5]`

This has a stride of `1`.

In [None]:
example[0:5]

But a slice with a stride of `2` (meaning we take only every 2nd item is `[0:5:2]`

In [None]:
example[0:5:2]

Note that since `0` is the beginning, we can leave it blank, and Python will assume we mean the beginning:

In [None]:
example[:5:2]

And also, note that `5` is the end, so we can leave it blank and Python will assume we mean the end:

In [None]:
example[::2]

For the next exercise we will define another tuple.

In [None]:
powers = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024

**(f)** Using positive indexing, create a slice containing `8, 16, 32, 64`.

In [None]:
...

Use negative indexing to define the same slice.

In [None]:
powers[-8:-4]

**(g)** Starting at the element `8`, go to the end in steps of `3`, i.e. `8, 64, 512`

In [None]:
...

## Functions

Functions can accept any type of variable as an argument (including collections) and return any type of variable as a return value (including collections).

You just saw a function, `len`, which accepts a collection as an argument.

In the example, we used the value of a variable as an argument

In [None]:
len(x)

But, like with other types, we can still pass a tuple as an argument, even if it is not assigned to another variable already.

Like this:

In [None]:
len((2, 4, 8, 16, 32, 64, 128, 256, 512))

Why the double brackets?

The outer-most brackets are there to call the function.

The inner-most brackets are there to make it clear to Python that there is only one argument, a `tuple`.

What happens if we remove the inner brackets?

In [None]:
len(2, 4, 8, 16, 32, 64, 128, 256, 512)

We see a `TypeError`. Python thinks we are trying to call `len` with multiple different arguments, but `len` takes only one argument. An easy mistake to make!

**(h)** Use the `len` function to check the length of a `tuple` containing exactly **one** element, which is the string `'hello'`. You should find the length is `1`. (If you get `5`, then something went wrong!)

In [None]:
...

Before continuing with functions, let's talk about 'unpacking'.

If you know the length of a tuple, you can write some convenient code to assign each of the individual elements to different variables, without having to specify indices.

For example, this length-two tuple which stores a person's name and age:

In [None]:
person = 'Bob', 36

We can 'unpack' the tuple by using commas on the left-hand side of an assignment operator.

In [None]:
name, age = person

print(f"{name}'s age is {age}.")

This is equivalent to

In [None]:
name = person[0]
age = person[1]

print(f"{name}'s age is {age}.")

For this next exercise we will define a tuple containing information about a company.

In [None]:
company = 'Alphabet Inc.', 'GOOGL', 1616.11

**(i)** Refactor the following code to use tuple unpacking.

In [None]:
company_name = company[0]
ticker_symbol = company[1]
stock_price = company[2]

In [None]:
...

Now that we know about unpacking, we can use this to have a function return multiple values.

Let's look at an example, here we are going to take a tuple as an argument, and return the mean and median of the data. For this, we will use a package called *NumPy*, which should come pre-installed with your installation of Anaconda.

To import NumPy you type:

    import numpy

However, it is a convention to instead use:

    import numpy as np
    
which means that we are aliasing `numpy` as the shorter `np` for convenience. A lot of Python packages used in data science are often imported with a standard two-letter alias, such as `pd` for `pandas`, `px` for `plotly-express`, and `tf` for `tensorflow`.

In [None]:
import numpy as np

def averages(data):
    return np.mean(data), np.median(data)

Now let's call the function on a data set of `13, 0, 1, 5` (not forgetting the double brackets because we have not assigned the tuple to a variable first!

In [None]:
averages((13, 0, 1, 5))

As you can see, we got a tuple as a result, but we probably want to do something with these separately. So let's use unpacking.

In [None]:
mean, median = averages((13, 0, 1, 5))

print(f'The mean of our data is {mean}.')
print(f'The median of our data is {median}.')

If you don't like having to remember to use double-brackets, there are a couple of ways around this. One is to always assign our tuple to a variable first:

In [None]:
data = 13, 0, 1, 5

averages(data)

But, there is an alternative. We can define out function with what is often called 'star args'.

In [None]:
def averages(*data):
    return np.mean(data), np.median(data)

When the argument is defined as star args, Python will cleverly understand our function call even without the double-brackets!

In [None]:
averages(13, 0, 1, 5)

The down-side is that we can only do this with the final positional argument.

**(j)** Define a function called roller, which rolls two six-sided dice and returns the **total**, and the **highest**. For example, if the roles were 5 and 3, it would return 8 (total) and 5 (highest).

In [None]:
import random

def roller():
    ...

total, highest = roller()

print(f'The total rolled was {total}')
print(f'The highest die rolled was {highest}')

## Iteration

Iteration, or looping, is a very common thing to do with a data set.

We don't need a data set to do a loop, using the `range` generator we can loop over the numbers `0` to `9` (just a like in a slice, we specify one more than the number we want to end on, this is a very common computer science convention).

In [None]:
for i in range(10):
    print(i)

Remember the Fibonacci sequence function from before? This was defined using recursion.

In [None]:
def fib(n):
    if n == 1:
        return 1
    if n == 2:
        return 1
    return fib(n-2) + fib(n-1)

We can actually define this function using recursion instead:

In [None]:
def fib(n):
    previous, current = 0, 1
    for i in range(n-1):
        previous, current = current, previous + current
    return current

In [None]:
fib(10)

Doing it this way is actually more efficient, and allows us to work out the value of fib of very large numbers.

In [None]:
fib(1000)

Let's look at the function piece by piece.

First we define starting values a and b. We can do this by typing

    previous = 0
    current = 1

or with unpacking

    previous, current = 0, 1

Either would work.

Then we loop up until the number we want:

    for i in range(n):

and each time we update the two numbers. The next number in the sequence will be the sum of the current two numbers, and the previous will be the old current number.

    previous, current = current, previous + current

Let's take an example:

In [None]:
fib(6)

First we have

    previous, current = 0, 1

then we count `0` on the iteration, and update the numbers to

    previous, current = 1, 0 + 1

which is

    previous, current = 1, 1

then we count `1` and update to

    previous, current = 1, 2

then we count `2` and update to

    previous, current = 2, 3

then we count `3` and update to

    previous, current = 3, 5

then we count `4` and update to

    previous, current = 5, 8

and return `current` which is `8`.

Most of the time, we will be dealing with iterating over data. But it is important to know that we don't need a data set to iterate.

Let's move on to iterating data sets. We will start with a tuple of a shopping list.

In [None]:
shopping = 'apple', 'orange', 'banana', 'pears'

shopping

If you have learned indexes, and `range`, you may be tempted to do this:

In [None]:
for index in range(len(shopping)):
    print(shopping[index])

But don't do the above, it's very bas style.

It's a is a very common thing for beginner Python programmers to try.

It can be refactored into a much simpler form below:

In [None]:
for item in shopping:
    print(item)

If we want to keep track of the index as we print, we might think to try this:

In [None]:
for index in range(len(shopping)):
    print(index, shopping[index], sep='\t')

But again, there is a much simpler refactoring

In [None]:
for index, item in enumerate(shopping):
    print(index, item, sep='\t')

Did you notice the unpacking? Each time around the loop, `enumerate` returns a tuple of two elements, and we are unpacking them into two variables, `index` and `item`.

`enumerate` is a generator function.

Another generator function is `reversed`.

In [None]:
for item in reversed(shopping):
    print(item)

Notice that we loop over the shopping list backwards, but the origonal tuple is unaffected.

In [None]:
shopping

Another function used for the same purpose (although not a generator for some technical reason) is `sorted`.

In [None]:
for item in sorted(shopping):
    print(item)

`sorted` may fail if you are mixing types in the tuple.

In [None]:
x = 1, 52, 'Apple'

for item in sorted(x):
    print(item)

This is because Python can't decide where 'Apple' goes in relation to `1` and `52`.

We can iterate over in reverse sorted order, using both...

In [None]:
data = 'b', 'c', 'a'

for item in reversed(sorted(data)):
    print(item)

However, there is a better way, `sorted` takes an optional *named* argument called `reverse`.

In [None]:
data = 'b', 'c', 'a'

for item in sorted(data, reverse=True):
    print(item)

A named argument is a type of optional argument to a function, where you need to specify the argument name and value.

In this module, we won't cover *defining* functions with named arguments, but we will be using them in some functions like this. You will also find that later, when we use the Pandas and Plotly libraries, there are often named arguments.

**(k)** Using what we have learned above, use the data given to produce the following output:

    0 - Apple
    1 - Cherry
    2 - Coconut
    3 - Orange
    4 - Peach
    5 - Pear

In [None]:
fruits = 'Cherry', 'Orange', 'Pear', 'Apple', 'Coconut', 'Peach'

...

## Map and Filter

Next we will go over map and filter. This is a slightly advanced topic. You might choose to skip it and go straight to the section below on **Reading Files** instead.

Mapping and filtering will be revisited in the context of Pandas in CMM202.

Map and filter are examples of higher-order functions. This is because the use a function as one of the arguments.

Map takes two arguments, a function to apply and a data set.

In [None]:
def square(x):
    return x ** 2

data = 1, 2, 3, 4

for item in map(square, data):
    print(item)

We might want to keep the original data too, we can use the generator function `zip` to iterate over two things at once.

For example:

In [None]:
foo = 2, 5, 6
bar = 7, 7, 1

for x, y in zip(foo, bar):
    print(x, y)

This is another example of unpacking.

Now let's combine the result of out map, with the original data

In [None]:
def square(x):
    return x ** 2

data = 1, 2, 3, 4

for origonal, transformed in zip(data, map(square, data)):
    print(origonal, transformed)

The filter function works similarly, but takes a function which returns a Boolean, saying whether or not to include an item of data.

In [None]:
def less_than_four(x):
    return x < 4

data = 1, 2, 3, 4

for item in filter(less_than_four, data):
    print(item)

The function `less_than_four` will return `True` for `1`, `2`, and`3`, and return `False` for `4`.

**(l)** Using `map` and `filter`, and our functions `less_than_four` and `square`, write a loop which prints out the squares of the data items if the original item is less than 4.

You should get:

    0
    1
    4
    9

In [None]:
data = 0, 1, 2, 3, 4, 5

...

**(m)** Using `map` and `filter`, and our functions `less_than_four` and `square`, write a loop which prints out the squares of the data items if the square is less than 4.

You should get:

    0
    1

In [None]:
data = 0, 1, 2, 3, 4, 5

...

## Reading Files

In a later unit, we will use Pandas to read structured datasets like `.csv` files.

But using a programming language, such as Python, we can also process unstructured / semi-structured data.

We have a file called `lab_text.txt` which is available on Moodle. It contains the following text:

Since this is well-structured, we could use Pandas, however, we are going to process this file manually using a loop to show how is it done.

To open a file we use `with open('...') as file`, where `file` is the name of the variable we are choosing. We could choose any other valid variable name, but `file` seems like a sensible name for the file. Maybe if we were opening many files at once we might need to give them different, clearer names.

Note that in this module we recommend adding `, encoding='utf-8'` to the open function call. The code may work just fine without it. But, this is there because some students are working on laptops where the default language is not set to English, and there is a chance that Python will guess the encoding of the file incorrectly causing it to mis-read certain symbols like `£`. Adding the encoding argument usually fixes this.

To loop over a file, we use a for loop on the file variable.

In [None]:
with open('lab_text.txt', encoding='utf-8') as file:
    for line in file:
        print(line)

Since the lines come with a extra line break, we may want to strip this off using our string processing methods. `line` is just a string, and e can use any of our usual string processing which we learned in Unit 1.5.

In [None]:
with open('lab_text.txt', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        print(line)

**(n)** Write a program which loops over the file `lab_text.txt` and prints out the name of an item of fruit, only if the price is greater than 25p. You should get:

    apples
    oranges

In [None]:
with open('lab_text.txt', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        ...