*Edited: 2022-11-08*

# CMM201 &mdash; Lab 2.2

In this lab, we use three new collection types `list`, `set` and `dict`, which are all mutable collections. `list` will be the most familiar because it works almost exactly like a `tuple`.

## Lists

When defining a list, we need to use square brackets.

We have to include the square brackets, otherwise, Python will think we want a tuple!

In [None]:
[10, 20, 30]

As will all other types, we can assign lists to variables, use them as function arguments, and return values.

In [None]:
x = [10, 20, 30]

x

In [None]:
def f(x):
    return x[0]

f([5000, 4000, 3000, 2000, 1000])

In [None]:
def g(a, b):
    return [1, 2, a, b, b, a]

g(100, 200)

Indexing, slicing, and iterations works exactly as it does with `tuple`.

In [None]:
x = [10, 20, 30, 40, 50, 60]

In [None]:
x[2:5]

In [None]:
for element in x:
    print(element)

In [None]:
for index, element in enumerate(x):
    print(index, element, sep='\t')

All of this should be familiar.

With lists, however, we can also mutate them.

This following code creates a list containing `[1, 2, 3]` then mutates the second element.

In [None]:
x = [1, 2, 3]
x[1] = 100

x

If we had another reference to the list, another name binding, we would see the same change reflected there.

In [None]:
x = [1, 2, 3]
y = x
x[1] = 100

print('x equals', x)
print('y equals', y)
print('Are they the same?', x == y)

This is **not** the case if we had used variable reassignment (instead of mutation) to change `x`.

In [None]:
x = [1, 2, 3]
y = x
x = [1, 100, 3]

print('x equals', x)
print('y equals', y)
print('Are they the same?', x == y)

With `tuple` we only have the choice of variable reassignment, because mutation is not allowed:

In [None]:
x = 1, 2, 3
x[1] = 100

We can also add items to the end of a list,

In [None]:
x = [10, 20, 30]
x.append(40)

x

delete items from a list by index

In [None]:
x = [10, 20, 30, 40]
del x[2]

x

or by specifying the value

In [None]:
x = [10, 20, 30, 40]
x.remove(20)

x

and sort a list in-place using the `.sort()` method.

In [None]:
x = [30, 20, 10, 40]
x.sort()

x

**(a)** Loop over each item in the `tuple` named `foo` and if the element is divisible by `3`, add it to the `list` names `bar`. `bar` should end up with the values `[33, 30, 9, 21, 15]`.

**Hint:** Use `%`

In [None]:
foo = 1, 71, 2, 33, 30, 9, 1, 10, 21, 13, 1, 11, 15

...

print('foo =', foo)
print('bar =', bar)

**(b)** Modify your program so that `bar` ends up in sorted order.

**Hint:** There is more than one way you could choose to do this.

You can read more about lists and list methods here:
    
https://docs.python.org/3/library/stdtypes.html?highlight=list#list

## Sets

A set is a collection which has no particular order. Everything is either inside or not inside a set.

For example, in this set, there are 6 elements:

In [None]:
x = {1, 5, 'a', 'hello', 3.14, (1, 2, 3)}

print('the elements are:')
for element in x:
    print(element)

We can check if a `set` contains an element using the `in` keyword.

This will result in a `bool`, which is `True` if the element is in the set, `False` otherwise.

In [None]:
1 in x

In [None]:
2 in x

In [None]:
(1, 2, 3) in x

In [None]:
(3, 2, 1) in x

Notice that you can have a `tuple` as an element of a set.

But you cannot have a list as an element of a set. Elements of a set must be immutable!

In [None]:
{1, 2, 3, [4, 5, 6]}

In fact, we can't even test if a `list` is in a `set` using `in`.

In [None]:
[4, 5, 6] in x

A set cannot be indexed. We add items to a set with `.add()` and remove with `.remove()`.

In [None]:
x = {1, 2, 3, 4}
x.add(5)
x.remove(3)

x

**(c)** Write a program which will loop over the elements of `all_data` and if the item is in the set `wanted`, add it to `result`. You should get `result = {1, 10}`

In [None]:
all_data = {1, 2, 3, 4, 10, 20, 30, 40, 100, 200, 300, 400}
wanted = {1, 7, 10}

...

print('result =', result)

We can actually accomplish the above with one of the set methods `.intersection()`.

In [None]:
all_data = {1, 2, 3, 4, 10, 20, 30, 40, 100, 200, 300, 400}
wanted = {1, 7, 10}

result = all_data.intersection(wanted)

print('result =', result)

or using `&` for shorthand.

In [None]:
all_data = {1, 2, 3, 4, 10, 20, 30, 40, 100, 200, 300, 400}
wanted = {1, 7, 10}

result = all_data & wanted

print('result =', result)

You can read more about sets and set methods here:
    
https://docs.python.org/3/library/stdtypes.html?highlight=set#set

## Dictionaries

A dictionary is a set of unique keys, but each key has an associated value. Like phone numbers in a phone book.

To create a `dict`, use `{}`, and you can specify keys and values on construction.

In [None]:
x = {1 : 2, 3 : 4}

We can loop over the keys of a `dict`,

In [None]:
for k in x:
    print('the dict has key', k)

or the values,

In [None]:
for v in x.values():
    print('the dict has value', v)

or (possible the more common use case), both:

In [None]:
for k, v in x.items():
    print('the dict has key', k, 'with associated value', v)

Because `dict` (like `list` and `set`) is a mutable type, we can modify it.

Unlike `list` (and `tuple`) we don't access by numerical index `0`, `1`, `2`, `3`, `4`, `5`...

We specify the key in the index when reading a value.

In [None]:
scores = {'Alice' : 100, 'Bob' : 70, 'Charlie' : 80}

scores['Alice']

and the same way when writing a value.

In [None]:
scores['Alice'] = 120

scores

to add a new element, we just write to a currently unused key.

In [None]:
scores['Dan'] = 20

scores

To remove an item, we use the `del` keyword.

In [None]:
del scores['Bob']

scores

**(d)** Use the following lists, and a loop to create a `dict` with the key `'Alice'` mapped to the value `555100`, etc.

In [None]:
keys = ['Alice', 'Bob', 'Charlie']
values = [555100, 555200, 555300]
result = {}

...

result

You can read more about dictionaries and dict methods here:

https://docs.python.org/3/library/stdtypes.html?highlight=dict#dict

## While Loops

So far we have seen one type of loop, a `for` loop.

A for loop is the preferred way to iterate over a set of data, or as we will see lines in a text file.

There is another type of loop, called a `while` loop.

You can think of a while loop like an if statements, except that an if statement checks once if a Boolean condition is true, whereas a while loop will keep checking over and over again until it is true.

The example below rolls random numbers until it rolls a 6.

In [None]:
import random

number = random.randint(1, 6)
print(f'rolled {number}')

while number != 6:

    number = random.randint(1, 6)
    print(f'rolled {number}')

Notice that the condition is based on the variable `number`. We must make sure to modify this variable inside the body of the loop, otherwise the loop will never end and we will need to terminate the Kernel in Jupyter manually to stop it!

One useful use for while loops is repeatedly prompting the user until they give a certain input.

For example, the box below will keep prompting you forever, until you type exactly the text `Hello` in the input box. Make sure to match the casing, uppercase `H`, lowercase `ello`!

In [None]:
text = ''
while text != 'Hello':
    text = input('Please input the word \'Hello\' here : ')
print('Thank you!')

We could be a bit looser with the specific input, for example, this code will stop when there box contains the word `Hello`, but is case-insensitive and allows for other text around the word too. For example, `hELlo world` will also work, so will `shellop`:

In [None]:
text = ''
while 'hello' not in text.lower():
    text = input('Please input the word \'Hello\' here : ')
print('Thank you!')

In [None]:
number = None
while number is None:
    try:
        number = input('Type an integer : ')
        number = int(number)
    except ValueError:
        print(f'\'{number}\' does not look like an integer!!!\nTry again...')
        number = None
print(f'Thank you!\n{number} is definitely an integer.')

We can probably tidy up this code a little.

Previously you have seen extracting repeated code to a function. Even though there isn't really repeated code, it may be a good idea to introduce a function anyway.

Let's call it `read_integer`.

In [None]:
def read_integer():
    try:
        number = input('Type an integer : ')
        return int(number)
    except ValueError:
        print(f'\'{number}\' does not look like an integer!!!\nTry again...')

In [None]:
number = None
while number is None:
    number = read_integer()
print(f'Thank you!\n{number} is definitely an integer.')

Notice that in the refactored version the main program is reduced to only 4 lines, which is very managable to understand. Let's look at them one-by-one.

    number = None
    
This initialised the variable to a default value, `None` is a placeholder which means no value has been set. If we tried to run the code without this, then then expression in the next line would fail because `number` is not defined.

    while number is None:

This means to loop the following code block until the variable `number` is set to something else other than `None`. It is conventional to use the was `is` instead of `==` when comparing something to `None`. Actually it would still work if you replaced `is` with `==`. For all other comparisons, use `==`.

        number = read_integer()

This line calls our function, `read_integer`, which will either return an `int` or it will return `None`.

    print(f'Thank you!\n{number} is definitely an integer.')
    
The last line is not indented, so it is only executed once, after the while loop ends.

**(e)** What would happen if we accidentally indented the last line by 4 spaces instead of 0?

Note about `return`:

Programmers familiar with other languages like Java or C++ might notice that we are not returning anything on the `except` path. We could write `return None` there, but actually Python does this automatically. In Python, if there is no `return` statement on one path of a function, `None` will automatically be returned!

## More Collection Practice

**(f)** Write a function, `highest_min` which accepts two lists, and returns the one whose minimum element is the largest. **Your function should be a pure function.**

For example, `highest_min([3, 10, 1], [2, 5, 9])` will return `[2, 5, 9]` because the minimum element of the `[3, 10, 1]` is `1` and the minimum element of `[2, 5, 9]` is `2`.

In [None]:
def highest_min(a, b):
    ...

In [None]:
highest_min([3, 10, 1], [2, 5, 9])    # should return [2, 5, 9]

In [None]:
x = [50, 300, 10, 20]
y = [15, 30]

print(highest_min(x, y))    # should print [15, 30]
print()
print(x)    # should print [50, 300, 10, 20]
print(y)    # should print [15, 30]

**(g)** Write a function `counts` which loops over the input `data` and creates a `dict` indicating how many times each element appears (see the example test code)

In [None]:
def counts(data):
    ...

counts(['a', 4, 10, 4, 'b', 'a'])    # returns {'a': 2, 4: 2, 10: 1, 'b': 1}

In the above question we implement this counting function manually, however, there is a built-in type called `Counter` which will do this for us.

In [None]:
from collections import Counter

Counter([1, 3, 3, 7])

The returned type is a `Counter`, but if we wanted to convert this back to a standard `dict`, we can typecast it.

In [None]:
dict(Counter([1, 3, 3, 7]))

## Processing Files

In the previous lab we looked at the example file `lab_text.txt`:

    apples:£0.62
    bananas:£0.13
    oranges:£0.30
    
and we learned how to iterate the file to identify a single line of interest.

But now, we know how to add things to collections.

**(h)** Write a program to read the file `lab_text.txt` and load the prices into a `dict` named `prices`, you should get `{'apples' : 0.62, 'bananas' : 0.13, 'oranges' : 0.30}`, each value should be a `float`.

In [None]:
...

In Unit 2.3, we will see that well-structured datasets like these can be read and processed more conveniently with libraries such as 'Pandas', for manipulating structured data.

But we can also process data which is a little messier, since we have complete control over what the program is doing.

Processing messier datasets is very common in activities like 'Web scraping', which is when we use a program to download information from a website (which has been generated to be Human-readable, not necessarily designed for machine-processing.

We have a modified version of the text file, called `lab_text_2.txt` which has an extra few lines of text added

    Today's fruit prices are
    
    apples:£0.62
    bananas:£0.13
    oranges:£0.30

**(i)** Modify your program to ignore any lines that don't contain the separator character `:`, and read in `lab-text_2.txt`.

In [None]:
...

Here is another variant of the file:
    
    
    Today's fruit prices are
    -------------------
    | apples   £ 0.62 |
    | bananas  £ 0.13 |
    | oranges  £ 0.30 |
    -------------------

You will find this data in `lab_text_3.txt`.
    
**(j)** Modify your program again to take account of the new format.

In [None]:
...

When processing messier semi-structured datasets, it is usually necessary to look at the raw data file manually and make decisions about how to write the code to work for this file. Unfortunately this can be time-consuming for particularly messy datasets, and sometimes if the file is updated with new data, the program you wrote stops working as the format is altered slightly, breaking the assumptions we made.

For this reason it is preferred that we work with structured data, like `.csv` files. This is what we will be using in Unit 2.3.