**Introductory and intermediate computing for Data Science [Barcelona School of Economics]**

`Instructor:` Maxim Fedotov  
`Program:` M.Sc. in Data Science Methodology

# Class 3

## Sequence types: lists, tuples, ranges
Let's dive into several very important built-in sequence types: `list`, `tuple` (and touch `range` a bit). The former two define collections of some objects, and the latter one defines a range between two integer numbers. Their *displays* work as follows:

In [1]:
supercomps = (
    "Frontier",
    "Aurora", 
    "Leonardo", 
    "Perlmutter"
)
nodes_servers = [10624, 3456, 1536]


In [2]:
range(len(supercomps))

range(0, 4)

The above cell shows expressions that create objects of types list, tuple, and range. 

How would you define a tuple with a single element? Try it out here:

In [3]:
# create and print (or just call) a variable which is supposed to be a tuple of one element

As you have seen already, arithmetic operators can be used with lists and tuples.

In [4]:
nodes_servers = [9856] + nodes_servers
print(f"Supercomputer {supercomps[0]} has {nodes_servers[0]} nodes.")

Supercomputer Frontier has 9856 nodes.


As you can see, we can access a value by specifying its index in the square brackets `volumes[0]` (note that it is necessary that there is no space between a name of a variable and a left square bracket). 

NOTE: A *negative* integer would also work as an index – its absolute value specifies a position of an element with respect to the end of the list. 

We can access specfic elements from these data structures doing *slicing*. The interface for slicing is the same square brackets as for selecting one elements, but with specific contents inside it: `[start:end:step]`. Note that it is not necessary to specify all of them.

In [5]:
supercomps = supercomps[::-1]
print(
    "These are the names of the super computers in reverse order:", 
    ", ".join(supercomps)
)

print("You can also take a slice of the first two elements of a range(0, 10):", range(10)[:2]) 
# note that you get a range when slicing a range

These are the names of the super computers in reverse order: Perlmutter, Leonardo, Aurora, Frontier
You can also take a slice of the first two elements of a range(0, 10): range(0, 2)


We can change values of lists by using an assignment expression of the following form:  
`list_identifier[index | slice] = new_value`

Note that if you use an incorrect integer index then you get an `IndexError`.

The main difference between a list and a tuple is that the former is *mutable* and the latter is *immutable*. It means that we can freely change values of elements in a list, but not in a tuple.

In [6]:
# try to change the second to last element of nodes_servers to 100


# now try to change any of the entries of locations


Note that Python considers several identifiers separated with commas as a tuple. This allows us to use an elegant expression when we work with several variables at the same time. For example, a basic computer science problem of swapping values of two variables can be resolved simply like that: 

In [7]:
value_1 = 1
value_2 = 2

value_1, value_2 = value_2, value_1

print(value_1, value_2)

2 1


In addition, tuples are useful because they exhibit *destructuring*. This is what you typically use when you need to retrieve values from tuples to define some new variables or reassign values of already existing ones.

Suppose that the following user information comes to you in a tuple: username, user id, balance. You can use tuple distructuring to retrieve the elemets separately.

In [8]:
user_info = ('mfedotov', '123', 50.1)
user_name, user_id, user_balance = user_info
print(f"User {user_name} with id {user_id} has {user_balance} on their balance.")

# note that if you can use a "plug" if you do not need some of the elements like this

_, user_id, user_balance = user_info
print(f"User {_} with id {user_id} has {user_balance} on their balance.")

# note that this "plug" is still a variable, "_" is also a valid identifier of a variable.

User mfedotov with id 123 has 50.1 on their balance.
User mfedotov with id 123 has 50.1 on their balance.


Note that `list`, `tuple` and `range` are data types, data structures, and classes in Python. So, they have some *methods* that are associated with them.

For example, `list` implements methods `append` and `pop` methods, allowing to add elements at the end of a list and remove an element at a specified index respectively.

In [9]:
requests = [
    "DELETE https://api.example.com/posts/99",
    "PATCH https://api.example.com/users/42/settings",
    "GET https://news.example.com/articles?page=2",
    "POST https://auth.example.com/login",
    "GET https://cdn.example.com/images/logo.png",
    "HEAD https://docs.example.com/manual.pdf",
]
requests.append("OPTIONS https://api.example.com/comments")  # enqueuing
requests.pop(0)  # dequeuing
requests

['PATCH https://api.example.com/users/42/settings',
 'GET https://news.example.com/articles?page=2',
 'POST https://auth.example.com/login',
 'GET https://cdn.example.com/images/logo.png',
 'HEAD https://docs.example.com/manual.pdf',
 'OPTIONS https://api.example.com/comments']

These methods, among other things, can be used to implement data structures like Queue and Stack in Python. (*comment: to fully implement such a structure, you would write a specific class with associated methods, stay tuned for the lecture on OOP*)

## For loops

To really make use of the sequence data structures, we have *loops* at our disposal. There are two types of loops in Python: `for` and `while`. We start with the former one as it is used in list comprehensions (a critically useful tool for data processing).

Below you can find an example of a simple for loop:

In [10]:
gflops_servers = []
cores_node = [64, 104, 32, 64]
clock_frequencies = [2, 1.9, 2.6, 2.45]
operations_cycle = [16, 32, 32, 16]

n_supercomps = len(supercomps)

for i in range(n_supercomps):
    gflops = nodes_servers[i] * cores_node[i] * operations_cycle[i] * clock_frequencies[i]
    gflops_servers.append(gflops)
    print(f"Server {supercomps[i]} supports {gflops / 10**6:.3f} PFLOPs.")

Server Perlmutter supports 20.185 PFLOPs.
Server Leonardo supports 67.178 PFLOPs.
Server Aurora supports 9.201 PFLOPs.
Server Frontier supports 3.854 PFLOPs.


The basic contents of a for loop are:
* A keyword `for`
* An arbitrary identifier for an iterator
* A keyword `in` which indicates that at each iteration we take one element of an iterable object that we specify right after the keyword.
* An identifier of an iterable object which we want to loop through (here it is `range(n_supercomps)`) which is followed by `:`.
* Then there goes a body of the loop. Do not forget about correct indentation.

You can also iterate over elements of a sequence directly, not just over indices.

In [11]:
for supercomp in supercomps:
    print(supercomp, end=' ')

Perlmutter Leonardo Aurora Frontier 

### Break and continue

There are two keywords that will help you to work with for loops: `break` and `continue`.

* `break` stops the loop right at a place it was reached.
* `continue` makes the loop to stop the current iteration without executing any code further and moves to the next iteration.

Suppose that you want to compute calories for positive portions only. In addition, you do not want to calculate calories for portions greater that 330 ml. Then you would use the following construction:

In [12]:
gflops_required = 90e6
# typically, for this kind of problems you would ensure that the list is sorted

specs = [
    (name, gflops) 
    for name, gflops 
    in sorted(zip(supercomps, gflops_servers), key=lambda x: x[1])
]
 
gflops_acc = 0
servers_op = []
                                                  
for j in range(n_supercomps):
    if specs[j][1] < 10e6:
        continue
    gflops_acc += specs[j][1]
    servers_op.append(specs[j][0])
    if gflops_acc > gflops_required:
        break

In [13]:
servers_op

['Perlmutter', 'Leonardo']

## While loops

`While` loops are another way to implement a repeated protocol of action. It is mostly used when there is a sort of stopping criterion to be satisfied after the end of the loop. 

Let's try to imprement a bisection root finding algorithm using a while loop.

In [14]:
from types import FunctionType
from typing import Union

def root_finding_bisection(
    f: FunctionType, 
    left_endpoint: Union[int, float], 
    right_endpoint: Union[int, float], 
    maxit: int = 1000
) -> float:
    if left_endpoint > right_endpoint:
        raise ValueError('Left enpoint must have less value that the right endpoint.')
    f_left = f(left_endpoint) 
    f_right = f(right_endpoint)
    if f_left * f_right >= 0:
        raise ValueError('The function values evaluated at the left and right endpoints' 
                         'must be of different signs.')
    iteration = 0 
    while iteration < maxit and (iteration == 0 or f_middle != 0):
        middle = (left_endpoint + right_endpoint) / 2
        f_middle = f(middle)
        if f_left * f_middle > 0:
            left_endpoint = middle
            f_left = f(left_endpoint)
        else:
            right_endpoint = middle
            f_right = f(right_endpoint)
        iteration += 1
    return middle
    
        
def f(x: int | float) -> float:
    return (x - 5) * (x + 3)

root_finding_bisection(f, -1, 10)

5.0

Note that it is still recommended to prioritize use of `for` loops rather than `while` loops if possible.

## List comprehensions

There is also a concept of list comprehensions that allows to utilize `for` loop functionality  in a concise way embedding it in a list display. Of course, `for` loops and list comprehensions do not serve for same purposes. However, for this specific example above, we could do the same thing using a list comprehension.

### Mapping

In [15]:
pflops_servers = [gflops / 1e6 for gflops in gflops_servers]
pflops_servers

[20.185088, 67.1776768, 9.2012544, 3.8535168000000004]

This list comprehension implements *mapping*, i.e. we apply a specific action to each element of the list.

We could have also used a function within the display of the list comprehension, e.g.

    new_list = [f(val, *args, *kwargs) for val in sequence]

where `*args` and `**kwargs` denote a tuple and a dictionary of potential additional arguments respectively.

### Filtering

There is also a concept of *filtering* which can be implemented with a list comprehension.

In [16]:
gflops_gt10e6 = [gflops for gflops in gflops_servers if gflops > 10e6]
gflops_gt10e6

[20185088, 67177676.8]

We can also combine filtering with mapping and use an `else` claus within a list comprehension. Supose that we want to convert GFLOPs to PFLOPs and "mask" the values that are lower than 10 PFLOPs.

In [17]:
pflops_gt10e6 = [gflops / 1**6 if gflops > 10e6 else None for gflops in gflops_servers]

### Reducing

Another helpful concept is *reducing*. That is, we can retrieve some useful information (e.g. some statistic) from a list, i.e. we reduce it to one particular number.

The classic examples are maximum and minimum values, average and median, counts of various events, and other statistics.

In [18]:
max(gflops_servers), min(gflops_servers)

(67177676.8, 3853516.8000000003)

In [19]:
sum(gflops for gflops in gflops_servers if gflops > 10e6) / sum(gflops_servers)

0.869995105237396

## Dictionaries

Dictionaries can be considered as a collection *key* / *value* pairs. Note that only a *hashable* object can be a key. For example, lists are specified as "unhashable".

In [20]:
user = {'name': 'Foo', 'score': 55}

print("The keys in the dictionary can be accessed through `dict.keys(...)` method:", user.keys())
print("The values in the dictionary can be accessed through `dict.values(...)` method:", user.values(), "\n")

# You can also access the 
for key, value in user.items():
    print("The %s of the user is %s." % (key, value))

The keys in the dictionary can be accessed through `dict.keys(...)` method: dict_keys(['name', 'score'])
The values in the dictionary can be accessed through `dict.values(...)` method: dict_values(['Foo', 55]) 

The name of the user is Foo.
The score of the user is 55.


Notice how we use tuple destructuring in the `for` loop. The loop iterates over a view of key-value pairs given by the dictionary method `dict.items()`. Thus, destructuring allows to break down each key-value pair into two distinct variables. 

You can access a value by its key in the following way:

In [21]:
print("Specify the key in the square brackets:", user['name'])
print("You can also use the `get` method:", user.get('name'))

Specify the key in the square brackets: Foo
You can also use the `get` method: Foo


You can also change a value or create a new entry of a dictionary.

In [22]:
user['name'] = 'Gru'
user['time_spent'] = 20
user

{'name': 'Gru', 'score': 55, 'time_spent': 20}

It is also possible to delete a key an the associated value in a dictionary by using `del` keyword.

In [23]:
del user['score']

A dictionary can be cleared entirely with a `dict.clear()` method.

In [24]:
user.clear()

You can feel the importance of dictionaries when working with data. For example, some data can be specified as a list of dictionaries like this:

In [25]:
users = [{'name': 'Foo', 'score': 55}, {'name': 'Lu', 'score': 56}]

Or data can also come in a nested dictionary format as well. You typically incounter this type of data when parsing websites.

## Set types: sets, frozensets

There are also set types at your disposal. They embody effective data structures that store unordered collections of unique elements. A variable can be checked for inclusion into a set of prespecified values. Also, sets implement set operations (like intersection, union, set difference and so on).

In [26]:
filtering_parameters = {"id", "account", "balance"}

user_data = {"id": 1, "name": "foo", "account": 1337, "date_open": "21/12/21", "balance": 550}
user_data_filtered = {}  # Note that this line creates an empty DICTIONARY, not a set. 
                         # To create an empty set use `set()`

for parameter in user_data:  # note that by default such a loop iterates over the dictionary keys.
    if parameter in filtering_parameters:
        user_data_filtered[parameter] = user_data.get(parameter)
        
user_data_filtered

{'id': 1, 'account': 1337, 'balance': 550}

The logical expression `parameter in filtering_parameters` represents a containment test. It can also be done in that way with other collections, like `lists`, `tuples`, `ranges`, even `strings`.

Note that sets contain only unique values, i.e. if you would like to convert your `list` using `set(...)` function, then it gets rid of all duplicates in the resulting list.

In [27]:
# Suppose that we have some data on occupations of individuals
data = [
    {"name": "Foo", "occupation": "data analyst"}, 
    {"name": "Lu", "occupation": "software engineer"}, 
    {"name": "Bro", "occupation": "data analyst"},
    {"name": "Gru", "occupation": "electrical engineer"}
]

n_individuals = len(data)
occupations = [None] * len(data)

for i in range(n_individuals):
    occupations[i] = data[i]["occupation"]
    
unique_occupations = set(occupations)

print("Unique occupations are:", ', '.join(unique_occupations)) 
print("Number of unique occupations in the data is:", len(unique_occupations))

Unique occupations are: data analyst, electrical engineer, software engineer
Number of unique occupations in the data is: 3


Examples of set operations are:

In [28]:
users_purchasing = ["Foo", "Lu", "Gru"]  # suppose that these are users who actively make in-app payments 
users_active = ["Bro", "Lu", "Gru"]  # these are the most active users

print("Difference:", set(users_purchasing) - set(users_active))  # note that order matters
print("Intersection:", set(users_purchasing) & set(users_active))
print("Union:", set(users_purchasing) | set(users_active)) 
print("Symmetric difference:", set(users_purchasing) ^ set(users_active)) 

Difference: {'Foo'}
Intersection: {'Lu', 'Gru'}
Union: {'Lu', 'Gru', 'Foo', 'Bro'}
Symmetric difference: {'Foo', 'Bro'}


Note that you can modify a set.

In [29]:
names = {'Gru', 'Foo', 'Lu'}
names.add('Bro')
names

{'Bro', 'Foo', 'Gru', 'Lu'}

That is why sets are made unhashable in Python (if you use the built-in function `hash(...)`, you get a `TypeError`). You can use a `frozenset` which is an *immutable* and *hashable* version of a set.

In [30]:
names = frozenset(['Gru', 'Foo', 'Lu'])
names

frozenset({'Foo', 'Gru', 'Lu'})