# Data structures

In an analysis pipeline, having variables as single numbers or text is obviously not enough. We need "containers" that can contain more complex data. Think for example of n-dimensional matrices to contain image, tables to contain analysis outputs (e.g. size and intensity of detected objects).

Some of the structures we will use come directly with Python (e.g. lists) while others are implemented by external packages, like Numpy arrays for images and Pandas dataframes for tables. We don't cover all details about these structures now but will explore them in more details in later notebooks. In particular we will come back to dataframes later on.

## Lists and arrays

There are several different data structures that are available out of the box in Python. We have already seen lists. Those can contain any type of data, including other lists:

In [1]:
mylist = ['a', 3, [0,1,3]]

Once defined you can access and change elements by using their index **which starts at 0**:

In [2]:
mylist[0]

'a'

In [3]:
mylist[2]

[0, 1, 3]

In [4]:
mylist[0] = 'b'

In [5]:
mylist

['b', 3, [0, 1, 3]]

We can alos add elements to a list by *appending* them removing them by *popping* them:

In [6]:
mylist.append('new element')

In [7]:
mylist

['b', 3, [0, 1, 3], 'new element']

In [8]:
mylist.pop(1)

mylist

['b', [0, 1, 3], 'new element']

List are typically obtained as an output for example of a function returning the size of objects in an image.

If we think of an image that has multiple lines and columns of pixels, we could now imagine that we can represent it as a list of lists, each single list being e.g. one row pf pixels. For example a 3 x 3 image could be:

In [9]:
my_image = [[4,8,7], [6,4,3], [5,3,7]]
my_image

[[4, 8, 7], [6, 4, 3], [5, 3, 7]]

While in principle we could use a ```list``` for this, computations on such objects would be very slow. For example if we wanted to do background correction and subtract a given value from our image, effectively we would have to go through each element of our list (each pixel) one by one and sequentially remove the background from each pixel. If the background is 3 we would have therefore to compute:
- 4-3
- 8-3
- 7-3
- 6-3

etc. Since operations are done sequentially this would be very slow as we couldn't exploit the fact that most computers have multiple processors. Also it would be tedious to write such an operation.

To fix this, most scientific areas that use lists of numbers of some kind (time-series, images, measurements etc.) resort to an **external package** called ```Numpy``` which offers a **computationally efficient list** called an **array**.

## Numpy arrays

To make this clearer we now import an image in our notebook to see such a structure. We will use a **function** from the scikit-image package to do this import. That function called ```imread``` is located in the submodule called ```io```. Remember that we can then access this function with ```skimage.io.imread()```. Just like we previously defined a function $f(x, a, b)$ that took inputs $x, a, b$, this ```imread()``` function also needs an input. Here it is just the **location of the image**, and that location can either be the **path** to the file on our computer or a **url** of an online place where the image is stored. Here we use an image that can be found at https://github.com/guiwitz/PyImageCourse_beginner/raw/master/images/19838_1252_F8_1.tif (more on imports in a later chapter). As you can see it is a tif file. This address that we are using as an input should be formatted as text:

In [10]:
my_address = 'https://github.com/guiwitz/PyImageCourse_beginner/raw/master/images/19838_1252_F8_1.tif'

Now we can call our function:

In [11]:
import skimage

In [12]:
myimage = skimage.io.imread(my_address)
myimage

array([[[42, 48,  0],
        [45, 41,  0],
        [47, 21,  0],
        ...,
        [78, 16,  1],
        [57, 14,  0],
        [53,  7,  0]],

       [[42, 57,  0],
        [37, 40,  0],
        [38, 30,  0],
        ...,
        [97,  7,  0],
        [67, 12,  0],
        [57,  9,  1]],

       [[42, 55,  0],
        [44, 40,  0],
        [31, 29,  0],
        ...,
        [79,  0,  0],
        [67,  1,  0],
        [61,  1,  0]],

       ...,

       [[ 0,  0,  0],
        [ 0,  0,  0],
        [ 0,  0,  0],
        ...,
        [65, 37,  0],
        [54, 37,  0],
        [47, 49,  0]],

       [[ 0,  0,  0],
        [ 0,  0,  0],
        [ 0,  0,  0],
        ...,
        [75, 41,  0],
        [59, 44,  0],
        [54, 74,  0]],

       [[ 0,  0,  0],
        [ 0,  0,  0],
        [ 0,  0,  0],
        ...,
        [82, 51,  0],
        [62, 48,  0],
        [57, 69,  0]]], dtype=uint8)

We see here an output which is what is returned by our function. It is as expected a list of numbers, and not all numbers are shown because the list is too long. We see that we also have ```[]``` to specify rows, columns etc. The main difference compared to our list of lists that we defined previously is the ```array``` indication at the very beginning of the list of numbers. This ```array``` indication tells us that we are dealing with a ```Numpy``` array, this alternative type of list of lists that will allow us to do efficient computations. Just as a quick example, if we want to remove a background value of 10, we could compute:

In [13]:
myimage - 10.

array([[[ 32.,  38., -10.],
        [ 35.,  31., -10.],
        [ 37.,  11., -10.],
        ...,
        [ 68.,   6.,  -9.],
        [ 47.,   4., -10.],
        [ 43.,  -3., -10.]],

       [[ 32.,  47., -10.],
        [ 27.,  30., -10.],
        [ 28.,  20., -10.],
        ...,
        [ 87.,  -3., -10.],
        [ 57.,   2., -10.],
        [ 47.,  -1.,  -9.]],

       [[ 32.,  45., -10.],
        [ 34.,  30., -10.],
        [ 21.,  19., -10.],
        ...,
        [ 69., -10., -10.],
        [ 57.,  -9., -10.],
        [ 51.,  -9., -10.]],

       ...,

       [[-10., -10., -10.],
        [-10., -10., -10.],
        [-10., -10., -10.],
        ...,
        [ 55.,  27., -10.],
        [ 44.,  27., -10.],
        [ 37.,  39., -10.]],

       [[-10., -10., -10.],
        [-10., -10., -10.],
        [-10., -10., -10.],
        ...,
        [ 65.,  31., -10.],
        [ 49.,  34., -10.],
        [ 44.,  64., -10.]],

       [[-10., -10., -10.],
        [-10., -10., -10.],
        [-10., -

We will learn much more on performing computations with these images arrays in later chapters.

## Other simple structures

During this course, we will encounter from time to time other types of containers. For example tuples. Those are defined with ```()``` and are immutable i.e. we can't change their values.

In [14]:
mytuple = (3, 'a')

In [15]:
mytuple

(3, 'a')

In [16]:
mytuple[0]

3

In [17]:
mytuple[0] = 5

TypeError: 'tuple' object does not support item assignment

The last structures mentionned here is the dictionary, which is a list of key-words and their corresponding value. For example you might get this kind of structure as output of a function that provides different properties of analyzed objects. We define it with curly brackets. Each key-word can contain any type of content.

In [18]:
mydict = {'area': [10, 12, 4], 'object_type': ['cell', 'nucleus', 'cell']}

In [19]:
mydict

{'area': [10, 12, 4], 'object_type': ['cell', 'nucleus', 'cell']}

We can then access each element by key-word (instead of by index like in a list):

In [20]:
mydict['area']

[10, 12, 4]

## Exercise

1. Create an empty list. Use the ```append``` mehod to add a few elements.
2. Try to exract one element using indices. Make sure you get the correct one!
3. Create dictionary containing one key for fruite names and one key for the fruit weight and fill it with 3 fruits. Make sure you can recover either the fruit names or the weight.