# Python introduction
## Week 2, Numpy and Pandas

Now that you have been getting familiar with the basic idea of python, its time to dwelve a little deeper into one of the most important python packages for scientific computing: [NumPy](http://www.numpy.org/). NumPy contains many powerful tools for handling N-sized arrays, matrices, broadcasting as well as linear algebra and random generators. In the end of this notebook, we will show you some basic plotting functions regularly used for data visualization.

Go through the notebook and make sure you understand every section before continuing.

Credits: This tutorial is partly based on work by [Tomas Knapen and Daan van Es](https://tknapen.github.io) (VU University) and [Lukas Snoek](https://github.com/lukassnoek) (University of Amsterdam).

## 1. Numpy

This section is about Numpy, Python's core library for scientific computing. While the syntax of basic Python (as we've gone over in the previous section) is important to do data analysis in Python, knowing numpy is **essential**. The most important feature of Numpy is it's core data structure: the numpy *ndarray* (which stands for *n*-dimensional array, referring to the fact that the array may be of any dimension: 1D, 2D, 3D, 180D ... *n*D).

### Python lists vs. numpy arrays

Basically, numpy arrays are a lot like Python lists. The major difference, however, is that numpy arrays may contain only a single data-type, while Python lists may contain different data-types within the same list.

In [None]:
# this list contains mixed data-types: an integer, a float, a string, a list
python_list = [1, 2.5, "whatever", [3, 4, 5]] 

for entry in python_list:
    
    print(entry)
    print('this is a: {0}'.format(type(entry)))
    print('\n') # this prints an empty line (so that the printed statements are easier to read)

Numpy thus only allows entries of the same data-type. This difference between Python lists and numpy arrays is basically the same as R lists (allow multiple data-types) versus R matrices/arrays (only allow one data type), and is also the same as MATLAB cells (allow multiple data-types) versus MATLAB matrices (only allow one data type).

In fact, if you try to make a numpy array with different data-types, numpy will force the entries into the same data-type (in a smart way), as is shown in the example below:

In [1]:
import numpy as np # this is how numpy is often imported

# Importantly, you often specify your arrays as Python lists first, and then convert them to numpy
to_convert_to_numpy = [1, 2, 3.5]               # specify python list ...
numpy_array = np.array(to_convert_to_numpy)     # ... and convert ('cast') it to numpy

for entry in numpy_array:
    
    print(entry)
    print('this is a: {0} \n'.format(type(entry)))

1.0
this is a: <class 'numpy.float64'> 

2.0
this is a: <class 'numpy.float64'> 

3.5
this is a: <class 'numpy.float64'> 



As you can see, Numpy converted our original list (to_convert_to_numpy), which contained both integers and floats, to an array with only floats! 

You can turn an array to a different data type by using the astype-method:

In [None]:
for entry in numpy_array.astype(int):
    
    print(entry)
    print('this is a: {0} \n'.format(type(entry)))

You might think that such a data structure that only allows one single data type is not ideal. However, the very fact that it only contains a single data-type makes operations on numpy arrays extremely fast. For example, loops over numpy arrays are often way faster than loops over python lists. This is because, internally, Python has to check the data-type of each loop entry before doing something with that entry. Because numpy arrays allow a single data-type, it only has to check for the entries' data type **once**. If you imagine looping over an array or list of length 100,000, you probably understand that the numpy loop is way faster.

Let's check out the speed difference between Python list operations and numpy array operations:

In [None]:
# timeit is a cool 'feature' that you can use in Notebooks (no need to understand how it works)
# it basically performs a computation that you specify a couple of times and prints how long it took on average
%timeit a = [x * 2 for x in range(0, 100000)] # multiplies each entry in a list of 0 - 100,000 by two

And now let's do the same with numpy:

In [None]:
%timeit b = np.arange(0, 100000) * 2 # np.arange creates a np.array in the same way 'range' creates a Python list

more than 10 times as fast! This really matters when you start doing more complex operations, on, let's say, very large datasets!

### Numpy arrays: creation
As shown ealier, numpy arrays can be created by defining a Python list and converting it to a numpy array explicitly.
Importantly, a simple Python list will be converted to a 1D numpy array, but a nested Python list will be converted to a 2D (or even higher-dimensional array), as is shown here:

In [None]:
my_list = [1, 2, 3]
my_array = np.array(my_list)
print(my_array)
print('\n')

my_nested_list = [[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]]

my_2D_array = np.array(my_nested_list)
print(my_2D_array)

As you can see, creating numpy arrays from nested lists becomes cumbersome if you want to create (large) arrays with more than 2 dimensions. There are, fortunately, a lot of other ways to create ('initialize') large, high-dimensional numpy arrays. One often-used method is to create an array with zeros using the numpy function `np.zeros`:

In [None]:
my_desired_dimensions = (2, 5) # suppose I want to create a matrix with zeros of size 2 by 5
my_array = np.zeros(my_desired_dimensions)

print(my_array)

Using arrays with zeros is often used in what is called 'pre-allocation', in which you create an 'empty' array with only zeros and, for example, 'fill' that array in a loop, as is done below:

In [None]:
my_desired_dimensions = (5, 5)
my_array = np.zeros(my_desired_dimensions)

print('Original zeros-array')
print(my_array)

# make sure you understand what we're looping over - what is 'range(my_desired_dimensions[0])'?
for i in range(my_desired_dimensions[0]):
    
    for ii in range(my_desired_dimensions[1]):
        
        my_array[i, ii] = (i + 1) * (ii + 1)

print('\nFilled array')
print(my_array)

**Important**: realize that loops (not shown above), if-statements and other boolean logic is the same for numpy and python!

Also, you can create numpy arrays using other functions:

In [None]:
ones = np.ones((5, 10)) # create an array with ones
print(ones)

rndom = np.random.random((5, 10)) # Create an array filled with random values
print(rndom)

<div class='alert alert-warning'>
**ToDo**: Create a numpy array with zeros only of the dimensions: (5, 2, 4, 9). Print the dimensions by using the shape-attribute for your array
</div>

In [None]:
# Create the numpy array!

### Numpy: indexing

Indexing (extracting a single value of an array) and slicing (extracting multiple values - a subset - from an array) of numpy arrays is largely the same as other scientific computing languages such as R and MATLAB. Let's check out a 1D example:

In [None]:
my_array = np.arange(10, 21) # similar to the python inherent range-function
print 'Full array:'
print(my_array) # make sure you understand why this array is from 10 - 20 (check the np.arange function using google!)

print('\nSeveral indices of my_array:')
print(my_array[0])
print(my_array[[0, 1]])
print(my_array[[0, 1, 9]])
print(my_array[5:8]) # from index 5 until (but NOT including) index 8
print(my_array[:3]) # slice from index 0 until 3 (NOT including 3)

Setting values in numpy arrays works the same way as lists:

In [None]:
my_array[0] = 100000
print(my_array)

my_array[[3, -1]] = -100
print('\n')
print(my_array)

my_array[:-3] = 0 # can you figure out what this slice (:-3) does?
print(my_array)

Often, instead of working on and indexing 1D array, we'll work with multi-dimensional (>1D) arrays. Indexing multi-dimensional arrays is, again, quite similar to other scientific computing languages:

In [None]:
my_array = np.zeros((3, 3)) # 3 by 3 array with zeros
my_array[2, 2] = 1
print(my_array)

my_array[:, 0] = 100 # the ':' specifies ALL entries, so setting the entire first dimension (rows) to 100 
print('\n')
print(my_array)

<div class='alert alert-info'>
**ToThink**: Make sure you understand why, in the first indexing operation (my_array[2, 2] = 1), the element in the *third* column and row is changed (and not the second).
</div>

In addition to setting specific slices to specific values, you can also extract sub-arrays using slicing/indexing:

In [None]:
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])
print(my_array)

first_row = my_array[0, :]
print('\nFirst row')
print(first_row)

first_col = my_array[:, 0]
print('\nFirst column')
print(first_col)

Perhaps one of the most frequently used indexing in scientific computing is boolean indexing (also called 'masking'). In this type of indexing, you index an array with a boolean array (i.e. array with True and False values) of the same shape. Let's look at an example:

In [None]:
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print(my_array)

bool_idx = my_array > 5
print('\nBoolean index corresponding to values in my_array larger than 5')
print(bool_idx)

print('\nIndexing my_array with bool_idx:')
print(my_array[bool_idx])

<div class='alert alert-warning'>
**ToDo**: Use a boolean index to extract all negative values from the matrix (my_matrix) below:
</div>

In [None]:
my_matrix = np.array([[0, 1, -1, -2],
                      [2, -5, 1, 4],
                      [10, -2, -4, 20]])

# Make a new boolean index below ...

# And use it to index my_matrix:


### Numpy: data-types

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy guesses the datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [None]:
x = np.array([1, 2])  # Let numpy choose the datatype (here: int)
y = np.array([1.0, 2.0])  # Let numpy choose the datatype (here: float)
z = np.array([1, 2], dtype=np.float64)  # Force a particular datatype (input: int, but converted to 64bit float)
# Note that above line is similar to using the astype-method


print(type(x[0]))
print(type(y[0]))
print(type(z[0]))

You can read all about numpy datatypes in the [documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html).

### Numpy: methods vs. functions
In the previous section (Basic Python), you've learned that, in addition to functions, 'methods' exist that are like functions of an object. In other words, methods are functions that are applied to the object itself. You've seen examples of list methods, e.g. `my_list.append(1)`, and string methods, e.g. `my_string.count('b')`. Like lists and strings, numpy arrays have a lot of convenient methods that you can call. Again, this is just like a function, but then applied to itself. Often, numpy provides both a function and method for simple operations. Let's look at an example: 

In [None]:
my_array = np.arange(10)
print(my_array)

mean_array = np.mean(my_array)
print('\nThe mean of the array is: {0}'.format(mean_array))

mean_array2 = my_array.mean() 
print('The mean of the array (computed by its corresponding method) is: {0}'.format(mean_array2))

print('\nIs the numpy function the same as the corresponding method? Answer: {0}'.format(str(mean_array == mean_array2)))

If there is both a function and a method for the operation you want to apply to the array, it really doesn't matter what you choose! Let's look at some more (often used) methods of numpy ndarrays:

In [None]:
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print(my_array.std()) # standard deviation, same as np.std(array)
print(my_array.T) # tranpose an array, same as np.transpose(array)
print(my_array.diagonal()) # get the diagonal, same as np.diag(array)
print(my_array.min()) # same as np.min(array)
print(my_array.max()) # same as np.max(array)
print(my_array.sum()) # same as np.sum(array)

Importantly, a method may or may not take arguments (input).
If no arguments are given, it just looks like "object.method()", i.e. two enclosing brackets with nothing in between.
However, a method may take one or more arguments (like the my_list.append(1) method)! 
This argument may be named or unnamed - doesn't matter. An example:

In [None]:
my_array2 = np.random.random((3, 3))
print('Original array:')
print(my_array2)

print('\nUse the round() method with the argument 3:')
print(my_array2.round(3))
print(my_array2.round(decimals=5))

**Some methods that you'll see a lot in the upcoming tutorials**. In addition to the methods listed above, you'll probably see the following methods a lot in the rest of this course (make sure you understand them!):

In [None]:
my_array = np.arange(10)
print(my_array.reshape((5, 2))) # reshape to desired shape

In [None]:
temporary = my_array.reshape((5, 2))
print(temporary.ravel()) # unroll multi-dimensional array to single 1D array

In [None]:
array1 = np.arange(10)
array2 = np.arange(10).T # this is the transposed version of array1

# .dot() does matrix multiplication (dot product: https://en.wikipedia.org/wiki/Dot_product)
# This linear algebra operation is used very often in data science (regression/fourier transform etc)
dot_product = array1.dot(array2)
print(dot_product)

<div class='alert alert-warning'>
**ToDo**: From the variable below (test_array), extract the diagonal, and sum them together. Print the output.
</div>

In [None]:
test_array = np.arange(9).reshape((3, 3))
# Extract the diagonal and sum them together:


### Numpy: methods vs. attributes?
Alright, by now, if you see a variable followed by a word ending with enclosed brackets, e.g. `my_array.mean()`, you'll know that it's a method! But sometimes you might see something similar, but **without** the brackets, such as `my_array.size`. This `.size` is called an **attribute** of the variable `my_array`. Like a method, it's an integral part of an object (such as a numpy ndarray). The attribute may be of any data-type, like a string, integer, tuple, an array itself. Let's look at an example:

In [None]:
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print(my_array)
print('\nThe size (number of element) in the array is:')
print(my_array.size)
print('\nThe .size attribute is of data-type: {0}'.format(type(my_array.size)))

Alright, so by now you might be wondering what the difference between a method and an attribute is. Superficially, you can recognize a method by the form `object.method()` (note the brackets!), like `my_array.round()`; an attribute is virtually the same **but without brackets**, in the form of `object.attribute`, like `my_array.size`. 

Conceptually, you may think of methods as things that **do** something with the array, while attributes **say** something about the array.

For example, `my_array.size` **does nothing** with the array - it only **says** something about the array (it gives information about its size), while `my_array.mean()` really **does** something (i.e. calculates the mean of the array). 

Again, you might not use attributes a lot during this course, but you'll definitely see them around in the code of the tutorials. Below, some of the common ndarray attributes are listed:

In [None]:
my_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

print('Size (number of elements) of array:')
print(my_array.size) # returns an integer

print('\nShape of array:')
print(my_array.shape) # this is a tuple!

print('\nNumber of dimensions:')
print(my_array.ndim) # this is an integer

### Numpy: array math
We have already gone through arithmetics with standard Python numbers. Doing it with numpy arrays isn't much different.
Basic mathematical functions operate elementwise on arrays, which means that the operation (e.g. addition) is applied onto each element in the array.

In [None]:
x = np.zeros(10)
print(x)
x += 1 # remember: this the same as x = x + 1
print('\n')
print(x)

Often, there exist function-equivalents of the mathematical operators. For example, `x + y` is the same as `np.add(x, y)`. However, it is recommended to use the operators wherever possible to improve readability of your code. See below for an example:

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum: element 1 from x is added to element 1 from y, element 2 from x is added to element 2 from y, etc.
print(x + y)
print('\n')
print(np.add(x, y))

In [None]:
# Elementwise difference; both produce the array
print(x - y)
print(np.subtract(x, y))

In [None]:
# Elementwise product; both produce the array
print(x * y)
print(np.multiply(x, y))

In [None]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

In [None]:
# Elementwise square root; produces the array

print(np.sqrt(x))

<div class='alert alert-info'>
**ToThink**: Suppose you have an array with integers, e.g. `x = np.array([1, 2, 3], dtype=np.int64)`. You calculate `x / 2`, and the result is `[0, 1, 1]` - unlike what you expected! What is going on here? If you forgot, refer back to the Basic Python section.
</div>

<div class='alert alert-warning'>
**ToDo**: Do an elementwise product between the two variables defined below (matrix_A and matrix_B) and subsequently add 5 to each element.
</div>

In [None]:
matrix_A = np.arange(10).reshape((5, 2))
matrix_B = np.arange(10, 20).reshape((5, 2))

Note that unlike MATLAB, `*` is elementwise multiplication, not matrix multiplication. We instead use the dot function (or method) to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices.

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

Probably the most used functions in numpy are the sum() and mean() functions (or methods!). A nice feature is that they can operate on the entire array (this is the default) or they can be applied per dimension. In numpy, dimensions are referred to as **axes**. Applying functions along axes is very common in scientific computing! An example:

In [None]:
x = np.array([[1, 2],[3, 4], [5, 6]])

print('Original array:')
print(x)

print('\nSum over ALL elements of x:')
print(np.sum(x))

print('\nSum over the columns of x:')
print(np.sum(x, axis=0))

print('\nSum over the rows of x:')
print(x.sum(axis=1))# this is the method form! Is exactly the same as np.sum(x, axis=1) 

<div class='alert alert-warning'>
**ToDo**: Calculate the mean of each columns of the matrix y below and print the output:
</div>


In [None]:
y = np.arange(20).reshape((5, 4))

# Calculate the mean of each column


### Broadcasting

Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.

For example, suppose that we want to add a constant vector to each row of a matrix. We could do it like this:

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print('x array is of shape: {0}'.format(list(x.shape)))
print(x)

v = np.array([1, 0, 1])
print('\nv vector is of shape: {0}'.format(list(v.shape)))
print(v)

y = np.zeros(x.shape)   # Create an empty (zeros) matrix with the same shape as x
print('\nShape of (pre-allocated) y-matrix: {0}'.format(list(y.shape)))

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(x.shape[0]): # see how the shape attributes comes in handy in creating loops?
    y[i, :] = x[i, :] + v

print('\n The result of adding v to each row of x, as stored in y:')
print(y)

This works; however when the matrix `x` is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix `x` is equivalent to forming a matrix `vv` by stacking multiple copies of `v` vertically, like this `[[1 0 1], [1 0 1], [1 0 1], [1 0 1]]`, and subsequently elementwise addition of `x + vv`:

In [None]:
vv = np.tile(v, (4, 1)) # i.e. expand vector 'v' 4 times along the row dimension (similar to MATLAB's repmat function)
y = x + vv  # Add x and vv elementwise
print(y)

Numpy **broadcasting** allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print(y)

The line `y = x + v` works even though `x` has shape `(4, 3)` and `v` has shape `(3,)` due to broadcasting; this line works as if v actually had shape `(4, 3)`, where each row was a copy of `v`, and the sum was performed elementwise.

This broadcasting function is really useful, as it prevents us from writing unnessary and by definition slower explicit for-loops. Additionally, it's way easier to read and write than explicit for-loops (which need pre-allocation). Functions that support broadcasting are known as universal functions. You can find the list of all universal functions in the [documentation](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs).

Here are some applications of broadcasting using different functions:

In [None]:
x = np.array([[1, 2],[3, 4], [5, 6]], dtype=np.float)
x_sum = x.sum(axis=0)

print(x / x_sum)

<div class='alert alert-warning'>
**ToDo**: Calculate the mean of each column in the variable 'my_array' below. Subsequenly, subtract these column-means from each row of the matrix.
</div>

In [None]:
my_array = np.arange(20).reshape((5, 4))
print(my_array)

# Calculate the mean of each column using the axis=0 argument! and subtract it from the array itself.

Now, you know the most important numpy concepts and functionality that is necessary to do neuroimaging analysis. Surely, there is a lot more to the numpy package that what we've covered here! But for now, let's continue with plotting using Matplotlib!

## 2. Pandas

Pandas is a tool for handling datastructures in python, where the DataFrame module is reminiscent of something you'd see in R. As you will see later, a lot of Python packages support or even prefer pandas (e.g. Seaborn). Pandas have a plethora of functionalities, and here we will quickly go through some of the most important ones.


In [3]:
import pandas as pd # This is how we usually import pandas
import numpy as np

In [4]:
# Lets define a few values
subject = [1, 2, 3, 4, 5, 6, 7, 8]
reaction_time = [1.6, 2.1, 1.7, 2.4, 1.2, 1.3, 2.5, 3.3]
age = [21, 32, 25, 37, 23, 23, 51, 61]

# put our lists into a numpy array, where each variable is a column (meaning we need to transpose)
data = np.array([subject, reaction_time, age]).T

# Create dataframe and print out first five rows
df = pd.DataFrame(data = data, columns=['S', 'R', 'A'])
print(df.head()) # head() gives you the first 5 rows, tail() gives you the last 5 rows

     S    R     A
0  1.0  1.6  21.0
1  2.0  2.1  32.0
2  3.0  1.7  25.0
3  4.0  2.4  37.0
4  5.0  1.2  23.0


<i>The data parameter to pd.DataFrame can also be a dictionary. See the [Pandas](https://pandas.pydata.org/pandas-docs/stable/) docs for more information.</i><br/>
We can easily change the name of the columns so they make more sense:

In [5]:
df.columns = ['Subject', 'RT', 'Age']
print(df)

   Subject   RT   Age
0      1.0  1.6  21.0
1      2.0  2.1  32.0
2      3.0  1.7  25.0
3      4.0  2.4  37.0
4      5.0  1.2  23.0
5      6.0  1.3  23.0
6      7.0  2.5  51.0
7      8.0  3.3  61.0


And select subjects who have a RT above the mean

In [6]:
print(df[df['RT'] > df['RT'].mean()])

   Subject   RT   Age
1      2.0  2.1  32.0
3      4.0  2.4  37.0
6      7.0  2.5  51.0
7      8.0  3.3  61.0


### Pandas: Extracting data from columns

Its easy to extract specific columns from the dataframe based on their name

In [7]:
RT = df['RT'] # get the reaction times
print(RT.mean()) # print mean reaction time

# which is equivalent of this
print(df['RT'].mean())


2.0125
2.0125


In pandas, each column is a <b>series</b>. Series are a panda specific format, which we can easily turn to a NumPy vector if needed

In [8]:
print(type(RT)) # type is an excellent function to use for checking what different objects are

# Lets make it a NumPy array in two ways
RT2 = np.array(RT)
print(type(RT2))
# which is the same as
print(type(RT.values))

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


We can extract several columns in a similar way as just one column

In [9]:
partial_data = df[['Subject', 'Age']]
print(partial_data)

   Subject   Age
0      1.0  21.0
1      2.0  32.0
2      3.0  25.0
3      4.0  37.0
4      5.0  23.0
5      6.0  23.0
6      7.0  51.0
7      8.0  61.0


And sort values by column names

In [10]:
df_sort = df.sort_values(['RT'], ascending=True)
print(df_sort)

   Subject   RT   Age
4      5.0  1.2  23.0
5      6.0  1.3  23.0
0      1.0  1.6  21.0
2      3.0  1.7  25.0
1      2.0  2.1  32.0
3      4.0  2.4  37.0
6      7.0  2.5  51.0
7      8.0  3.3  61.0


Here we can easily add and remove data

In [11]:
# Add gender as a variable
# The list most be the same length must match the lenght of the dataframe
df['Gender'] = ['f', 'f', 'm', 'f', 'm', 'm', 'f', 'm']
print(df)
# lets remove it
df = df.drop('Gender' ,1)
print(df)

   Subject   RT   Age Gender
0      1.0  1.6  21.0      f
1      2.0  2.1  32.0      f
2      3.0  1.7  25.0      m
3      4.0  2.4  37.0      f
4      5.0  1.2  23.0      m
5      6.0  1.3  23.0      m
6      7.0  2.5  51.0      f
7      8.0  3.3  61.0      m
   Subject   RT   Age
0      1.0  1.6  21.0
1      2.0  2.1  32.0
2      3.0  1.7  25.0
3      4.0  2.4  37.0
4      5.0  1.2  23.0
5      6.0  1.3  23.0
6      7.0  2.5  51.0
7      8.0  3.3  61.0


## Pandas: Data selection by rows
This takes us to another way of selecting data. In pandas you can also select by rows, you can do this using two very similar methods:
<b>loc</b> and <b>iloc</b>

Knowing the difference between the two can save you many hours of debugging. If you look at the left hand side of the printed df_sort above. There you have the row names, and as you can see, because we sorted them they are all jumbled around compared to the unsorted df structure. This means that the index number and row name is not the same in df_sort, but they are in df.

In [12]:
print('This is the row named 0 in df:')
print(df.loc[0]) # NOTICE THE HARD BRACKETS!
print('This is the row indexed 0 of df:')
print(df.iloc[0])
print('\n*********************************\n')
print('This is the row named 0 in df_sort:')
print(df_sort.loc[0])
print('This is the row indexed 0 of df_sort:')
print(df_sort.iloc[0])

This is the row named 0 in df:
Subject     1.0
RT          1.6
Age        21.0
Name: 0, dtype: float64
This is the row indexed 0 of df:
Subject     1.0
RT          1.6
Age        21.0
Name: 0, dtype: float64

*********************************

This is the row named 0 in df_sort:
Subject     1.0
RT          1.6
Age        21.0
Name: 0, dtype: float64
This is the row indexed 0 of df_sort:
Subject     5.0
RT          1.2
Age        23.0
Name: 4, dtype: float64


<div class='alert alert-info'>
**ToThink**: Make sure you understand what is happening here. Compare the output from above cell with output from df and df_sort in the cells above
</div>

<div class='alert alert-warning'>
**ToDo**: We can change a value in any given cell using df.set_value(x, y, value). For example:

df.set_value(3, 'RT', 2.6)

would replace the reaction time for participant 4. Change the age of Subject 1 to 26!
</div>

In [102]:
# Change age here

## Pandas: Saving and loading data
Pandas lets you save and load data in different formats.

The only parameters we will use is index and header. Setting the index parameters to False will prevent it from being exported.

In [13]:
file_name = 'our_data.csv'
df.to_csv(file_name, index=False) # lets save to a csv file, but row names set to False

# If the file is created successfully it means that if we accidentally remove a 
# column from our data, we can easily recover our data
df = df.drop('Age', 1) # 1 is to define that the axis we want to remove is columnns
print(df)
# read in saved data
df = pd.read_csv(file_name)
print('This is our recovered data')
print(df)

   Subject   RT
0      1.0  1.6
1      2.0  2.1
2      3.0  1.7
3      4.0  2.4
4      5.0  1.2
5      6.0  1.3
6      7.0  2.5
7      8.0  3.3
This is our recovered data
   Subject   RT   Age
0      1.0  1.6  21.0
1      2.0  2.1  32.0
2      3.0  1.7  25.0
3      4.0  2.4  37.0
4      5.0  1.2  23.0
5      6.0  1.3  23.0
6      7.0  2.5  51.0
7      8.0  3.3  61.0


<div class='alert alert-warning'>
**ToDo**: Make your own DataFrame from the beginnning, where you define your subjects and some type of "measurement" with random values between 0 and 1 (use numpy for randomization). Save your dataframe as "my_dataframe.csv"
</div>

In [101]:
# Make your own DataFrame here

## Pandas: Managing formats

We can change the DataFrame by shifting between wide and long format, transpose (shifting rows and columns), group by columns, merging with another dataset and many other ways. In data science its rarely the case that you get data that is neatly formatted, so learning how to handle your formats is essential.<br/><br/>

Lets load a dataset we don't know much about, except that it should contain 4 subjects pupil size per trial:

In [18]:
df = pd.read_csv('pupil_data.csv')
print('Dimensions of our data frame:', df.shape)

Dimensions of our data frame: (200, 5)


In [19]:
# So we have 200 rows and 5 columns. Lets have a look at the five first rows:
print(df.head())
print(df.dtypes)

   Trial         1         2         3         4
0    1.0  1.397783  1.002726  1.782394  1.169649
1    2.0  1.089223  1.498374  1.511774  1.816924
2    3.0  1.450949  1.985838  1.671868  1.784004
3    4.0  1.513763  1.286041  1.949068  1.186064
4    5.0  1.795847  1.343431  1.672975  1.106802
Trial    float64
1        float64
2        float64
3        float64
4        float64
dtype: object


Here it looks like we have each subject on a column, together with the trial indentifier. However, we would like to stack our data into long format, where we have one column indicating the subject (1-4), and one column for the pupil size. We can achieve this quickly using the melt-method:

In [20]:
df = pd.melt(df, id_vars=['Trial'], value_vars=['1', '2', '3', '4'], var_name='Subject', value_name='Pupil Size')
print('Dimensions of our data frame:', df.shape)
print(df.head())
# And lets print a few of Subject 4s trials as well
print(df[df['Subject'] == '4'].head())

Dimensions of our data frame: (800, 3)
   Trial Subject  Pupil Size
0    1.0       1    1.397783
1    2.0       1    1.089223
2    3.0       1    1.450949
3    4.0       1    1.513763
4    5.0       1    1.795847
     Trial Subject  Pupil Size
600    1.0       4    1.169649
601    2.0       4    1.816924
602    3.0       4    1.784004
603    4.0       4    1.186064
604    5.0       4    1.106802


<div class='alert alert-warning'>
**ToDo**: To understand what happened here. Create a new DataFrame with 3 columns: A, B and C. Column A should be defined as [1, 2, 3] - Column B = [0.3, 0.7, 0.8], - Column C = [22, 25, 21].
<br/><br/>
Then use the melt-method to achieve this:<br/>

<table>
  <tr>
    <th></th>
    <th>A</th>
    <th>Variable</th>
    <th>Value</th>
  </tr>
  <tr>
    <td>0</td>
    <td>1.0</td>
    <td>B</td>
    <td>0.3</td>
  </tr>
  <tr>
    <td>1</td>
    <td>2.0</td>
    <td>B</td>
    <td>0.7</td>
  </tr>
  <tr>
    <td>2</td>
    <td>3.0</td>
    <td>B</td>
    <td>0.8</td>
  </tr>
  
  
  <tr>
    <td>3</td>
    <td>1.0</td>
    <td>C</td>
    <td>22.0</td>
  </tr>
  <tr>
    <td>4</td>
    <td>2.0</td>
    <td>C</td>
    <td>25.0</td>
  </tr>
  <tr>
    <td>5</td>
    <td>3.0</td>
    <td>C</td>
    <td>21.0</td>
  </tr>
</table>

<br/><br/><br/><br/>

<b>Hint<b/>: No need to name the variable and the values (Don't fill in the parameters value_name and var_name)

</div>

Putting a two datasets together is an easy task:

In [208]:
df2 = pd.read_csv('more_pupil_data.csv') # read in another data set
print(df2.head())

   Trial  Subject  Pupil Size
0    1.0      5.0    1.715435
1    2.0      5.0    1.286689
2    3.0      5.0    1.496787
3    4.0      5.0    1.825850
4    5.0      5.0    2.727050


<b>Perfect</b>! It looks like our data is already formatted. To concatenate two datasets with the same columns we can use concat or append:

In [229]:
df3 = pd.concat([df, df2])
df4 = df.append(df2)
print('Dimensions of our first data frame:', df.shape)
print('Dimensions of our second data frame:', df2.shape)
print('Dimensions of our third data frame:', df3.shape)
print('Dimensions of our fourth data frame:', df4.shape)
del df2, df3, df4 # Lets remove the extra dataframes from our memory

Dimensions of our first data frame: (800, 3)
Dimensions of our second data frame: (800, 3)
Dimensions of our third data frame: (1600, 3)
Dimensions of our fourth data frame: (1600, 3)


## Pandas: Grouping data
Another nifty function is the groupby-method. We can use this to get quickly get the average pupil size for each Subject:

In [238]:
df_mean = df.groupby(['Subject']).mean()
print(df_mean)

         Trial  Pupil Size
Subject                   
1        100.5    1.476810
2        100.5    1.503165
3        100.5    1.529600
4        100.5    1.523925


However, now the Subject-column ended up as our indices. We can easily solve this

In [242]:
df_mean = df.groupby(['Subject'], as_index=False).mean()
print(df_mean)

  Subject  Trial  Pupil Size
0       1  100.5    1.476810
1       2  100.5    1.503165
2       3  100.5    1.529600
3       4  100.5    1.523925


<div class='alert alert-info'>
**ToThink**: Make sure you understand what type of implication this has for the loc[]-method
</div>

As you can see, we averaged over all the values here. If we add a "Trial Condition" variable, we can see what the average over "Subject" and "Trial Condition" could be:

In [252]:
TC = [1, 2]*int(len(df)/2) # a list of 400 1s and 2s each
df['Trial Condition'] = TC

df_mean = df.groupby(['Subject', 'Trial Condition'], as_index=False).mean()
print(df_mean)

  Subject  Trial Condition  Trial  Pupil Size
0       1                1  100.0    1.483401
1       1                2  101.0    1.470219
2       2                1  100.0    1.525740
3       2                2  101.0    1.480590
4       3                1  100.0    1.531551
5       3                2  101.0    1.527648
6       4                1  100.0    1.530357
7       4                2  101.0    1.517492


## Pandas (and overall): CULPRIT!!
In python we have to careful copy or use methods to change a variable. The set_value-method will actually directly change our DataFrame, while df.sort_values() will return a new variable without changing the DataFrame. Meaning in order to save the sorted data we need to assign it again to a variable:<br/><br/>
    <font face='fixed width' size=4>df = df.sort_values()</font><br/><br/>
This is not necessary for the set_value-method that directly changes the DataFrame:<br/><br/>
    <font face='fixed width' size=4>df.set_value(3, 'RT', 2.6)</font><br/><br/>

This relates to view and copy in python. If we assign our DataFrame to another variable:

In [264]:
# Lets define a few values
subject = [1, 2, 3, 4, 5, 6, 7, 8]
reaction_time = [1.6, 2.1, 1.7, 2.4, 1.2, 1.3, 2.5, 3.3]
age = [21, 32, 25, 37, 23, 23, 51, 61]

# put our lists into a numpy array, where each variable is a column (meaning we need to transpose)
data = np.array([subject, reaction_time, age]).T

# Create dataframe and print out first five rows
df = pd.DataFrame(data = data, columns=['Subject', 'RT', 'Age'])
print(df.head()) # head() gives you the first 5 rows, tail() gives you the last 5 rows

   Subject   RT   Age
0      1.0  1.6  21.0
1      2.0  2.1  32.0
2      3.0  1.7  25.0
3      4.0  2.4  37.0
4      5.0  1.2  23.0


In [265]:
df_new = df

We are actually copying the <b>link</b> between df and its object to df_new, while we might be thinking we are making a copy of the actual object. Internally what is happening is that we now made a new binding (df_new) to the same object (the DataFrame).


This means that if we change anything about df_new, we also change df:

In [262]:
print(df)

# lets change the reaction time for participant 8 in our df_new variable
df_new.set_value(7, 'RT', np.random.random(1))

print(df) # and print df again

   Subject        RT   Age
0      1.0  1.600000  21.0
1      2.0  2.100000  32.0
2      3.0  1.700000  25.0
3      4.0  2.400000  37.0
4      5.0  1.200000  23.0
5      6.0  1.300000  23.0
6      7.0  2.500000  51.0
7      8.0  0.929724  61.0
   Subject        RT   Age
0      1.0  1.600000  21.0
1      2.0  2.100000  32.0
2      3.0  1.700000  25.0
3      4.0  2.400000  37.0
4      5.0  1.200000  23.0
5      6.0  1.300000  23.0
6      7.0  2.500000  51.0
7      8.0  0.270745  61.0


This can be helped by actually making a copy (in this case a shallow copy).

The difference between shallow and deep copying is only relevant for compound objects (objects that contain other objects, like lists or class instances):

- A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original.
- A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.


In [263]:
df_new = df.copy()

print(df)

# lets change the reaction time for participant 8 in our df_new variable
df_new.set_value(0, 'RT', 44.6)

print(df) # and print df again

   Subject        RT   Age
0      1.0  1.600000  21.0
1      2.0  2.100000  32.0
2      3.0  1.700000  25.0
3      4.0  2.400000  37.0
4      5.0  1.200000  23.0
5      6.0  1.300000  23.0
6      7.0  2.500000  51.0
7      8.0  0.270745  61.0
   Subject        RT   Age
0      1.0  1.600000  21.0
1      2.0  2.100000  32.0
2      3.0  1.700000  25.0
3      4.0  2.400000  37.0
4      5.0  1.200000  23.0
5      6.0  1.300000  23.0
6      7.0  2.500000  51.0
7      8.0  0.270745  61.0


<div class='alert alert-warning'>
**ToDo**: Make your own example of this python phenomenon using a python list
</div>

In [221]:
# Make your example of what happens when you don't use copy() on a list here