# `AA Workshop 03` - Libraries: NumPy & Pandas

In this tutorial we will introduce the concept of Python libraries and cover two very important such library - NumPy.

We will go through the following:

- Python dictionary and lists comprehensions
- Introduction to the concept of `Python Libraries`
- Introduction to `NumPy`
- Introduction to `Pandas`

---

## Python Iterable Comprehensions

Comprehensions are a method for transforming one iterable into another iterable. During this transformation, items within the original iterable can be conditionally included in the new one and each item can be transformed as needed. Comprehensions are a powerful concept and can be used to substitute `for loops` and `lambda functions`. However, not all `for loops` can be written as a comprehension but all comprehensions can be written with a `for loop`. (We will talk about `lambda functions` in future workshops).

A good comprehension can make your code more expressive and, thus, easier to read. The key with creating comprehensions is to not let them get so complex that your head spins when you try to decipher what they are actually doing. Keeping the idea of *easy to read* alive.

The general template you can follow for comprehensions in Python is:

`dict_variable = {key:value for (key,value) in dictonary.items()}`

`list_variable = [value for value in list]`

In [1]:
dict1 = {'a':1, 'b':2, 'c':3, 'd':4}

Remeber that with the `items()` method you can access each key-value pair in a dictionary. You can access the values and keys in a dictionary by using `values()` and `keys()`, respectively.

In [2]:
print(dict1.items())
print(dict1.values())
print(dict1.keys())

dict_items([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
dict_values([1, 2, 3, 4])
dict_keys(['a', 'b', 'c', 'd'])


Here are some examples of dictionary comprehensions:

In [3]:
# ^2 each value in the dictionary
new_dict1 = {k:v**2 for (k,v) in dict1.items()}
print(new_dict1)

{'a': 1, 'b': 4, 'c': 9, 'd': 16}


In [4]:
# exchange keys and values
new_dict2 = {v:k for (k,v) in dict1.items()}
print(new_dict2)

{1: 'a', 2: 'b', 3: 'c', 4: 'd'}


In [5]:
# change the keys of the original dictionary
new_dict3 = {k+'a':v for (k,v) in dict1.items()}
print(new_dict3)

{'aa': 1, 'ba': 2, 'ca': 3, 'da': 4}


### Conditional Comprehesions

Consider the following problem: you want to create a new dictionary with even numbers in a range of 0-10 as keys and the square of the number as values. One solution is to use a `for` loop. Another solution is to use dictionary comprehension method as follows:

In [6]:
new_dict4 = {n:n**2 for n in range(10) if n%2 == 0}
print(new_dict4)

{0: 0, 2: 4, 4: 16, 6: 36, 8: 64}


You can do the same with any other iterable, especially lists:

In [7]:
ls1 = [i for i in range(10)]
ls1

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [8]:
ls2 = [str(i**2) for i in range(10) if i%3==0 ]
print(ls2)

['0', '9', '36', '81']


In [10]:
quote = "life, uh, finds a way"
unique_vowels = {i for i in quote if i in 'aeiou'}
vowels = [i for i in quote if i in 'aeiou']
print(unique_vowels)
print(vowels)

{'e', 'u', 'i', 'a'}
['i', 'e', 'u', 'i', 'a', 'a']


---

## `Python Libraries`

**Introduction to Libraries**

A library (or module, or package) is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

You can use any Python source file as a module by executing an import statement in some other Python source file. The import has the following syntax:


```
import <module name>
```

By convention it is common to name modules so they can be called by entering an abbreviated name. This is effectively importing the module in the same way that `import <module name>` will do, with the only difference of it being available as ` <module name abbreviation>`. In the case of `numpy`, for example, the abbreviation `np`is used.

```
import <module name> as <module name abbreviation>
```


In [2]:
# import numpy
import numpy as np

**Adding/Installing Libraries**

To add Python libraries to your installation you can use the `conda` package manager that we have installed in the last workshop. Alternatively you can also use the `pip` package manager. The quickest way to do so is via a terminal:

* If you are on a **Windows** computer, use the "Anaconda Command Prompt" from the Start menu. 
* On a **Mac**, start up the "Terminal". 
* In **Linux**, use any of the terminals available.


The gerneral command syntax is the following:

```
conda install <package name>
```

If you are looking for a specific package but are unsure of the exact command line name do a quick google search and/or check the [Anaconda Cloud](https://anaconda.org). Another approach to find the exact name of a package before installation is using the conda search command in terminal; the syntax is the following:

```
conda search <name>
```


**Relevant Libraries for this course**

There is a large variety of open source libraries available in Python. Below is a list of some of the most relevant ones for data science, which will be covered in this course.

* Selected data science libraries

    * Data Analysis and Processing
    >* Pandas (pd)
    >* Numpy (np)
    * Visualization        
    >* matplotlib and pyplot (plt)
    >* seaborn (sns)
    * Models and methods
    >* Scikit Learn (sklearn)
    >* statsmodels

**Executing commands from within Jupyter:** For your convenience, it is possible to execute commands from within Jupyter cells using a line prefixed with `!`.
In the example, we list (`ls`) the current working directory, and format it for human-readable output (`-h`) and to include more detail (`-l`). Commands are executed in the local user shell and in the conda environment that the notebook uses.

In [1]:
!ls -lh

Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


---

## `NumPy`

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- A powerful N-dimensional array object
- Sophisticated (broadcasting) functions
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy, dimensions are called axes. The number of axes is rank.

In today's short overview tutorial we will cover the following:

1. **Creating NumPy Arrays**
1. **Modifying (manipulating) NumPy Arrays**
1. **NumPy Array Oprations**

Let's get started...

### Creating NumPy Arrays

First, we can use `np.array` to create arrays from python lists. Unlike Python lists, **NumPy is constrained to arrays that all contain the same type**. If types don't match, NumPy will upcast if possible. 

In [3]:
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [None]:
B = np.array([3.1, 5, 4, 6])
B

In [None]:
D =np.array([1,2,'a',3])
D

Read more info in greater details about string types and other datatype objects in the [numpy documentation](https://numpy.org/doc/stable/reference/arrays.dtypes.html).

If we want to explicitly set the data type of the resulting array, we can use the `dtype` keyword.

In [None]:
C = np.array([1,2,3,8],dtype = float)
print (C)

Other examples of creating arrays using np functions:

In [None]:
# create vector of length 5 filled with zeros
E = np.zeros(5,dtype = float)

# create 2x4 matrix of ones (float)
F = np.ones((2,4), dtype= float)

# create vector from 0-12 in steps of 2
G = np.arange(0,12,2)

# create vector from 0 to 1 with five equally (linearly) spaced elements 
H = np.linspace(0,1,5)

# create a 2x2 matrix with random floats in the half-open interval [0.0, 1.0)
I = np.random.random((2,2))

# return random integers from 0 (inclusive) to 10 (exclusive) of size (4,3,2)
J = np.random.randint (0,10,(4,3,2))

print("E =", E,
      "\n\nF =", F, 
      "\n\nG =", G, 
      "\n\nH =", H, 
      "\n\nI =", I, 
      "\n\nJ =", J)

### Modifying (manipulating) NumPy Arrays

Data manipulation in Python is nearly synonymous with NumPy array manipulation. We will cover a few categories of basic array manipulations here:
- **Attributes of arrays**: Determinig the size, shape, memory consumption and data type of arrays.
- **Indexing of arrays**: Getting and setting the value of indivisual array elements.
- **Slicing of arrays**: Getting and setting smaller subarrays within a larger array.
- **Reshaping of arrays**: Changing the shape of a given array.
- **Joining and splitting of arrays**: Combining multiple arrays into one, and splitting one array into many.

#### NumPy Array Attributes:
In the following some examples on attributes are presented.

In [None]:
# returns dimension
print("E ndim: ", E.ndim)

# returns shape in form (#row,#col)
print("F sahpe: " , F.shape)

# returns size (i.e. no of elements)
print("J size: ", J.size)

# returns data type
print("H dtype: ", H.dtype)

# returns length of one array element in bytes
print("itemsize: ", I.itemsize," bytes")

# returns total bytes consumed by the elements of the array
print("nbytes:  ", I.nbytes, "bytes")

#### NumPy Array Indexing

You can easily access single elements as you already know from Python:

In [None]:
# this is the full array
A

In [None]:
print("The 4th element of A is {}". format(A[3]))
print("The last element of A is {}". format(A[-1])) # index from the back

In a multidimensional array (i.e. a matrix), you access items using a comma-seperated tuple of indices.

In [None]:
# remember multidimensional array I
I

In [None]:
print ("The first element of I is {}". format(I[0,0]))  #array[row,column]
print ("The last element of I is {}". format(I[1,1]))   #array[row,column]

#### NumPy Array Slicing:

**One-dimensional arrays**

Just as we can use square brakets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon `:` character. The syntax is as follow:

` X[start (incl.):stop (excl.):step]`

In [None]:
# remember array G
G

In [None]:
print("middle subarray:", G[2:4])
print("First 4 elemnts:", G[:5])
print("Last 3 elements:", G[-3:] )
print("Every other element:", G[::2])
print("All elements reversed:", G[::-1])

**Multi-dimensional arrays**

Multi-dimensional slices work in the same way, with multiple slices seperated by commas. The command is: 

`X[slice row, slice column]`

In [None]:
# create a multi-dimensional array K
K = np.random.randint(0,20, (3,4))
K

In [None]:
print ("The first two rows and the first three column: \n", K[:2,:3])
print("All rows and every other column:\n", K[:,::2])
print("All rows and columns reversed:\n",K[::-1,::-1] )

#### NumPy Array Reshaping:

Another useful type of operation is reshaping of arrays. The most flexible way of doing this is with the reshape method. Note that for this to work, the size of the initial array must match the size of the reshaped array.

In [None]:
# return evenly spaced values within a given interval
Y = np.arange(1,25)
Y

In [None]:
len(Y)

In [None]:
# we can re-shape this array into any shape with 24 elements
Y.reshape(6,4)

Similarly, you can easily transpose multi-dimensional arrays using `T`:

In [None]:
# remember K
K

In [None]:
# transpose K
K.T

#### NumPy Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.

**Concatenation of arrays**

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routine `np.concatenate`. Additionally `np.vstack`, and `np.hstack` may be used.

In [None]:
# one-dimensional
P = np.array([1,2,3])
Q = np.array([4,5,6])
np.concatenate((P,Q))

In [None]:
# multi-dimensional
R = np.array([[3,5,7],[1,3,5]])
S = np.array([[2,4,2],[0,9,8]])
print(R, "\n")
print(S)

In [None]:
# pass the axis along which the arrays should be joined
print(np.concatenate((R,S), axis = 0), "\n")
print(np.concatenate((R,S), axis = 1), "\n")

For working with arrays of mixed dimensions, it can be more practical to use the `np.vstack` (vertical stack) and `np.hstack` (horizontal stack) functions:

In [None]:
# stack row-wise
np.vstack((R,S))

In [None]:
# stack column-wise
np.hstack((R,S))

**Splitting of arrays**

The opposite of concatenation is splitting, which is implemented by the functions `np.split`, `np.hsplit`, and `np.vsplit`. For each of these, we can pass a list of indices indicating the split points:

In [None]:
# divides the array x into 4 equal subarrays
x = np.array([2,4,6,7,8,9,1,3,11,35,55,34])
print(np.split(x,4))

In [None]:
x1, x2, x3, x4 = np.split(x,4)
print(x1, x2, x3, x4)

In [None]:
# create multi-dimensional array
Z = np.arange(16).reshape((4, 4))
Z

In [None]:
# splits array Z into multiple sub-arrays vertically (row-wise)
upper, lower = np.vsplit(Z, 2)
print("upper: \n",upper)
print("lower: \n",lower)

In [None]:
# splits array Z into multiple sub-arrays horizontally (column-wise)
left, right = np.hsplit(Z, 2)
print("left: \n",left)
print("right: \n",right)

### NumPy Array Operations

Numpy also allows for **element-wise** as well as linear algebra **matrix-type** operations, which are a key component of scientific computing tasks.

In [None]:
# create two one-dimensional arrays
A = np.array([1,2,3,4])
B = np.array([9,3,-9,6])

**Element-wise operations**

In [None]:
# element-wise addition
C=A+B
C

In [None]:
# element-wise substraction
D=A-B
D

In [None]:
# element-wise multiplication
E=A*B
E

In [None]:
# element-wise division
F=A/B
F

**Matrix operations**

In [None]:
# create two 2x5 matrices
M = np.arange(10).reshape(2,5)
N = np.random.randint(1,10,10).reshape(2,5)
print(M, "\n")
print(N)

In [None]:
# note: performing a matrix multiplication on two 2x5 matrices is not possible
M@N

In [None]:
# We can transpose one of the matrices to obtain a 5x2 matrix, then the operation works
M@N.T

---


## `Pandas`

Pandas is a newer package built on top of `NumPy`, and provides an efficient implementation of a `DataFrame`, the core Pandas object. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs. We will use `Pandas` as the main tool to structure, manipulate and store data throughout this course. As such, Pandas constitutes a core data science library that all of you should be very well familiar with.

Let's get started by importing pandas.

In [None]:
import pandas as pd

### Introduction to `Pandas` objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.



#### The `Pandas Series` Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [None]:
A = [12,13,14,15,16,17]
A

In [None]:
F = pd.Series(A)
F

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

In [None]:
F.values

The index is an array-like object of type pd.Index, which we'll discuss in more detail later.

In [None]:
F.index

Like with a `NumPy` array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
print(F[2])

In [None]:
print(F[2:5])

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. **The essential difference is the presence of the index**: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:


In [None]:
C = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
C

In [None]:
print(C['c'])

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.
The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 2644819,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [None]:
population_dict.keys()

In [None]:
# slicing a dict does not work
population_dict['California':'New York']

In [None]:
# slicing a Series object works
population['California':'New York']

#### The `Pandas DataFrame` Object

The next fundamental structure in `Pandas`  is the `DataFrame` . Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area


Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [None]:
states.index

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [None]:
states.columns

Thus, the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

#### The `Pandas Index` Object

We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a list of integers:

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

In [None]:
ind[0]

In [None]:
ind[::2] # every second object starting at 0

Index objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

One difference between Index objects and NumPy arrays is that indices are immutable – that is, they cannot be modified via the normal means.

In [None]:
# this does not work with index objects
ind[1] = 0

### Data selection in `Pandas`

#### Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us understand the patterns of data indexing and selection in these arrays.

Like a dictionary, the Series object provides a mapping from a collection of keys (i.e. the index) to a collection of values:

In [None]:
C

In [None]:
print(C['d'])

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [None]:
C.keys()

In [None]:
list(C.items())

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:


In [None]:
# add value
C['e'] = 1.25
C

In [None]:
# change value
C['a'] = 0
C

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:

In [None]:
# slicing by explicit index (note: including last)
C['a':'c']

In [None]:
# slicing by implicit integer index (note: excluding last)
C[0:2]

In [None]:
C[C>=0.5]

In [None]:
# masking
C[(C > 0.3) & (C < 0.8)]

In [None]:
# multiple indexing
C[['a', 'e', "c"]]

Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

**Indexers: loc and iloc**

These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c', 'd'], index=[1, 3, 5,7])
data

In [None]:
# explicit index when indexing
data[5]

In [None]:
# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
First, the `.loc` attribute allows indexing and slicing that always references the **explicit index**:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

The `.iloc` attribute allows indexing and slicing that always references the **implicit Python-style index**:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

One guiding principle of Python code is that **"explicit is better than implicit"**. The explicit nature of `.loc` and `.iloc` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

#### Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [None]:
data['pop']

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

In [None]:
data.values

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [None]:
data.T

When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

In [None]:
data.values[0]

...while passing a single "index" to a DataFrame accesses a column:

In [None]:
data['area']

Thus, for array-style indexing, we need another convention. Here Pandas again uses the `.loc` and `.iloc` indexers mentioned earlier. Using the `.iloc` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

In [None]:
data.iloc[:3, :]

Similarly, using the `.loc` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
data.loc[:'Texas', :'pop']

We can also pass conditions to the indexer, a technique called **masking**.

In [None]:
# masking with .loc
data.loc[data.area > 200000]

In [None]:
# masking with explicit column index
new_df = data[data["area"] > 200000]
new_df

Masks are conditional statements that are evaluated for every element (i.e., row) of the DataFrame.
If you want to combine multiple statements, you can use boolean operators: `&`, `|`. Negation is handled with `~` in pandas masking. Be aware that you cannot use operators such as `and`, `or`, `!`, `not`.

In [None]:
data[(data["area"] > 100000) & (data["area"] <= 180000)]

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you are accustomed to from working with NumPy:

In [None]:
data.iloc[0, 1] = 90000000
data

### Data anaylsis and manipulation in `Pandas`

#### Hierarchical Indexing

**Using MultiIndex to create hierarchichal DataFrames**

Often it is useful to store higher-dimensional data, i.e. data indexed by more than one or two keys. The most approapriate Pandas function for this is `MultiIndex`. We will demonstrate using a simple example:

In [None]:
# we use a 2-D example, first specifying the outer and inner dimensions as well as the values
Outer = ['California','California',
         'New York','New York',
         'Texas','Texas']

Inner = [2000,2010,
         2000,2010,
         2000,2010]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

# From the outer and inner dimensions we create an index
index_list = list(zip(Outer,Inner))
index_list

In [None]:
# based on this index a multi index can be created
index_object = pd.MultiIndex.from_tuples(index_list)
index_object

You can generate the same result using the `MultiIndex.from_arrays()` function fro **Pandas**. Keep in mind the size of arrays has to be the same.

In [None]:
mul_ind = pd.MultiIndex.from_arrays([Outer, Inner])
mul_ind

In [None]:
# Finally we create a data frame with hierarchichal indexes
df = pd.DataFrame(index=index_object,
                  data=populations,
                  columns=["Population"])

df

In [None]:
area = [164,164, #California 
        55,55,   #NY State
        269,269] #Texas

df["Area"] = area

df

In [None]:
# again, we can easily transpose this DataFrame
df.transpose()

**MultiIndex as extra dimension**

You might notice that we could have stored the same data using a simple DataFrame with index and column labels. The `unstack()` method allows us to convert a hierachically indexed Series into a conventionally indexed DataFrame.

In [None]:
df.unstack()["Population"]

Pivot a level of the (necessarily hierarchical) index labels, returning
a DataFrame having a new level of column labels whose inner-most level
consists of the pivoted index labels.

In [None]:
df.unstack(level = -1)

#### Merging data in `Pandas`

Pandas also makes it easy to combine a Series and DataFrame. Multiple Series can be joined to represent a DataFrame, while multiple DataFrame can be joined in a table style join (like in databases):

In [None]:
s1 = pd.Series([1,1,2,2,3,3], name = 'Series 1')
s2 = pd.Series([1.1, 1.2, 2.2, 2.4, 3.3, 3.6], name = 'Series 2')
 
pd.DataFrame([s1, s2])

If we want them to represent columns, we need to supply them as values of a dictionary indexed by column names:

In [None]:
s1s2 = pd.DataFrame({'Series 1': s1, 'Series 2': s2})
s1s2

In [None]:
# merge series objects by index
pd.merge(s1s2, s1, left_index=True, right_index=True)

Here, we join on the index, but we could also join on another column:

In [None]:
a = pd.DataFrame({
    'key': [1,2,4,5,5,6],
    'value': [0.1, 0.2, 0.4, 0.5, 0.55, 0.6]
})

b = pd.DataFrame({
    'key': [1,2,4,5,5,6],
    'value': pd.Series([0.1, 0.2, 0.4, 0.5, 0.55, 0.6]) ** 2
})

pd.merge(a, b, on = 'key')

Observe that there are more rows than in each of the individual DataFrames. This is because we do an inner style join, where we return every match, and the key `5` matches four times, as it is present two times in both DataFrames!

#### Quick data analysis in `Pandas`

In practice your dataset will often be considerably larger than the illustrative example above. In these cases it is often useful to run brief descriptive statistical analyses on the set, which will help to get a first feel for the data. 

We will demonstrate how to do this using the provided `iris.csv` dataset. This is a famous dataset used frequently for educational purposes in data science. You can read up on the dataset, its content and its origins here: https://en.wikipedia.org/wiki/Iris_flower_data_set.

The iris dataset contains measurements for 150 iris flowers from three different species.

The __three classes__ in the Iris dataset:
* Iris-setosa (n=50)
* Iris-versicolor (n=50)
* Iris-virginica (n=50)

The __four features__ of the Iris dataset:
* sepal length in cm
* sepal width in cm
* petal length in cm
* petal width in cm

<img src="./iris.original.png" width="600" height="400"/>

In [None]:
# first read in the data as a dataframe
# you can optionally set the index by using the index_col function
Iris_set = pd.read_csv("../data/iris.csv", index_col="number")

In [None]:
Iris_set.columns

In [None]:
# display the first five columns
Iris_set.head()

In [None]:
# display the last 5 columns
Iris_set.tail()

In [None]:
# the describe function provides an overview of key descriptive statistics by columns
Iris_set.describe()

In [None]:
# get info on data types and counts
Iris_set.info()

In [None]:
# you can also focus your analysis on individual features by indexing them
Iris_set["Sepal.Width"].describe()

In [None]:
# additionally you can call the functions seperately, e.g. max
Iris_set.max()

#### Handling missing numerical data (`NaN` values)
If you have paid attention to the descriptive statistics above you will have noticed that the __counts__ of values differ across features. This is a first indication for missing numerical data. In the real world you will often encounter incomplete and noisy data which will require pre-processing before you can apply data science and machine learning tools to them. In this section we will provide a quick overview on how to deal with missing data in your datasets. 
Note that there are multiple ways how Python, Pandas, and other packages might highlight missing data:
* `NaN` - (acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation. Pandas uses this for display
* `pd.NA` - is a special pandas type, that is shorthand for all other types
* `None` - Python-specific object that is often used for missing data in Python code. Because it is a Python object, `None` can only be used in arrays with data type 'object' (i.e., arrays of Python objects)

**Detecting missing numerical data**

Pandas data structures have two useful methods for detecting null data: `isnull()` and `notnull()`. Either one will return a Boolean mask over the data.

In [None]:
# isnull() returns "True" for every missing numerical value in the dataset
Iris_set.isnull()

In [None]:
# notnull() returns "False" for every missing numerical value in the dataset
Iris_set.notnull()

#### Dealing with missing numerical data
In essence two approaches to dealing with missing data exist:
* __Elimination__: Dropping null values from the dataset
* __Imputation__: Imputing/replacing null values with numerical values/estimates

**Elimination (i.e. dropping null values)**

We cannot drop single values from a data frame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so `dropna()` gives a number of options:

We use the `dropna(axis=0)` function to drop all rows from the dataset that incl. null values. Note that `dropna()` will not modify your dataframe unless you call `dropna(axis=0,inplace=True)`

In [None]:
Iris_set.dropna(axis=0, inplace=False)

In [None]:
# note that the dataset is unchanged
Iris_set

We use the `dropna(axis=1)` function to drop all columns from the dataset that incl. null values. Note that, also in this case, `dropna()` will not modify your dataframe unless you call `dropna(axis=1,inplace=True)`

In [None]:
Iris_set.dropna(axis=1)

In [None]:
# again, note that dropna() does not modify the original dataframe!
len(Iris_set)

In [None]:
# we can therefore also use dropna() to easily identify the number of rows with missing values 
len(Iris_set)-len(Iris_set.dropna())

In [None]:
# if you whish to clean the data with dropna it is good practice to define a new data frame
Iris_set_clean = Iris_set.dropna(axis=0)

In [None]:
len(Iris_set_clean)

**Imputation (i.e. filling null values)**

Imputation fills null values with numerical values chosen by the data scientist. For this `fillna()`provides the appropriate tools.

The data scientist will usually choose one of the following methods:
* Fill with zeros: `fillna(0)`
* Conduct a forward fill (i.e. filling NaN values with data from following observation): `fillna(method="ffill")`
* Conduct a backward fill (i.e. filling NaN values with data from previous observation): `fillna(method="bfill")`
* Fill with the column mean,max,min, etc.: `df["Col_name"].fillna(value=df["Col_name"].mean())`
* Custom fill depending on the application

In [None]:
# fillna() allows for inserting a number of choice, confirm with inplace=True
Iris_set[["Sepal.Length"]].fillna(value=Iris_set["Sepal.Length"].mean())

#### Data aggregation and grouping
The `groupby()` method allows for grouping of rows and the application of aggregation functions to these grouped rows. It is a highly popular method in data science used extensively. We will return to the iris dataset for illustration purposes.

In [None]:
# we might want to group rows according to "Species" and assign to a new variable "Species_group"
Species_groups = Iris_set.groupby("Species").describe()
Species_groups

In [None]:
# we might select an individual group across all features
# note that you need to transpose the array before you can index the group
Iris_set.groupby("Species").describe().transpose()[["versicolor"]]

You can also apply aggregation functions to each group using groupby-apply syntax:

In [None]:
Iris_set.groupby("Species").max()

Here, we applied the `max` aggregation function to all groups, but we can restrict that to columns and even use different aggregation functions per column of interest using the more flexible `agg` functionality. We need to give the desired aggregation function for every column, and columns we don't explicitly list get omitted in aggregation:

In [None]:
Iris_set.groupby("Species").agg({'Sepal.Length': 'min', 'Sepal.Width': 'max', 'Petal.Length': 'median'})

You can check which functions are available by looking at the [Documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods).

Sometimes you might wish to give the resulting columns names that reflect what aggregation you used. You can use named aggregation for that, but must then use keyword arguments instead of passing a dictionary:

In [None]:
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'C', 'C', 'C'],
    'somevalue': [1, 2, 1.5, 1, 2, 3],
    'anothervalue': [1.1, 2.1, 1.6, 1.1, 2.1, 3.1]
})

In [None]:
df.groupby('group').agg(
    somevalue_avg = pd.NamedAgg(column='somevalue', aggfunc='mean'), 
    anothervalue_max = pd.NamedAgg(column='anothervalue', aggfunc='max'),     
)

Unluckily, this only works if the new column names are valid Python keywords, which would for example not be the case when we want to name resulting columns in our flower example using the syntax `agg.columnname`. 

In [None]:
Iris_set.groupby("Species").agg(
    avg.Sepal.Length = pd.NamedAgg(column='Sepal.Length', aggfunc='mean'),
    median.Sepal.Length = pd.NamedAgg(column='Sepal.Length', aggfunc='median')
)

Here, we need a little trick (which is presented in the pandas documentation [here](https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation)): Python functions accept a dynamic unpacking of dictionaries for keyword arguments (`**kwargs`), which is why we can circumvent the illegal named aggregation:

In [None]:
Iris_set.groupby("Species").agg(
    **{
    'avg.Sepal.Length': pd.NamedAgg(column='Sepal.Length', aggfunc='mean'),
    'median.Sepal.Length': pd.NamedAgg(column='Sepal.Length', aggfunc='median')
    }
)

**Final note**:
To add a table of content to your jupyter notebook, you need to install the `nbextension` available on https://github.com/minrk/ipython_extensions.

---