## 1. Why Pandas?
### Working with (semi)structured data

The data types and functions covered in the previous Notebook belong to the standard Python library. It is the backbone of the Python language, but the operations they perform (and the structures they represent) remain rather basic. As writing code to handle more complex application would be laborious, Python comes with various libraries that offer myriad data science tools. 

Below, I give the example of opening CSV files with Pandas (the Python data science library we will be using in this course).

### 1.1. CSV files

CSV stands for "Comma Separated Values": a data format in which the cells are separated by commas.

For example, the CSV table below has **columns** A,B,C and **rows** 1 to 3:

``
,A,B,C
1,0,1,1
2,1,0,1
3,1,1,1
``

The CSV above is an example of **structured** data: each element is properly identified by the **header** (columns) and the **index** (rows).

We can represent the CSV file as a Python string. The newlines are represented in Python by the '\n' character. The cell below creates a CSV table and prints it.

In [45]:
csv_string = ',A,B,C,D\n1,0,2,5,a\n2,8,3,1,b\n3,2,9,1,e'
print(csv_string)

,A,B,C,D
1,0,2,5,a
2,8,3,1,b
3,2,9,1,e


Using index notation (remember the square brackets?) we can select certain elements, both single characters and slices.

In [46]:
print(csv_string[11]) # print character at index 11
print(csv_string[11:14]) # print slice starting at character with index 11 up to index 14

0
0,2


Index notation is handy, but the structure of the CSV file remains implicit--i.e. we can not ask Python to give us the integer in column B at row 2. To recognize the structure properly, Python needs to **parse** the document. While this can be done with standard Python tools, there are external libraries that help you with handling structured data.

In the course, we rely on Pandas, Python's main data science library. Before we can use Pandas, we have to load it using the import statement (which means as much as "please load all tools hidden in de Pandas folder").

In [47]:
import pandas # this imports pandas tools

After importing the Pandas tools we can use the `read_csv` function to read the data.

In [48]:
from io import StringIO # Ignore this line, this technicality is not important for this course
df = pandas.read_csv(StringIO(csv_string),index_col=0) # transform the CSV file as a Pandas DataFrame
df

Unnamed: 0,A,B,C,D
1,0,2,5,a
2,8,3,1,b
3,2,9,1,e


### Exercise

What is the data type of the `df` variable?

In [49]:
# Insert code here

### --Important!--

Please note the following:
- The syntax for using tools from external libraries adheres to the following structure:
        `<library>.<function>(<arguments>)`
- As you noticed, the CSV file is not a string object but belongs to the DataFrame class (which we will inspect below)

Because Pandas identifies the rows and columns automatically, we can ask more meaningful and elaborate questions to extract specific cells or other elements from the DataFrame.

#### Exercise: 

To access specific areas in our DataFrame, we need to use the **`.loc[:,:]`** indexer. The syntax, here, is similar to the index notation we studied previously but applies to two dimensions (rows *and* columns) instead to just one. These dimensions are separated by a comma. While index notation allowed you to retrieve an item from a sequence by location, `.loc[]` is more flexible: you can simultaneously define the row and columns you'd like to access. To understand how this works guess which part of the table the statements below will print. 

#### Retrieving one cell

In [50]:
print(df.loc[2,'A'])

8


In [51]:
print(df.loc[1,'B'])

2


#### --Exercise--

Print the integer in row 3 and column C.

In [52]:
# insert code here

Multiply the integer in row 3 and column C with the integer in row 1 and column B.

In [53]:
# insert code here

#### Retrieving slices

In [54]:
print(df.loc[2,:])

A    8
B    3
C    1
D    b
Name: 2, dtype: object


In [55]:
print(df.loc[:,'A'])

1    0
2    8
3    2
Name: A, dtype: int64


It is also possible to obtain multiple columns. Note that the column names are then enclosed within square brackets.

In [56]:
print(df.loc[:,['A','B']])

   A  B
1  0  2
2  8  3
3  2  9


Print the integers from row 1 to 2 in columns A and C.

In [57]:
# insert code here

We can also print all columns for a specific row:

In [58]:
print(df.loc[1,:])

A    0
B    2
C    5
D    a
Name: 1, dtype: object


By now it should be clear that Pandas' `.read_csv()` facilitates the exploration of structured data (compared to strings in combination with index notation that are part of the Python standard library).

For sure this a toy example, the power of these tools will become clear when working with real data.

## Inspecting the DataFrame

`describe()` offers you a quick look into the summary statistics of the dataset.

In [59]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,3.333333,4.666667,2.333333
std,4.163332,3.785939,2.309401
min,0.0,2.0,1.0
25%,1.0,2.5,1.0
50%,2.0,3.0,1.0
75%,5.0,6.0,3.0
max,8.0,9.0,5.0


**Question**: Which **columns** are not shown in this summary. Can you explain why?

**Question**: Can you determine the meaning of the **rows**? If not, have a look at the Wikipedia pages for the [Mean](https://en.wikipedia.org/wiki/Mean), [Median](https://en.wikipedia.org/wiki/Median), [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) and [Percentiles](https://en.wikipedia.org/wiki/Percentile).

For more information about the `.describe()` we can access the help function using `??` at the end of the line.

In [60]:
df.describe??

The mean and the median are used to summarise a list of values, for exampe [1,2,4,5,8].
The median is the number you get when sorting the list in ascending order, and picking the number in the middle.


**Question**: compute the median value of 2,4,8,9,0.

The means is obtained by summing up all values and dividing this sum by the total number of values. 


**Question**: what is the mean for the series  2,4,8,9,0?

Luckily Python's Numpy library provides you with functions to compute the median and the mean. 


Before running the commands below we have load Numpy. Similar to the Pandas example before, load the numpy library using `np` as an abbreviation.

In [61]:
import numpy as np

Then, to check your previous answers, run the cells below:

In [62]:
np.median([2,4,8,9,0])

4.0

In [63]:
np.mean([2,4,8,9,0])

4.6

In the examples above, the mean and the median happen to be very close to each other, but this is not always the case.

**Question**

Run the cells below. Can you explain why the median and mean diverge? Which score provides the best summary for the list of numbers? Which metric (mean or median) is more reliable? Or does it depend on the distribution of the values?

In [64]:
results = [0,0,1,1,2,4,5,6,6,7,8,7]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

The mean is  3.9166666666666665
The median is  4.5


In [65]:
results = [0,0,1,1,2,4,5,6,6,7,8,70000]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

The mean is  5836.666666666667
The median is  4.5


In [66]:
results = [0,0,0,0,0,0,0,6,6,7,8,7]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

The mean is  2.8333333333333335
The median is  0.0


In [67]:
results = [0,0,0,0,0,0,0,6000,6000,7000,8000,7000]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

The mean is  2833.3333333333335
The median is  0.0


## Sorting DataFrames

Lastly, we can sort the DataFrame by a specific column. De `.sort_values()` method takes two argument
- `by=`: the column you column you want use
- `ascending=` should be equal to `False` if you want to rank the table from high to low.

In [68]:
df.sort_values(by='A',ascending=False)

Unnamed: 0,A,B,C,D
2,8,3,1,b
3,2,9,1,e
1,0,2,5,a


You can use this in combination with `head()` if you want to show only the first *n* rows

In [70]:
df.sort_values(by='D',ascending=False).head(10)

Unnamed: 0,A,B,C,D
3,2,9,1,e
2,8,3,1,b
1,0,2,5,a
