<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Using-the-Pandas-library" data-toc-modified-id="Using-the-Pandas-library-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using the Pandas library</a></span></li><li><span><a href="#Using-a-Pandas-Series" data-toc-modified-id="Using-a-Pandas-Series-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Using a Pandas Series</a></span></li></ul></div>

<span style="align-items: center">
<img src="images/general/Crystal_Clear_app_restart.png"        style="width: 70px ; float:left;   display: inline-block; vertical-align: bottom;" /> 
<img src="images/general/Crystal_Clear_app_restart__right.png" style="width: 70px ; float:right"/> 
</span>






> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses). Parts of these materials were inspired by https://github.com/engineersCode/EngComp/ (CC-BY 4.0), L.A. Barba, N.C. Clementi.
>
>Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice.

### TODO

* Sets  (maybe later)?

* Copy by reference: show it by example 

* Read: CSV and Excel and MATLAB
* Write: CSV and Excel (not MATLAB!)
* Moving average example
* Temperature and Cities example: now with Pandas
* Combine columns: fridge simulation example
* Plotting a simply series with Pandas
* Average of the dice thrown tends to be normally
* Regression in Pandas?
* Mathematical summaries across various axes


# Module 7: Overview 

In [module 5](https://yint.org/pybasic05) and [module 6](https://yint.org/pybasic06) you used NumPy to create arrays, and perform mathematical calculations on them. Even though module 6 was about Python functions in general, the applications were all with NumPy.

Now we take a look at Pandas. This is the current best library for data manipulation. Along the way we will also learn about Jupyter notebooks. 


<img src="images/general/Crystal_Clear_action_db_commit.png" style="width: 100px ; float:left"/> <br><br> Once again, don't forget to use your version control system. Commit your work regularly, where ever you see this icon.

<hr>

### Preparing for this module###

You should have 
1. Completed [worksheet 6](https://yint.org/pybasic06)
2. Finish the short [project on DataCamp](https://projects.datacamp.com/projects/33) about Jupyter notebooks.

<hr>

## Using the Pandas library

Why use ``pandas`` if you already can use NumPy?

* In NumPy you have arrays of data. Pandas adds column headings and row labels (indexes) and calls the result a ``DataFrame``. Think of a spreadsheet.
* But much better than a spreadsheet, pandas can merge two tables together, to align data from different sources.
* If the axis is time-based, it can be used taking advantage of that features: e.g. you can then average over a week, or a month. In other languages you have to manually program that, including taking into account that months sometimes have 28, 29, 30 or 31 days.
* Data which are not time-based are equally well handled
* If you do something on a dataframe, like calculate an average over all rows, then the result has the labels, the column headings in this case, kept in place.
* Pandas takes care of missing data handling.
* It has a database-type thinking, so in later modules, when we handle databases, it will not be an unfamiliar topic.

You can load the Pandas library, similar to how you load the NumPy library, with this command:

```python
import pandas as pd
pd.__version__
```

Before we start with DataFrames, there is a simpler object in Pandas, called a ``Series``, the equivalent of a vector in NumPy.

Let's see some characteristics of a ``Series``:
```python
# Create from a list. Put your own numbers here
s = pd.Series([ ... ]) 
print(s)
```
Notice the indexes? Each row has a label:
```python
>>> s = pd.Series([ 5, 9, 1, -4, float('nan'), 5 ])
>>> print(s)  
0     5
1     9
2     1
3    -4
4   NaN
5     5
dtype: float64
```
The row labeling, if you do not provide is, will be automatically generated, to start at 0.

What if you have labels?
```python
# You call the function with two inputs. One input is 
# mandatory (the first one), the other is optional.
s = pd.Series([5, 9, 1, -4, float('nan'), 5 ], index=['a', 'b', 'c', 'd', 'e', 'f'])
s.values
type(s.values)
```
Ah ha! See what you get there? Pandas is built on top of the NumPy library. The underlying data are still stored as NumPy arrays, and you can access them with the `.values` attribute. This is partly why understanding NumPy first is helpful before using Pandas.

## Using a Pandas Series

### Mathematical calculations

The series you created above, can be used in calculations. Notice how missing data are handled seamlessly.

```python
s = pd.Series([5, 9, 1, -4, float('nan'), 5 ], index=['a', 'b', 'c', 'd', 'e', 'f'])
s * 5

import numpy as np
np.sqrt(s)
```
The last line shows that Pandas and NumPy are compatible with each other. You can call NumPy operations on a Pandas object, and the result is returned as a Pandas object to you, with the row labels (indexes).

Also notice, that taking the square root of a negative number if not defined for real values, so the square root of $-4$ in row `d` returns a `NaN`.

### Accessing entries

Like in NumPy, you can access the data using the square brackets



## Loading data: CSV files

pd.read_csv('http://openmv.net/file/raw-material-height.csv')


http://openmv.net/info/electricity-usage
 * Average electricity usage
 * Maximum
 * Minimum
 * Usage during off-peak, on-peak
 


## Challenges

1. KNMI data loading
2. Fridge simulation: return 4 columns


<img src="images/general/Crystal_Clear_action_db_commit.png" style="width: 100px ; float:left"/> <br><br>Wrap up this section by committing all your work. Have you used a good commit message? Push your work, to refer to later, but also as a backup.

>***Feedback and comments about this worksheet?***
> Please provide any anonymous [comments, feedback and tips](https://docs.google.com/forms/d/1Fpo0q7uGLcM6xcLRyp4qw1mZ0_igSUEnJV6ZGbpG4C4/edit).

In [1]:
# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())