# Introduction to python

## Session 4

## Contents

### 1. Introduction. Data Frames. CSV and Excel files. Average, max, min, etc.

### 2. Indexing. Queries.

### 3. Time Series. Resample.

### 4. Exercises

## 1. Introduction. Data Frames. CSV and Excel files. Average, max, min, etc.

___pandas___ is a python library with functions and tools for data analysis. 

It provides means to handle series and tables of data, make calculations with them, select data based on criteria and save the results to files.

A simple example of pandas usage is reading a table of data from an excel or csv file, select some rows and columns and calculate the average value.

More complicated examples include creating new tables based on the previous data, and also working with time. An example would be to calculate the daily (or hourly, monthly) average (or sum, or maximum value) of a value from of a high number of measurements, perhaps each minute or 5 minutes.

The following examples refer to measurements on different nutrients in an experiment with lettuce. They will help us to practice the `pandas` usage.

There are three different treatments, with different amounts of nitrogen as nitrate and ammonium, as well as the amount of other elements. The measurements are what was present in the lettuce at the end of the cultivation cycle. This is the kind of data that we often need to group by categories, and where we might want to calculate averages or sums for each group.

Later, in the last part of the present session we will use climate data to give an example of the use of ___time series___, which is data that is not in categories or groups, but rather in points ordered in time, events that happen one after the other.

#### Import the pandas library

It is a common practice to import pandas using the alias ___pd___, like this:

In [None]:
import pandas as pd

From there, the package can be used with `pd.` + the name of the function (dot operator `.`)

For example, to access a `read_csv` function we can write `pd.read_csv`.

Of course it is possible to just `import pandas` and then use `pandas.read_csv`, it makes no difference.

#### Read a table from a file

We will show how to import data from two common file types: ___comma-separated values___ and ___excel___ files.

#### Reading a csv-file

<img src='img/csv_file.png' width='700'>

csv files are raw-text files that separate columns in a table by using a previously defined character, most commonly a comma `,`

Note however, that other characters might be used. For example, Microsoft Excel uses a semicolon `;` when it saves a sheet as csv file!

(like in the example shown)

We can read csv files using the `read_csv` function from pandas. The function's most important argument is the file name:

In [None]:
df_nutrients = pd.read_csv( 'data/nutrients.csv', ';' )

Note that by default, the function looks for ___comma___ separated files `,`, which is not the case in the file shown (created with excel). 

Therefore it is necessary to use the parameter `sep=';'` to indicate that, in this case, the columns are separated by semicolon characters.

The function `pd.read_csv` returns a ___Data Frame___, which is the most used data type for tables.

In [None]:
type( df_nutrients )

Inside the Data Frame that was created, `df_nutrients` in this example, we have the content of the file, in rows and columns:

In [None]:
df_nutrients

Note that pandas:
* tries to use the data in the first row as column names (in this case this is correct)
* adds a column at the beginning, with numbers: that is the ___index___ of all rows
* infers the data type of each column: text, numbers... (this fails if there are mixed types)
* adds a value of __NaN__ (not-a-number) to empty cells

#### Reading an xlsx-file from Microsoft Excel

Microsoft Excel is a widespread software for working with data tables, therefore it is very useful to import this kind of files into pandas.

The main difference from csv files to take into account is that excel files can contain several different sheets, and we need to specify which one we want to read to a single table, to a single pandas Data Frame object.

<img src='img/excel_file.png' width='700'>

We will import the sheet "Nutrients", which has the same values that we had previously from the csv file.

In [None]:
# df_nutrients = pd.read_csv( 'data/nutrients.csv', ';' ) → how we read a csv file previously

df_nutrients_xlsx = pd.read_excel( 'data/lettuce.xlsx', sheet_name='Nutrients' )

In [None]:
df_nutrients_xlsx

#### Having a look at a DataFrame

There are a number of functions that can be used directly on a DataFrame and that help to have information from the data inside. 

We will show some of the most common in the next cells. 

#### Head and tail

`.head()` shows the first rows in the table, while `.tail()` shows the last ones.

In [None]:
df_nutrients.head()

In [None]:
df_nutrients.tail()

By default, they show 5 rows, but a number can be passed as parameter:

In [None]:
df_nutrients.head(2)

#### Shape

`.shape` prints the number of rows and columns in the table, very useful to check that the data was imported correctly.

In [None]:
df_nutrients.shape

#### Describe

`.describe()` returns a table with descriptive statistics for all columns in the table, also useful to spot potential errors.

In [None]:
df_nutrients.describe()

It is also possible to check those same descriptive statistics separately. 

For example, we can get the mean only, without having the other values:

In [None]:
df_nutrients.mean()

Note that:
* the tank is a categorical value (it is not a _number_ of tanks: could have been tank A, B, etc.), but ___pandas___ calculates the statistics all the same
* the column `treatment` is not included, because it has text values (could have been the same with the `tank` column)
* there is an empty column at the end of the table, which gets a count value of 0, as there are no cells with any value there

Here the complete table again, for reference only:

In [None]:
df_nutrients

## 2. Indexing. Queries.

Having some general data about the whole table is an important step of data exploration, because we can get a feeling of what is happening (what is big, what is missing), and also because it helps to spot possible errors in calculations.

However, it is most common that we want to answer questions like:
* What are the average nutrients _per treatment_?
* Which tanks were used for the Control treatment?
* What is the ratio of ammonium/nitrate for the control treatment?

This kind of questions require that we _select_ some rows and columns before asking for means or sums. We call this selection ___indexing___, similar to the indexing of lists.

### Indexing

To index lists, and then also data frames, we use square brackets. 

With lists, we use always numeric indexes. 

With data frames we can use colum names and numeric indexes.

#### Short reminder of list indexing

In [None]:
example_list1 = [ 'red', 'blue', 'black', 'yellow' ]

example_list1[ 2 ]

We read that like "example_list1 in the position 2 is 'black'", 

or "the value of example_list1 in the position 2 is 'black'", 

or "there is a 'black' in the position 2 of example_list1"

In [None]:
example_list2 = [ [ 'red', 'blue', 'black', 'yellow' ],
                  [ 'north', 'south', 'up', 'down' ],
                  [ 'apple', 'banana', 'peach', 'pear' ] ]

In [None]:
example_list2[ 2 ]

In [None]:
example_list2[ 2 ][ 1 ]

We will use the same syntax with pandas DataFrames, but we will index rows with the index (mostly with numbers) and columns with their name.

### Selecting columns from a data frame in pandas

Our DataFrame with nutrients looks like this:

In [None]:
df_nutrients

We can select only the nitrates' column with:

In [None]:
df_nutrients[ 'NitrateN' ]

Or more than one column at a time...

To select nitrates and ammonium, we use a list of columns:

In [None]:
df_nutrients[ [ 'NitrateN', 'AmmoniumN' ] ]

The syntax select columns uses a list of the columns to be selected:

        dataframe_name[ [ column_name_1, column_name_2, ... , column_name_n ] ]


Again, we can use the functions `head`, `tail`, `describe`, `mean`, etc in the selected columns, whether they are one or more:

In [None]:
df_nutrients[ [ 'NitrateN', 'AmmoniumN' ] ].head()

In [None]:
df_nutrients[ [ 'NitrateN', 'AmmoniumN' ] ].describe()

In [None]:
df_nutrients[ [ 'NitrateN', 'AmmoniumN' ] ].mean()

### Selecting rows from a data frame in pandas

To select rows from a data frame, we need to know ___the exact index___ of that row. In most cases, the index is a list of numbers, so we need the row number.

The index is that column without name, on the far left of the table:

In [None]:
df_nutrients

In this case, the index is made up from numbers from 0 to 9, as we can check here:

In [None]:
df_nutrients.index

To select a particular row using the index number, we need the ___index locator___ in the form:
    
    dataframe_name.iloc[ row_number ]

In [None]:
df_nutrients.iloc[ 2 ]

However, that is not commonly needed.

More often, we select a group of rows according with a condition in a column. For example, we could select all rows from a certain treatment.

We can select groups of rows using queries, as described below.

### Queries

This section shows how to select the rows of a dataframe that fulfil a logical condition.

The general syntax is:
    
    dataframe_name[ (condition) ]
    
And the form of the _condition_ that we will show here is written as:
    
    dataframe_name[ column ] == value
    
Of course, we can use other logical operators, not only _is equal to_: `==`, `<`, `>`, `<=`, `>=`.

For example, we can select only the _control_ treatment from the nutrient dataframe as follows:

In [None]:
df_nutrients[ df_nutrients['Treatment']=='Control' ] # Get part of a table with a selection rule

Some people like to separate it in two lines, to have a better view of what is happening:

In [None]:
condition = df_nutrients['Treatment']=='Control' # First define the selection rule
df_nutrients[ condition ] # Then get the part of the table

The result is exactly the same, feel free to use the form that best fit your needs!

#### More than one logical condition

Sometimes, we need to look for parts of a data frame that fulfil more than one logical condition.

Following with out example, the nutrients in the lettuce experiment, we could look for:
* cases where an element in a treatment falls above or under a certain treshold, for example, cases with low nitrogen in the control treatment
* cases where two elements are above or under a treshold value, like very low values of iron and boron
* cases where a value is outside a range, 

Note: In these cases we must use ___bitwise logical operators___, which are & |:
* & → ___and___ → condition 1 & condition2
* | → ___or___ → condition 1 | condition2

In [None]:
df_nutrients[ (df_nutrients['Treatment']=='Control') & (df_nutrients['NitrateN']<20000) ]

Again, we can write the code in one or more lines with the same result.

In [None]:
control_treatment = df_nutrients['Treatment']=='Control' # First condition

low_nitrate = df_nutrients['NitrateN']<20000 # Second condition

df_nutrients[ control_treatment & low_nitrate ]

Another example: Select the rows with low Iron or low Boron

In [None]:
low_iron = df_nutrients['Iron']<250 # First condition

low_boron = df_nutrients['Boron']<20 # Second condition

df_nutrients[ control_treatment & low_nitrate ]

#### Calculate on part of a data frame

After selecting part of a data frame we can make calculations, like the average of a treatment.

In [None]:
control_selection = df_nutrients[ 'Treatment' ] == 'Control'

In [None]:
df_nutrients[ control_selection ].mean()

In [None]:
df_nutrients[ control_selection ].min()

In [None]:
df_nutrients[ control_selection ].max()

Lastly, you can use the function `.groupby()` to get the descriptive statistics of the whole table, but by groups. For example, for each tratment:

In [None]:
df_nutrients.groupby( 'Treatment' ).mean()

In [None]:
df_nutrients.groupby( 'Treatment' ).min()

There are much more applications and uses of these (and other) pandas functions, but they may be stuff for a more advanced course. This introduction aims at giving you only an idea of what can be done and the first steps to get you started!

## 3. Time Series. Resample.

`pandas` has a very useful set of functions to deal with time series data, that is, data that are ordered in time.

For this introductory tutorial, we will limit ourselves to the following:
* making a time index for a data frame
* selecting rows between times, using `pd.Timestamp`
* using `pd.resample` to calculate averages

We will use data from climate sensors that were measuring automatically inside the lettuce greenhouse.

The data is in another excel file, which we import as follows:

In [None]:
df_climate = pd.read_excel( 'data/climate.xlsx', sheet_name='greenhouse' )

In [None]:
df_climate.head()

In [None]:
df_climate.tail()

We have data measurements in (about) 5-minutes intervales, measuring temperature and relative humidity inside a greenhouse.

The data span from 18th of April until the 24th of May of 2018. We have a little more than one month of data.

Note that the index is numeric. We want to make it time-aware using the data in the ___Date and Time___ column to be able to select rows in time ranges.

For that, we use the function `pd.DatetimeIndex`, and send as argument the column from te table that has the date and time.

In [None]:
df_climate.head()

Old index:

In [None]:
df_climate.index

New index:

In [None]:
df_climate.index = pd.DatetimeIndex( df_climate[ 'Date and Time' ] )

In [None]:
df_climate.index

Note: A common source of error in this step is a confusion between days and monts (because of the order). It can also happen with the year, if only the last 2 numbers are written: Which are the day, month and year if the date is __10-11-12__?

For the first case, you can specify if days or months go first:
`df_climate.index = pd.DatetimeIndex( df_climate[ 'Date and Time' ], dayfirst=True, yearfirst=True )`

Now we have the index as a time object and can select rows according with it:

In [None]:
df_climate.head()

To select a particular day, we can use `pd.Timestamp`.

Let's say we want to check the data of the 23rd of May:

In [None]:
start = pd.Timestamp( '2018-05-23, 00:00:00' )
end = pd.Timestamp( '2018-05-24, 00:00:00' )

In [None]:
condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ]

And we can use this new table to calculate the average (or max, min, etc) of the columns on that interval:

In [None]:
df_climate[ condition_start & condition_end ].mean()

Of course, we can also look for other periods with this technique, we just follow the steps:
* Define a start time with `pd.Timestamp`
* Define a finishing time with `pd.Timestamp`
* Check condition: > start
* Check condition: < end
* Select from the table (query)

In [None]:
start = pd.Timestamp( '2018-05-23, 08:00:00' )
end = pd.Timestamp( '2018-05-23, 09:00:00' )

condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ]

This techinque does not look for exact matches, as the measuring times in the table include seconds that are irregular. It is very useful because it also works in cases when there are empty or not equally distributed timestamps.

It is now very easy to know the average temperature in the period between 8 and 9 that we selected previously:

In [None]:
start = pd.Timestamp( '2018-05-23, 08:00:00' )
end = pd.Timestamp( '2018-05-23, 09:00:00' )

condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ].mean()

#### Resample

Lastly, we will have a quick introduction to `.resample`, a function that allows us to change the interval in which some data are given, either ___upsample___ (get more points, at smaller intervals) or ___downsample___ (agreggating values in bigger intervals).

The data in the climate data frame is stored in intervals of about 5 minutes. 

We will first ___downsample___ it to hourly and daily values.

There are two things that we have to have clear to correctly resample data:

* What the new frequency will be. 1 hour? 15 minutes? 1 day?
* How we will create the new values. Sum? Average?

About the first question, we will use the following letters to specifiy the new frequency:
    
* M → monthly frequency
* W → weekly frequency
* D → daily frequency
* H → hourly frequency
* T → minutely frequency
* S → secondly frequency
* L → milliseonds
* U → microseconds
* N → nanoseconds

About the second question, think that water (liter [L]) from irrigation in the morning and in the afternoon should ___add___ up for a daily value.

On the other hand, the temperature in the morning, and the temperature in the afternoon should be ___averaged___ to give a daily value.

Also, in cases we need the last value, or the first, or the most common in the interval. At the end of this notebook is a link to a very nice post where these options can be consulted.

First we will resample the whole data frame to 1 hour, taking the average of the values in each hourly interval. 

It is like this now:

In [None]:
df_climate.head()

In [None]:
df_climate_1h = df_climate.resample( '1H' ).mean()

In [None]:
df_climate_1h.head()

In [None]:
df_climate_1h.tail()

And now the same for daily values:

In [None]:
df_climate_1d = df_climate.resample( '1D' ).mean()

In [None]:
df_climate_1d.head()

In [None]:
df_climate_1d.tail()

And that easily we get the daily average temperature and humidity from measurements in 5 minutes interval!

Lastly, we will show what happens if the new frequency is bigger, i.e. the time intervals are smaller.

In these cases, we get empty spaces, that need to be filled with _something_. Common options are the next or last values, or empty cells.

For an example, we will change the frequency from 5 minutes to 1 minute.

In [None]:
df_climate['Temp. (°C)'].resample( '1min' ).ffill()  # Forward fill: takes the value from before

Mean, average and other functions that ___aggregate___ values do not have meaning in this case, because we are "creating" new values, cells that were not there before:

In [None]:
df_climate['Temp. (°C)'].resample( '1min' ).mean()

## 4. Exercises

### 4.1

Consider the fresh weight measurements of the lettuce experiment, included in the excel file ___lettuce.xlsx___, on the sheet ___Freshweight___.

Import the data and select each one of the treatments.

For each treatment, create a data frame. You can call them `df_Au47`, `df_Au53` and `df_Control`.

Once you have three data frames, save each one of them to a ___csv___ file, using the `.to_csv()` function.

You might want to use the optional parameters:
* sep=';' if you plan to read it easily using microsoft excel
* index=False if you don't want an extra column with the index at the beginning of the file

The general syntax is:

`dataframe_name.to_csv( "filename.csv" )`

### 4.2

Open the sheet ___Elements___ in the same excel file to a new data frame. You can call it `df_elements`. 

It contains the use of fertilizers in the lettuce experiment. 

There are several rows for each tank of nutrient solution, and we want to know the descriptive statistics of each tank.

Select the first tank and show its descriptive statistics using either of these forms:

`df_elements[ df_elements['Tank']==1 ].describe()`

`tank_condition = df_elements['Tank']==1`

`df_elements[ tank_condition ].describe()`

Create a for loop that counts from 1 to 9, which are the tank numbers. In the loop, include the code you just used to print the statistics and:
* wrap it in a `print()` function to see the statistics for all tanks
* append the following code to the same line to obtain 9 files with the statistics: `.to_csv(str(i)+'.csv')`
* correct the code from last line to make the files names agree with the tank number

## Links

[Pandas official documentation](https://pandas.pydata.org/pandas-docs/stable/)

[A tutorial on pandas](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)

[Example about row selection on conditions](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/)

[Resampling options, frequency and aggregation functions](http://benalexkeen.com/resampling-time-series-data-with-pandas/)

---