# Lesson: more analysis and plotting tools 

Last week you read and analyzed a CSV file with the `pandas` library. The purpose of todays lesson is to introduce you to *most* of the tools that you will need for your projects.

This is quite a dense lesson. Please do it entirely and quietly in order to remember the principles for later. You can always come back to these examples when you'll need them. There are three sections:
- More about python syntax
- More about pandas
- More about plotting

## More about python syntax

### Python objects

In python, *all* variables are also "things". In the programming jargon, these "things" are called *objects*. Without going into details that you won't need for this lecture, objects have so-called "attributes" and "methods". `Attributes` are information stored about the object, `methods` are similar to functions, but they are applied *to* or *from* the object.

For example, even simple integers are also "things with attributes":

In [None]:
# Let's define an interger
a = 1
# Get its attributes
print('The real part of a is', a.real)
print('The imaginary part of a is', a.imag)

Attributes are read with a `dot`. They are like variables:

In [None]:
ra = a.real
ra

Importantly, objects can also have functions that apply to them. For example, `strings` have a function called `split()`:

In [None]:
s = 'This:is:a:stupid:example'
s_splitup = s.split(':')
print(s_splitup)

Note that whereas attributes are read with a '.', functions are called with parentheses, and sometimes they require arguments (the ':' in the case above). 

Strings also have the `join()` method by the way:

In [None]:
' '.join(s_splitup)

It is not necessary to know the details about object oriented programming to use python (in fact, most of the time you don't need to use these concepts yourselves). But it **is** important to know that you can have access to attributes and methods on almost *everything* in python. 

### Getting help about python variables and functions

Remember that you can always ask for help about functions:

In [None]:
import pandas as pd
pd.date_range?

Also, the ocumentation pages of the various libraries are very useful. This semester, we are going to rely mostly on three components:
- [numpy](http://docs.scipy.org/doc/numpy/reference/): this is the base on which any scientific python project is built. 
- [matplotlib](http://matplotlib.org/index.html): plotting tools
- [pandas](http://pandas.pydata.org/pandas-docs/version/0.18.0/): working with time series data

It's always useful to have their documentation webpage open on your browser for easy reference.

## More about pandas

We are going to learn about some of the most basic tools offered by ``pandas``.

In [None]:
import pandas as pd  # pd is the short name for pandas. It's good to stick to it. 
# While we are at it, let's import some other things we might need later
# this tells the notebook to draw the plots below the cells, and not in a new window:
%matplotlib inline 
import matplotlib.pyplot as plt  # plotting library
import numpy as np  # numerical library

# Below I set a gobal option: I would like to reduce the number of rows that pandas is printing:
pd.options.display.max_rows = 14

In [None]:
# We are now using pandas to read the data out of the same csv file as last time
# The first argument to the function is a path to a file, the other arguments
# are called "keywords". They tell to pandas to do certain things
df = pd.read_csv('data/data_Zhadang.csv', index_col=0, parse_dates=True)

**Q: What are the two keywords of the function doing? Have a quick look at the documentation of `pandas.read_csv()` (either online or by using `?`). Why do you think that `read_csv` has so many keywords available?**

In [None]:
# your answer here

The output of read_csv is a DataFrame object. A DataFrame has columns and rows and, importantly, it has an index. In our case, it is the time:


In [None]:
df.index

For the excercices below it is enough to select a shorter time period, let's say one day:

In [None]:
df = df.loc['2010-10-02']

If you want to quickly know the number of elements of a dataframe, you can use the `len()` function:

In [None]:
len(df)

### Columns

Columns can be accessed in two ways. Like this:

In [None]:
t2m = df['TEMP_2M']
t2m

Or as an "attribute", as you will find out by yourself. 

**Q: Type df. in the cell below and then press TAB. See the list of options and select the temperature.**

In [None]:
# your answer here

A column taken out of the dataframe like this is called a `Series`. Note that it is printed slightly differently than a `DataFrame`. In many ways, series and dataframes are the same (i.e. they share the same functionalities. See also the pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/overview.html)).

We are now creating a new column, which is simply the temperature + 3 degrees. Note that mathematical operations are allowed on pandas series. Even better, **the `index` of the data is conserved with this mathematical operation**! See:

In [None]:
new_data = t2m + 3
new_data

If you really need it (this is quite rare), you can access the "old fashioned" data of the series like this:

In [None]:
vals = new_data.values
vals

**Q: What is the type of vals? Ask the help for this. To which library does this object belong?**

In [None]:
# your answer here

### Changing the index of a series

It is possible to update the index of a series. This is very useful when you are dealing with different time zones for example. We are now going to assume that our time has to be shifted by 3 hours:

In [None]:
# We add 3 hours to the old index, and then replace it:
new_data.index = new_data.index + pd.DateOffset(hours=3)

**Careful! If you run the cell above more than once, you are going to shift the time of more than three hours!**

Note that here we use a so-called "naive" timestamp, i.e. our timestamp doesn't know in which time zone it is. However timezone is a possible attribute for the index.

### Indexes ARE important

We already learned that indexes are very useful: we can select periods out of it, and plots are using the index to locate the x-axis:

In [None]:
new_data.plot();

Indexes are also used by pandas to merge the data together. We are now putting our new (shifted) data into the dataframe as a new column:

In [None]:
df['NEW_TEMP'] = new_data

**Q: Print and plot the new dataframe. What did pandas to the data? Does it make sense? What would you have expected?**

In [None]:
# your answer here

### Operations on dataframes

Mathematical operations are also possible on DataFrames. For example, see the result of:

In [None]:
df - df

This operation of course is not very useful, but you see the point. More interesting however is following:

In [None]:
df.mean()

Ok, so it is easy to compute the average over columns. But see what happens when I do:

In [None]:
dfa = df - df.mean()

**Q: Plot the dfa variable. What did we just do? **

In [None]:
# Your answer here

### Selecting data

As we have seen last week, the index decides upon how to locate data in the dataframe:

In [None]:
df_sel = df.loc['2010-10-02 08:00:00':'2010-10-02 14:00:00']
df_sel

But it also possible to select data based on a condition. For example, I would like to select all data where `TEMP_2M` is higher than 2°:

In [None]:
df_sel.loc[df_sel['TEMP_2M'] > 2]

We can also select data based on other conditions. For example:

In [None]:
df_sel.loc[df_sel.index.hour == 10]

Or taking all hours before and including 11:

In [None]:
df_sel.loc[df_sel.index.hour <= 11]

### Resampling, grouping

For the next examples, we are going to need our original data back again:

In [None]:
df = pd.read_csv('data/data_Zhadang.csv', index_col=0, parse_dates=True)

We already learned last week that we can compute monthly averages very easily using the `resample` function:

In [None]:
monthly_mean_ts = df.resample('MS').mean()

`resample` is "time aware", meaning that it understands the time and makes groups of the months. What you make with these groups is up to you (here we used `.mean()`, but last week we also used `.max()` for example).

Pandas provides an even more general way to make "groups". This is the `groupby` function. We are going to use it:

In [None]:
monthly_mean_groups = df.groupby(df.index.month).mean()

**Q: Print and plot the variables `monthly_mean_ts` and `monthly_mean_groups`. What are the differences between the two? What is the same? Which different usage will both of them have?**

In [None]:
# your answer here

### Renaming columns

Let's start by creating a useful dataframe first:

In [None]:
monthly_mean = df.resample('MS').mean()
monthly_mean['TEMP_MAX'] = df.resample('MS').max()
monthly_mean['TEMP_MIN'] = df.resample('MS').min()

I can list all the available columns easily, since `columns` is a dataframe attribute:

In [None]:
monthly_mean.columns

One of the possible ways to rename columns of a dataframe is simply to reassign this attribute:

In [None]:
monthly_mean.columns = ['T_Mean', 'T_Max', 'T_Min']

Print that to see how it looks

## More about plotting

We have now seen that making a plot from a dataframe is a very easy task. Here we are going to show some ways to make other meaningful plots.

### Matplotlib

### Figure size

You can define the size of the figure with the following keyword:

In [None]:
monthly_mean.plot(figsize=(14, 5));

### Matplotlib machinery

Matplotlib is the actual "machine" doing the plots. We don't see it, because `pandas` is actually calling `matplotlib` internally. But pandas' plots are still "customizable", as we are going to see.

### Add units and title to the axes

In [None]:
ax = monthly_mean.plot()
ax.set_ylabel('Temp (°C)');
ax.set_title('Monthly averages, min and max');

You can use latex formatting too!

In [None]:
ax = monthly_mean.plot()
ax.set_title('$y = x^2 - x + \\frac{1}{2}$');

### Multiple plots 

In [None]:
# Make one figure, two axes
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# plot the first dataframe in the ax1
monthly_mean.plot(ax=ax1);
# plot the second dataframe in the ax2
df.plot(ax=ax2);
# Rename the axes
ax1.set_ylabel('Temp (°C)');
ax2.set_ylabel('Temp (°C)');
ax1.set_xlabel('');
ax2.set_xlabel('');
# Add titles
ax1.set_title('Monthly averages, min and max');
ax2.set_title('Hourly temperature');

### Make the plots more pretty (a matter of taste)

It is a matter of taste, but people say that that matplotlib has bad figure style defaults (this changed recently with the update to Matplotlib V2, which has **much** nicer colors and prettier defaults than V1).

Still, some people decided to make their own custom styles, for example the people who devellopped the [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) library. Seaborn just needs to be imported, after which the plots can be made to look quite different:

In [None]:
import seaborn as sns

In [None]:
monthly_mean.plot()

Now if you want, try controlling some of the [figure aesthetics](https://stanford.edu/~mwaskom/software/seaborn/tutorial/aesthetics.html) using other set-ups:

In [None]:
# Setting new defaults. See the link above for more options
sns.set_style('ticks')
sns.set_context('talk')
sns.set_style("dark")

In [None]:
monthly_mean.plot();

## Further kind of plots 

Timeseries are by far not the only way to represent data. Here we show three other ways to represent the data.

**Q: Do you know what each of these plots is used for? Do you see them often?**

### Scatterplots

In [None]:
monthly_mean.plot(kind='scatter', x='T_Min', y='T_Max', 
                  s=90, c='C2', edgecolor='k', figsize=(8, 6));

### Histogram plots

In [None]:
df.plot(kind='hist', bins=30, figsize=(8, 6));

### Boxplots

In [None]:
monthly_mean.plot(kind='box', figsize=(8, 6));

### More exotic things

In [None]:
sns.violinplot(data=monthly_mean);

In [None]:
sns.pairplot(monthly_mean);