# Notebook 5: Pandas DataFrames and analyzing data
*Developed by Johannes Haas and Raoul Collenteur*

In this Notebook we will look into Pandas. Pandas (http://pandas.pydata.org) is the Python data analysis package that can be used for many different tasks you might have in Python. In a previous lecture we already used the Pandas library to read in CSV-files (`pd.read_csv`) and how this function returned a Pandas Series or DataFrame. We have also explored some of the basic statistics available in Pandas. In this lecture we will look a little closer at Pandas, the most common data types and it's powerfull capabilities.

In [None]:
# Import packages
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [None]:
#Let's import Pandas
import pandas as pd

In [None]:
# Remember this one?
pd.read_csv?

# Data
Before we start let's load some data. In this lecture we will look at some data of the slope of the beach-face and the sediment size. Bujan et al (2018) created a dataset collected from 78 peer-reviewed articles. This dataset is available on Zenodo:

*Bujan, Nans, Cox, Ronadh, & Masselink, Gerd. (2018). From fine sand to boulders: examining the relationship between beach-face slope and sediment size. Dataset and references. [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3241984*

The file contains information contained in different columns. We want to use this dataset to explore the relationship between the size of the sediments and the slope of the beach face. What would be our hypothesis of this relationship?

### 1. Read the data using Pandas read_csv method
First, have a look at the csv file (`Size-Slope-Data-Points.csv`) in any text editor and think about which arguments you probably need to provide to pandas `read_csv` method. 

In [None]:
data = pd.read_csv()

In [None]:
# What is the data type?
type(data)

In [None]:
# Print the first XX number of rows
data.head(3)

### 2. Accessing data in Pandas DataFrames
Accessing data in the Pandas dataframe is similar to accessing data in Numpy array, using square brackets (`data[row, column]`). The row has to be a values from the index and the column from the column names. We can use the following attributes of the DataFrame object to know which row/column values we can use:

In [None]:
data.columns
#data.index

In [None]:
#data[row, column]
#data.loc[row, column] # Faster
#data.loc[row, [col, col]] # Multiple columns

### Other ways of accessing data
Sometimes you want to access the index using an integer. For this, Pandas DataFrames have their own method, named iloc (`data.iloc[row, column]`).

In [None]:
#data.iloc[row_id, col_id]

### 3. Slicing data and using basic comparisons
Slicing data is very similar to slicing lists and numpy arrays. The use of basic comparison operators also works similarly and is a powerfull tool to select certain data within a DataFrame

In [None]:
data.iloc[-1:-5:-1]

In [None]:
data.loc[data["Reference"] == "Komar1998"]

### Inclass-Exercise: Select data from the DataFrame
Select all "slope (deg)" values higher than 10.0 for the Reference "Brayne2015". 

### Plotting data of Pandas DataFrames

In [None]:
data.plot()
#data.plot(subplots=True)

In [None]:
#data.plot.scatter()

### Statistics in Pandas DataFrames

In [None]:
data.describe()

In [None]:
data.corr()

In [None]:
data.loc[:, "Slope (deg)"].sum()

In [None]:
#data.min()
#data.std()
#data.max()

### Pandas Time Series functionality
One of the coolest features of Pandas are it's time series capabilities. When the index is a collection of datetimes, we can look at the data as time series and use the built-in functionality for this. First, let's create some random data to play with.

In [None]:
# Let's create some random data to play with
index = pd.date_range(start="2000-01-01", end="2009-12-31")
values = 1 + np.sqrt(0.01 * np.arange(len(index))) + np.random.rand(len(index))
ts = pd.Series(values, index)

print("The type of variable ts is:", type(ts))

ts.head()

In [None]:
# Let's look at the index, which is a DateTimeIndex
ts.index

### Time series
Pandas will recognize the fact that `ts` is a time series. When you plot the time series, date time indices will be used as axis labels.

In [None]:
ts.plot()

In [None]:
# Indexing time series is easy
ts.loc["2000"].max()

#ts.loc["2000": "2005-06-01"]

### In-class exercise
Calculate the average values for each year and plot it as red dots on top of the actual data.

### Writing .xls and .csv files with pandas
Storing the results of your analysis can be a very important way to communicate with your future employers. CSV-files and Excel-files are common file formats to do this and can easily be sent through email. Pandas has methods for different file-formats. All start with `data.to_format`, e.g., `data.to_csv` or `data.to_excel`.


In [None]:
#data.to_csv("test_data.csv")
#data.to_excel("test_data.xlsx")

### Inclass-Exercise: Let's look at the actual data!
That's it for the introduction to Pandas. Now, let us use what we have learned to study the relationship between the beach-face slope and the sediment size (clast size). Perform the following steps:

1. Define a hypothesis (What do you expect is the relationship?)
2. Choose two variables you think are usefull to test the hypothesis 
3. Create a scatter plot of these two variables
4. Calculate the correlation between these two variables
5. Draw a conclusion 
