# Pandas

In this notebook, we introduce you to Pandas which provides data structures and analysis tools for Data Science. Much of what we'll cover here involves the DataFrame data structure. As before, refer to documentation as and when you need to: https://pandas.pydata.org/

## Getting started with Pandas

Start by loading the `pandas` library (with alias `pd`).

In [None]:
# import the library


Load the provided dataset `data/airfoil.csv` using the [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function; call the DataFrame you load `df`.

In [None]:
# load the dataframe, use "head" to have a look


Note that this `read_csv()` function is very flexible and can accomodate all sorts of file. 
You will do this in much more details in module 2.
For now, we're giving you a nicely formatted dataset that directly works well with Pandas.

`df` is a pandas DataFrame object. DataFrame objects have many methods and attributes which make data science easier! Use the `.head()` method to show what `df` looks like.

In [None]:
# Use .head() on df


Let's load another dataset into another dataframe and call it `smalldf`. The data is located at `data/smalldata.csv`. 

First, have a look at the data yourself. If you're on a unix system, you can do this from the command line like this:

In [None]:
# Edit this if you're on Windows
!cat data/smalldata.csv  # note that this cell is runnable inside the notebook!

See what happens if you just run pd.read_csv() on it

In [None]:
# read it with pandas (don't save it, just allow it to print)


DataFrames have column names and an index. Import the csv and save it as a DataFrame called `smalldf`. Check the documentation (or google!) and:
* make the column names: `['Age', 'Weight', 'Gender', 'Name']`
* specify `name` as the index

Display `smalldf` in the notebook and check it against what you did above.

In [None]:
# add your code here to load smalldf


### Retrieving some basic information

Now that you have a DataFrame object you can explore it (by typing `df.<TAB>` in a cell, you'll see all the methods and attributes)!

Examples of useful attributes:

* `shape` stores the dimensions of the data frame
* `columns` stores the names of the columns 
* `index` stores the names of the rows, by default pandas uses a range from 0 to the number of rows
* `dtypes` stores the `dtype` of each column

Show all of those, check it matches what you expected versus the output of `head` used earlier.

You can also use `df.describe()` to get a "description" of your data (per column, a number of standard statistics such as the mean, variance etc).

In [None]:
# add your code here to explore df's attributes


In [None]:
# use `df.describe()`


## Accessing elements in a dataframe

Let's get the value of the 1st column (`Age`) of the 3rd row (`Maria`) of `smalldf`. This can be done in many ways, the most convenient to you will depend on situation:

1. using `iloc`
1. using `loc`
1. each column of a `DataFrame` is a `Series` object: you can first get this then access relevant element

Try each of these methods below.

**Note**: remember that indexing in Python starts at 0.

In [None]:
# add your code here


### Using loc for fancy selections

Using `.loc`, can you retrieve the sub-dataframe of `df` with all the columns whose name has strictly more than 15 characters? Call this `df2`.

In [None]:
# add your code here


Using `to_csv`, output `df2` as a tab separated file (not comma) and call the file `airfoil_2.dat`.
(Open the file in an editor to check it matches what you expect).

In [None]:
df2.to_csv("airfoil_2.dat", sep='\t')

### Working with a pd.Series

Retrieve the series corresponding to the sound pressure from the dataframe, display

* show the name of the series (it should be `Sound pressure [dB]`)
* show the shape of the series (it should be `(1503,)`) 
* the mean and the median (resp. `124.84` and `125.72`)
* the mean of the squared values (it should be `15631.57`)
* Check [the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) and try some other methods and attributes

In [None]:
# add your code here


### Getting the raw values

It can sometimes be useful to access the elements of a dataframe as a raw numpy array. For this you simply need to call the `.values` attribute on a series or attribute. 

* retrieve a raw array containing the sound pressures
* retrieve a raw array containing the frequencies and the sound pressures (watch the type inference!)

In [None]:
# add your code here


### (Bonus) Joining dataframes

Check the pandas documentation on the [.join method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) to work out how to join `smalldf` with the following data:

```
{'Salary': [100, 150, 110, 90, 105, 500], 'Education': [5, 10, 7, 3, 4, 0]}
```

In [None]:
# add your code here
