# Exploring CSV data with pandas

This workbook is a classroom exercise to see how `pandas` can help us explore data - and should quickly illustrate its power compared to the basic Python CSV tools.

Before you proceed, you'll need `pandas` available to your Python environment. (Eg: `conda install pandas` if you use `conda`).

- you should **run each Python cell as you come to it**
- return a DataFrame or Series as the final item in a cell to display it.

With that done: let's load the flights CSV file.

In [None]:
import pandas as pd

flights_raw = pd.read_csv("../../data/flights.csv").iloc[:, 1:]

# that `iloc` incantation is just removing the first, unnecessary column

flights_raw

# don't forget to run me!

1. How many flights are in the system? The `shape` property on a DataFrame will help you define this.

In [101]:
# use `shape` to summarise how many rows are in the DataFrame - don't just copy it from a table printout!

## Filtering and sorting data

`query`

selects rows according to whatever conditions we specify, e.g.:

```python
data.query('origin == "EWR"')
data.query('month == 12')
```

Note that the query is a Boolean expression, provided as a string ''.

Inside the query, column names are unquoted and string values are quoted using "" (double quotes).

We can refer to columns containing spaces by enclosing them in backticks ``.

We can also refer to variables in the environment using the @ prefix.

```python
six_am = 600
data.query('dep_time < @six_am')
```

`sort_values`

returns a _copy_ of the DataFrame, sorted by ascending column value:

```python
data.sort_values('dep_time')
```
...or by descending value using ascending=False:

```python
data.sort_values('arr_delay', ascending=False)
```

The original DataFrame is unchanged.

### Filtering exercise

The flights data table isn't quite in date order.

1. Use the `tail` method to prove what the last flights in `flights_raw` are not from December.



2. Make a new dataframe called `flights_sorted`, sorted by year/month/day. (To sort a DataFrame or Series by _multiple_ values, pass a list to `sort_values`).

3. Let's filter out all the flights that never took off.

- In a `query` string, you can query for non-NA values like this: `your_data_frame.query('column.notna()')`
- Filter your sorted array to only flights where their departure time is not NA

4. What percentage of flights actually took off?


## Making new columns

Making new columns is straightforward: you can use `[]=` to make a new column based on other data. A contrived example:

```python
flights['halfway'] = flights['distance'] / 2
```

### Exercise

1. Add a new column `catch_up` to the dataframe, that describes how many minutes the aircraft made up in flight. Use the `dep_delay` and `arr_delay` columns to do so.

2. How many flights in the dataset made up time in transit?

## Grouping

We can use `groupby()` to group data by one or more columns. This is very useful for performing aggregate statistics. 

In [None]:
# for instance: we could use `query` to find out how many flights were scheduled to take off from Newark:

flights_raw.query('origin == "EWR"').shape[0]


In [None]:
# But we could also group by origin, and then calculate the `size` of each group:
flights_raw.groupby('origin').size()

### Exercise

1. What are the 5 most popular destinations in the year? (Group the flights by `dest`, get the `size()` of each value, and `sort_values` with `ascending=False`)

2. Calculate how many flights there were per day. (You can group by a _list_ of columns!) Is this dataset familiar?

3. Calculate how many flights there were per day, per origin airport

## Exporting frames to CSV

Finally, we can export DataFrames to CSV.

Take your group flights per-day, per-airport, and use the `to_csv` method on it - with a suitable filename - to export your data:

`data.to_csv('my_filename.csv')`

## Finally

If you reach the end of the workbook, take the CSV file above - or another set of data of your own choosing - and explore visualising it in a tool of your choice - a spreadsheet graphing library, P5, etc.