<a href="https://colab.research.google.com/github/SoIllEconomist/ds4b/blob/master/python_ds4b/01_exploration/02_data_transformation/02_data_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Transformation

## Introduction

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
### Prerequisites
In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the NYC Flight data, and use `seaborn` to help us understand the data.

In [0]:
import pandas as pd
flights = pd.read_csv("flights.csv")

### NYC Flights Dataset

To explore the basic data manipulation with `pandas`. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations. The data comes from the [US Bureau of Transportation Statistics](https://www.kaggle.com/usdot/flight-delays#flights.csv).

You might notice that this data frame prints a differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `flights` which will open the dataset instead of `flights.head()`. 

In [0]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
Unnamed: 0        100000 non-null int64
year              100000 non-null int64
month             100000 non-null int64
day               100000 non-null int64
sched_dep_time    100000 non-null float64
dep_delay         98592 non-null float64
arr_time          98478 non-null float64
dept_time         98592 non-null float64
dtypes: float64(4), int64(4)
memory usage: 6.1 MB


You might have noticed that `.info()` prints a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. 

## Pandas Basics

In this chapter you are going to learn the key pandas methods and funcgtions that allow you to solve the vast majority of your data manipulation challenges:

1. Pick observations by their values.
1. Reorder the rows.
1. Pick variables by their names.
1. Create new variables with functions of existing variables.
1. Collapse many values down to a single summary.

These can all be used in conjunction with `groupby()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

### Query

`query()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

In [0]:
flights.query("month ==1 & day == 1")

When you run that line of code, pandas executes the querying operation and returns a new data frame. pandas functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `=`:

In [0]:
jan1 = flights.query("month ==1 & day == 1")

NameError: name 'flights' is not defined

### Comparisons

To use querying effectively, you have to know how to select the observations that you want using the comparison operators. Python provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

When you’re starting out with Python, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error:

In [0]:
flights.query("month =1")

There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!

In [0]:
from math import sqrt

In [0]:
sqrt(2) ** 2 == 2

In [0]:
1/49 * 49 == 1

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation.

## Logical Operators