# Best Practices

This document collects some of the best practices used elsewhere in the pandas documentation.
Together, they lead to a style of code lovingly referred to as *pandorable*. We encourage
you to apply these practicies when using pandas.

In [1]:
import pandas as pd
pd.options.display.max_rows = 10

## Use method chaining

Compare the following two stories (credit to [Jeff Allen](http://trestletech.com/wp-content/uploads/2015/07/dplyr.pdf))

First,

```python
on_hill = went_up(jack_jill, 'hill')
with_water = fetch(on_hill, 'water')
fallen = fell_down(with_water, 'jack')
broken = broke(fallen, 'jack')
after = tmple_after(broken, 'jill')
```

and second,

```python
(jack_jill
    .went_up("hill")
    .fetch("water")
    .fell_down("jack")
    .broke("crown")
    .tumble_after("jill"))
```

I hope you agree that the second story, written in a method chaining style, is easier to follow. It avoids uninteresting intermediate variables, generally making things easier to read.

In [3]:
%%file cat.yaml
metadata:
  version: 1
sources:
  airlines:
    description: airlines
    driver: csv
    args:
      urlpath: "https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airlines.csv"

  airports:
    description: airports
    driver: csv
    args:
      urlpath: "https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv"

  planes:
    description: planes
    driver: csv
    args:
      urlpath: "https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/planes.csv"

  weather:
    description: weather
    driver: csv
    args:
      urlpath: "https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/weather.csv"
        
  airports_raw:
    description: "raw airports"
    driver: csv
    args:
        urlpath: "https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat"
        csv_kwargs:
            header: null
            names:
                - "id"
                - "name"
                - "city"
                - "country"
                - "faa"
                - "icao"
                - "lat"
                - "lon"
                - "alt"
                - "tz"
                - "dst"
                - "tzone"
        

Overwriting cat.yaml


As a concrete example, we'll look at the light pre-procesing done to the `airports` datset following Hadley Wickham's [nycflights13 package](https://github.com/hadley/nycflights13/blob/master/data-raw/airports.R).

In [4]:
names = ["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"]

# TODO: check why intake is dropping rows.
airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",
                           header=None, names=names)
airports_raw.head()

Unnamed: 0,id,name,city,country,faa,icao,lat,lon,alt,tz,dst,tzone
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby
3,4,Nadzab,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569828,146.726242,239,10.0,U,Pacific/Port_Moresby
4,5,Port Moresby Jacksons Intl,Port Moresby,Papua New Guinea,POM,AYPY,-9.443383,147.22005,146,10.0,U,Pacific/Port_Moresby


We'll do a bit of cleaning up including filtering the rows and columns to the values of interest.

In [5]:
airports = (
    airports_raw
        .loc[lambda df: (df['country'] == 'United States') & (df['faa'] != '')]
        [['faa', 'name', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzone']]
        .drop_duplicates(subset="faa")
        .set_index("faa")
)
airports

Unnamed: 0_level_0,name,lat,lon,alt,tz,dst,tzone
faa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4I7,Putnam County Airport,39.633556,-86.813806,842,-5.0,U,America/New_York
C91,Dowagiac Municipal Airport,41.992934,-86.128012,748,-5.0,U,America/New_York
CDI,Cambridge Municipal Airport,39.975028,-81.577583,799,-5.0,U,America/New_York
SUE,Door County Cherryland Airport,44.843667,-87.421556,725,-6.0,U,America/Chicago
0P2,Shoestring Aviation Airfield,39.794824,-76.647191,1000,-5.0,U,America/New_York
...,...,...,...,...,...,...,...
UCA,Union Station,43.104167,-75.223333,456,-5.0,A,America/New_York
CVO,Corvallis Muni,44.506700,-123.291500,250,-8.0,A,America/Los_Angeles
CWT,Chatsworth Station,34.256944,-118.598889,978,-8.0,A,America/Los_Angeles
DHB,Deer Harbor Seaplane,48.618397,-123.005960,0,-8.0,A,America/Los_Angeles


Most Series or DataFrame methods return a new Series or DataFrame, encouraging this method chaining style. Some notable methods include

1. assign
2. `loc` / `.iloc` / `where` / `__getitem__`.
3. `pipe`

One thing to note, the `assign` and indexing methods will accept callables, which you use to refer to the previous link in the method chain. Consider translating an imperative string of operations like

```python
df1 = pd.read_csv(...)
df1['foo'] = df1['foo'].str.upper()
df1 = df1.loc[df['bar'] > 3]
```

to method chaining style. You'd use callables, often `lambda` functions, to refer to `df1` in subsequent operations.

```python
df = (
    pd.read_csv(...)
    .assign(foo=lambda df: df["foo"].str.upper())
    .loc[lambda df: df["bar"] > 3]
)
```

Finally, pandas provides an escape hatch through the `.pipe` method. With `.pipe`, you can provide any callable that expects a DataFrame (or Series) as it's first argument. For example, we could implement a function approximating the great circle distance between some airport `to` and the rest.

In [8]:
import numpy as np


def great_circle_distance(df, to="DSM"):
    # https://www.johndcook.com/blog/python_longitude_latitude/
    df = df.copy()
    lat = np.deg2rad(90 - df['lat'])
    lon = np.deg2rad(90 - df['lon'])
    
    to_lat, to_lon = df.loc[to, ['lat', 'lon']]
    cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) +
           np.cos(lat) * np.cos(to_lat))
           
    arc = np.arccos(cos)
    kilometers = 6373 * cos
    df[f'km_to_{to}'] = kilometers
    return df

In [9]:
great_circle_distance(airports)

Unnamed: 0_level_0,name,lat,lon,alt,tz,dst,tzone,km_to_DSM
faa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4I7,Putnam County Airport,39.633556,-86.813806,842,-5.0,U,America/New_York,-611.548347
C91,Dowagiac Municipal Airport,41.992934,-86.128012,748,-5.0,U,America/New_York,-874.022782
CDI,Cambridge Municipal Airport,39.975028,-81.577583,799,-5.0,U,America/New_York,-827.065264
SUE,Door County Cherryland Airport,44.843667,-87.421556,725,-6.0,U,America/Chicago,-1122.842947
0P2,Shoestring Aviation Airfield,39.794824,-76.647191,1000,-5.0,U,America/New_York,-997.427349
...,...,...,...,...,...,...,...,...
UCA,Union Station,43.104167,-75.223333,456,-5.0,A,America/New_York,-1370.623054
CVO,Corvallis Muni,44.506700,-123.291500,250,-8.0,A,America/Los_Angeles,-531.679746
CWT,Chatsworth Station,34.256944,-118.598889,978,-8.0,A,America/Los_Angeles,594.567494
DHB,Deer Harbor Seaplane,48.618397,-123.005960,0,-8.0,A,America/Los_Angeles,-985.750406


Notice that our custom `great_circle_distance` function further encourages method chaining by returning a DataFrame itself.

Appending that to our original method chain, that would be

```python
airports = (
    airports_raw
        .loc[lambda df: (df['country'] == 'United States') & (df['faa'] != '')]
        [['faa', 'name', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzone']]
        .drop_duplicates(subset="faa")
        .set_index("faa")
        .pipe(gcd)
)
```

Additional keyword arguments passed to `.pipe` are passed through to the callable.

```python
airports = (
    ...
    .pipe(gcd, to="ORD")
)
```

## Use Meaningful Labels

Every Series and DataFrame has a `.index` property storing the *row labels*.
Additionally, DataFrame has the `.columns` property for storing *column labels*.

We recommend that you use meaningful labels. Pandas most fundamental operations are all based around the idea of *alignment by label*.

## Avoid duplicate row and column labels

One of pandas' primary roles is to help clean up messy tabular data. As such, it needs to support duplicates in the row labels. That does not mean, however, that you should allow duplicates to stick around; we recommend addressing duplicate labels as early as possible to avoid surpsises later on. Consider one of the most basic opertions: indexing. Duplicate labels can change the behavior in surprising ways.

Pandas follows the NumPy tradition of *reducing dimensionality* when indexing. Slicing a row from a 2-D array returns a 1-D array. Slicing a row and a column returns a scalar. Similarly with pandas.

In [294]:
airports['name']

faa
4I7                Putnam County Airport
C91           Dowagiac Municipal Airport
CDI          Cambridge Municipal Airport
SUE       Door County Cherryland Airport
0P2         Shoestring Aviation Airfield
                     ...                
UCA                        Union Station
CVO                       Corvallis Muni
CWT                   Chatsworth Station
DHB                 Deer Harbor Seaplane
OLT    San Diego Old Town Transit Center
Name: name, Length: 1459, dtype: object

In [295]:
airports.loc['BFT', 'name']

'Beaufort'

But, when there are duplicates in the index, it's no longer possible to reduce dimensionality.

In [296]:
airports_raw.set_index('faa').loc['BFT']

Unnamed: 0_level_0,id,name,city,country,icao,lat,lon,alt,tz,dst,tzone
faa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
BFT,3769,Beaufort,Beaufort,United States,KNBC,32.477411,-80.723161,37,-5.0,A,America/New_York
BFT,7049,BFT County Airport,Beauford,United States,KBFT,32.41083,-80.635,500,-5.0,A,America/New_York


In this case, there are *two* rows with the code `FAA`, meaning the `.loc['BFT']` returns a DataFrame, rather than a Series.

## Avoid Inplace Operations

## Avoid iteration, especially `.apply`

## Avoid `.values`



## Follow Tidy Data Principles

Pandas has a *columnar* data model. Each column in the DataFrame has a dtype, common to all the elements within that column, and distinct from the rest of the columns.