# Best Practices

This document collects some of the best practices used elsewhere in the pandas documentation.
Together, they lead to a style of code lovingly referred to as *pandorable*. We encourage
you to apply these practicies when using pandas.

In [None]:
import pandas as pd
pd.options.display.max_rows = 10

## Use method chaining

Compare the following two stories (credit to [Jeff Allen](http://trestletech.com/wp-content/uploads/2015/07/dplyr.pdf)):

First,

```python
on_hill = went_up(jack_jill, 'hill')
with_water = fetch(on_hill, 'water')
fallen = fell_down(with_water, 'jack')
broken = broke(fallen, 'jack')
after = tmple_after(broken, 'jill')
```

and second,

```python
(jack_jill
    .went_up("hill")
    .fetch("water")
    .fell_down("jack")
    .broke("crown")
    .tumble_after("jill"))
```

I hope you agree that the second story, written in a method chaining style, is easier to follow. It avoids uninteresting intermediate variables, generally making things easier to read.

As a concrete example, we'll look at the light pre-procesing done to the `airports` datset following Hadley Wickham's [nycflights13 package](https://github.com/hadley/nycflights13/blob/master/data-raw/airports.R).

In [None]:
names = ["id", "name", "city", "country", "faa", "icao", "lat", "lon", "alt", "tz", "dst", "tzone"]

airports_raw = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.dat",
                           header=None, names=names)
airports_raw.head()

We'll do a bit of cleaning up including filtering the rows and columns to the values of interest.

In [None]:
airports = (
    airports_raw
        .loc[lambda df: (df['country'] == 'United States') & (df['faa'] != '')]
        [['faa', 'name', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzone']]
        .drop_duplicates(subset="faa")
        .set_index("faa")
)
airports

Most Series or DataFrame methods return a new Series or DataFrame, encouraging this method chaining style. Some notable methods include

1. :meth:`DataFrame.assign`
2. :meth:`DataFrame.loc`, :meth:`DataFrame.iloc`, :meth:`DataFrame.where`, and ``DataFrame.__getitem__`.
3. :meth:`DataFrame.pipe`

One thing to note, the `assign` and indexing methods will accept callables, which you use to refer to the previous link in the method chain. Consider translating an imperative string of operations like

```python
df1 = pd.read_csv(...)
df1['foo'] = df1['foo'].str.upper()
df1 = df1.loc[df['bar'] > 3]
```

to method chaining style. You'd use callables, often `lambda` functions, to refer to `df1` in subsequent operations.

```python
df = (
    pd.read_csv(...)
    .assign(foo=lambda df: df["foo"].str.upper())
    .loc[lambda df: df["bar"] > 3]
)
```

Finally, pandas provides an escape hatch through the `.pipe` method. With `.pipe`, you can provide any callable that expects a DataFrame (or Series) as it's first argument. For example, we could implement a function approximating the great circle distance between some airport `to` and the rest.

In [None]:
import numpy as np


def great_circle_distance(df, to="DSM"):
    # https://www.johndcook.com/blog/python_longitude_latitude/
    df = df.copy()
    lat = np.deg2rad(90 - df['lat'])
    lon = np.deg2rad(90 - df['lon'])
    
    to_lat, to_lon = df.loc[to, ['lat', 'lon']]
    cos = (np.sin(lat) * np.sin(to_lat) * np.cos(lon - to_lon) +
           np.cos(lat) * np.cos(to_lat))

    arc = np.arccos(cos)
    kilometers = 6373 * cos
    df[f'km_to_{to}'] = kilometers
    return df

In [None]:
great_circle_distance(airports)

Notice that our custom `great_circle_distance` function further encourages method chaining by returning a DataFrame itself.

Appending that to our original method chain, that would be

```python
airports = (
    airports_raw
        .loc[lambda df: (df['country'] == 'United States') & (df['faa'] != '')]
        [['faa', 'name', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzone']]
        .drop_duplicates(subset="faa")
        .set_index("faa")
        .pipe(gcd)
)
```

Additional keyword arguments passed to `.pipe` are passed through to the callable.

```python
airports = (
    ...
    .pipe(gcd, to="ORD")
)
```

## Use Meaningful Labels

Every Series and DataFrame has a `.index` property storing the *row labels*.
Additionally, DataFrame has the `.columns` property for storing *column labels*.

We recommend that you use meaningful labels. Pandas' most fundamental operations *align by label*. Constructors, binary options (`add`, `mul`, etc.), reshaping (`concat`), etc. all align before doing an operation.

Let's consider a simple example computing population density from two datasets (https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html).

In [None]:
area = pd.DataFrame([
    ('Alaska', 1723337),
    ('Texas', 695662),
    ('California', 423967)
], columns=['state', 'area'])
area

In [None]:
population = pd.DataFrame([
    ('California', 38332521),
    ('Texas', 26448193),
    ('New York', 19651127),
], columns=['state', 'population'])
population

If we naively divide the population column by the area column, we get incorrect results, and it's unclear which population values go with which state.

In [None]:
population['population'] / area['area']

It'd be better to model this problem as two Series, each with the `state` as its index.

In [None]:
area_ = area.set_index("state")['area']
population_ = population.set_index("state")["population"]
population_ / area_

Pandas uses row labels (and column labels for DataFrames) to align the data before doing the operation.

## Avoid duplicate row and column labels

One of pandas' primary roles is to help clean up messy tabular data. So while pandas *can* store duplicate labels, we recommend addressing duplicate labels as early as possible to avoid surpsises later on. Consider one of pandas' most basic opertions: selecting a value from a DataFrame. Duplicate labels can change the behavior in surprising ways.

Pandas follows the NumPy tradition of *reducing dimensionality* when indexing. Slicing a row from a 2-D array returns a 1-D array. Slicing a row and a column returns a scalar. Similarly with pandas.

In [None]:
airports['name']

In [None]:
airports.loc['BFT', 'name']

But, when there are duplicates in the index, it's no longer possible to reduce dimensionality.

In [None]:
airports_duplicated = airports_raw.set_index('faa')
airports_duplicated.head()

In [None]:
airports_duplicated.loc['BFT']

In this case, there are *two* rows with the code `FAA`, meaning the `.loc['BFT']` returns a DataFrame, rather than a Series.

In [None]:
# Index.duplicated
airports_deduplicated = airports_duplicated[~airports_duplicated.index.duplicated()]
airports_deduplicated.head()

In [None]:
airports_deduplicated.loc['BFT']

## Avoid Inplace Operations

For many operations, Pandas current memory model doesn't allow true inplace (zero copy) operations.
The reasons are complicated, and we hope to address them someday, but the upshot is that `inplace=True` rarely means zero-copy.

Consider :meth:`DataFrame.fillna`. That requires checking for missing values and applying a boolean mask, selecting just the rows with no NA values. Even in NumPy, boolean indexing takes a copy of the data, and not a view.

In [None]:
airports_inplace = airports_raw.copy()
airports_inplace.dropna(inplace=True)
airports_inplace

The actual operation is the same, regardless of whether `inplace=True` or `inplace=False`. The only difference is whether a new `DataFrame` object is returned, or whether your reference is updated inplace. For these types of methods, the only benefit of `inplace=True` is to avoid having to type the name of your object twice

```python
really_long_dataframe_name = really_long_dataframe_name.dropna()

# vs.

really_long_dataframe_name.dropna(inplace=True)
```

But we recommend using method chaining, which avoids the need to type the name of the object twice in the first place.

## Avoid `.values`

``DataFrame.values`` is a surprising complex attribute. The main goal is to get a NumPy representation of the data backing the DataFrame. This can be useful if you're doing lower-level numerical operations, or working with a library that needs an ndarray rather than a DataFrame.

In the simplest case, ``.values`` really does return a view on the data stored inside a DataFrame.

In [None]:
raw = np.random.randn(4, 3)
df = pd.DataFrame(raw, columns=['a', 'b', 'c'])
df

In [None]:
df.values.base is raw

However, whenever you're mixing mulitple dtypes (which is kind of the point of pandas), `.values` ceases to be a simple view.

In [None]:
cat = pd.Categorical(['a', 'b', 'c', 'd'])
df['d'] = cat
df

In [None]:
df.values

NumPy arrays have a single dtype for every element, which means we must find a common dtype for all the columns. In practice, this often means `object`-dtype (each element of the 2D array is a Python object). This conversion from native to object dtype is expensive in time and memory.

If you need a NumPy array from a DataFrame, we recommend using :meth:`DataFrame.to_numpy()`.

In [None]:
df.to_numpy()

This makes it clearer that the operation may be expensive (and offers control over whether or not to copy the data).

For :class:`Series` things are both simpler and more complex. We no longer have the issue with having to find a common dtype to accomodate multiple columns. However, not every 1-D array allowed in Pandas can be represented by NumPy.

The basics like floats are fine. And we get zero-copy access to the the original data.

In [None]:
df['a'].values

In [None]:
df['a'].values.base is raw

But for extension types, this isn't necessarily true. We have two conflicting desires

1. Get a NumPy representation of the data
2. Get a zero-copy view on the original data

In [None]:
periods = pd.array(['2000', '2001', '2002', '2003'], dtype='Period[D]')
df['e'] = periods
df

In [None]:
df['d'].values

In [None]:
df['e'].values

For the first purpose, we recommend :meth:`Series.to_numpy`.

In [None]:
df['d'].to_numpy()

In [None]:
df['e'].to_numpy()

In [None]:
df['a'].array

In [None]:
df['d'].array

In [None]:
df['e'].array

See :ref:`dsintro.arraylike` for more.

## Follow Tidy Data Principles

As [Hadley Whickham](http://www.jstatsoft.org/v59/i10/paper) says, Tidy Data is about

> Structuring datasets to facilitate analysis

His three rules are that a dataset is tidy when

1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

In [None]:
tables = pd.read_html("http://www.basketball-reference.com/leagues/NBA_2016_games.html")
games = tables[0]
games.head()

In [None]:
column_names = {'Date': 'date', 'Start (ET)': 'start',
                'Unamed: 2': 'box', 'Visitor/Neutral': 'away_team', 
                'PTS': 'away_points', 'Home/Neutral': 'home_team',
                'PTS.1': 'home_points', 'Unamed: 7': 'n_ot'}

games = (games.rename(columns=column_names)
    .dropna(thresh=4)
    [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
    .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
    .set_index('date', append=True)
    .rename_axis(["game_id", "date"])
    .sort_index())
games.head()


Consider the question **How many days of rest did each team get between each game?**
As currently structed, our dataset does not facilitate answering that question. A single team's games are spread across multiple columns (`away_team`, `home_team`).

To answer this question, the columns would be something like

date       | team_name
---------- | ---------------
2015-10-27 | Detroit Pistons
2015-10-27 | Atlanta Hawks
2015-10-27 | Cleveland Cavaliers
...        | ...

We acheive that with :meth:`DataFrame.melt`

In [None]:
tidy = (games.reset_index()
    .melt(id_vars=['game_id', 'date'], value_vars=['away_team', 'home_team'],
          value_name='team', var_name='home_or_away')
)
tidy.head()

Now answering the question is relatively straightforward. For each team (`.groupby('team')`), how many days passed between rows (`.date.diff().dt.days - 1`)

In [None]:
tidy.groupby('team')['date'].diff().dt.days - 1