# Pandas things

`Pandas` is built around DataFrame, a concept inspired by R's Data Frame, which is, in turn, similar to tables in relational databases. A DataFrame is a two-dimentional table with rows and columns.

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'
df = pd.read_csv(url)

print(df.shape)
df.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


One important thing to know about pandas is that it's column-major.
Column-major means consecutive elements in a column are stored next to each other in memory. Row-major means the same but for elements in a row.

For our dataset, accessing a row takes about 40x longer than accessing a column in our DataFrame

In [3]:
# Get the column `date`, 1000 loops
%timeit -n1000 df["total_bill"]

# Get the first row, 1000 loops
%timeit -n1000 df.iloc[0]

1.97 µs ± 327 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
76.1 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# 1. Iterating over rows

`.apply()` vs. `map()`

In [4]:
%timeit -n1 df['total_bill'].map(round)
%timeit -n1 df['total_bill'].apply(round)

178 µs ± 74.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
244 µs ± 87.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


`.iterrows()` returns a generator that generates row by row and it's very slow

`.itertuples()` returns rows in the namedtuple format. It still lets you access each row and it's about 40x faster

In [5]:
%timeit -n1 [row for index, row in df.iterrows()]
%timeit -n1 [row for row in df.itertuples()]
%timeit -n1 [row for row in df.values]
%timeit -n1 [row for row in df.to_numpy()]

6.18 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
543 µs ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
173 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 5.69 times longer than the fastest. This could mean that an intermediate result is being cached.
248 µs ± 173 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


# 2. Ordering slicing operations

Because pandas is column-major, if you want to do multiple slicing operations, always do the column-based slicing operations first.

For example, if you want to get the review from the first row of the data, there are two slicing operations:

- get row (row-based operation)
- get total_bill (column-based operation)
- get row -> get total_bill is 25x slower than get total_bill -> get row.

In [6]:
%timeit -n1000 df["total_bill"][0]
%timeit -n1000 df.iloc[0]["total_bill"]
%timeit -n1000 df.loc[0, "total_bill"]

4.85 µs ± 482 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
88 µs ± 5.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
7.12 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# 3. Grouping and Aggregating

## Splitting the data into groups based on some criteria.

In [7]:
df2 = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", 60),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),
)
df2

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,60.0
leopard,mammal,Carnivora,58.0


In [8]:
group = df2.groupby(['class'])

In [9]:
group.get_group('bird')

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0


## Applying a function to each group independently.

In [10]:
# Some common aggregations, currently only sum, mean, std, and sem, have optimized Cython implementations:
group.sum()

Unnamed: 0_level_0,max_speed
class,Unnamed: 1_level_1
bird,413.0
mammal,198.2


In [11]:
group.size()

class
bird      2
mammal    3
dtype: int64

In [12]:
group['max_speed'].agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,sum,mean,std
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bird,413.0,206.5,258.093975
mammal,198.2,66.066667,12.280608


In [13]:
group['max_speed'].agg([np.sum, np.mean, np.std]).rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})

Unnamed: 0_level_0,foo,bar,baz
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bird,413.0,206.5,258.093975
mammal,198.2,66.066667,12.280608


In [14]:
group.agg({"max_speed": "std"})

Unnamed: 0_level_0,max_speed
class,Unnamed: 1_level_1
bird,258.093975
mammal,12.280608


## Combining the results into a data structure.

In [15]:
tmp = group['order'].apply(set).to_frame()
tmp

Unnamed: 0_level_0,order
class,Unnamed: 1_level_1
bird,"{Psittaciformes, Falconiformes}"
mammal,"{Primates, Carnivora}"


In [16]:
tmp.explode('order')

Unnamed: 0_level_0,order
class,Unnamed: 1_level_1
bird,Psittaciformes
bird,Falconiformes
mammal,Primates
mammal,Carnivora


In [17]:
df.groupby(['day'])['sex'].transform(lambda x: pd.factorize(x)[0])

0      0
1      1
2      1
3      1
4      0
      ..
239    0
240    1
241    0
242    0
243    1
Name: sex, Length: 244, dtype: int64

# 4. Pivot table

In [18]:
pivot = df.pivot_table(index="sex", columns="smoker", values='total_bill').reset_index()
pivot

smoker,sex,No,Yes
0,Female,18.105185,17.977879
1,Male,19.791237,22.2845


In [19]:
pd.melt(pivot)

Unnamed: 0,smoker,value
0,sex,Female
1,sex,Male
2,No,18.105185
3,No,19.791237
4,Yes,17.977879
5,Yes,22.2845
