In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
filename = '../data/nyc-parking-violations-2020.csv'
df = pd.read_csv(filename,
                usecols=['Plate ID', 'Registration State', 'Plate Type', 'Feet From Curb',
                        'Vehicle Make', 'Vehicle Color'])
df.columns = ['pid', 'state', 'ptype', 'make', 'color', 'feet']

# Beyond 1

In `df.query`, we can use the words `and` and `or`, rather than the symbols `&` and `|`, thanks to the `numexpr` library. Rewrite our final query using the words. Does this change the speed at all?

In [3]:
%timeit df.query('state == "NY" and ptype == "PAS" and color == "WHITE" and feet > 1 and make == "TOYOT"')

914 ms ± 7.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Beyond 2

I prefer measuring distance in meters, rather than in feet. I thus want to find all of the cars that were ticketed when they were more than 1 meter from the curb. Perform this query using the traditional `df.loc` and also using `df.query`. Which one runs faster?

In [4]:
%timeit df.loc[(df['feet'] * 0.3048) > 1]

63.2 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]:
%timeit df.query('(feet * 0.3048) > 1')

84.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Beyond 3

What if we modify our query, such that we look for cars that are > 1 meter from the curb and the state is New York? Which query runs faster, and by how much?

In [6]:
%timeit df.loc[((df['feet'] * 0.3048) > 1) & (df['state'] == 'NY')]

507 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit df.query('(feet * 0.3048) > 1 and state == "NY" ')

314 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
