## Setup

### Install Dependencies

In [1]:
using Pkg

In [2]:
pkg"up"
pkg"add BenchmarkTools PyCall CSV DataFrames"
pkg"precompile"

[32m[1m  Updating[22m[39m registry at `/opt/julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m Installed[22m[39m OpenSSL_jll ────────────────── v1.1.1+2
[32m[1m Installed[22m[39m CompilerSupportLibraries_jll ─ v0.3.3+0
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.3/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.3/Manifest.toml`
 [90m [e66e0078][39m[93m ↑ CompilerSupportLibraries_jll v0.3.2+0 ⇒ v0.3.3+0[39m
 [90m [458c3c95][39m[93m ↑ OpenSSL_jll v1.1.1+1 ⇒ v1.1.1+2[39m
 [90m [ea10d353][39m[95m ↓ WeakRefStrings v0.6.2 ⇒ v0.5.8[39m
[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.3/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.3/Manifest.toml`
 [90m [ea10d353][39m[93m ↑ WeakRefStrings v0.5.8 ⇒ v0.6.2[39m
[32m[1mPrecompili

┌ Info: Precompiling JuMP [4076af6c-e467-56ae-b986-b466b2749572]
└ @ Base loading.jl:1273


[32m[1mPrecompiling[22m[39m SearchLight


┌ Info: Precompiling SearchLight [340e8cb6-72eb-11e8-37ce-c97ebeb32050]
└ @ Base loading.jl:1273


[32m[1mPrecompiling[22m[39m Plots


┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1273


[32m[1mPrecompiling[22m[39m LibPQ


┌ Info: Precompiling LibPQ [194296ae-ab2e-5f79-8cd4-7183a0a5a0d1]
└ @ Base loading.jl:1273


### Imports

In [3]:
using BenchmarkTools
using CSV
using DataFrames
using PyCall
using Dates

In [4]:
pd = pyimport("pandas")
np = pyimport("numpy")

PyObject <module 'numpy' from '/opt/conda/lib/python3.7/site-packages/numpy/__init__.py'>

### Download Test Data

In [5]:
download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv", 
    "test_data.csv")

"test_data.csv"

In [6]:
filesize("test_data.csv")/1024^2 # in MB

39.241610527038574

### Estimate PyCall Overhead

How large is the potential influence of PyCall to the Python timings?

In [7]:
@btime x = pd.DataFrame()

  854.797 μs (7 allocations: 320 bytes)


In [8]:
@btime py"1+1"

  25.699 μs (3 allocations: 48 bytes)


2

The PyCall overhead is << 1ms.

## Import CSV File

In [134]:
@time df = CSV.File("test_data.csv") |> DataFrame # including compilation

  0.376642 seconds (1.45 M allocations: 308.709 MiB)


Unnamed: 0_level_0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID
Unnamed: 0_level_1,Int64⍰,String,String,String⍰,Int64⍰
1,1,2019-12-01 00:09:45,2019-12-01 00:10:59,N,1
2,2,2019-12-01 00:26:05,2019-12-01 00:31:30,N,1
3,2,2019-12-01 00:56:36,2019-12-01 00:59:38,N,1
4,2,2019-12-01 00:26:20,2019-12-01 00:40:19,N,1
5,2,2019-12-01 00:56:36,2019-12-01 00:59:56,N,1
6,1,2019-12-01 00:14:28,2019-12-01 00:19:39,N,1
7,1,2019-12-01 00:45:54,2019-12-01 00:52:46,N,1
8,2,2019-12-01 00:25:35,2019-12-01 01:04:08,N,1
9,1,2019-12-01 00:43:12,2019-12-01 00:56:44,N,1
10,2,2019-12-01 00:56:08,2019-12-01 01:05:11,N,1


In [10]:
@btime df = CSV.File("test_data.csv") |> DataFrame;

  377.322 ms (1453158 allocations: 308.71 MiB)


In [11]:
@btime df = CSV.File("test_data.csv", threaded=false) |> DataFrame;

  946.510 ms (1552638 allocations: 313.63 MiB)


In [12]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any
1,VendorID,1.83345,1,2.0,2
2,lpep_pickup_datetime,,2008-12-31 22:34:56,,2035-09-02 17:17:47
3,lpep_dropoff_datetime,,2008-12-31 22:42:10,,2035-09-02 19:01:37
4,store_and_fwd_flag,,N,,Y
5,RatecodeID,1.10284,1,1.0,6
6,PULocationID,107.481,1,82.0,265
7,DOLocationID,128.446,1,129.0,265
8,passenger_count,1.31158,0,1.0,9
9,trip_distance,3.44502,-9436.33,1.94,77843.8
10,fare_amount,15.5867,-200.0,11.0,500.0


In [13]:
@time pydf = pd.read_csv("test_data.csv");

  2.003603 seconds (33.99 k allocations: 1.710 MiB)


In [14]:
pydf.dtypes

PyObject VendorID                 float64
lpep_pickup_datetime      object
lpep_dropoff_datetime     object
store_and_fwd_flag        object
RatecodeID               float64
PULocationID               int64
DOLocationID               int64
passenger_count          float64
trip_distance            float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
ehail_fee                float64
improvement_surcharge    float64
total_amount             float64
payment_type             float64
trip_type                float64
congestion_surcharge     float64
dtype: object

In [15]:
@btime pydf = pd.read_csv("test_data.csv");

  1.889 s (9 allocations: 352 bytes)


Python direct: 1.98s

CSV Import in Julia is a factor of 2 faster (single-threaded). It uses multithreading as default, with gives an additional speedup of a factor 3 here (4-core machine).

## Calculations

### Conversion of datetime columns

In [16]:
@btime DateTime.(df[!, :lpep_pickup_datetime], dateformat"yyyy-mm-dd HH:MM:SS");

  140.186 ms (450632 allocations: 24.07 MiB)


In [135]:
@time begin
    df[!, :lpep_pickup_datetime] = DateTime.(df[!, :lpep_pickup_datetime], 
        dateformat"yyyy-mm-dd HH:MM:SS");
    df[!, :lpep_dropoff_datetime] = DateTime.(df[!, :lpep_dropoff_datetime], 
        dateformat"yyyy-mm-dd HH:MM:SS");
end

  0.297416 seconds (901.27 k allocations: 48.133 MiB)


450627-element Array{DateTime,1}:
 2019-12-01T00:10:59
 2019-12-01T00:31:30
 2019-12-01T00:59:38
 2019-12-01T00:40:19
 2019-12-01T00:59:56
 2019-12-01T00:19:39
 2019-12-01T00:52:46
 2019-12-01T01:04:08
 2019-12-01T00:56:44
 2019-12-01T01:05:11
 2019-12-01T00:34:48
 2019-12-01T00:20:42
 2019-12-01T00:06:22
 ⋮                  
 2020-01-01T00:06:00
 2020-01-01T00:31:00
 2019-12-31T23:45:00
 2019-12-31T23:38:00
 2019-12-31T23:27:00
 2019-12-31T23:59:00
 2020-01-01T00:03:00
 2019-12-31T23:54:00
 2019-12-31T23:16:00
 2019-12-31T23:40:00
 2019-12-31T23:37:00
 2020-01-01T00:05:00

In [18]:
@btime pd.to_datetime(pydf.lpep_pickup_datetime);

  129.738 ms (12 allocations: 528 bytes)


In [19]:
@time begin
    pydf.lpep_pickup_datetime = pd.to_datetime(pydf.lpep_pickup_datetime);
    pydf.lpep_dropoff_datetime = pd.to_datetime(pydf.lpep_dropoff_datetime);
end

  0.329862 seconds (7.13 k allocations: 345.935 KiB)


PyObject 0        2019-12-01 00:10:59
1        2019-12-01 00:31:30
2        2019-12-01 00:59:38
3        2019-12-01 00:40:19
4        2019-12-01 00:59:56
                 ...        
450622   2019-12-31 23:54:00
450623   2019-12-31 23:16:00
450624   2019-12-31 23:40:00
450625   2019-12-31 23:37:00
450626   2020-01-01 00:05:00
Name: lpep_dropoff_datetime, Length: 450627, dtype: datetime64[ns]

Python direct: 0.35s

In [20]:
pydf.dtypes

PyObject VendorID                        float64
lpep_pickup_datetime     datetime64[ns]
lpep_dropoff_datetime    datetime64[ns]
store_and_fwd_flag               object
RatecodeID                      float64
PULocationID                      int64
DOLocationID                      int64
passenger_count                 float64
trip_distance                   float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
ehail_fee                       float64
improvement_surcharge           float64
total_amount                    float64
payment_type                    float64
trip_type                       float64
congestion_surcharge            float64
dtype: object

Conversion speed is similar, but Python syntax is easier (no need to explicitly define datetime format).

## Vectorized Calculations

In [21]:
df[!, :drive_time] = Second.(df[!, :lpep_dropoff_datetime] .- df[!, :lpep_pickup_datetime])

450627-element Array{Second,1}:
 74 seconds  
 325 seconds 
 182 seconds 
 839 seconds 
 200 seconds 
 311 seconds 
 412 seconds 
 2313 seconds
 812 seconds 
 543 seconds 
 1456 seconds
 241 seconds 
 85 seconds  
 ⋮           
 540 seconds 
 1920 seconds
 1860 seconds
 2040 seconds
 960 seconds 
 1500 seconds
 1020 seconds
 1680 seconds
 840 seconds 
 960 seconds 
 1260 seconds
 780 seconds 

In [22]:
@btime df[!, :drive_time] = Second.(df[!, :lpep_dropoff_datetime] .- df[!, :lpep_pickup_datetime]);

  6.973 ms (8 allocations: 3.44 MiB)


In [23]:
pydf.drive_time = (pydf.lpep_dropoff_datetime - pydf.lpep_pickup_datetime)

PyObject 0        00:01:14
1        00:05:25
2        00:03:02
3        00:13:59
4        00:03:20
           ...   
450622   00:28:00
450623   00:14:00
450624   00:16:00
450625   00:21:00
450626   00:13:00
Length: 450627, dtype: timedelta64[ns]

In [24]:
@btime pydf.drive_time = (pydf.lpep_dropoff_datetime - pydf.lpep_pickup_datetime);

  16.283 ms (9 allocations: 400 bytes)


Python direct: 17ms

In [25]:
df[!, :price_per_mile] = df[!, :fare_amount] ./ df[!, :trip_distance]

450627-element Array{Float64,1}:
 Inf                 
   8.208955223880597 
   7.377049180327869 
   3.58974358974359  
   9.0               
   5.454545454545454 
   5.0               
   4.0431266846361185
   3.1132075471698113
   3.7878787878787876
   4.675324675324675 
   6.617647058823529 
  75.0               
   ⋮                 
   5.55036855036855  
   2.9398034398034394
   3.1742610837438425
   2.40232268768146  
   4.89202657807309  
   2.398413666870043 
   2.886944818304172 
   4.875249500998004 
  10.268722466960352 
   2.916577540106952 
  -5.020949720670392 
   8.703703703703702 

In [26]:
@btime df[!, :price_per_mile] = df[!, :fare_amount] ./ df[!, :trip_distance];

  6.653 ms (5 allocations: 3.44 MiB)


In [46]:
@btime pydf.price_per_mile = pydf.fare_amount / pydf.trip_distance

  2.956 ms (9 allocations: 368 bytes)


PyObject 0               inf
1          8.208955
2          7.377049
3          3.589744
4          9.000000
            ...    
450622     4.875250
450623    10.268722
450624     2.916578
450625    -5.020950
450626     8.703704
Length: 450627, dtype: float64

In [47]:
@btime exp.(df[!, :fare_amount] ./ df[!, :trip_distance]) .+ df[!, :passenger_count].^2

  29.247 ms (27 allocations: 7.31 MiB)


450627-element Array{Union{Missing, Float64},1}:
  Inf                    
 3674.702264144405       
 1599.864838235631       
   37.22478632880233     
 8104.083927575384       
  234.81856574475168     
  149.4131591025766      
   58.004298919792745    
   23.493076419549602    
   45.16262255965407     
  132.26738851310304     
  749.182594758126       
    3.7332419967990015e32
    ⋮                    
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             
     missing             

In [48]:
@btime np.exp(pydf.fare_amount / pydf.trip_distance) + pydf.passenger_count^2

  26.298 ms (24 allocations: 928 bytes)


PyObject 0                 inf
1         3674.702264
2         1599.864838
3           37.224786
4         8104.083928
             ...     
450622            NaN
450623            NaN
450624            NaN
450625            NaN
450626            NaN
Length: 450627, dtype: float64

Python direct: 28.5ms

Julia is faster for the time distance, whereas Pandas is faster for the numerical calculations.

## Grouping

In [49]:
by(df, :passenger_count, (mean_distance=:trip_distance=>mean), 
    (max_distance=:trip_distance=>maximum))

Unnamed: 0_level_0,passenger_count,mean_distance,max_distance
Unnamed: 0_level_1,Int64⍰,Float64,Float64
1,1,2.58961,333.3
2,5,2.60981,38.89
3,2,2.89825,48.78
4,6,2.41251,27.78
5,3,2.84961,34.1
6,4,2.6318,69.86
7,0,2.27189,20.67
8,7,0.0,0.0
9,8,2.1075,11.77
10,9,0.04,0.08


In [50]:
@btime by(df, :passenger_count, (mean_distance=:trip_distance=>mean), 
    (max_distance=:trip_distance=>maximum));

  28.064 ms (251 allocations: 14.33 MiB)


In [51]:
pydf.groupby("passenger_count").agg(mean_distance=("trip_distance", "mean"), 
    max_distance=("trip_distance", "max"))

Unnamed: 0_level_0,mean_distance,max_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,2.271892,20.67
1.0,2.589612,333.3
2.0,2.898245,48.78
3.0,2.849613,34.1
4.0,2.631798,69.86
5.0,2.609807,38.89
6.0,2.412509,27.78
7.0,0.0,0.0
8.0,2.1075,11.77
9.0,0.04,0.08


In [52]:
@btime pydf.groupby("passenger_count").agg(mean_distance=("trip_distance", "mean"), 
    max_distance=("trip_distance", "max"));

  38.919 ms (47 allocations: 2.13 KiB)


In [53]:
@btime pydf.groupby("passenger_count").agg(mean_distance=("trip_distance", np.mean), 
    max_distance=("trip_distance", np.max));

  38.981 ms (54 allocations: 2.47 KiB)


Python direct: 36.4ms

Groupby-Aggregation is faster in Julia (and has shorter syntax).

## Apply Custom Functions

In [54]:
function myfunc(a, b, c)
    if a === missing
        return zero(a)
    elseif round(Int, a) % 2 == 0
        return 2a + b*c
    else
        return a*b + 2c
    end
end

myfunc (generic function with 1 method)

In [55]:
df[!, :myfunc] = myfunc.(df.passenger_count, df.trip_distance, df.fare_amount)

450627-element Array{Union{Missing, Float64},1}:
  6.0     
 11.67    
  9.61    
 31.9     
  9.5     
 13.1     
 16.5     
 67.42    
 38.3     
 22.64    
 55.25    
  9.68    
  6.04    
  ⋮       
   missing
   missing
   missing
   missing
   missing
   missing
   missing
   missing
   missing
   missing
   missing
   missing

In [56]:
@btime df[!, :myfunc] = myfunc.(df.passenger_count, df.trip_distance, df.fare_amount);

  4.284 ms (14 allocations: 7.31 MiB)


In [57]:
size(df)

(450627, 23)

In [58]:
df2 = first(df, 1000);

In [59]:
@btime df2[!, :myfunc] = myfunc.(df2.passenger_count, df2.trip_distance, df2.fare_amount);

  8.459 μs (11 allocations: 8.19 KiB)


In [60]:
py"""
import numpy as np
def myfunc_py(a, b, c):
    if a is np.nan:
        return 0
    elif round(a) % 2 == 0:
        return 2*a + b*c
    else:
        return a*b + 2*c
"""

In [61]:
@btime py"myfunc_py(1, 2, 3)"

  43.056 μs (3 allocations: 48 bytes)


8

Python direct: 849ns - significantly faster than with PyCall

In [62]:
pydf2 = pydf.head(1000);

In [63]:
@btime pydf2.myfunc = pydf2.apply(py"lambda x: myfunc_py(x.passenger_count, x.trip_distance, x.fare_amount)", 
    axis=1)

  199.071 ms (35 allocations: 1.70 KiB)


PyObject 0       6.00
1      11.67
2       9.61
3      31.90
4       9.50
       ...  
995    26.11
996    34.83
997    46.70
998    76.50
999    14.43
Length: 1000, dtype: float64

In [64]:
@btime py"$pydf2.apply(lambda x: myfunc_py(x.passenger_count, x.trip_distance, x.fare_amount), axis=1)"

  199.429 ms (7 allocations: 208 bytes)


PyObject 0       6.00
1      11.67
2       9.61
3      31.90
4       9.50
       ...  
995    26.11
996    34.83
997    46.70
998    76.50
999    14.43
Length: 1000, dtype: float64

Python direct: 201ms

Julia is here ca. 20,000 times faster than Python/ Pandas!

## Row Iteration

In [78]:
function df_iter(df)
    result = 0
    last_passenger = 0
    for row in eachrow(df)
        if row.passenger_count === missing
            continue
        end
        result += row.passenger_count * last_passenger
        last_passenger = row.passenger_count
    end
    result
end

df_iter (generic function with 2 methods)

In [80]:
@code_warntype df_iter(df)

Variables
  #self#[36m::Core.Compiler.Const(df_iter, false)[39m
  df[36m::DataFrame[39m
  result[91m[1m::Any[22m[39m
  last_passenger[91m[1m::Any[22m[39m
  @_5[33m[1m::Union{Nothing, Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}}[22m[39m
  row[36m::DataFrameRow{DataFrame,DataFrames.Index}[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m       (result = 0)
[90m│  [39m       (last_passenger = 0)
[90m│  [39m %3  = Main.eachrow(df)[36m::DataFrames.DataFrameRows{DataFrame,DataFrames.Index}[39m
[90m│  [39m       (@_5 = Base.iterate(%3))
[90m│  [39m %5  = (@_5 === nothing)[36m::Bool[39m
[90m│  [39m %6  = Base.not_int(%5)[36m::Bool[39m
[90m└──[39m       goto #7 if not %6
[90m2 ┄[39m %8  = @_5::Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}[36m::Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}[39m
[90m│  [39m       (row = Core.getfield(%8, 1))
[90m│  [39m %10

In [79]:
@btime df_iter($df)

  288.378 ms (3149642 allocations: 54.94 MiB)


715957

In [84]:
function df_iter2(df)
    result = 0
    last_passenger = 0
    for row in eachrow(df)
        if row.passenger_count === missing
            continue
        end
        passenger = row.passenger_count:: Int
        result += passenger * last_passenger
        last_passenger = passenger
    end
    result
end

df_iter2 (generic function with 1 method)

In [85]:
@code_warntype df_iter2(df)

Variables
  #self#[36m::Core.Compiler.Const(df_iter2, false)[39m
  df[36m::DataFrame[39m
  result[36m::Int64[39m
  last_passenger[36m::Int64[39m
  @_5[33m[1m::Union{Nothing, Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}}[22m[39m
  row[36m::DataFrameRow{DataFrame,DataFrames.Index}[39m
  passenger[36m::Int64[39m

Body[36m::Int64[39m
[90m1 ─[39m       (result = 0)
[90m│  [39m       (last_passenger = 0)
[90m│  [39m %3  = Main.eachrow(df)[36m::DataFrames.DataFrameRows{DataFrame,DataFrames.Index}[39m
[90m│  [39m       (@_5 = Base.iterate(%3))
[90m│  [39m %5  = (@_5 === nothing)[36m::Bool[39m
[90m│  [39m %6  = Base.not_int(%5)[36m::Bool[39m
[90m└──[39m       goto #7 if not %6
[90m2 ┄[39m       Core.NewvarNode(:(passenger))
[90m│  [39m %9  = @_5::Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}[36m::Tuple{DataFrameRow{DataFrame,DataFrames.Index},Tuple{Base.OneTo{Int64},Int64}}[39m
[90m│ 

In [86]:
@btime df_iter2($df)

  152.553 ms (2070664 allocations: 38.47 MiB)


715957

A speedup of a factor of 2 by type assertion.

In [98]:
function df_iter3(passengers)
    result = 0
    last_passenger = 0
    for row in passengers
        if row === missing
            continue
        end
        passenger = row
        result += passenger * last_passenger
        last_passenger = passenger
    end
    result
end

df_iter3 (generic function with 1 method)

In [99]:
@btime df_iter3(df[!, :passenger_count])

  1.640 ms (1 allocation: 16 bytes)


715957

Even more efficient:

In [100]:
function df_iter4(passengers)
    result = 0
    last_passenger = 0
    for row in skipmissing(passengers)
        passenger = row
        result += passenger * last_passenger
        last_passenger = passenger
    end
    result
end

df_iter4 (generic function with 1 method)

In [101]:
@btime df_iter4(df[!, :passenger_count])

  936.638 μs (1 allocation: 16 bytes)


715957

Looping directly on arrays (extracted from df columns) gives a speedup of 100!

Due to type instability, `eachrow` should be avoided in performance-critical code. Instead, use looping over column arrays.

In [112]:
py"""
def pydf_iter(df):
    result = 0
    last_passenger = 0
    for row in df.itertuples(): # much faster than df.iterrows()
        if np.isnan(row.passenger_count):
            continue
        result += row.passenger_count * last_passenger
        last_passenger = row.passenger_count
    return result
"""

In [113]:
pydf_iter = py"pydf_iter"

PyObject <function pydf_iter at 0x7fcaad4ba0e0>

In [114]:
@btime pydf_iter($pydf)

  18.263 s (3 allocations: 48 bytes)


715957.0

In [116]:
py"""
from numba import njit

@njit
def pydf_iter2(passengers):
    result = 0
    last_passenger = 0
    for row in passengers:
        if np.isnan(row):
            continue
        result += row * last_passenger
        last_passenger = row
    return result
"""

In [117]:
pydf_iter2 = py"pydf_iter2"

PyObject CPUDispatcher(<function pydf_iter2 at 0x7fcaac6d4560>)

In [119]:
@btime pydf_iter2(pydf.passenger_count.values)

  2.723 ms (41 allocations: 3.44 MiB)


715957.0

Even in the primitive (not type-stable) implementation, Julia is a factor of 50 faster than Python. When using loops over column arrays, Julia is a factor of 10,000 faster than Python.

Using a Numba JIT compiled function on a Numpy array is "only" a factor of 3 slower than the Julia array loop implementation and faster than the (type unstable) Julia `eachrow` implementations.

## Sorting

In [136]:
@btime sort(df, (:passenger_count, :lpep_pickup_datetime), rev=(true, false));

  1.444 s (11493254 allocations: 292.95 MiB)


Note that `sort` creates a copy of the DataFrame, `sort!` does in-place sorting.

In [133]:
@btime py"$pydf.sort_values(['passenger_count', 'lpep_pickup_datetime'], ascending=[False, True])";

  358.607 ms (7 allocations: 208 bytes)


Sorting is significantly faster in Pandas.

## Filtering

The Pandas slicing syntax corresponds to Julia filter function. `missing` data must be explicitly treated.

In [151]:
@btime filter(x->(!ismissing(x) || x[:passenger_count]==2), df)

  55.052 ms (110 allocations: 70.49 MiB)


Unnamed: 0_level_0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID
Unnamed: 0_level_1,Int64⍰,DateTime,DateTime,String⍰,Int64⍰
1,1,2019-12-01T00:09:45,2019-12-01T00:10:59,N,1
2,2,2019-12-01T00:26:05,2019-12-01T00:31:30,N,1
3,2,2019-12-01T00:56:36,2019-12-01T00:59:38,N,1
4,2,2019-12-01T00:26:20,2019-12-01T00:40:19,N,1
5,2,2019-12-01T00:56:36,2019-12-01T00:59:56,N,1
6,1,2019-12-01T00:14:28,2019-12-01T00:19:39,N,1
7,1,2019-12-01T00:45:54,2019-12-01T00:52:46,N,1
8,2,2019-12-01T00:25:35,2019-12-01T01:04:08,N,1
9,1,2019-12-01T00:43:12,2019-12-01T00:56:44,N,1
10,2,2019-12-01T00:56:08,2019-12-01T01:05:11,N,1


In [141]:
@btime py"$pydf[$pydf.passenger_count == 2]"

  15.736 ms (9 allocations: 240 bytes)


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
69,2.0,2019-12-01 00:29:25,2019-12-01 00:31:31,N,1.0,159,167,2.0,0.47,3.5,0.5,0.5,0.00,0.00,,0.3,4.80,2.0,1.0,0.00
152,2.0,2019-12-01 00:37:10,2019-12-01 00:42:05,N,1.0,74,168,2.0,1.09,6.0,0.5,0.5,0.00,0.00,,0.3,7.30,2.0,1.0,0.00
169,2.0,2019-12-01 00:14:09,2019-12-01 00:26:42,N,1.0,82,226,2.0,2.17,10.5,0.5,0.5,0.00,0.00,,0.3,11.80,2.0,1.0,0.00
170,2.0,2019-12-01 00:07:27,2019-12-01 00:22:18,N,1.0,82,83,2.0,2.31,10.5,0.5,0.5,0.00,0.00,,0.3,11.80,2.0,1.0,0.00
171,2.0,2019-12-01 00:33:06,2019-12-01 00:40:06,N,1.0,82,129,2.0,1.31,7.0,0.5,0.5,1.66,0.00,,0.3,9.96,1.0,1.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359794,2.0,2019-12-31 23:05:36,2019-12-31 23:14:21,N,1.0,247,244,2.0,1.77,8.5,0.5,0.5,0.00,0.00,,0.3,9.80,2.0,1.0,0.00
359803,2.0,2019-12-31 23:23:27,2019-12-31 23:33:17,N,1.0,129,70,2.0,2.60,10.5,0.5,0.5,0.00,0.00,,0.3,11.80,2.0,1.0,0.00
359810,2.0,2019-12-31 23:51:51,2020-01-01 00:10:29,N,1.0,74,92,2.0,8.94,26.5,0.5,0.5,0.00,6.12,,0.3,33.92,2.0,1.0,0.00
359826,2.0,2019-12-31 23:48:11,2019-12-31 23:59:23,N,1.0,181,225,2.0,2.86,11.0,0.5,0.5,0.00,0.00,,0.3,12.30,2.0,1.0,0.00


Julia and Pandas syntax is different for filter operations (but rather a matter of taste).

Pandas is significantly faster.

## Summary

* CSV Import: draw - Julia is significantly faster single threaded and uses multiple threads by default, but has large compile-time for the first load
* Conversions to Datetime: draw - similar timings, but `pd.to_datetime` could automatically infer the datetime format.
* Vectorized standard calculations (available in Numpy): Pandas - but the more complex the calculations get, the more Julia catches up
* Vectorized calculations using custom functions: Julia - by 4 orders of magnitude in my example!
* Iteration over rows: Julia - by 2 orders of magnitude using `eachrow` (not type-stable) and 4 orders of magnitude if looping over DataFrame column Arrays. Using Numba on Numpy arrays is still a factor of 3 slower than Julia Arrays.
* Sorting: Pandas - a factor of 4 faster in my example
* Filtering: Pandas - a factor of 4 faster in my example