![LOGO](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/img/MODIN_ver2_hrz.png?raw=True)

<center><h2>Scale your pandas workflows by changing one line of code</h2>


# Getting Started

To install the most recent stable release for Modin run the following code on your command line:

In [1]:
#!pip install modin[all] 

For further instructions on how to install Modin with conda or for specific platforms or engines, see our detailed [installation guide](https://modin.readthedocs.io/en/latest/getting_started/installation.html).

Modin acts as a drop-in replacement for pandas so you can simply change a single line of import to speed up your pandas workflows. To use Modin, you simply have to replace the import of pandas with the import of Modin, as follows.

In [2]:
import modin.pandas as pd
import pandas

In [3]:
#############################################
### For the purpose of timing comparisons ###
#############################################
import time
#import ray
#ray.init()
# use Dask instead of ray
from dask.distributed import Client
client = Client()
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

### Dataset: NYC taxi trip data

Link to raw dataset: https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv (**Size: ~200MB**)

In [4]:
# This may take a few minutes to download
import urllib.request
s3_path = "https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv"
urllib.request.urlretrieve(s3_path, "taxi.csv")  

('taxi.csv', <http.client.HTTPMessage at 0x15af2d160>)

# Faster Data Loading with Modin's ``read_csv``

In [5]:
start = time.time()

pandas_df = pandas.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to read with pandas: 4.721 seconds


In [6]:
start = time.time()

modin_df = pd.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("## Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

Time to read with Modin: 2.286 seconds


## Modin is 2.07x faster than pandas at `read_csv`!

You can quickly check that the result from pandas and Modin is exactly the same.

In [7]:
pandas_df

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-01-15 19:05:39,2015-01-15 19:23:42,1,1.59,-73.993896,40.750111,1,N,-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,1,1,2015-01-10 20:33:38,2015-01-10 20:53:28,1,3.30,-74.001648,40.724243,1,N,-73.994415,40.759109,1,14.5,0.5,0.5,2.00,0.0,0.3,17.80
2,2,1,2015-01-10 20:33:38,2015-01-10 20:43:41,1,1.80,-73.963341,40.802788,1,N,-73.951820,40.824413,2,9.5,0.5,0.5,0.00,0.0,0.3,10.80
3,3,1,2015-01-10 20:33:39,2015-01-10 20:35:31,1,0.50,-74.009087,40.713818,1,N,-74.004326,40.719986,2,3.5,0.5,0.5,0.00,0.0,0.3,4.80
4,4,1,2015-01-10 20:33:39,2015-01-10 20:52:58,1,3.00,-73.971176,40.762428,1,N,-74.004181,40.742653,2,15.0,0.5,0.5,0.00,0.0,0.3,16.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1274893,1274893,1,2015-01-11 08:32:03,2015-01-11 08:34:37,1,0.30,-73.947800,40.790176,1,N,-73.952866,40.791824,2,3.5,0.0,0.5,0.00,0.0,0.3,4.30
1274894,1274894,1,2015-01-11 08:32:03,2015-01-11 08:36:49,1,1.30,-73.980423,40.775387,1,N,-73.992508,40.758579,2,6.0,0.0,0.5,0.00,0.0,0.3,6.80
1274895,1274895,1,2015-01-11 08:32:04,2015-01-11 08:44:21,1,2.40,-73.981750,40.778496,1,N,-73.955757,40.763962,2,10.5,0.0,0.5,0.00,0.0,0.3,11.30
1274896,1274896,1,2015-01-11 08:32:05,2015-01-11 08:41:05,1,2.20,-73.982559,40.771423,1,N,-73.994759,40.748760,1,9.0,0.0,0.5,1.95,0.0,0.3,11.75


In [8]:
modin_df

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-01-15 19:05:39,2015-01-15 19:23:42,1,1.59,-73.993896,40.750111,1,N,-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,1,1,2015-01-10 20:33:38,2015-01-10 20:53:28,1,3.30,-74.001648,40.724243,1,N,-73.994415,40.759109,1,14.5,0.5,0.5,2.00,0.0,0.3,17.80
2,2,1,2015-01-10 20:33:38,2015-01-10 20:43:41,1,1.80,-73.963341,40.802788,1,N,-73.951820,40.824413,2,9.5,0.5,0.5,0.00,0.0,0.3,10.80
3,3,1,2015-01-10 20:33:39,2015-01-10 20:35:31,1,0.50,-74.009087,40.713818,1,N,-74.004326,40.719986,2,3.5,0.5,0.5,0.00,0.0,0.3,4.80
4,4,1,2015-01-10 20:33:39,2015-01-10 20:52:58,1,3.00,-73.971176,40.762428,1,N,-74.004181,40.742653,2,15.0,0.5,0.5,0.00,0.0,0.3,16.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1274893,1274893,1,2015-01-11 08:32:03,2015-01-11 08:34:37,1,0.30,-73.947800,40.790176,1,N,-73.952866,40.791824,2,3.5,0.0,0.5,0.00,0.0,0.3,4.30
1274894,1274894,1,2015-01-11 08:32:03,2015-01-11 08:36:49,1,1.30,-73.980423,40.775387,1,N,-73.992508,40.758579,2,6.0,0.0,0.5,0.00,0.0,0.3,6.80
1274895,1274895,1,2015-01-11 08:32:04,2015-01-11 08:44:21,1,2.40,-73.981750,40.778496,1,N,-73.955757,40.763962,2,10.5,0.0,0.5,0.00,0.0,0.3,11.30
1274896,1274896,1,2015-01-11 08:32:05,2015-01-11 08:41:05,1,2.20,-73.982559,40.771423,1,N,-73.994759,40.748760,1,9.0,0.0,0.5,1.95,0.0,0.3,11.75


# Faster Append with Modin's ``concat``

Our previous ``read_csv`` example operated on a relatively small dataframe. In the following example, we duplicate the same taxi dataset 100 times and then concatenate them together.

In [9]:
N_copies= 100
start = time.time()

big_pandas_df = pandas.concat([pandas_df for _ in range(N_copies)])

end = time.time()
pandas_duration = end - start
print("Time to concat with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to concat with pandas: 40.049 seconds


In [10]:
start = time.time()

big_modin_df = pd.concat([modin_df for _ in range(N_copies)])

end = time.time()
modin_duration = end - start
print("Time to concat with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `concat`!".format(round(pandas_duration / modin_duration, 2)))

Time to concat with Modin: 4.95 seconds


### Modin is 8.09x faster than pandas at `concat`!

The result dataset is around 19GB in size.

In [11]:
big_modin_df.info()

<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 127489800 entries, 0 to 1274897
Data columns (total 20 columns):
 #   Column                 Non-Null Count      Dtype         
---  ---------------------  ------------------  -----         
 0   Unnamed: 0             127489800 non-null  int64
 1   VendorID               127489800 non-null  int64
 2   tpep_pickup_datetime   127489800 non-null  datetime64[ns]
 3   tpep_dropoff_datetime  127489800 non-null  datetime64[ns]
 4   passenger_count        127489800 non-null  int64
 5   trip_distance          127489800 non-null  float64
 6   pickup_longitude       127489800 non-null  float64
 7   pickup_latitude        127489800 non-null  float64
 8   RateCodeID             127489800 non-null  int64
 9   store_and_fwd_flag     127489800 non-null  object
 10  dropoff_longitude      127489800 non-null  float64
 11  dropoff_latitude       127489800 non-null  float64
 12  payment_type           127489800 non-null  int64
 13  fare_amount       



## Faster ``apply`` over a single column

The performance benefits of Modin becomes aparent when we operate on large gigabyte-scale datasets. For example, let's say that we want to round up the number across a single column via the ``apply`` operation. 

In [12]:
start = time.time()
rounded_trip_distance_pandas = big_pandas_df["trip_distance"].apply(round)

end = time.time()
pandas_duration = end - start
print("Time to apply with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to apply with pandas: 57.671 seconds


In [13]:
start = time.time()

rounded_trip_distance_modin = big_modin_df["trip_distance"].apply(round)

end = time.time()
modin_duration = end - start
print("Time to apply with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))

Time to apply with Modin: 3.391 seconds


### Modin is 17.01x faster than pandas at `apply` on one column!

In [14]:
# make sure to shutdown Dask
client.close()

# Summary

Hopefully, this tutorial demonstrated how Modin delivers significant speedup on pandas operations without the need for any extra effort. Throughout example, we moved from working with 100MBs of data to 20GBs of data all without having to change anything or manually optimize our code to achieve the level of scalable performance that Modin provides.

Note that in this quickstart example, we've only shown ``read_csv``, ``concat``, ``apply``, but these are not the only pandas operations that Modin optimizes for. In fact, Modin covers [more than 90% of the pandas API](https://github.com/modin-project/modin/blob/master/README.md#pandas-api-coverage), yielding considerable speedups for many common operations.