![LOGO](../../../img/MODIN_ver2_hrz.png)

<center><h2>Scale your pandas workflows by changing one line of code</h2>


# Exercise 2: Speed improvements

**GOAL**: Learn about common functionality that Modin speeds up by using all of your machine's cores.

## Concept for Exercise: `read_csv` speedups

The most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation.

We will import both Modin and pandas so that the speedups are evident.

**Note: Rerunning the `read_csv` cells many times may result in degraded performance, depending on the memory of the machine**

In [None]:
import modin.pandas as pd
import pandas
import time
import modin.config as cfg
cfg.StorageFormat.put("omnisci")

### Dataset: 2015 NYC taxi trip data

Link to raw dataset: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes

**Size: ~2GB**

In [None]:
s3_path = "s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv"

**Optional:** Note that the dataset takes a while to download. To speed things up a bit, if you prefer to download this file once locally, you can run the following code in the notebook:

In [None]:
# [Optional] Download data locally. This may take a few minutes to download.
# import urllib.request
# url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
# urllib.request.urlretrieve(url_path, "taxi.csv")
# path = "taxi.csv"

## `pandas.read_csv`

In [None]:
start = time.time()

pandas_df = pandas.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

### Expect pandas to take >3 minutes on EC2, longer locally

This is a good time to chat with your neighbor
Dicussion topics
- Do you work with a large amount of data daily?
- How big is your data?
- What’s the common use case of your data?
- Do you use any big data analytics tools?
- Do you use any interactive analytics tool?
- What’s are some drawbacks of your current interative analytic tools today?

## `modin.pandas.read_csv`

In [None]:
start = time.time()

modin_df = pd.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

print("Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

## Are they equals?

In [None]:
modin_df

In [None]:
pandas_df

## Concept for exercise: Reduces

In pandas, a reduce would be something along the lines of a `sum` or `count`. It computes some summary statistics about the rows or columns. We will be using `sum`.

In [None]:
start = time.time()

pandas_sum = (pandas_df[['VendorID', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'RateCodeID', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']]).sum()

end = time.time()
pandas_duration = end - start

print("Time to count with pandas: {} seconds".format(round(pandas_duration, 3)))

In [None]:
start = time.time()

modin_sum = (modin_df[['VendorID', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'RateCodeID', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']]).sum()

end = time.time()
modin_duration = end - start
print("Time to count with Modin: {} seconds".format(round(modin_duration, 3)))

print("Modin is {}x faster than pandas at `sum`!".format(round(pandas_duration / modin_duration, 2)))

## Are they equals?

In [None]:
pandas_sum

In [None]:
modin_sum

## Concept for exercise: Groupby and aggregate

In pandas, you can groupby and aggregate. We will groupby a column in the dataset and use count for our aggregate.

In [None]:
start = time.time()

pandas_groupby = pandas_df.groupby(by="total_amount").count()

end = time.time()
pandas_duration = end - start

print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))

In [None]:
start = time.time()

modin_groupby = modin_df.groupby(by="total_amount").count()

end = time.time()
modin_duration = end - start
print("Time to groupby with Modin: {} seconds".format(round(modin_duration, 3)))

print("Modin is {}x faster than pandas at `groupby`!".format(round(pandas_duration / modin_duration, 2)))

## Are they equal?

In [None]:
pandas_groupby

In [None]:
modin_groupby

**Please move on to [Exercise 3](./exercise_3.ipynb)**