# Scale-Out Data Preparation
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Once we are done with preparing and featurizing the data locally, we can run the same steps on the full dataset in scale-out mode. The new york taxi cab data is about 300GB in total, which is perfect for scale-out. Let's start by downloading the package we saved earlier to disk. Feel free to run the `new_york_taxi_cab.ipynb` notebook to generate the package yourself, in which case you may comment out the download code and set the `package_path` to where the package is saved.

In [None]:
from tempfile import mkdtemp
from os import path
from urllib.request import urlretrieve

dflow_root = mkdtemp()
dflow_path = path.join(dflow_root, "new_york_taxi.dprep")
print("Downloading Dataflow to: {}".format(dflow_path))
urlretrieve("https://dprepdata.blob.core.windows.net/demo/new_york_taxi_v2.dprep", dflow_path)

Let's load the package we just downloaded.

In [None]:
import azureml.dataprep as dprep

df = dprep.Dataflow.open(dflow_path)

Let's replace the datasources with the full dataset.

In [None]:
from uuid import uuid4

other_step = df._get_steps()[7].arguments['dataflows'][0]['anonymousSteps'][0]
other_step['id'] = str(uuid4())
other_step['arguments']['path']['target'] = 1
other_step['arguments']['path']['resourceDetails'][0]['path'] = 'https://wranglewestus.blob.core.windows.net/nyctaxi/yellow_tripdata*'

In [None]:
green_dsource = dprep.BlobDataSource("https://wranglewestus.blob.core.windows.net/nyctaxi/green_tripdata*")
df = df.replace_datasource(green_dsource)

Once we have replaced the datasource, we can now run the same steps on the full dataset. We will print the first 5 rows of the spark DataFrame. Since we are running on the full dataset, this might take a little while depending on your spark cluster's size.

In [None]:
spark_df = df.to_spark_dataframe()
spark_df.head(5)

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/case-studies/new-york-taxi/new-york-taxi_scale-out.png)