# rolling your own functions package

In the following notebook we will develop a function that can be integrated, as is, into a machine learning pipeline.  MLRun offers many alternative options for packaging your code into reusable components, and we'll focus here on a common example--a function sitting inside a package module residing in a code repository, in this case, **[GitHub](https://github.com)**.

### arc to parquet

One of the most common steps of machine learning pipelines is the acquisition of remote archives.  In research and competitions these are often one-time downloads per project, whereas in commercial applications this can involve hundreds or even thousands of files per day.  In any case, if saving the data makes sense (cents!), then a popular format would be **[parquet](https://parquet.apache.org/documentation/latest/)**.  And that's because  paquet files can be loaded relatively quickly on systems with fast cpu's, it's columnar, it's compatible with **[Apache Arrow](https://arrow.apache.org/docs/python/parquet.html)**, it's a standard file format that comes with a certain guarantee (future proof) that it won't suffer versioning issues over time, and so on...

Since we do alot of pipeline building with these raw data files, we would want this component to optionally make its data available to the next step in an MLRun pipeline, and not just store the data. So we log the data as an MLRun artifact.  This will also enable us to take a peek at the data in tabular format through an artifact viewer, and make available some descriptive stats and information on categorical types that may require further attention.

    !pip uninstall -y mlrun
    !pip install -U mlrun

    !pip install -U joblib pandas pyarrow numpy==1.16.4

    !pip install -U nuclio-jupyter

In [5]:
# nuclio: ignore
import nuclio 

In [6]:
%nuclio cmd -c pip install joblib pandas pyarrow numpy==1.16.4
%nuclio cmd -c pip install mlrun

In [7]:
import os
from urllib.request import urlretrieve

from mlrun.execution import MLClientCtx
from typing import (IO, 
                    AnyStr, 
                    TypeVar, 
                    Union, 
                    List, 
                    Tuple, Any)
from pathlib import Path

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

from functions.tables import log_context_table

In [None]:
def arc_to_parquet(
    context: MLClientCtx,
    archive_url: Union[str, Path, IO[AnyStr]],
    header: Union[None, List[str]] = None,
    target_path: str = "",
    name: str = "",
    chunksize: int = 10_000,
    log_data: bool = True,
    key: str = 'raw_data'
) -> None:
    """Open a file/object archive and save as a parquet file.
    
    Args:
    :param context:     function context
    :param archive_url: any valid string path consistent with the path variable
                        of pandas.read_csv. ncluding strings as file paths, as urls, 
                        pathlib.Path objects, etc...
    :param header:      column names
    :param target_path: destination folder of table
    :param name:        name file to be saved locally
    :param chunksize:   (0) row size retrieved per iteration
    :param log_data:    (True) if True, log the data so that it is available
                        at the next step
    :param key:         when logging data as an artifact, assign it to this
                        key.
    """
    os.makedirs(target_path, exist_ok=True)

    if not name.endswith(".parquet"):
        name += ".parquet"

    dest_path = os.path.join(target_path, name)

    if not os.path.isfile(dest_path):
        context.logger.info("destination file does not exist, downloading")
        pqwriter = None
        for i, df in enumerate(
            pd.read_csv(archive_url, chunksize=chunksize, names=header)
        ):
            table = pa.Table.from_pandas(df)
            if i == 0:
                pqwriter = pq.ParquetWriter(dest_path, table.schema)
            pqwriter.write_table(table)

        if pqwriter:
            pqwriter.close()

        context.logger.info(f"saved table to {dest_path}")
    else:
        context.logger.info("destination file exists")

    if log_data:
        context.logger.info("logging data to context")
        # we simply give the context our file location and 
        # assign it to `key`:
        log_context_table(context, dest_path, key)