Make sure that pip is upgraded, that we have the latest version of pandas and that pyarrow is installed.  pyarrow is required in order to create parquet files directly from pandas as is done in the utility.

In [1]:
!pip install --upgrade pip

Requirement already up-to-date: pip in /opt/conda/lib/python3.5/site-packages (18.0)


In [2]:
!pip install pandas --upgrade

Requirement already up-to-date: pandas in /opt/conda/lib/python3.5/site-packages (0.23.3)


In [3]:
!pip install pyarrow --upgrade

Requirement already up-to-date: pyarrow in /opt/conda/lib/python3.5/site-packages (0.9.0)


Set up logging

In [4]:
import logging
import utilities.setup_logging

utilities.setup_logging.setup_logging()

Use our script utilities to read the UCI iris database into Pickle and Parquet files.

In [5]:
from utilities.pickle_util import read_csv_into_pickle

pickle_path = 'iris.pickle'

# Read the CSV (at URL) into a Pickle file
read_csv_into_pickle(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'],
    store_path=pickle_path
)

INFO:root:URL Read Time = 0.1403353214263916 seconds
INFO:root:Pickle Creation Time = 0.0016875267028808594 seconds


In [6]:
!ls -l iris.pickle

-rw-r--r-- 1 root root 5997 Jul 23 15:20 iris.pickle


In [7]:
from utilities.parquet_util import read_csv_into_parquet

parquet_path = 'iris.parquet'

# Read the CSV (at URL) into a Parquet file
read_csv_into_parquet(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'],
    store_path=parquet_path
)

INFO:root:URL Read Time = 0.12006831169128418 seconds
INFO:root:Parquet Creation Time = 0.022646427154541016 seconds


In [8]:
!ls -l iris.parquet

-rw-r--r-- 1 root root 4144 Jul 23 15:20 iris.parquet


Read the Pickle and the Parquet files into dataframes.

In [9]:
from utilities.pickle_util import read_pickle_to_df

pickle_df = read_pickle_to_df(pickle_path)

INFO:root:Pickle Read Time = 0.0009696483612060547 seconds


In [10]:
from IPython.display import display

display(pickle_df.head())

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [11]:
from utilities.parquet_util import read_parquet_to_df

parquet_df = read_parquet_to_df(parquet_path)

INFO:root:Parquet Read Time = 0.006642341613769531 seconds


In [12]:
display(parquet_df.head())

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Use assertions to make sure that all the data is as expected.

In [13]:
import pandas as pd

assert(isinstance(pickle_df, pd.DataFrame))

assert(isinstance(parquet_df, pd.DataFrame))

assert(pickle_df.equals(parquet_df))