## Parquet File Format

Ok, so with the scanner dataset we may want to reformat the datafile so it can be read into a dataframe quicker and take up less memory on the USB or your computer relative to the current format (or others, e.g. .csv). As we will see below just reading in the dataset will take about 5 minutes (on my laptop, 2.5 minutes on desktop). Its ok, the issue is that we will have to do this multiple times as we read in different files across years, so a faster format will prove useful. 

The format we will explore is [`parquet`](https://parquet.apache.org/) which appears to be in growing use within the data science community. 

So what are the advantages, this is what is described...

- "Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV. Apache Parquet is built from the ground up with complex nested data structures in mind. Apache Parquet is built to support very efficient compression and encoding schemes"

This sounds like it is well designed for our purposes lets try it out...

---

### Step \#1

First you need to have the ``pyarrow`` package installed. To do so use, ``conda install pyarrow`` in the command prompt to install the package(s). Side note: one one of my computers I had some issues installing this. If this is the case, then remove the package and install it via pip, so ``pip install pyarrow``

Then what we do is import these packages in the following way

In [2]:
import pandas as pd
import time

import pyarrow as pa
import pyarrow.parquet as pq

Now lets read in **one year** of the scanner dataset. Note, I did some of the work here for you guys by figuring out that the dataset is in a "fixed width format". Pandas can still handel it, but the point of this notebook is that we can do better.

In [3]:
file_path = "F://BEER_DATA//BEER_DATA//year_2007//beer"

beer_file = "//beer_groc_1427_1478"

file_path = file_path + beer_file

In [4]:
file_path

'F://BEER_DATA//BEER_DATA//year_2007//beer//beer_groc_1427_1478'

Below we will read it in. I will also measure the time how long this process takes...

In [5]:
start = time.time()

print('Read in the Beer Data Set:')
scan_beer = pd.read_fwf(file_path)

end = time.time()
print(end - start)

Read in the Beer Data Set:
162.7815866470337


Ok, so this took about two and a half minutes to import. Can we do better?

In [6]:
scan_beer.head()

Unnamed: 0,IRI_KEY,WEEK,SY,GE,VEND,ITEM,UNITS,DOLLARS,F,D,PR
0,234212,1427,0,1,72783,200,2,17.98,NONE,0,0
1,234212,1427,0,1,18200,10985,1,6.49,NONE,0,0
2,234212,1427,0,1,18200,11981,2,12.98,NONE,0,0
3,234212,1427,0,1,18200,6992,1,6.99,NONE,0,0
4,234212,1427,0,1,18200,468,18,26.82,NONE,0,0


Now we will write the dataframe as a parquet file, below is a way to to it

In [7]:
new_path = file_path + '.parquet'

pq.write_table(pa.Table.from_pandas(scan_beer), new_path)

# The inside part creates teh parquet table, then the outside part writes the table to where we want it.
# It is then saved to file_path with the extension being .parquet

In [8]:
new_path

'F://BEER_DATA//BEER_DATA//year_2007//beer//beer_groc_1427_1478.parquet'

Now this is where the file is, same place but now with the ``.parquet`` extension. 

Now we will will delete the orginal dataframe and read it in the ``.parquet`` to see how fast it is

In [9]:
del scan_beer


In [10]:
start = time.time()

df = pq.read_table(new_path).to_pandas()
# This reads it in as a parquet table, # then converts it to a dataframe

end = time.time()
print(end - start)

1.3810555934906006


**So this is very awesome.** What took about 160 seconds to read in, now is taking less than 2 seconds to read in. 2 orders of magnitude smaller! Another thing to notice is that the size of the .parquet file is only 50mb, not 500mb. So it compressed the data in a very efficient way.

In [11]:
df.head()

Unnamed: 0,IRI_KEY,WEEK,SY,GE,VEND,ITEM,UNITS,DOLLARS,F,D,PR
0,234212,1427,0,1,72783,200,2,17.98,NONE,0,0
1,234212,1427,0,1,18200,10985,1,6.49,NONE,0,0
2,234212,1427,0,1,18200,11981,2,12.98,NONE,0,0
3,234212,1427,0,1,18200,6992,1,6.99,NONE,0,0
4,234212,1427,0,1,18200,468,18,26.82,NONE,0,0
