In this part of the assignment, you will use data sourced from the Open Database of Addresses (ODA) provided courtesy of Statistics Canada. You can find the link to the database here: https://www.statcan.gc.ca/en/lode/databases/oda. 

This data is provided under the terms of the [Open Government License - Canada](https://open.canada.ca/en/open-government-licence-canada). 

You can find the data as a zipfile in D2L with the assignment, and you should unpack this and run it in the same directory as this notebook.


## Question 1

First, explore your data and the conversion task that you will be working with. In the cell below:

Compare the conversion of the files to parquet using different compression algorithms. What is the difference between using no compression, snappy, and gzip? Comment on processing times and approximately how much compression you observe has been achieved with this data.

Note: You may have some issues running this with the brotli compression algorithm.

Based on any sources you can find online (please include your references cite them approriately), how different would using brotli be? What would you expect in terms of processing time?


In [1]:
# MODIFIED
import pandas as pd
import random
import time 
from pathlib import Path

def convert_csv_to_parquet(src, dst, method):
    # print(f"Converting {src} to {dst}")
    df = pd.read_csv(src, low_memory=False)
    df.to_parquet(dst, compression=method)
    
file_list = list(Path('address_data').glob('address_*.csv'))
random.shuffle(file_list)

In [3]:
# MODIFIED - converted code into a fucntion that tests all compression algorithms 
methods = ['gzip', 'None', 'snappy', 'brotli']
def file_conversion(file_list, methods):
    for method in methods:
        start_time = time.time()
        for ind, src in enumerate(file_list):
            dest = str(src.with_suffix('.parquet'))
            convert_csv_to_parquet(src, dest, method)
        print(f"{method}: {time.time()-start_time:.4f} seconds")
file_conversion(file_list, methods)

gzip: 66.4325 seconds
None: 36.4289 seconds
snappy: 37.6192 seconds
brotli: 82.3449 seconds


**Analysis:** 

Order of compression algorithm speeds:


**gzip**: 66.4 seconds <br>
**none**: 36.4 seconds <br>
**snappy**: 37.6 seconds <br>
**brotli**: 82.3 seconds <br>

After some research online regardign these algorithms, this is why I believe they behaved in the way they did:


**gzip**: <br>

gzip compresses files using the DEFLATE algorithm. It is one of the most popular compresion algorithms since it has a compresison ratio around 95% for csv and json. It takes longer than snappy and no algorithm because it compresses the files so much.
_Source_: https://kinsta.com/blog/enable-gzip-compression/#what-is-gzip-compression <br>

**snappy**: <br>

snappy is the default algorithm for the to_parquet function in pandas. It is made by Google and is meant to be faster than gzip while still providing some compression. This makes sense why it has a similar time to no compression algorithm.

_Source_: https://github.com/google/snappy <br>

**brotli**: <br>

brotli is also made by Google and builds on the gzip compression algorithm to provide greater compression. This is why it takes so long to complete.

_Source_: https://kinsta.com/blog/brotli-compression/#brotli-compression-vs-gzip-compression <br>

**none**: <br>

It makes sense that using no compression algorithm will be fastest since the files being converted to parquet are not being compressed at all. This means that although it is the fastest, the files take up the most space.



## Question 2

Parallelize the code above using either processes or threads.

Use comments reading `# MODIFIED` to indicate any code cells you have changed and `# NEW` to indicate any code cells you have added.

With respect to the improvement in runtimes, do you think this is reasonably characteristic? What happens if you use a pool of threads or processes? 


In [5]:
# NEW 
from multiprocessing.pool import ThreadPool as Pool
import multiprocessing
from pathlib import Path
import time


def convert_csv_to_parquet(src, dst, method):
    df = pd.read_csv(src, low_memory=False)
    return df.to_parquet(dst, compression=method)

def thread_pool_func():
    for method in ['gzip', 'None', 'snappy', 'brotli']:
        start_time = time.time()
        with Pool(4) as pool:
            pool.starmap(convert_csv_to_parquet, [(f, str(f.with_suffix('.parquet')), method) for f in file_list])
        print(f"{method}: {time.time()-start_time:.4f} seconds")

thread_pool_func()

gzip: 28.6826 seconds
None: 22.0003 seconds
snappy: 22.3294 seconds
brotli: 34.7656 seconds


**Analysis:** 

Using thread pools for multiprocessing, the Brotli and gzip compression algorithms have significantly reduced. However the snappy algorthim was about the same, if not a little slower. Using no algorithm was a little faster using thread pools.

**gzip**: 28.7 seconds <br>
**none**: 22.0 seconds <br>
**snappy**: 22.3 seconds <br>
**brotli**: 34.8 seconds <br>

## Question 3

Share one way in which you could improve the running of this test. You may consider (but are not limited to) your runtime environment, the test data provided, or the manner in which this test was run.

One way I could improve the performance of this test in terms of the time it takes to run is to optimize the amount of thread pools for the hardware I'm using.

In [8]:
multiprocessing.cpu_count()

8

My PC has 8 cpu cores which means that using around 8 threads should improve the time of the test. I can utilize the code "multiprocessing.cpu_count()" in my code to implement this.

In [9]:
# MODIFIED
def convert_csv_to_parquet(src, dst, method):
    df = pd.read_csv(src, low_memory=False)
    return df.to_parquet(dst, compression=method)

def thread_pool_func():
    for method in ['gzip', 'None', 'snappy', 'brotli']:
        start_time = time.time()
        with Pool(multiprocessing.cpu_count()) as pool:
            pool.starmap(convert_csv_to_parquet, [(f, str(f.with_suffix('.parquet')), method) for f in file_list])
        print(f"{method}: {time.time()-start_time:.4f} seconds")

thread_pool_func()

gzip: 23.7607 seconds
None: 20.7990 seconds
snappy: 20.9314 seconds
brotli: 28.7523 seconds


_**Improvement:**_

**gzip**: 4.9sec <br>
**none**: 1.2sec <br>
**snappy**: 1.4sec <br>
**brotli**: 6.0sec <br>

We see that utilizing the appropriate amount of cores for the hardware the test is being run on improves the time of the test. We see that the "heavier" compression algorithms had more significant improvements.