In this part of the assignment, you will use data sourced from the Open Database of Addresses (ODA) provided courtesy of Statistics Canada. You can find the link to the database here: https://www.statcan.gc.ca/en/lode/databases/oda. 

This data is provided under the terms of the [Open Government License - Canada](https://open.canada.ca/en/open-government-licence-canada). 

You can find the data as a zipfile in D2L with the assignment, and you should unpack this and run it in the same directory as this notebook.


## Question 1

First, explore your data and the conversion task that you will be working with. In the cell below:

Compare the conversion of the files to parquet using different compression algorithms. What is the difference between using no compression, snappy, and gzip? Comment on processing times and approximately how much compression you observe has been achieved with this data.

Note: You may have some issues running this with the brotli compression algorithm.

Based on any sources you can find online (please include your references cite them approriately), how different would using brotli be? What would you expect in terms of processing time?


In [6]:
# MODIFIED
import pandas as pd
import random
import time 
from pathlib import Path

def convert_csv_to_parquet(src, dst, method):
    # print(f"Converting {src} to {dst}")
    df = pd.read_csv(src, low_memory=False)
    df.to_parquet(dst, compression=method)
    
file_list = list(Path('address_data').glob('address_*.csv'))
random.shuffle(file_list)

In [10]:
# MODIFIED
methods = ['gzip', 'None', 'snappy', 'brotli']
def file_conversion(file_list, methods):
    for method in methods:
        start_time = time.time()
        for ind, src in enumerate(file_list):
            dest = str(src.with_suffix('.parquet'))
            convert_csv_to_parquet(src, dest, method)
        print(f"{method}: {time.time()-start_time:.4f} seconds")
file_conversion(file_list, methods)

gzip: 119.2666 seconds
None: 55.6176 seconds
snappy: 54.5760 seconds
brotli: 135.7738 seconds


**Analysis:** Order of compression algorithm speeds:

brotli: 135sec <br>
gzip: 119sec <br>
none: 55sec <br>
snappy: 54sec <br>

After some research online regardign these algorithms, this is why I believe they behaved in the way they did:

Sources:

## Question 2

Parallelize the code above using either processes or threads.

Use comments reading `# MODIFIED` to indicate any code cells you have changed and `# NEW` to indicate any code cells you have added.

With respect to the improvement in runtimes, do you think this is reasonably characteristic? What happens if you use a pool of threads or processes? 


In [5]:
# NEW 
from multiprocessing.pool import ThreadPool as Pool
import multiprocessing
from pathlib import Path
import time


def convert_csv_to_parquet(src, dst, method):
    df = pd.read_csv(src, low_memory=False)
    return df.to_parquet(dst, compression=method)

def thread_pool_func():
    for method in ['gzip', 'None', 'snappy', 'brotli']:
        start_time = time.time()
        with Pool(multiprocessing.cpu_count()) as pool:
            pool.starmap(convert_csv_to_parquet, [(f, str(f.with_suffix('.parquet')), method) for f in file_list])
        print(f"{method}: {time.time()-start_time:.4f} seconds")

thread_pool_func()

gzip: 62.74 seconds
None: 47.84 seconds
snappy: 56.58 seconds
brotli: 88.44 seconds


**Analysis:** Using thread pools for multiprocessing, the Brotli and gzip compression algorithms have significantly reduced. However the snappy algorthim was about the same, if not a little slower. Using no algorithm was a little faster using thread pools.

Bbotli: 88sec <br>
gzip: 63sec <br>
snappy: 56sec <br>
none: 48sec <br>

## Question 3

Share one way in which you could improve the running of this test. You may consider (but are not limited to) your runtime environment, the test data provided, or the manner in which this test was run.