## Overview

pandas is a very useful tool for manipulating data, but it's heavy on memory resource due to the amount of features it provides. People (ab)use pandas out of convenience for opening big CSV, Excel & other file formats, which is fine for most part, but sometimes using pandas just to take a look & manipulating a little bit of big data file is slow & cumbersome. So I'm going to use python generator to demonstrate it doesn't take much to achieve this kind of task with built-in modules, although I suspect it _might be_ more expensive on I/O side, but speed wise this is faster because this one only lazily loads the data when you need it.

I'm going to use this bank dataset from data.gov: https://catalog.data.gov/dataset/development-credit-authority-dca-data-set-loan-transactions-28508 renamed as `'credit_loan_dataset.csv'`. It's ~33MB of comma delimited CSV, containing 186,545 rows of raw data.

### Software & hardwares

This is based on miniconda IPython version 7.9.0 (Python 3.7.5) in macOS Mojave 10.14.6 with 16GB of 2133MHz memory chip and a cheap 120GB whitebranded **SATA** M.2 drive. YMMV on the actual performance on your machine, but regular SATA SSD should be comparable & PCIe NVMe M.2 is much faster.

In [6]:
import json, csv
from pprint import pprint

def open_csv(filename, stop=5, dialect='excel'):
        
    with open(filename, 'r') as f:
        reader = csv.DictReader(f, dialect=dialect)
        for i in range(0, 1, stop):
            for row in reader:
                yield row

In [2]:
# let's try and see if this works
# in python v.3.8 the output of `csv.DictWriter` is a `dict`, but v3 prior to 3.8 it's `OrderedDict`
data = open_csv('credit_loan_dataset.csv')
test = dict(next(data))

In [3]:
test

{'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
 'Transaction Report ID': '356191',
 'Guarantee Country Name': 'Worldwide',
 'Amount (USD)': '980144.4043',
 'Currency Name': 'PERU - NUEVO SOL',
 'Disbursement Date': '09/20/2011 12:00:00 AM',
 'End Date': '09/09/2013 12:00:00 AM',
 'Business Sector': '',
 'City/Town': 'Cusco',
 'State/Province/Region Name': 'Cusco',
 'State/Province/Region Code': 'PE08',
 'State/Province/Region Country Name': 'Peru',
 'Region Name': 'LATIN AMERICA & THE CARIBBEAN',
 'Is Woman Owned?': '0',
 'Is First Time Borrower?': '1',
 'Business Size': '>100',
 'Latitude': '-13.518333',
 'Longitude': '-71.978056'}

In [4]:
# again
test

{'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
 'Transaction Report ID': '356191',
 'Guarantee Country Name': 'Worldwide',
 'Amount (USD)': '980144.4043',
 'Currency Name': 'PERU - NUEVO SOL',
 'Disbursement Date': '09/20/2011 12:00:00 AM',
 'End Date': '09/09/2013 12:00:00 AM',
 'Business Sector': '',
 'City/Town': 'Cusco',
 'State/Province/Region Name': 'Cusco',
 'State/Province/Region Code': 'PE08',
 'State/Province/Region Country Name': 'Peru',
 'Region Name': 'LATIN AMERICA & THE CARIBBEAN',
 'Is Woman Owned?': '0',
 'Is First Time Borrower?': '1',
 'Business Size': '>100',
 'Latitude': '-13.518333',
 'Longitude': '-71.978056'}

In [5]:
# lets see if this works
test2 = [dict(next(data))]*5

In [6]:
test2

[{'Guarantee Number': '099-DCA-09-006B (Pro Mujer Peru)',
  'Transaction Report ID': '331620',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '1960288.809',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/04/2014 12:00:00 AM',
  'Business Sector': '',
  'City/Town': 'Puno',
  'State/Province/Region Name': 'Puno',
  'State/Province/Region Code': 'PE21',
  'State/Province/Region Country Name': 'Peru',
  'Region Name': 'LATIN AMERICA & THE CARIBBEAN',
  'Is Woman Owned?': '0',
  'Is First Time Borrower?': '0',
  'Business Size': '>100',
  'Latitude': '-15',
  'Longitude': '-70'},
 {'Guarantee Number': '099-DCA-09-006B (Pro Mujer Peru)',
  'Transaction Report ID': '331620',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '1960288.809',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/04/2014 12:00:00 AM',
  'Business Sector': '',
  'City/Town': 'Puno',
 

In [7]:
len(test2)

5

In [8]:
%lsmagic # this is a list of ipython features that you can use in Jupyter notebooks

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

some useful links for `%magic`:

* https://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magic-functions
* https://stackoverflow.com/questions/49136737/how-profiling-class-method-using-ipython-lprun-magic-function

to install external profiler on conda use `conda install line_profiler` if you're using conda/miniconda, or follow guide in https://github.com/rkern/line_profiler#installation
memory_profiler is available on conda & can be installed with `pip install memory_profiler`

Make sure your in the right virtual env.


In [6]:
%load_ext line_profiler
%load_ext memory_profiler

In [21]:
import json, csv
from pprint import pprint

def open_csv(filename, stop=5, dialect='excel'):
        
    with open(filename, 'r') as f:
        reader = csv.DictReader(f, dialect=dialect)
        for i in range(0, 1):
            for row in reader:
                yield row
    

def get_csv_rows(filename, stop):
    data = open_csv(filename, stop)
    return [dict(next(data))] * stop


In [24]:
%timeit get_csv_rows('credit_loan_dataset.csv', 10)

52.5 µs ± 3.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [15]:
%lprun get_csv_rows('credit_loan_dataset.csv', 10)

In [17]:
%mprun get_csv_rows('credit_loan_dataset.csv', 10)




In [23]:
%%time
get_csv_rows('credit_loan_dataset.csv', 10)

CPU times: user 395 µs, sys: 344 µs, total: 739 µs
Wall time: 482 µs


[{'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
  'Transaction Report ID': '356191',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '980144.4043',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/09/2013 12:00:00 AM',
  'Business Sector': '',
  'City/Town': 'Cusco',
  'State/Province/Region Name': 'Cusco',
  'State/Province/Region Code': 'PE08',
  'State/Province/Region Country Name': 'Peru',
  'Region Name': 'LATIN AMERICA & THE CARIBBEAN',
  'Is Woman Owned?': '0',
  'Is First Time Borrower?': '1',
  'Business Size': '>100',
  'Latitude': '-13.518333',
  'Longitude': '-71.978056'},
 {'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
  'Transaction Report ID': '356191',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '980144.4043',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/09/2013 12:00:00 AM',
  'Business Sector': '',


In [22]:
%%time
get_csv_rows('credit_loan_dataset.csv', 1000)

CPU times: user 366 µs, sys: 283 µs, total: 649 µs
Wall time: 448 µs


[{'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
  'Transaction Report ID': '356191',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '980144.4043',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/09/2013 12:00:00 AM',
  'Business Sector': '',
  'City/Town': 'Cusco',
  'State/Province/Region Name': 'Cusco',
  'State/Province/Region Code': 'PE08',
  'State/Province/Region Country Name': 'Peru',
  'Region Name': 'LATIN AMERICA & THE CARIBBEAN',
  'Is Woman Owned?': '0',
  'Is First Time Borrower?': '1',
  'Business Size': '>100',
  'Latitude': '-13.518333',
  'Longitude': '-71.978056'},
 {'Guarantee Number': '099-DCA-09-006A (Asociacion Arariwa)',
  'Transaction Report ID': '356191',
  'Guarantee Country Name': 'Worldwide',
  'Amount (USD)': '980144.4043',
  'Currency Name': 'PERU - NUEVO SOL',
  'Disbursement Date': '09/20/2011 12:00:00 AM',
  'End Date': '09/09/2013 12:00:00 AM',
  'Business Sector': '',


In [4]:
!python -m line_profiler profile.lprof chunk_csv.py

/usr/local/Caskroom/miniconda/base/envs/notebooks/bin/python: No module named line_profiler


In [4]:
!python -m cProfile -s cumtime chunk_csv.py

         3809 function calls (3691 primitive calls) in 0.007 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      7/1    0.000    0.000    0.007    0.007 {built-in method builtins.exec}
        1    0.000    0.000    0.007    0.007 chunk_csv.py:1(<module>)
      8/3    0.000    0.000    0.007    0.002 <frozen importlib._bootstrap>:978(_find_and_load)
      8/3    0.000    0.000    0.007    0.002 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
      8/3    0.000    0.000    0.006    0.002 <frozen importlib._bootstrap>:663(_load_unlocked)
      6/3    0.000    0.000    0.006    0.002 <frozen importlib._bootstrap_external>:722(exec_module)
     11/3    0.000    0.000    0.005    0.002 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
        1    0.000    0.000    0.004    0.004 __init__.py:97(<module>)
        1    0.000    0.000    0.003    0.003 decoder.py:2(<module>)
    15/14    0.000    0.000    