### Handling large datasets with chunks


* Pandas allows for the loading of data in a data-frame by chunks, it is therefore possible to process data-frames as iterators and be able to handle data-frames larger than the available memory.

In [1]:
import pandas

In [2]:
# link to gapminder data as csv file from software carpentry website
csv_url='http://bit.ly/2cLzoxH'

In [3]:
# load the big file in smaller chunks
df_iter = pandas.read_csv(csv_url, chunksize = 5)

In [4]:
df_iter.get_chunk()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


* The combination of defining a chunksize when reading a data source and the get_chunk method, allows pandas to process data as an iterator, such as in the example shown above, where the data frame is read 100 rows at the time. These chunks can then be iterated through:

```
i = 0
for a in df_iter:
  # do some processing  chunk = df_iter.get_chunk()
  i += 1
  new_chunk = chunk.apply(lambda x: do_something(x), axis=1)
  new_chunk.to_csv("chunk_output_%i.csv" % i )
  ```

In [5]:
from collections import defaultdict
# default value of int is 0 with defaultdict
continent_dict = defaultdict(int) 

In [6]:
# load the big CSV file with chunnksize=500 and count the number of continent entries in each smaller chunk 
# using the defaultdict.
df_iter = pandas.read_csv(csv_url, chunksize = 500)

for gm_chunk in df_iter:
    print(gm_chunk.shape)
    for c in gm_chunk['continent']:
        continent_dict[c] += 1
        

(500, 6)
(500, 6)
(500, 6)
(204, 6)


In [7]:
continent_dict

defaultdict(int,
            {'Asia': 396,
             'Europe': 360,
             'Africa': 624,
             'Americas': 300,
             'Oceania': 24})