## Read file data in chunks

In [1]:
#importing pandas library as pd
import pandas as pd

In [None]:
#information about read_csv
help(pd.read_csv) 

In [3]:
 #path to the orders file
path = '/data/retail_db/orders/part-00000'

In [4]:
#defining header
schema = ["order_id","order_date","order_customer_id","order_status"] 

In [5]:
#Assigining header
orders=pd.read_csv(path, delimiter=',', header=None, names=schema) 

In [6]:
#type of orders
type(orders) 

pandas.core.frame.DataFrame

In [7]:
# length of orders
len(orders) 

68883

### Using Chunksize

* The chunksize splits the data into chunks of defined size when given an input of integer greater than 0.
* Using chunksize just splits the file into chunks.
* But to access a particular chunk,we need to use nrows and skiprows.
* It is demonstrated after chunksize.

In [8]:
# Usage of chunksize
orders_chunk = pd.read_csv(path, delimiter=',', header=None, names=schema, chunksize=10 )

In [9]:
# Here,we get 6889 chunks of each size 10, except for the last chunk
len(list(orders_chunk))

6889

### Usage of nrows along with chunksize
* `nrows` basically gives no. of rows that we want to see.
* Here we are assigning nrows with a value that is multiple of chunksize to get required no. of chunks.
* If `nrows < chunksize`, it gives given no. of rows from first chunk.
* If `nrows = chunksize`, gives us the first chunk.
* If `nrows = 2(chunksize)` we can get first two chunks.
Note : nrows to be a multiple of chunksize to get proper chunks as output

The below code gives us first two chunks :

In [155]:
chunk_1_and_2 = pd.read_csv(path, delimiter=',', header=None, names=schema, chunksize=10 ,nrows=20)

In [152]:
type(chunk_1_and_2)

pandas.io.parsers.TextFileReader

In [153]:
len(list(chunk_1_and_2))

2

In [156]:
list(chunk_1_and_2)

[   order_id             order_date  order_customer_id     order_status
 0         1  2013-07-25 00:00:00.0              11599           CLOSED
 1         2  2013-07-25 00:00:00.0                256  PENDING_PAYMENT
 2         3  2013-07-25 00:00:00.0              12111         COMPLETE
 3         4  2013-07-25 00:00:00.0               8827           CLOSED
 4         5  2013-07-25 00:00:00.0              11318         COMPLETE
 5         6  2013-07-25 00:00:00.0               7130         COMPLETE
 6         7  2013-07-25 00:00:00.0               4530         COMPLETE
 7         8  2013-07-25 00:00:00.0               2911       PROCESSING
 8         9  2013-07-25 00:00:00.0               5657  PENDING_PAYMENT
 9        10  2013-07-25 00:00:00.0               5648  PENDING_PAYMENT,
     order_id             order_date  order_customer_id     order_status
 10        11  2013-07-25 00:00:00.0                918   PAYMENT_REVIEW
 11        12  2013-07-25 00:00:00.0               1837      

* If we want to get a particular chunk by avoiding all the chunks:
  * We have to use `nrows`, `skiprows` and  `chunksize` together.
  * `nrows` will be equal to chunksize as we need output as particular chunk completely.
  * `skiprows` should be `(n - 1) * chunksize` so that it skips all the lines above the required chunk.

* Here you can see how to get 2nd chunk.

In [136]:
chunk_2 = pd.read_csv(path, delimiter=',', header=None, names=schema, chunksize=10 ,skiprows=10,nrows=10)

In [133]:
type(chunk_2)

pandas.io.parsers.TextFileReader

In [134]:
len(list(chunk_2))

1

In [137]:
list(chunk_2)

[   order_id             order_date  order_customer_id     order_status
 0        11  2013-07-25 00:00:00.0                918   PAYMENT_REVIEW
 1        12  2013-07-25 00:00:00.0               1837           CLOSED
 2        13  2013-07-25 00:00:00.0               9149  PENDING_PAYMENT
 3        14  2013-07-25 00:00:00.0               9842       PROCESSING
 4        15  2013-07-25 00:00:00.0               2568         COMPLETE
 5        16  2013-07-25 00:00:00.0               7276  PENDING_PAYMENT
 6        17  2013-07-25 00:00:00.0               2667         COMPLETE
 7        18  2013-07-25 00:00:00.0               1205           CLOSED
 8        19  2013-07-25 00:00:00.0               9488  PENDING_PAYMENT
 9        20  2013-07-25 00:00:00.0               9198       PROCESSING]

* To get nth chunk

```python
skiprows=chunksize * (n-1),
nrows=chunksize
```

Lets define a function to access the chunk we need:

In [158]:
def get_nth_chunk(path, size, n):
    """This function gets us the nth chunk by taking path,required chunksize and required chunk number as inputs """
    chunksize = size
    nrows = size
    skiprows = size * (n - 1)
    chunk_n = pd.read_csv(path, delimiter=',', header=None, names=schema, 
                          chunksize=chunksize, nrows=nrows, skiprows=skiprows)
    return list(chunk_n)

In [159]:
# split the file in chunks of size 10 each and get 2nd chunk:
get_nth_chunk('/data/retail_db/orders/part-00000', 10, 2)

[   order_id             order_date  order_customer_id     order_status
 0        11  2013-07-25 00:00:00.0                918   PAYMENT_REVIEW
 1        12  2013-07-25 00:00:00.0               1837           CLOSED
 2        13  2013-07-25 00:00:00.0               9149  PENDING_PAYMENT
 3        14  2013-07-25 00:00:00.0               9842       PROCESSING
 4        15  2013-07-25 00:00:00.0               2568         COMPLETE
 5        16  2013-07-25 00:00:00.0               7276  PENDING_PAYMENT
 6        17  2013-07-25 00:00:00.0               2667         COMPLETE
 7        18  2013-07-25 00:00:00.0               1205           CLOSED
 8        19  2013-07-25 00:00:00.0               9488  PENDING_PAYMENT
 9        20  2013-07-25 00:00:00.0               9198       PROCESSING]