# Pandas

In [1]:
import pandas as pd

## Read Data from Files

### Read from CSV in Chunks

If file is too big, then we can read it in chunks. For example, function `read_csv` has parameter `chunksize` which allows to return **iterator** object. This allows efficiently access data from the CSV file, chunk by chunk using function `next()` or method `__next__()`. Each chunk is a Pandas data frame.

In [19]:
it_chunk = pd.read_csv('../data/test/test.csv', chunksize = 5)
print("Iterator: ",it_chunk)

first_chunk = next(it_chunk)
print("Type of object the iterator returns: ",type(first_chunk))
print(first_chunk)
# or we can use method `__next__()` to access next chunk
print(it_chunk.__next__())

Iterator:  <pandas.io.parsers.TextFileReader object at 0x0000022D603EA3C8>
Type of object the iterator returns:  <class 'pandas.core.frame.DataFrame'>
    x   y   z
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
4  13  14  15
    x   y   z
5  16  17  18
6  19  20  21
7  22  23  24
8  25  26  27
9  28  29  30


Note, content of iterator could be accessed with unpacking operator asterisk `*`. Note, it could overfill memory if input file is very big!

In [20]:
it_chunk = pd.read_csv('../data/test/test.csv', chunksize = 5)
print(it_chunk)
print(*it_chunk)

<pandas.io.parsers.TextFileReader object at 0x0000022D60A07EC8>
    x   y   z
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
4  13  14  15     x   y   z
5  16  17  18
6  19  20  21
7  22  23  24
8  25  26  27
9  28  29  30


Reading data in chunk could be very efficient if we want to perform an aggregation operation over very big input file - **because we loop through iterator**. We can save result for each chunk and then combine results together. For example, let's sum all values in column `x`:

In [25]:
result = []

for it_chunk in pd.read_csv('../data/test/test.csv', chunksize = 5):
    result.append(sum(it_chunk['x']))

print("Sum for each chunk: ", result)
print("Total sum: ", sum(result))

Sum for each chunk:  [35, 110]
Total sum:  145
