## Read Data in Chunks

Earlier we have seen how to load the data into target database in chunks. Let us also explore how to read the data from large files in chunks for processing.

Here are the steps we need to follow:
* Make sure to read the data into manageable chunks from the file.
* If we invoke `pd.read_csv` with `chunksize` then it will return **TextReader**. We can iterate through the TextReader which will return one Dataframe per chunk.
* We can process each chunk in the Dataframe based upon the requirements.
* In our case for each chunk, we will drop fields that are not required, rename the fields as per the target and then load the data to the target MongoDB database collection.

In [None]:
import pandas as pd

In [None]:
pd.read_csv?

* Read the data in chunks

In [None]:
customers = pd.read_csv(
    '/data/ecomm/customers/part-00000', 
    iterator=True, 
    chunksize=5
)

In [None]:
type(customers)

* Get list of fields that need to be dropped as well as mapping between source and target columns.

In [None]:
column_mapping_str = '''{
    "customer_first_name": {"target_field_name": "FirstName", "is_required": true},
    "customer_last_name": {"target_field_name": "LastName", "is_required": true},
    "customer_email": {"target_field_name": "Email", "is_required": true},
    "product_name": {"is_required": false},
    "product_subscription": {"is_required": false}
}'''

import json
column_mapping = json.loads(column_mapping_str)

# Assigning the list of not required fields to a variable
columns_to_be_dropped = dict(
    list(
        filter(
            lambda col: not col[1]['is_required'], 
            column_mapping.items()
        )
    )
).keys()

required_columns_list = list(
    filter(
        lambda col: col[1]['is_required'], 
        column_mapping.items()
    )
)

required_columns_mapping = dict(
    map(
        lambda col: (col[0], col[1]['target_field_name']), 
        required_columns_list
    )
)

* Create MongoDB Connection.

In [None]:
import pymongo, getpass, configparser

username = getpass.getuser()
config = configparser.ConfigParser()
config.read(f'/home/{username}/.jupyterenv')

client = pymongo.MongoClient(
    host='pylabsmd.itversity.com', 
    username=f'{username}_scratch_user', 
    password=config['DEFAULT']['MONGO_SCRATCH_PASS'], 
    authSource='admin'
)

* Cleanup the data in the collection before loading so that we will not end up with duplicate data.

In [None]:
client[f'{username}_scratch_db']['customers'].delete_many({})

In [None]:
for doc in client[f'{username}_scratch_db']['customers'].find({}):
    print(doc)

* Process and store data into MongoDB collection in chunks.

In [None]:
for idx, chunk in enumerate(customers):
    print(f'Processing chunk {idx}')

    # This will take care of dropping the not required fields 
    # and rename others as per mapping
    customers_target = chunk. \
        drop(columns=columns_to_be_dropped). \
        rename(columns=required_columns_mapping)
    
    customers_list = customers_target.to_dict(orient='records')
    client[f'{username}_scratch_db']['customers'].insert_many(customers_list)

* Validate whether all the data is copied succesfully or not.

In [None]:
for doc in client[f'{username}_scratch_db']['customers'].find({}):
    print(doc)

In [None]:
client[f'{username}_scratch_db']['customers'].count_documents({})