## Load Data in Chunks

Let us understand the overall process of loading CSV data to MongoDB with attribute level mapping in chunks.
* Read data from file into a Pandas Dataframe.
* Drop the fields that are not required and rename the fields as per the target structure defined.
* Load the data to MongoDB using Bulk load. We can also load the data in chunks.
* While loading the data to target database table, it might be better to load the data in chunks. For example, if we have 10,000 records to be loaded, it is a good practice to load in smaller chunks. The chunk size will be determined considering several factors.
* For this demo, we will be loading 6 records at a time rather than loading all the 20 records in one shot.

In [1]:
import pandas as pd
customers = pd.read_csv('/data/ecomm/customers/part-00000')

In [2]:
column_mapping_str = '''{
    "customer_first_name": {"target_field_name": "FirstName", "is_required": true},
    "customer_last_name": {"target_field_name": "LastName", "is_required": true},
    "customer_email": {"target_field_name": "Email", "is_required": true},
    "product_name": {"is_required": false},
    "product_subscription": {"is_required": false}
}'''

import json
column_mapping = json.loads(column_mapping_str)

# Assigning the list of not required fields to a variable
columns_to_be_dropped = dict(list(filter(lambda col: not col[1]['is_required'], column_mapping.items()))).keys()
required_columns_list = list(filter(lambda col: col[1]['is_required'], column_mapping.items()))
required_columns_mapping = dict(map(lambda col: (col[0], col[1]['target_field_name']), required_columns_list))

# This will take care of dropping the not required fields and rename others as per mapping
customers_target = customers.drop(columns=columns_to_be_dropped).rename(columns=required_columns_mapping)

In [3]:
import pymongo, getpass, configparser

username = getpass.getuser()
config = configparser.ConfigParser()
config.read(f'/home/{username}/.jupyterenv')

client = pymongo.MongoClient(
    host='pylabsmd.itversity.com', 
    username=f'{username}_scratch_user', 
    password=config['DEFAULT']['MONGO_SCRATCH_PASS'], 
    authSource='admin'
)

In [4]:
client[f'{username}_scratch_db']['customers'].delete_many({})

<pymongo.results.DeleteResult at 0x7f90376bf820>

In [5]:
for doc in client[f'{username}_scratch_db']['customers'].find({}):
    print(doc)

In [6]:
customers_target.to_dict?

[0;31mSignature:[0m [0mcustomers_target[0m[0;34m.[0m[0mto_dict[0m[0;34m([0m[0morient[0m[0;34m=[0m[0;34m'dict'[0m[0;34m,[0m [0minto[0m[0;34m=[0m[0;34m<[0m[0;32mclass[0m [0;34m'dict'[0m[0;34m>[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters
(see below).

Parameters
----------
orient : str {'dict', 'list', 'series', 'split', 'records', 'index'}
    Determines the type of the values of the dictionary.

    - 'dict' (default) : dict like {column -> {index -> value}}
    - 'list' : dict like {column -> [values]}
    - 'series' : dict like {column -> Series(values)}
    - 'split' : dict like
      {'index' -> [index], 'columns' -> [columns], 'data' -> [values]}
    - 'records' : list like
      [{column -> value}, ... , {column -> value}]
    - 'index' : dict like {index -> {column -> value}}

    Abbreviations are allowed. `s` indicates `s

In [8]:
customers_list = customers_target.to_dict(orient='records')

In [10]:
len(customers_list)

20

In [18]:
customers_list_range = list(range(0, len(customers_list), 6))

In [19]:
customers_list_range[:-1]

[0, 6, 12]

In [20]:
customers_list_range[1:]

[6, 12, 18]

In [22]:
list(zip(customers_list_range[:-1], customers_list_range[1:]))

[(0, 6), (6, 12), (12, 18)]

In [23]:
chunks = list(zip(customers_list_range[:-1], customers_list_range[1:]))

In [24]:
for lb, ub in chunks:
    print(f'Processing from {lb} to {ub}')
    print(customers_list[lb:ub])

print(f'Processing last set from {ub} to {len(customers_list)}')
print(customers_list[ub:])

Processing from 0 to 6
[{'FirstName': 'Cassaundra', 'LastName': 'Collinson', 'Email': 'ccollinson0@alibaba.com'}, {'FirstName': 'Rozamond', 'LastName': 'Oene', 'Email': 'roene1@technorati.com'}, {'FirstName': 'Gus', 'LastName': 'Hawick', 'Email': 'ghawick2@dagondesign.com'}, {'FirstName': 'Delano', 'LastName': 'Ashbey', 'Email': 'dashbey3@purevolume.com'}, {'FirstName': 'Fara', 'LastName': 'Simondson', 'Email': 'fsimondson4@umn.edu'}, {'FirstName': 'Myrilla', 'LastName': 'Gates', 'Email': 'mgates5@sina.com.cn'}]
Processing from 6 to 12
[{'FirstName': 'Arabela', 'LastName': 'Tweedlie', 'Email': 'atweedlie6@comcast.net'}, {'FirstName': 'Loise', 'LastName': 'Schindler', 'Email': 'lschindler7@discovery.com'}, {'FirstName': 'Storm', 'LastName': 'McBrearty', 'Email': 'smcbrearty8@ovh.net'}, {'FirstName': 'Westley', 'LastName': 'Matityahu', 'Email': 'wmatityahu9@altervista.org'}, {'FirstName': 'Gerta', 'LastName': 'Shaughnessy', 'Email': 'gshaughnessya@smugmug.com'}, {'FirstName': 'Coretta', 

In [25]:
for lb, ub in chunks:
    print(f'Processing from {lb} to {ub}')
    client[f'{username}_scratch_db']['customers'].insert_many(customers_list[lb:ub])

print(f'Processing last set from {ub} to {len(customers_list)}')
client[f'{username}_scratch_db']['customers'].insert_many(customers_list[ub:])

Processing from 0 to 6
Processing from 6 to 12
Processing from 12 to 18
Processing last set from 18 to 20


<pymongo.results.InsertManyResult at 0x7f9035d83780>

In [26]:
for doc in client[f'{username}_scratch_db']['customers'].find({}):
    print(doc)

{'_id': ObjectId('60b65e5e8852136200229daa'), 'FirstName': 'Cassaundra', 'LastName': 'Collinson', 'Email': 'ccollinson0@alibaba.com'}
{'_id': ObjectId('60b65e5e8852136200229dab'), 'FirstName': 'Rozamond', 'LastName': 'Oene', 'Email': 'roene1@technorati.com'}
{'_id': ObjectId('60b65e5e8852136200229dac'), 'FirstName': 'Gus', 'LastName': 'Hawick', 'Email': 'ghawick2@dagondesign.com'}
{'_id': ObjectId('60b65e5e8852136200229dad'), 'FirstName': 'Delano', 'LastName': 'Ashbey', 'Email': 'dashbey3@purevolume.com'}
{'_id': ObjectId('60b65e5e8852136200229dae'), 'FirstName': 'Fara', 'LastName': 'Simondson', 'Email': 'fsimondson4@umn.edu'}
{'_id': ObjectId('60b65e5e8852136200229daf'), 'FirstName': 'Myrilla', 'LastName': 'Gates', 'Email': 'mgates5@sina.com.cn'}
{'_id': ObjectId('60b65e5e8852136200229db0'), 'FirstName': 'Arabela', 'LastName': 'Tweedlie', 'Email': 'atweedlie6@comcast.net'}
{'_id': ObjectId('60b65e5e8852136200229db1'), 'FirstName': 'Loise', 'LastName': 'Schindler', 'Email': 'lschindler

In [27]:
client[f'{username}_scratch_db']['customers'].count_documents({})

20