# KOL Data Standardization - Step 1 (Data Extraction)

In [1]:
"""
KOL Data Standardization - Step 1 (Extract the Data from various sources (Batches) and store it in the Landing table)

This module represents the first step in KOL Data Standardization process where we load the data which is available to us
in different set of batches. The data once loaded is checked for null values with following constraints:
- The mdm_id should not be null.
- The first name of the KOL should not be null.
Any data point failing this constraint will be stored separately (warehouse area) for future reference. Rest will be sent to landing area.
"""

import os
import datetime
import pandas as pd

In [2]:
RAW_DATA_STORE_PATH = "data_store"
WAREHOUSE_AREA_FILE_PATH = "output_store/landing_area.xlsx"
LANDING_AREA_FILE_PATH = "output_store/landing_area.xlsx"

In [3]:
# Data Extraction Process - Loading all the data from different batches and storing them in either warehouse or in landing stage.

warehouse_area = []
landing_area = []
print("Extracting the data from various batches and storing it in Landing area.\n")
for batch in os.listdir(RAW_DATA_STORE_PATH):
    file_path = os.path.join(RAW_DATA_STORE_PATH, batch)
    print(f"Processing batch: {file_path}")
    dataframe = pd.read_excel(file_path)

    condition = dataframe[["mdm_id", "first_name"]].isna().any(axis=1)

    # Datapoints having NULL values in essential columns
    warehouse_area.append(dataframe[condition])

    # Adequate data input for landing stage
    correct_data_df = dataframe[~condition].reset_index(drop=True)
    correct_data_df["batch_id"] = datetime.datetime.now()
    landing_area.append(correct_data_df)

warehouse_df = pd.concat(warehouse_area).reset_index(drop=True)
landing_df = pd.concat(landing_area).reset_index(drop=True)
print("\nSuccessfully completed data extraction process")

# Storing both the dataframes
warehouse_df.to_excel(WAREHOUSE_AREA_FILE_PATH, index=False)
landing_df.to_excel(LANDING_AREA_FILE_PATH, index=False)
print("\nSuccessfully updated the Landing area")

Extracting the data from various batches and storing it in Landing area.

Processing batch: data_store\batch_1.xlsx
Processing batch: data_store\batch_2.xlsx
Processing batch: data_store\batch_3.xlsx

Successfully completed data extraction process

Successfully updated the Landing area


In [4]:
# Warehouse DataFrame
warehouse_df.head()

Unnamed: 0,mdm_id,first_name,last_name,age,city,state,profile_status,speciality,degree
0,155,,,33.0,Richardton,Colorado,1.0,"Pharmacoepidemilogy,Sports Medicine(Physical M...",MBBS
1,166,,Williams,,,,1.0,"Neurodegenerative diseases,Occupational Medici...","BHMS,BHMS"
2,195,,Clark,,,,1.0,,BUMS
3,158,,,32.0,West Andrewburgh,Indiana,1.0,Inflmmation at the Skin Barrier,"BDS,MD,MD"
4,189,,Crosby,18.0,,Pennsylvania,1.0,"Inflmmation at the Skin Barrier,Allergic Rhini...","PHD,BHMS"


In [5]:
# Landing Area DataFrame
landing_df.head()

Unnamed: 0,mdm_id,first_name,last_name,age,city,state,profile_status,speciality,degree,batch_id
0,182,James,Allen,58.0,Loriview,,2.0,"Vascular Medicine,Brain imaging",,2024-09-07 14:43:31.928517
1,109,Justin,Davenport,,Laurenport,Washington,,Dermatopharmacology,"PHD,BHMS,MD",2024-09-07 14:43:31.928517
2,199,Ashley,,,East Johnview,,2.0,"Retinal Ophthalmology,Molluscum,Otology/Neurot...","BUMS,MBBS",2024-09-07 14:43:31.928517
3,180,Jose,Chambers,59.0,,,,,,2024-09-07 14:43:31.928517
4,124,Justin,Sexton,,Bethstad,Colorado,,"Immunological Disorders,Interventional Pain Me...",,2024-09-07 14:43:31.928517
