# Topic 02 - Problem 10: Build a Complete Missing-Value Cleaning Pipeline

---

## 1. About the Problem

This problem asks me to build a complete pipeline to handle missing values in a dataset.  
Instead of solving one small task, I will combine detection, column removal, and value imputation into one process.  
This mimics how data cleaning is done in real data science and machine learning workflows.  
To solve this, I will remove columns with too many missing values and fill the remaining missing values using default strategies.

---


## 2. Solution Code

In [5]:
def clean_missing_data_pipeline(data,threshold_percent,default_values):
    missing_counts={}
    total=len(data)
    for record in data:
        for key,value in record.items():
            if value is None:
                missing_counts[key]=missing_counts.get(key,0)+1
    
    drop_columns=set()
    for cols,count in missing_counts.items():
        missing_percent=(count/total)*100
        if missing_percent>=threshold_percent:
            drop_columns.add(cols)
        
    cleaned_records=[]
    for record in data:
        cleaned_dataset={}
        for key,value in record.items():
            if key not in drop_columns:
                if value is None:
                    cleaned_dataset[key]=default_values.get(key)
                else:
                    cleaned_dataset[key]=value
        cleaned_records.append(cleaned_dataset)
    
    return cleaned_records
data = [
    {"age": 25, "salary": None, "city": "Dhaka",'Name':'Anna'},
    {"age": None, "salary": None, "city": None,'Name':'Charlie'},
    {"age": 30, "salary": None, "city": "Chittagong",'Name':'Jacob'},
    {"age": None, "salary": None, "city": None,'Name':None}
]

defaults = {"age": 0, "city": "Unknown","Name":"Unknown"}

print("Fully cleaned dataset:",
      clean_missing_data_pipeline(data, threshold_percent=50, default_values=defaults))


Fully cleaned dataset: [{'Name': 'Anna'}, {'Name': 'Charlie'}, {'Name': 'Jacob'}, {'Name': 'Unknown'}]


---

## 3. Summary / Takeaways

By solving this problem, I learned how to design a complete data-cleaning workflow.  
I understood how multiple cleaning decisions work together.  
This pipeline approach is closer to real-world preprocessing than isolated functions.  
Building such logic improves my confidence in handling messy datasets.  
Now Iâ€™m ready to move on to exploratory data analysis.
