# NOTEBOOK 01 : DATA CLEANING

**Author** : Guy Arbus  
**Date** : 2026/02/09  
**Version** : 1.0

## ABSTRACT

The data cleaning phase aimed to ensure the **reliability**, **consistency**, and **operational relevance** of the dataset prior to analysis and modeling.  

Raw data were systematically inspected to identify missing values, outliers, duplicates, and incoherent records, particularly those resulting from sensor noise, transmission errors, or adversarial perturbations. Domain-specific rules and statistical thresholds were applied to validate measurements and enforce physical plausibility constraints. Missing or corrupted data were handled using controlled imputation strategies or exclusion when necessary to avoid bias.  

This process was essential to reduce noise, enhance signal integrity, and guarantee that downstream analytical and predictive models operate on data aligned with defense and security requirements.

## BEST PRACTICES & PRINCIPLES

- Always keep raw data unchanged (read-only)
- Document every decision and assumption
- Version control your cleaning scripts
- Make it reproducible (seeds, scripts, environments)
- Validate at each step before proceeding
- Visualize before and after each major change
- Domain expertise consultation when uncertain
- Automate repetitive tasks
- Test on sample before applying to full dataset
- Consider ethical implications (bias, fairness, privacy)

## COMMON PITFALLS TO AVOID

- Deleting data without understanding why
- Imputing without investigating missingness patterns
- Removing outliers that are legitimate
- Data leakage (using test set info in cleaning)
- Over-cleaning (removing valuable variance)
- Ignoring domain knowledge
- Not documenting changes
- Cleaning test data differently than training data
- Assuming correlation implies causation in feature selection
- Not validating cleaned data

## TABLE OF CONTENT

- **[01. INITIAL DATA ASSESSMENT AND DOCUMENTATION](#)**
    - *[01.1 Data provenance](#)*
    - *[01.2 Initial exploration](#)*
- **[02. STRUCTURAL ISSUES](#)**
    - *[02.1 Data types and formatting](#)*
    - *[02.2 Schema validation](#)*
- **[03. MISSING DATA](#)**
    - *[03.1 Identification](#)*
    - *[03.2 Treatment](#)*
- **[04. DUPLICATES](#)**
    - *[04.1 Exact duplicates](#)*
    - *[04.2 Fuzzy duplicates](#)*
- **[05. OUTLIERS](#)**
    - *[05.1 Statistical outliers](#)*
    - *[05.2 Domain-specific anomalies](#)*
    - *[05.3 Outlier treatment](#)*
- **[06. CONSISTENCY AND VALIDATION](#)**
    - *[06.1 Internal consistency](#)*
    - *[06.2 External validation](#)*
    - *[06.3 Business and domain rules](#)*
- **[07. DATA QUALITY METRICS](#)**
    - *[07.1 Completeness](#)*
    - *[07.2 Accuracy](#)*
    - *[07.3 Timeliness](#)*
    - *[07.4 Consistency](#)*
- **[08. SECURITY AND PRIVACY](#)**
    - *[08.1 Classification review](#)*
    - *[08.2 PII and sensitive data](#)*
    - *[08.3 Access control](#)*
- **[09. FEATURE ENGINEERING](#)**
    - *[09.1 Standardization](#)*
    - *[09.2 Derived variables](#)*
- **[10. VERSION CONTROL AND DOCUMENTATION](#)**
    - *[10.1 Change tracking](#)*
    - *[10.2 Metadata documentation](#)*
    - *[10.3 Code documentation](#)*
- **[11. VALIDATION AND TESTING](#)**
    - *[11.1 Post-cleaning validation](#)*
    - *[11.2 Stakeholder review](#)*
- **[12. FINAL DOCUMENTATION](#)**
    - *[12.1 Comprehensive report](#)*