# NOTEBOOK 01 : DATA CLEANING

**Author** : Guy Arbus
**Date** : 2026/02/09
**Version** : 1.0

## ABSTRACT

The data cleaning phase aimed to ensure the **reliability**, **consistency**, and **operational relevance** of the dataset prior to analysis and modeling.  

Raw data were systematically inspected to identify missing values, outliers, duplicates, and incoherent records, particularly those resulting from sensor noise, transmission errors, or adversarial perturbations. Domain-specific rules and statistical thresholds were applied to validate measurements and enforce physical plausibility constraints. Missing or corrupted data were handled using controlled imputation strategies or exclusion when necessary to avoid bias.  

This process was essential to reduce noise, enhance signal integrity, and guarantee that downstream analytical and predictive models operate on data aligned with defense and security requirements.

## BEST PRACTICES & PRINCIPLES

- Always keep raw data unchanged (read-only)
- Document every decision and assumption
- Version control your cleaning scripts
- Make it reproducible (seeds, scripts, environments)
- Validate at each step before proceeding
- Visualize before and after each major change
- Domain expertise consultation when uncertain
- Automate repetitive tasks
- Test on sample before applying to full dataset
- Consider ethical implications (bias, fairness, privacy)

## COMMON PITFALLS TO AVOID

- Deleting data without understanding why
- Imputing without investigating missingness patterns
- Removing outliers that are legitimate
- Data leakage (using test set info in cleaning)
- Over-cleaning (removing valuable variance)
- Ignoring domain knowledge
- Not documenting changes
- Cleaning test data differently than training data
- Assuming correlation implies causation in feature selection
- Not validating cleaned data

## TABLE OF CONTENT

### 01. INITIAL DATA ASSESSMENT

#### Data Sources
#### Collection methods
#### Sensitive data classification
#### Restrictions
#### Data dictionary
#### Completeness
#### Data structure 

---

### 02. HANDLING SENSITIVE DATA AND CLASSIFICATION INFORMATION

---

### 03. STRUCTURAL ISSUES

---

### 04. MISSING DATA

#### 04.1 Missing data identification

**Calculate missingness percentage per variable**  
**Identify patterns (MCAR, MAR, MNAR)**  
**Create missingness indicator variables**  
**Visualize missingness patterns (heatmaps, co-occurrence matrices)**  
**Document reasons for missingness when known**  

#### 04.2 Missing data treatment

**Determine appropriate imputation method per variable**  
**Document deletion criteria (listwise, pairwise)**  
**Validate imputation quality**  
**Consider multiple imputation for critical variables**  
**Flag imputed values for transparency**  

---

### 05. DUPLICATES

#### 05.1 Exact Duplicates

**Identify complete row duplicates**  
**Check for duplicate IDs/primary keys**  
**Verify timestamp duplicates in time-series data**  
**Document retention policy (first, last, average)**  
  
####Â 05.2 Fuzzy Duplicates

**Implement similarity matching for text fields**  
**Check for near-duplicate sensor readings**  
**Identify potential entity resolution issues**  
**Apply domain-specific deduplication rules**  

---

### 06. OUTLIERS

#### 06.1 Statistical Outliers

**Apply univariate outlier detection (IQR, Z-score, modified Z-score)**  
**Implement multivariate outlier detection (Mahalanobis distance, isolation forests)**  
**Check for distribution violations**  
**Validate against known operational ranges**  

#### 06.2 Domain-Specific Anomalies

**Verify physically impossible values (negative distances, speeds exceeding limits)**  
**Check temporal consistency (events out of chronological order)**  
**Validate geographic coordinates within operational theaters**  
**Verify sensor readings against calibration ranges**  
**Check for adversarial data contamination**  

#### 06.3 Outlier Treatment

**Distinguish errors from legitimate extreme values**  
**Document retention/removal decisions**  
**Consider winsorization or transformation**  
**Flag outliers for analyst review**  

---

### 07. CONSISTENCY

#### 07.1 Internal Consistency

**Cross-validate related fields (e.g., start/end times)**  
**Check calculated fields against source data**  
**Verify hierarchical relationships (category/subcategory)**  
**Validate conditional logic rules**  
**Check for contradictory information**  

#### 07.2 External Validation

**Compare against authoritative reference data**  
**Validate geographic locations against maps**  
**Cross-reference with external databases**  
**Verify equipment specifications against technical documentation**  
**Check historical consistency with previous datasets**  

#### 07.3 Business/Domain Rules

**Apply defense-specific validation rules**  
**Verify mission-critical constraints**  
**Check operational parameter bounds**  
**Validate threat classification schemas**  
**Ensure compliance with doctrine and procedures**  

---

### 08. VALIDATION WITH DOMAIN KNOWLEDGE

---

### 09. CATEGORICAL VARIABLES CLEANING

---

### 10. TEXT DATA CLEANING

---

### 11. TEMPORAL DATA

---

### 12. GEOSPATIAL DATA

---

### 13. NUMERICAL DATA

---

### 14. FEATURE ENGINEERING (PRE-MODELING)