# Handling missing data:


#### Three mechanisms of missing data:  
1. MCAR (MISSING COMPLETELY AT RANDOM)

2. MAR (MISSING AT RANDOM)

3. MNAR ( MISSINT NOT AT RANDOM)


#### Observed and Un-observed data:
1. **Observed data :** 
   - Data which is not missing, i.e. recorded data.

2. **Un-observed data :**
   - Data which is missing.   


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)


#### Rubin's Missing Data Mechanisms (1976)
we have three variables (more generally sets of variables), (Y,X,Z), with Y and Z fully observed. and X being missing.

##### Three Types of Missingness:

1. **Missing Completely At Random (MCAR)**
   - `P(R|X,Y,Z) = P(R)`: Missing probability independent of all variables, or chances of missingness are independent of all.

   - Hence likelyhood of a value being missing is completely unrelated to **observed** and **unobserved** data.

   - **Note** : likelyhood, chances and probability are synonyms of each other.

2. **Missing At Random (MAR)**
   - `P(R|X,Y,Z) = P(R|Y,Z)`: Missing probability depends only on observed variables Y,Z

3. **Missing Not At Random (MNAR)**
   - Missing probability depends on the unobserved value X itself

#####
**Summary Table**

| Type | Missing Probability Depends On | Consequence |
|------|--------------------------------|-------------|
| MCAR | Nothing | Complete case analysis unbiased |
| MAR | Observed variables only | Requires methods like multiple imputation |
| MNAR | Unobserved values | Requires modeling the missing mechanism |

*Note: R is binary Random variable, which is either 0 or 1, and indicates missingness. (R=1 if observed, R=0 if missing)*

---


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Handling Missing Values

## Approaches to Missing Values
- **Remove Values**
  - Complete Case Analysis (CCA): Only analyze rows with no missing values in any column

- **Impute**
  - **Univariate**
    - Numerical
      - Mean/median
      - Random value
      - End of distribution
    - Categorical
      - Mode
      - Missing (as category)
  - **Multivariate**
    - KNN imputer
    - Iterative imputer

## Topics we are going to study : 
1. **Remove values** 

2. **Simple Imputer**

3. **KNN Imputer**

4. **Iterative Imputer**

5. **Missing Indicator**



![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Techniques for Handling Missing Values

To address the challenges posed by missing values, data scientists and analysts employ various techniques, including:

1. **Deletion methods**: 
   - Removing rows or columns with missing data when the proportion of missingness is low.

2. **Imputation**: 
   - Replacing missing values with statistical estimates, such as mean, median, mode, or predictions from machine learning models.

3. **Advanced modeling**: 
   - Using algorithms designed to handle missing data, such as multiple imputation or matrix factorization.

4. **Flagging missingness**: 
   - Creating additional features to indicate whether a value was missing, which can add predictive value to models.