How to deal with missing value /
1. If there are to many missing value i.e. >30%, imputing missing value might prone to error -> It is better to drop the feature
2. Else, imputing
   1. Univariable imputation
      1. Mean
      2. Median
      3. Mode
   2. Multivariate imputation
      1. KNNImputer
      2. IterativeImputer

# Types

#### Missing completely at random (MCAR)
data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. \
unbiased, no pattern \
example, low battery sensor


#### Missing at random (MAR)
occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. \
 Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. \
An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). \
However, if the parameter is estimated with **Full Information Maximum Likelihood**, MAR will provide asymptotically unbiased estimates.


#### Missing not at random (MNAR) (also known as nonignorable nonresponse)
is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing). \
To extend the previous example, this would occur if men failed to fill in a depression survey because of their level of depression.

Samuelson and Spirer (1992) discussed how missing and/or distorted data about demographics, law enforcement, and health could be indicators of patterns of human rights violations. They gave several fairly well documented examples.[9]

In [None]:
# Techniques of dealing with missing data

Missing data reduces the representativeness of the sample and can therefore distort inferences about the population.

There are three main approaches to handle missing data:
1. Imputation—where values are filled in the place of missing data, 
2. omission—where samples with invalid data are discarded from further analysis and 
3. analysis—by directly applying methods unaffected by the missing values. 
One systematic review addressing the prevention and handling of missing data for patient-centered outcomes research identified 10 standards as necessary for the prevention and handling of missing data. These include standards for study design, study conduct, analysis, and reporting.

An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.

#### Imputation
Some data analysis techniques are not robust to missingness, and require to "fill in", or impute the missing data. 
For many practical purposes, 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. 
However, a too-small number of imputations can lead to a substantial loss of statistical power, and some scholars now recommend 20 to 100 or more. Any multiply-imputed data analysis must be repeated for each of the imputed data sets and, in some cases, the relevant statistics must be combined in a relatively complicated way.[2] Multiple imputation is not conducted in specific disciplines, as there is a lack of training or misconceptions about them.[13] Methods such as listwise deletion have been used to impute data but it has been found to introduce additional bias.[14]

The expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.


###### Interpolation
Main article: Interpolation
In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points.

In the comparison of two paired samples with missing data, a test statistic that uses all available data without the need for imputation is the partially overlapping samples t-test.[16] This is valid under normality and assuming MCAR

#### Partial deletion
Methods which involve reducing the data available to a dataset having no missing values include:
- Listwise deletion/casewise deletion
- Pairwise deletion

#### Full analysis
Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:

- Generative approaches:
  - The expectation-maximization algorithm
  - full information maximum likelihood estimation
- Discriminative approaches:
  - Max-margin classification of data with absent features [17][18]
  - Partial identification methods may also be used.[19]

## Model-based techniques

Model based techniques, often using graphs, offer additional tools for testing missing data types (MCAR, MAR, MNAR) and for estimating parameters under missing data conditions.