-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
/
missing-value-imputation.html
219 lines (189 loc) · 11.6 KB
/
missing-value-imputation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
layout: layout.njk
permalink: "{{ page.filePathStem }}.html"
---
{% include "toc.njk" %}
<div class="col-md-9 col-md-pull-3">
<h1 id="missing-value-imputation-top" class="title">Missing Value Imputation</h1>
<p>So far, we have been living in a prefect data world where we select features,
build models, and validate them. However, missing data, or missing values,
are a common occurrence in real world and can have a
significant effect on the conclusions that can be drawn from the data.</p>
<p>Data are missing for many reasons. Missing data can occur because of
nonresponse: no information is provided for several items or no information
is provided for a whole unit. Some items are more sensitive for nonresponse
than others, for example items about private subjects such as income.</p>
<p>Dropout is a type of missingness that occurs mostly when studying
development over time. In this type of study the measurement is repeated
after a certain period of time. Missingness occurs when participants drop
out before the test ends and one or more measurements are missing.</p>
<p>Sometimes missing values are caused by the device failure or even by
researchers themselves. It is important to question why the data is missing,
this can help with finding a solution to the problem. If the values are
missing at random there is still information about each variable in each
unit but if the values are missing systematically the problem is more severe
because the sample cannot be representative of the population.</p>
<p>All of the causes for missing data fit into four classes, which are based
on the relationship between the missing data mechanism and the missing and
observed values. These classes are important to understand because the
problems caused by missing data and the solutions to these problems are
different for the four classes.</p>
<p>The first is Missing Completely at Random (MCAR). MCAR means that the
missing data mechanism is unrelated to the values of any variables, whether
missing or observed. Data that are missing because a researcher dropped the
test tubes or survey participants accidentally skipped questions are
likely to be MCAR. If the observed values are essentially a random sample
of the full data set, complete case analysis gives the same results as
the full data set would have. Unfortunately, most missing data are not MCAR.</p>
<p>At the opposite end of the spectrum is Non-Ignorable (NI). NI means that
the missing data mechanism is related to the missing values. It commonly
occurs when people do not want to reveal something very personal or
unpopular about themselves. For example, if individuals with higher incomes
are less likely to reveal them on a survey than are individuals with lower
incomes, the missing data mechanism for income is non-ignorable. Whether
income is missing or observed is related to its value. Complete case
analysis can give highly biased results for NI missing data. If
proportionally more low and moderate income individuals are left in
the sample because high income people are missing, an estimate of the
mean income will be lower than the actual population mean.</p>
<p>In between these two extremes are Missing at Random (MAR) and Covariate
Dependent (CD). Both of these classes require that the cause of the missing
data is unrelated to the missing values, but may be related to the observed
values of other variables. MAR means that the missing values are related to
either observed covariates or response variables, whereas CD means that the
missing values are related only to covariates. As an example of CD missing
data, missing income data may be unrelated to the actual income values, but
are related to education. Perhaps people with more education are less likely
to reveal their income than those with less education.</p>
<p>A key distinction is whether the mechanism is ignorable (i.e., MCAR, CD, or
MAR) or non-ignorable. There are excellent techniques for handling ignorable
missing data. Non-ignorable missing data are more challenging and require a
different approach.</p>
<p>If it is known that the data analysis technique which is to be used isn't
content robust, it is good to consider imputing the missing data.
Once all missing values have been imputed, the dataset can then be analyzed
using standard techniques for complete data. The analysis should ideally
take into account that there is a greater degree of uncertainty than if
the imputed values had actually been observed, however, and this generally
requires some modification of the standard complete-data analysis methods.
Many imputation techniques are available.</p>
<p>Imputation is not the only method available for handling missing data.
The expectation-maximization algorithm is a method for finding maximum
likelihood estimates that has been widely applied to missing data problems.
In machine learning, it is sometimes possible to train a classifier directly
over the original data without imputing it first. That was shown to yield
better performance in cases where the missing data is structurally absent,
rather than missing due to measurement noise.</p>
<p>Smile provides several methods to impute missing values. The <code>NaN</code> values
in the input data matrix are treated as missing values and will be replaced with imputed
values after the processing.</p>
<h2 id="average">Average Value Imputation</h2>
<p>In this approach, we impute missing values with the average of other attributes in the instance.
Assume the attributes of the dataset are of same kind, e.g. microarray gene
expression data, the missing values can be estimated as the average of
non-missing attributes in the same instance. Note that this is not the
average of same attribute across different instances.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_1" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def avgimpute(data: Array[Array[Double]]): Unit
</code></pre>
</div>
</div>
</div>
<h2 id="knn">K-Nearest Neighbor Imputation</h2>
<p>The KNN-based method selects instances similar to the instance of interest to impute
missing values. If we consider instance <code>A</code> that has one missing value on
attribute <code>i</code>, this method would find <code>k</code> other instances, which have a value
present on attribute <code>i</code> with values most similar (in term of some distance,
e.g. Euclidean distance) to <code>A</code> on other attributes without missing values.
The average of values on attribute <code>i</code> from the <code>k</code> nearest
neighbors is then used as an estimate for the missing value in instance <code>A</code>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_2" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def knnimpute(data: Array[Array[Double]], k: Int = 5)
</code></pre>
</div>
</div>
</div>
<h2 id="kmeans">K-Means Imputation</h2>
<p>This method first cluster data by K-Means
with missing values and then impute missing values with the average value of each attribute
in the clusters.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_3" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def impute(data: Array[Array[Double]], k: Int, runs: Int = 1): Unit
</code></pre>
</div>
</div>
</div>
<h2 id="lls">Local Least Squares Imputation</h2>
<p>The local least squares imputation method represents a target instance that has missing values as
a linear combination of similar instances, which are selected by k-nearest
neighbors method.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_4" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def llsimpute(data: Array[Array[Double]], k: Int): Unit
</code></pre>
</div>
</div>
</div>
<h2 id="svd">SVD Imputation</h2>
<p>Given SVD <code>A = U Σ V<sup>T</sup></code>, we use the most significant eigenvectors of
<code>V<sup>T</sup></code> to linearly estimate missing values. Although it has been
shown that several significant eigenvectors are sufficient to describe
the data with small errors, the exact fraction of eigenvectors best for
estimation needs to be determined empirically. Once <code>k</code> most significant
eigenvectors from <code>V<sup>T</sup></code> are selected, we estimate a missing value <code>j</code>
in row <code>i</code> by first regressing this row against the <code>k</code> eigenvectors and then use
the coefficients of the regression to reconstruct <code>j</code> from a linear combination
of the <code>k</code> eigenvectors. The <code>j</code> th value of row <code>i</code> and the <code>j</code> th values of the <code>k</code>
eigenvectors are not used in determining these regression coefficients.
It should be noted that SVD can only be performed on complete matrices;
therefore we originally fill all missing values by other methods (e.g. K-Means) in
matrix <code>A</code>, obtaining <code>A'</code>. We then utilize an expectation maximization method to
arrive at the final estimate, as follows. Each missing value in <code>A</code> is estimated
using the above algorithm, and then the procedure is repeated on the newly
obtained matrix, until the total change in the matrix falls below the
empirically determined threshold (say 0.01).</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_5" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def svdimpute(data: Array[Array[Double]], k: Int, maxIter: Int = 10)): Unit
</code></pre>
</div>
</div>
</div>
<div id="btnv">
<span class="btn-arrow-left">← </span>
<a class="btn-prev-text" href="validation.html" title="Previous Section: Model Validation"><span>Model Validation</span></a>
<a class="btn-next-text" href="clustering.html" title="Next Section: Clustering"><span>Clustering</span></a>
<span class="btn-arrow-right"> →</span>
</div>
</div>
<script type="text/javascript">
$('#toc').toc({exclude: 'h1, h5, h6', context: '', autoId: true, numerate: false});
</script>