/
ham.Rd
80 lines (65 loc) · 3.26 KB
/
ham.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ham.R
\name{ham}
\alias{ham}
\title{ham()}
\usage{
ham(mat_1, mat_2)
}
\arguments{
\item{mat_1}{a vector, a matrix or a data.frame of binary values that may contain missing data}
\item{mat_2}{a vector, a matrix or a data.frame of binary values with the same number of columns as \code{mat_1} that may contain missing data}
}
\value{
A distance matrix
}
\description{
This function computes a matrix distance using the Hamming distance as proximity measure.
}
\details{
\code{ham} returns the pairwise distances between rows (observations) of a single matrix if \code{mat_1} equals \code{mat_2}.
Otherwise \code{ham} returns the matrix distance between rows of the two matrices \code{mat_1} and \code{mat_2} if this 2 matrices are different in input.
Computing the Hamming distance stays possible despite the presence of missing data by applying the following formula. Assuming that A and B are 2 matrices such as \code{ncol(A) = ncol(B)}.
The Hamming distance between the \eqn{i^{th}} row of A and the \eqn{k^{th}} row of B equals:
\deqn{\mbox{ham}(A_i,B_k) = \frac{\sum_j 1_{\left\{A_{ij} \neq B_{kj}\right\}}}{\sum_j 1}\times\left(\frac{\sum_j 1}{\sum_j 1_{\left\{!\mbox{is.na}(A_{ij}) \& !\mbox{is.na}( B_{kj})\right\}}}\right)}
where: \eqn{i = 1,\dots,\mbox{nrow}(A)} and \eqn{k = 1,\dots,\mbox{nrow}(B)}; And the expression located to the right term of the multiplication corresponds to a specific weigh applied in presence of NAs in \eqn{A_i} and/or \eqn{B_k}.
This specificity is not implemented in the \code{cdist} function and the Hamming distance can not be computed using the \code{\link[proxy]{dist}} function either.
The Hamming distance can not be calculated in only two situations:
\enumerate{
\item If a row of A or B has only missing values (ie for each of the columns of A or B respectively).
\item The union of the indexes of the missing values in row i of A with the indexes of the missing values in row j of B concerns the indexes of all considered columns.
}
Example: Assuming that \eqn{\mbox{ncol}(A) = \mbox{ncol}(B) = 3}, if \eqn{A_i = (1,\mbox{NA},0)} and \eqn{B_j = (\mbox{NA},1,\mbox{NA})}, for each column, either the information in row i is missing in A,
or the information is missing in B, which induces: \eqn{\mbox{ham}(A_i,B_k) = \mbox{NA}}.
If \code{mat_1} is a vector and \code{mat_2} is a matrix (or data.frame) or vice versa, the length of \code{mat_1} must be equal to the number of columns of \code{mat_2}.
}
\examples{
set.seed(3010)
sample_A <- sample(c(0, 1), 12, replace = TRUE)
set.seed(3007)
sample_B <- sample(c(0, 1), 15, replace = TRUE)
A <- matrix(sample_A, ncol = 3)
B <- matrix(sample_B, ncol = 3)
# These 2 matrices have no missing values
# Matrix of pairwise distances with A:
ham(A, A)
# Matrix of distances between the rows of A and the rows of B:
ham(A, B)
# If mat_1 is a vector of binary values:
ham(c(0, 1, 0), B)
# Now by considering A_NA and B_NA two matrices built from A and B respectively,
# where missing values have been manually added:
A_NA <- A
A_NA[3, 1] <- NA
A_NA[2, 2:3] <- rep(NA, 2)
B_NA <- B
B_NA[2, 2] <- NA
ham(A_NA, B_NA)
}
\references{
Roth R (2006). Introduction to Coding Theory. Cambridge University Press.
}
\author{
Gregory Guernec
\email{otrecod.pkg@gmail.com}
}