# MSDM5058 Tutorial 1 (Part 1) - Review on probability and statistics

## Contents
1. Probability of events
2. Probability disbtribution
3. Statistical measures

---

# 1. Probability of events

The probability of an event $A$ is denoted as $P(A)$. The two key principles of probability are

- $0<P<1$: The probability of any events are values between $0$ and $1$.
- $\sum P = 1$: Sum of the probability of all _mutually exclusive_ events is $1$.


## 1.1. Relations between events

You should bear in mind these relations between events:

### 1.1.1. Complement

The complement of an event is the probability that the event dose not occur. It is denoted as $P(\bar{A})=1-P(A)$.

### 1.1.2. Intersection

The intersection between two events is equivalent to the "AND" operation, i.e. the probability for event $A$ AND event $B$ happening together, denoted as $P(A\cap B)$. It is also called their **joint probability**. 

- **Independence**: If the occurrence of $A$ and $B$ do not influence each other, they are independent and satisfies the so-called "product rule": 
 
 $$P(A\cap B) \stackrel{\text{indep.}}= P(A) P(B)\ .$$

 However, "independence" may get philosophically hard to determine. (Does a butterfly in Brazil cause a tornado in the United States?) Hence, as long as his data supports the equality, one may hedge to say that two events are statistically independent instead.

- **Mutually exclusive**: If $A$ and $B$ never happen together, they are called mutually exclusive and $P(A\cap B)\stackrel{\text{m.e.}}=0$. A simple example of mutually exclusive events is to get a head and get a tail from flipping one coin.

### 1.1.3. Union

The union between two events is equivalent to the "OR" operation, i.e. the probability for at least one of $A$ OR $B$ happening, denoted as $P(A\cup B)$. 

- **Sum rule**: The direct calculation to union is usually confusing. So we usually make use of the following rules. For two events, it can be calculated as: 

 $$P(A\cup B) = P(A)+P(B)-P(A\cap B)\ .$$

 It may be generalized as the **inclusion-exclusion principle**: given a set of events $\{E_1, E_2, E_3,...\}$, the probability for at least one to happen is

 $$
\begin{align*}
P\left(\bigcup E_i\right) &= \sum_iP(E_i) - \sum_{i<j}P(E_i\cap E_j) + \sum_{i<j<k}P(E_i\cap E_j\cap E_k) + ... \\
&= \sum_k\left[(-1)^{k-1}\sum_{i_1<...<i_k}P\left(\bigcap_{j=1}^{k}E_{i_j}\right)\right]\ .
\end{align*}
 $$

- If the events are independent, the formula can be simplified into 
    
    $$P\left(\bigcup E_i\right) \stackrel{\text{indep.}}= 1-\prod_i[1-P(E_i)] \ .$$

- If the events are mutually exclusive, all terms of intersection vanish, so
    
    $$P\left(\bigcup_i E_i\right) = \sum_i P(E_i) \ .$$

### 1.1.4. Conditional probability

This conditional probability of $B$ on $A$ is the probability for $B$ to happen given that $A$ has already happened. It is defined as 

$$
P(B|A) = \frac{P(A\cap B)}{P(A)}
$$

- If the events are independent, the occurrence of $A$ does not alter the probability for $B$ to occur at all. So
 $$P(B|A) \stackrel{\text{indep.}}=P(B)$$
 
 
 
- If the events are mutually exclusive, they cannot occur consequencely. So $P(A\cap B)=0$ and
 $$P(B|A)\stackrel{\text{m.e.}}=0$$ 

## 1.2. Bayes' theorem

In science we can distinguish the two variables as independent variable $A$ and dependent variable $B$ by causality, but in mathematics there is no reasons would stop  $A$ and $B$ being symmetric in the formula, we can write:

$$
\begin{cases}
P(B|A) = \dfrac{P(A\cap B)}{P(A)} \\[0.5em]
P(B|A) = \dfrac{P(A\cap B)}{P(A)}
\end{cases}
$$

and thus 
$$
P(A|B)P(B) = P(A\cap B) = P(B|A)(A)
$$

It comes to the Bayes' theorem which says: regardless of any causality,

$$
P(B|A) \equiv \frac{P(A|B)P(B)}{P(A)}
$$

We will see more application of the Bayes' theorem in later lectures and tutorials.

---

# 2. Probability distribution

We would often like know how a random variable $X$'s probability $P(X=x)$ distributes over all possible $x$.

## 2.1. Terminology

### 2.1.1. Probability mass function (PMF)

If $X$ is a discrete random variable, its probability mass function (PMF) is defined as

$$
P_X(x) \equiv P(X=x)
$$

For example, the PMF of a dice's outcome $D$ is 

$$
p_D(x) = \begin{cases}
 \frac{1}{6} & (x\in \{1,...,6\})\\
0 & \text{otherwise}
\end{cases}\ .
$$


### 2.1.2. Probability density function (PDF)

The definition of PMF fails if $X$ is continuous, because we cannot find an exact real number on the real number line. (There are, informally speaking, "infinitely many possible real numbers", so the probability to locate any particular real number goes to zero.) We need to adopt a different but similar concept for continuous random variables - the probability density function (PDF). If $X$ has a PDF $f_X(x)$, the following identity holds:

$$
\int_a^b f_X(x)\mathrm{d}x \equiv P(a\leq X\leq b) \ .
$$

- Therefore the product $f_X(x)\mathrm{d}x$ may be regarded as the probability of observing $X$ in the range $\left[{x,x+dx}\right]$. (While $f_X(x)$ alone is not the probability but the **_probability density_**.) 

- As $X$ definitely lies in $(-\infty,\infty)$, we must impose the normalization condition $\int_{-\infty}^\infty f_X(x)dx\equiv 1$ on a PDF.


### 2.1.3. Cumulative distribution function (CDF)

The cumulative distribution function (CDF) of $X$ is defined as

$$
\begin{align*}
F_X(x) &\equiv P(X\leq x) \\[0.5em]
&= \begin{cases}
\displaystyle \sum_{x'\leq x} p_X(x') & (\text{discontinous }X) \\[0.5em]
\displaystyle \int_{-\infty}^x f_X(x')\mathrm{d}x' & (\text{continuous }X)
\end{cases}\ ,
\end{align*}
$$

which must satisfy $F(-\infty)=0$ and $F(+\infty)=1$. Although it does not bring in new information, CDF is often more useful analytically because of its monotonic nature. For the continuous case, $f_X(x)=\frac{dF_X(x)}{dx}$ implies that $f_X(\pm\infty)=0$.

## 2.2. Multivariate distributions

We are often interested how a random variable's outcome correlate with others'. In such cases, we need to consider multivariate distributions, the simplest of which contains only two random variables  $X$ and $Y$. While this section focuses on continuous random variables, it is not difficult to rephrase for discrete ones. 


### 2.2.1. Joint distribution

The joint PDF $f_{XY}(x,y)$ is defined to satisfy

$$
\int_a^b\int_c^d f_{XY}(x,y)\mathrm{d}y\mathrm{d}x = P(X\in[a,b] \cap Y\in[c,d]) \ ,
$$

whereas their joint CDF is

$$
\begin{align*}
F_{XY}(x,y) \equiv& P(X\leq x\cap Y\leq y) \\
=& \int_{-\infty}^x  \int_{-\infty}^y f_{XY}(x', y')\mathrm{d}y'\mathrm{d}x' \ .
\end{align*}
$$

Conversely, $f_{XY}(x,y) = \frac{\partial^2}{\partial x\partial y}F_{XY}$. As the number of variables grows, it becomes more convenient to use the differential form of definition:

$$
\begin{align*}
F_{X_1X_2\dots X_n}(x_1,x_2,\dots,x_n)
\equiv& P\left(\bigcap_{i=1}^n X_i\leq x_i\right) \\
\Rightarrow f_{X_1X_2\dots X_n}(x_1,x_2,\dots,x_n)
\equiv& \frac{\partial^n}{\partial x_1\partial x_2\dots\partial x_n}F_{X_1X_2\dots X_n} \ .
\end{align*}
$$
 

### 2.2.2. Marginal distribution

Sometimes we are given $f_{XY}(x,y)$, from which we would like to extract $f_X(x)$. Since $P(X=x) =P[X=x\cap Y\in (-\infty, \infty)]$, we obtain

$$
\int_a^b f_X(x') \mathrm{d}x'
= \int_a^b\int_{-\infty}^\infty f_{XY}(x',y')\mathrm{d}y'\mathrm{d}x'
$$

or simply

$$
f_X(x) = \int_{-\infty}^\infty f_{XY}(x,y)\mathrm{d}y \ .
$$

This form of deduced PDF is also called a marginal PDF. Similarly, as $P(X\leq x) =P[X\leq x \cap Y\leq \infty] $, the marginal CDF of $X$ is

$$
F_X(x)=F_{XY}(x,y=\infty) \ .
$$


### 2.2.3. Conditional distribution

The conditional PDF of $Y$ on $X$ measures the probability density of $Y$ given that $X=x$. It is defined as

$$
f_{Y|X}(y|x) = \frac{f_{XY}(x,y)}{f_{X}(x)}
$$

so that $P(Y\in [c,d]| X=x) = \int_c^d f_{Y|X}(y\mid x) \mathrm{d}y$. 

> _(optional reading)_ 
>
> **Informal proof:** 
> 
> $$
\begin{align*}
P(Y\in[c,d]|X=x)
=& \lim_{\delta \to 0} P(Y\in [c,d]|X=[x,x+\delta]) \\\mathrm{d}y
=& \lim_{\delta \to 0} \frac{P(X=[x,x+\delta]\cap Y\in[c,d])}{P(X=[x,x+\delta])} \\\mathrm{d}y
=& \lim_{\delta \to 0} \frac{\int^d_c\int^{x+\delta}_x f_{XY}(x',y)\mathrm{d}x'\mathrm{d}y}{F_X(x+\delta)-F_X(x)} \\\mathrm{d}y
=& \lim_{\delta \to 0} \int^d_c \frac{\frac{\int^{x+\delta}_{-\infty} f_{XY}(x',y)\mathrm{d}x' - \int^x_{-\infty} f_{XY}(x',y)\mathrm{d}x'}{x+\delta-x}}{\frac{F_X(x+\delta)-F_X(x)}{x+\delta-x}} \mathrm{d}y\\\mathrm{d}y
=& \int^d_c \frac{f_{XY}(x,y)}{f_X(x)} \mathrm{d}y \\\mathrm{d}y
=& \int^d_c f_{Y|X}(y|x)\mathrm{d}y
\end{align*}
$$
> 
> From the fifth line to the sixth line, the fundamental theorem of calculus and the definition of derivative are invoked.

## 2.3. Transformation of distribution

Given the PDF $f_X(x)$ of a continuous random variable $X$, we can calculate the PDF of an associated random variable $Y=g(X)$ with a general formula:

$$
f_Y(y) = \left|\frac{f_X(x)}{g'(x)}\right|_{x=g^{-1}(y)}
$$

for $y$ defined in the range of $g$. The function $g$ must be

- continuous to possess a derivative $g'(y)=\frac{\mathrm{d}g}{\mathrm{d}y}$ and
- one-to-one to possess an inverse function $g^{-1}(y)$.

The two conditions combine to imply that $g$ is strictly monotonic. If the function is not one-to-one, besides its lack of a proper inverse, its derivative hits zero somewhere (by Rolle's theorem) and invalidates the formula. Still, $f_Y(y)$ may be deduced with other approaches.

 
> _(optional reading)_
>
> **Proof:** 
> 
> Let us first assume that $g$ is strictly increasing, i.e. $g'>0$.
>
> $$
F_Y(y)
=P(Y\leq y)
\stackrel{g'>0}= P[X\leq g^{-1}(y)]
=F_X[g^{-1}(y)] \ .
$$
>
> Then by the chain rule,
>
>$$
f_Y(y)
=\frac{\mathrm{d}}{\mathrm{d}y}F_X[g^{-1}(y)]
=\left.\frac{\mathrm{d}F_X}{\mathrm{d}x}\frac{\mathrm{d}x}{\mathrm{d}y}\right|_{x=g^{-1}(y)}
=\left.\frac{f_X(x)}{g'(x)}\right|_{x=g^{-1}(y)} \ .
$$
>
> On the other hand, for a strictly decreasing $g$ with $g'<0$,
>
> $$
\begin{align*}
F_Y(y)
=& P(Y\leq y)
\stackrel{g'<0}=P[(X\geq g^{-1}(y))]
=1-F_X[g^{-1}(y)] \\
\Rightarrow f_Y(y)=& -\left.\frac{f_X(x)}{g'(x)}\right|_{x=g^{-1}(y)} \ .
\end{align*}
$$
>
> Our assumption $g'<0$ cancels the negative sign and again makes $f_Y(y)\geq 0$. Hence, the two cases can be combined with an absolute sign.

 

#### Example: Power function

What are the PDFs of $Y=X^p$ and $Z=X^{-p}$ for $X\sim U(0,1)$ with $p>0$?

**Solution.** Because 

$$
f_X(x)=\begin{cases}
1 & (0\leq x\leq1)\\
0 & (\text{otherwise})\\
\end{cases}\ ,
$$

we have 
$$
\begin{align*}
f_Y(y) =&
\begin{cases}
\left|\frac{1}{px^{p-1}}\right| & (0\leq y\leq 1) \\
0 & (\text{otherwise})
\end{cases} 
=\begin{cases}
\frac{1}{p}y^{1/p-1} & t(0\leq y\leq 1)\\
0 & (\text{otherwise})
\end{cases} \\[1em]
f_Z(z) =& \begin{cases}
\left|\frac{1}{-px^{-p-1}}\right| & (z\geq 1 )\\
0 & \left(\text{otherwise}\right)
\end{cases}
=\begin{cases}
\frac{1}{p}z^{-1/p-1} & (z\geq 1)\\
0 & (\text{otherwise})
\end{cases}\ .
\end{align*}
$$

You may see that $f_Y(0)\to\infty$. Such a divergence does not invalidate $f_Y(y)$. (You may integrate it to see if it satisfies the normalization condition.) In fact, $F_Y(y)=\sqrt{y}$, which indeed has a steep slope as $y\to 0^+$.

### 2.4. Generation of random variables

Most programming languages provide a random number generator that returns a uniform random variable $X\sim U(0,1)$. We can generate random variables $Y$ with any strictly increasing CDF $F_Y(y)$ by defining

$$
Y=F_Y^{-1}(X) \ .
$$

The monotonic condition is, again, necessary for $F_Y^{-1}$ to exist.

 
> _(optional reading)_
>
> **Proof:**
>
> The CDF of $Y$ at a specific realization of $Y$ is also a random variable. Let us denote it with $\tilde{Y}=F_Y(Y)$ and any specific realization of $\tilde{Y}$ as $\tilde{y}$. Now consider the new variable's CDF $F_\tilde{Y}(\tilde{y})$: for $\tilde{y}\in[0,1]$,
>
> $$
F_\tilde{Y}(\tilde{y})
= P(\tilde{Y}\leq \tilde{y})
= P[Y\leq F_Y^{-1}(\tilde{y})]
= F_Y[F_Y^{-1}(\tilde{y})] = \tilde{y}
$$
>
> and thus $f_\tilde{Y}(y)=1$. Consequently, $\tilde{Y}=F_Y(Y)\sim U(0,1)$ is isomorphic to $X$, the (pseudo-)random number by our programs. Hence, $Y=F_Y^{-1}(X)$ has the desired CDF $F_Y(y)$.

 

#### Example 1: Linear mapping

Express $Y\sim U[a,b]$ in terms of $X\sim U[0,1]$.

**Solution.** For $y\in[a,b]$, $F_Y(y)=\frac{y-a}{b-a}$, so $Y=F_Y^{-1}(X)=(b-a)X+a$.

#### Example 2: Logistic distribution

Fig. 1 shows a logistic function $L(x)=\frac{1}{1+e^{-x}}$. A distribution is logistic if its CDF is a logistic function. Express a logistically distributed random variable $Y$ with $F_y(y)=L(y)$ in terms of $X\sim U[0,1]$.
 
<figure style="text-align: center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg" alt="logistic function" style="width:50%">
    <figcaption> <b>Fig. 1</b> $L(x)=\frac{1}{1+e^{-x}}$. Retrieved from <a href="https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg">https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg</a>.</figcaption>
</figure>
    
**Solution.** $Y=L^{-1}(X)=-\ln\left(\frac{1}{X}-1\right)$. You may verify this answer by plugging it into the general formula in Section 3.

---

# 3. Statistical measures

## 3.1. Mean, variance and moments

Moments are statistical quantities that describe a probabilistic distribution. For a random variable $X$ a probability density function (PDF) $f_X(x)$, the moments are defined as

- N-th moment: $$\langle X^n\rangle = \int^\infty_{-\infty} x^n f_X(x) \mathrm{d}x$$

- N-th central moment: $$\langle (X-\langle X\rangle)^n\rangle = \int^\infty_{-\infty} (x-\langle X\rangle)^n f_X(x) \mathrm{d}x$$

The idea of moment is generalizing from some common statistical quantities

- Mean = The first moment. Usually denoted as $\langle X\rangle$.

- Variance = the second _central_ moment. Usually denoted as $\sigma^2$.

$$\sigma^2 = \langle (X-\langle X\rangle)^2\rangle = \langle X^2 \rangle -\langle X\rangle ^2$$

- skewness, kurtosis = third and forth central moment after standardization




## 3.2. Characteristic function

The characteristic function of a probabilistic distribution is its expecation of $e^{ikx}$, i.e. 

$$
M_X(k) = \langle e^{ikX}\rangle = \int^\infty_{-\infty} e^{ikx} f_X(x)\mathrm{d}x
$$

Be careful that the characteristic is a complex-value function. There are various applications:

- An alternative way for convolution: 

    Let $X$ and $Y$ be two _independent_ random variables with PDF $f_X(x)$ and $g_Y(y)$, the PDF of the variable $Z=X+Y$ can be found by 
 
    - Method 1: direct calculation of the convolution integral:
     $$ 
     h_Z(z) = \int^\infty_{-\infty} f_X(x)g_Y(z-x) \mathrm{d}x
     $$
    
    - Method 2: First find the the characteristic function of $X$ and $Y$, and then calculate the characteristic function of $Z=X+Y$ by 

     $$ 
    M_Z(k) = \langle e^{ikZ} \rangle = \langle e^{ik(X+Y)} \rangle = \langle e^{ikX}e^{ikY} \rangle = \langle e^{ikX} \rangle\langle e^{ikY} \rangle = M_X(k)M_Y(k)
    $$

     Then convert back to the original probabilistic function from the characteristic function, simple apply the inverse Fourier transform.

     $$
    h_Z(z) = \frac{1}{2\pi}\int^\infty_{-\infty} e^{-ikz} M_Z(k) \mathrm{d}k
    $$


- Moment generating function:

 The moments of a distribution can be calculated by the general formula over characteristic function:
 
 $$ 
 \langle X^n\rangle = \frac{1}{i^n}\left.\frac{\mathrm{d}^nM_X(k)}{\mathrm{d}^nk}\right|_{k=0} 
$$
 
 > _(Optional reading)_
 > 
 > **Proof:**
 > Making use of the Taylor expansion:
 > $$\begin{align*} 
 e^{ikx} &= 1+ (ikx) + \frac{(ikx)^2}{2!} + \frac{(ikx)^3}{3!} + ...\\
 \left.\frac{\mathrm{d}^n}{\mathrm{d}^nk}e^{ikx}\right|_{k=0} &= (ix)^n \\
 \int^\infty_{-\infty} \left[\left.\frac{\mathrm{d}^n}{\mathrm{d}^nk}e^{ikx}\right|_{k=0}\right] f_X(x)\mathrm{d}x &= \int^\infty_{-\infty} \left[(ix)^n\right] f_X(x) \mathrm{d}x \\
 \left.\frac{\mathrm{d}^n}{\mathrm{d}^nk} \left(\int^\infty_{-\infty} e^{ikx}f_X(x)\mathrm{d}x\right) \right|_{k=0} &= i^n \int^\infty_{-\infty} x^n f_X(x)\mathrm{d}x \\
 \left.\frac{\mathrm{d}^nM_X(k)}{\mathrm{d}^nk}\right|_{k=0} &= i^n \langle X^n \rangle
 \end{align*}
 $$




## 3.3. Covariance and correlation

- The **covariance** between two random variables $X$ and $Y$ reads 

 $$
\begin{align*}
\sigma_{XY}^2 &= \langle (X-\langle X\rangle)(Y-\langle Y \rangle)\rangle \\
&= \langle XY \rangle - \langle X\rangle \langle Y \rangle \ .
\end{align*}
$$

 It gets this name for its identical form to variance; if $Y=X$, $\sigma_{XY}^2 = \sigma_X^2$. Two variables has a positive covariance if an increase in one often occurs with an increase in the other. On the other hand, two variables has a more negative covariance if they possess opposite trends.

- The **correlation** (or correlation coefficient) is the normalized covariance for comparing behaviours of various pirs of variables. The correlation between $X$ and $Y$ is canonically defined as 

 $$
r_{XY} = \frac{\sigma^2_{XY}}{\sigma_X \sigma_Y} \in [-1,1]
$$
 
 which may be specifically called Pearson's correlation. 

### 3.3.1. Correlation verus dependence

The Pearson's correlation between two variables only indicates the strength of their linear dependence. In other words, a higher correlation between two variables means their scatter plot resemble a straight line more **regardless of its slope**. 

<figure style="text-align: center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg" alt="Correlation plot" style="width:50%">
    <figcaption style="text-align: left"> <b>Fig. 2</b> The number above each subplot indicates the correlation between the horizontal and the vertical variables. (Retrieved from <a href="https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg"> https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg</a>.) For the central subplot, correlation is undefined for the zero variance of its vertical variable.</figcaption>
</figure>

#### Example: quadratic dependence

Let a random variable $X \sim U(-1,1)$, i.e. $X$ is uniformly distributed in $[-1,1]$. What is its correlation between $Y=X^2$?

**Solution**. Because of its uniform distribution, the PDF of $X$ is

$$
f_X(x) = \begin{cases}
\frac{1}{2} & (-1\leq x\leq 1) \\
0 & (\text{otherwise})
\end{cases}\ .
$$

Then we can compute $\langle X\rangle$, which we need for $\sigma^2_{XY} = \langle XY\rangle -\langle X\rangle \langle Y\rangle = \langle X^3\rangle - \langle X \rangle \langle X^2 \rangle$ :

$$
\begin{align*}
\langle X \rangle &= \int^\infty_{-\infty} xf_X(x)\, dx \\
&= \frac{1}{2} \int^1_{-1} x\, dx \\
&= 0 \ .
\end{align*}
$$

Similarly, $\langle X^3\rangle$, so does $\sigma^2_{XY}$. This results implies that the correlation between $X$ and $Y$ is $r_{xy} = 0$, and it ultimately teaches us that Peason's correlation does not measure the strength of nonlinear dependence reliably.

### 3.3.2. Rank correlation (Optional reading)

Two useful alternatives to Person's correlation are Spearman's correlation and Kendall's correlation, which are special cases of rank correlation. This class of correlation is devised to respond more sensitively to nonlinear dependence. 

Unlike Person's correlation, rank correlation usually requires an explicit knowledge of observed data, so let us first assume there are $n$ realization of $(X,Y)$, i.e. $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$. Rank correlation first transforms each $(x_i, y_i)$ to a rank variable $(R_{x_i}, R_{y_i})$, where $R_{x_i} = k$ if $x_i$ is the k-th smallest realization of $X$.

- **Spearman's correlation**. Also called Spearman's $\rho$, it is in fact the Pearson's correlation between $R_x = \{R_{x_i}\}$ and $_Y = \{R_{y_i}\}$, so 

$$
\rho_{XY} = \frac{\sigma^2_{R_X R_Y}}{\sigma_{R_X}\sigma_{R_Y}}\ .
$$

- **Kendall's correlation**. Also called Kendall's $\tau$, it firsts assigns an $x$-score $\hat{x}_{ij} = \text{sgn}(R_{x_i}-R{x_j})$ and a $y$-score $\hat{y}_{ij} = \text{sgn}(R_{y_i}-R{y_j})$ to each pair of $(R_{x_i}, R_{y_i})$ and $(R_{x_j}, R_{y_j})$ with $i<j$. Then the correlation is defined as 

$$
\tau_{XY} = \frac{2}{n(n-1)}\sum_{i<j} \hat{x}_{ij}\hat{y}_{ij}\ .
$$

In Kendall's original terms, a pair is concordant if $\hat{x}_{ij}\hat{y}_{ij}>0$ but discordant if $\hat{x}_{ij}\hat{y}_{ij}<0$, whears it is neighter concordant nor disccordant if $\hat{x}_{ij}\hat{y}_{ij}=0$. This trichotomy helps reformulate Kendall's correlation as 

$$
\tau_{XY} = \frac{\text{#concordances}-\text{#discordances}}{n(n-1)/2}\ .
$$