# Domain Adaptation - a brief review

Traditionally, in the supervised learning approach we are interested in approximating a function $f: X \longrightarrow Y$ by means of using a training sample of the random vectors (or feature space) $X_{\text{train}}$, this is typically done by optimizing some *cost* function on unseen *test* data.
By doing this we are (naively) assuming that both the training and test samples are drawn from the same joint probability distribution. Real world, though, is harsh and this is typically not the case. In particular, once we have trained our ML model, we cannot assure that the *real-world* data that this model will see in production it is drawn from the same joint probability distribuition as the training data. This can happen for several reasons, such as:
* Joint probabilities of the domain can, and most probably will, change with time.
* Training samples collection might have unknown biases due to different selection methods, for instance.
* We might need to generalize a model to a completely new domain, e.g., use training data from Brazil to build a model for Mexico.
* ...

In all above examples, the training samples are not an accurate representation of the actual examples that the model will see in real-life. Naturally, this will translate into a performance degradation of the model that could worsen with time. To tackle this problem, several techniques under the name of **domain adaptation** were developed. From now on, we will refer to the training and test samples as *source* and *target* distributions. 

Domain adaptation is a special case of **transfer learning**. Transfer learning refers to a class of machine learning problems where the tasks and/or domains may change between source and target whereas domain adaptation only is interested in problems where the domains differ and tasks remain the same between source and target distributions. Also, semi-supervised learning adresses the problem of having unlabeled data, In this setup, a small set of labeled data is used together to a big amount of unlabeled data to train a model. Moreover, both labeled and unlabeled data are assumed to have being drawn from the same joint probability distribution, an assumption that is relaxed in the domain adaptation and transfer learning setups.

In this notebook we will discuss the basic ideias under **Domain adaptation**, the discussion will be motivated by practical (synthetic) examples and at the end we will show how domain adaptation was used on a practical example at the DataLab.
Let us begin by stating the problem in general terms by defining the essential mathematical objects under study in this context.

# Notation and definitions
In domain adaptation, domains consists of three general ingredients:
* Input or feature space $X$, e.g. $X \subset \mathbb{R}^d$.
* output or label sabes $Y$, e.g. $Y = \{0,1\}$ for binary classification tasks, $Y = \{0, 1, \dots, k-1\}$ for k-class classification tasks, and $Y \subset \mathbb{R}$ for regression tasks.
* and an associated joint probability distribution $\mathcal{D}=\{X, Y, \mathbb{P}(X=x,Y=y)\}$, where $\mathbb{P}(X=x,Y=y)$ is the probability density.

For simplicity we will write $\mathbb{P}(X= x, Y = y)$ as $\mathbb{P}(x,y)$. Recall that we can write the joint probability density as:
\begin{align*}
\mathbb{P}(x,y) &= \mathbb{P}(x|y) \mathbb{P}(y) \\
& = \mathbb{P}(y|x)\mathbb{P}(x)
\end{align*}

In [2]:
import numpy as np
