<div style="font-weight: bold; color:#5D8AA8" align="center">
    <div style="font-size: xx-large">Machine Learning with Funcitonal Data</div><br>
    <div style="font-size: x-large; color:gray">Descriptive statistics and preprocessing</div><br>
    <div style="font-size: large">José Luis Torrecilla Noguerales - Universidad Autónoma de Madrid</div><br></div><hr>
</div>

**Initial setting**: This cell defines the Notebook setting

In [None]:
%%html
<style>
    .qst {background-color: #b1cee3; padding:10px; border-radius: 5px; border: solid 2px #5D8AA8;}
    .qst:before {font-weight: bold; content: "Questions"; display: block; margin: 0px 10px 10px 10px;}
    h1, h2, h3 {color: #5D8AA8;}
    .text_cell_render p {text-align: justify; text-justify: inter-word;}
</style>

**Packages to use:**

In [None]:
import skfda
import matplotlib.pyplot as plt
import sklearn

We will illustrate these methods with our well-known *AEMET* dataset. In particular, we will use the temperature curves.

In [None]:
X, _ = skfda.datasets.fetch_aemet(return_X_y=True)

X = X.coordinates[0] #Selecting temperatures only
X.plot()
plt.show()

# Visualization tools and outlier detection

Visualization tools can be used to gain insight into the data. In particular, trends, salient features, outliers, and other patterns in the data can be identified simply by inspection.  The package *scikit-fda* provides a number of interactive tools for data visualization and outlier detection. Their implementation utilizes the functionality provided by *matplotlib*.

### Summary statistics
Common summary statistics, such as the sample mean function and the sample covariance function can be estimated using the tools provided by **scikit-fda**. Consider a set of functional observations $\left\{ X_{i}(t) \right\}_{i=1}^{n}$. The sample mean, 
\begin{equation}\label{eq:mean}
	\hat\mu(t) = \frac{1}{n}\sum_{i=1}^{n} X_{i}(t),
\end{equation}
can be computed by applying the function *mean()* to the *FData* object in which the data are stored. The functional observations can be either in discrete form or in a basis representation. The resulting mean function is a *FData* object of the same type as the input (i.e., discretized or in a basis representation).

The sample covariance function $\hat k$,
\begin{equation}\label{eq:covariance}
	\hat k(t, s) = \frac{1}{n - 1}\sum_{i=1}^{n} (X_{i}(t) - \hat\mu(t))(X_i(s) - \hat\mu(s)),
\end{equation}
can be computed applying the function *cov()* to the corresponding *FData* object. Similarly, the function *var()* can be used to computed the sample variance, $\hat k(t, t)$.
Irrespective of the representation of the functional observations, the sample variance and covariance are returned in discretized form.
Instead of the functions *mean*, *cov* and *var*, the *FData* methods of the same name can be used to compute these summary statistics.

The package scikit-fda provides support for the computation of robust statistics. Robust
statistics may provide a better characterization of the data than non-robust ones (e.g., the
mean or the covariance functions), especially in the presence of outliers. One of the most
important robust statistics is the geometric median
\begin{equation}\label{eq:geometric_median}
	M = \underset{z \in \mathcal{F}}{\arg \min} \sum_{i=1}^{n} \left \| x_i-z \right \|.
\end{equation}
It can be computed with the function **geometric_median()**. Alternatively, the median
can be defined as the deepest point in the sample. Different depth measures yield differ-
ent definitions of the median. These types of medians can be computed with the function
**depth_based_median()**.
A trimmed mean is a robust version of the standard mean in which the most outlying functional
observations (the ones with the lowest depth values) are discarded. In scikit-fda, the trimmed
mean is implemented in function **trim_mean()**.

In [None]:
import numpy as np
import skfda.exploratory.stats as stats

from skfda.exploratory.depth import ModifiedBandDepth as MBD
from skfda.exploratory.depth import BandDepth as BD

mean = X.mean() #sample mean
var = X.var()   #sample variance
std = np.sqrt(var)  #sample standard deviation
trim_mean = stats.trim_mean(X, 0.05)  #trimmed mean
geo_median = stats.geometric_median(X)  #geometric median
depth_median = stats.depth_based_median(X, depth_method=MBD())  

fig, axes = plt.subplots(1, 1, figsize=(10, 6))
X.plot(fig=fig, color="grey", alpha=0.2)
axes.fill_between(
    X.grid_points[0],
    (mean-std).data_matrix[0, ..., 0],
    (mean+std).data_matrix[0, ..., 0],
    alpha=0.3,
)

mean.plot(fig=fig, label="mean")
trim_mean.plot(fig=fig, label="trimmed mean")
geo_median.plot(fig=fig, label="geometric median")
depth_median.plot(fig=fig, label="depth based median")
axes.legend(loc='lower right')
fig.suptitle(None)

### Functional boxplot

This is a generalization of the univariate boxplot for functional data. The functional boxplot consists of a graph of the functional median (i.e., the deepest curve in the sample) surrounded by a central envelope, which encompasses the deepest $50\%$ of the observations, and a maximum non-outlying envelope. The width of this outer envelope is determined by scaling the central one by a constant factor. This constant factor can be selected by the user. Its default value is $1.5$. In scikit-fda, the class **Boxplot** can be used to generate and customize functional boxplots. In this plot, a trajectory is marked as an outlier if it lies beyond the maximum non-outlying envelope for some interval. The class **BoxplotOutlierDetector** can be used for outlier detection based on this criterion. Some customizable elements of Boxplot objects are the depth measure, and the definition of centered bands that encompasses a user-specified fraction of the deepest observations. The following code provides an illustration of these functionalities with the *AEMET* dataset.

By default, only the part of the outlier curves which falls out of the central regions is plotted. We want the entire curve to be shown, that is why the show_full_outliers parameter is set to True.

In [None]:
from skfda.exploratory.visualization import Boxplot
from skfda.exploratory.depth import ModifiedBandDepth as MBD
from skfda.exploratory.depth import BandDepth as BD

boxplot = Boxplot(X)
boxplot.show_full_outliers = True
boxplot.plot()

#The results depends on the Depth measure
boxplot = Boxplot(X, depth_method=BD())
boxplot.plot()

#Customizable
boxplot = Boxplot(X, prob=[0.75,0.5, 0.25])
boxplot.plot()

plt.show()

<div class="qst">

* Is it a direct generalization of the classical boxplot?

In [None]:
boxplot.outliers

In [None]:
boxplot = Boxplot(X, depth_method=MBD(), factor=1)
boxplot.plot()

X.plot(group=boxplot.outliers.astype(int),
       group_colors=["blue", "red"],
       group_names=["nonoutliers", "outliers"])

plt.show()

### Magnitud-Shape plot

Another tool for functional data visualization and outlier detection is the magnitude-shape plot (MS-plot) proposed by Dai and Genton (2018, 2019). In this method, the degree of outlyingness of a functional observation is characterized in terms of two quantities: the magnitude outlyingness (MO) and the shape outlyingness (VO). The MS-plot is the scatter plot of the values MO and VO for each functional observation. This two-dimensional representation of the data can be used, for instance, to identify clusters of functions, or detect potential outliers, either in shape or in magnitude.

The following code can be used to display the magnitude-shape plot (MS-plot) for the temperature curves of the *AEMET* weather  dataset together with the original trajectories. Additionally, outliers are identified according to the MS-plot criterion and marked in red. The class **MagnitudeShapePlot** generates the MS-plot and uses internally the methods of the class **MSPlotOutlierDetector** for outlier detection. 


In [None]:
from skfda.exploratory.visualization import MagnitudeShapePlot as MSplot

ms_plot = MSplot(X)
ms_plot.plot()

fig = X.plot(group=ms_plot.outliers, group_colors=["blue", "red"])
ms_plot.outliers

### Outliergram

The class **Outliergram** provides an additional method for data visualization and detection of shape outliers (Arribas-Gil and Romo, 2014). The graph is defined in terms of two related quantities: the modified epigraph index (MEI) and the MBD. Each curve is represented as a point (MEI, MBD) in the scatter plot. The MEI of a trajectory is the average over time of the fraction of curves in the sample that lie above it. 

The outliergram takes advantage of the fact that points corresponding to typical functional observations lie on a parabola whose analytical form is known. This parabola is used as a reference for the identification of shape outliers. Specifically, the degree of outlyingness of a curve is quantified in terms of its vertical distance to the parabola. The scikit-fda’s classes **Outliergram** and **OutliergramOutlierDetection** can be used to generate the outliergram and to detect outliers by using this criterion, respectively. 

In [None]:
from skfda.exploratory.visualization import Outliergram

fig = X.plot()
fig = Outliergram(X).plot()


### Interactivity
In addition to standard plotting capabilities, most graphs generated with *scikit-fda* incorporate some interactive features. For example, the cursor can be placed at a point in the graph to display the actual coordinate values and the label of the observation. In addition, if different plots are used for visual exploration of some functional dataset, selecting a particular curve in one plot highlights the corresponding curve in the other active plots. Finally, widgets such as sliders can be used to select curves by some property, such as the label of the observation, or their depth in the sample.

In [None]:
%matplotlib widget
from matplotlib.widgets import Slider

from skfda.exploratory.visualization.representation import GraphPlot
from skfda.exploratory.visualization import MultipleDisplay

fig, axes = plt.subplots(2, 2, figsize=(8, 6))
graph_plot = GraphPlot(X)
ms_plot = MSplot(X)
outliergram_plot = Outliergram(X)
mbd = MBD()
interactive_plot = MultipleDisplay(
    [graph_plot, ms_plot, outliergram_plot],
    criteria=mbd(X),
    sliders=Slider,
    label_sliders=["MBD"],
    fig=fig,
)

interactive_plot.plot()

axes[0, 0].set_title("Trajectories")
axes[0, 1].set_title("MS-Plot")
axes[1, 0].set_title("Outliergram")
fig.suptitle(None)

plt.show()

In [None]:
X.coordinates[0].plot() #Selecting temperatures only
plt.show()

<div class="qst">

* Explore the effect of varying the number of elements in the basis and the type of basis.
</div>

# Registration

Registration consists in applying transformations to the raw data so that the functional observations
are properly aligned. There is a variety of reasons why misalignment can occur. In some
cases, it is the result of errors in the measurement process. In others, the domain has to be
warped because the functions depend on an internal parameter, which is different from the
one observed. For periodic functions, such as the signal of a heartbeat, the starting time for
the different measurements could be different. A number of strategies can be used for registration. For instance, maxima, minima, zeros, and other landmarks can be used as reference points for alignment. Alternatively, some measure of dispersion between the observations can
be minimized.

The simplest curve alignment procedure is landmark registration. This method only takes into account a discrete ammount of features of the curves which will be registered.

We will use a dataset synthetically generated by **make_multimodal_samples()**, which in this case will be used to generate bimodal curves.

In [None]:
fd = skfda.datasets.make_multimodal_samples(
    n_samples=4,
    n_modes=2,
    std=0.002,
    mode_std=0.005,
    random_state=1,
)
fd.plot()
plt.show()

Because our dataset has been generated synthetically we can obtain the value of the landmarks using the function **make_multimodal_landmarks()**, but in general it will be necessary to use numerical or other methods to determine the location of the landmarks.

In [None]:
landmarks = skfda.datasets.make_multimodal_landmarks(
    n_samples=4,
    n_modes=2,
    std=0.002,
    random_state=1,
).squeeze()

print(landmarks)

The transformation will not be linear, and will be the result of applying a warping function to the time of our curves. 
After the identification of the landmarks asociated with the features of each of our curves we can construct the warping function with the function **landmark_elastic_registration_warping()**.

In this case we will place the landmarks at -0.5 and 0.5.

In [None]:
warping = skfda.preprocessing.registration.landmark_elastic_registration_warping(
    fd,
    landmarks,
    location=[-0.5, 0.5],
)

# Plots warping
fig = warping.plot()

# Plot landmarks
for i in range(fd.n_samples):
    fig.axes[0].scatter([-0.5, 0.5], landmarks[i])

Once we have the warping functions $h$, the registered curves can be obtained using function composition. Let 
$x(t)$ a curve, we can obtain the corresponding registered curve as $\hat x(t) = x(h(t))$

In [None]:
fd_registered = fd.compose(warping)
fig = fd_registered.plot()

fig.axes[0].scatter([-0.5, 0.5], [1, 1])

Finally, if we do not need the warping function we can obtain the registered curves directly using the function **landmark_elastic_registration()**.

If the position of the new location of the landmarks is not specified the mean position is taken.

In [None]:
fd_registered = skfda.preprocessing.registration.landmark_elastic_registration(
    fd,
    landmarks,
)
fd_registered.plot()

plt.scatter(np.mean(landmarks, axis=0), [1, 1])
plt.show()