<div style="font-weight: bold; color:#5D8AA8" align="center">
    <div style="font-size: xx-large">Machine Learning with Functional Data</div><br>
    <div style="font-size: x-large; color:gray">Dimensionality reduction</div><br>
    <div style="font-size: large">José Luis Torrecilla Noguerales - Universidad Autónoma de Madrid</div><br></div><hr>
</div>

**Initial setting**: This cell defines the Notebook configuration.

In [None]:
%%html
<style>
    .qst {background-color: #b1cee3; padding:10px; border-radius: 5px; border: solid 2px #5D8AA8;}
    .qst:before {font-weight: bold; content: "Questions"; display: block; margin: 0px 10px 10px 10px;}
    h1, h2, h3 {color: #5D8AA8;}
    .text_cell_render p {text-align: justify; text-justify: inter-word;}
</style>

**Packages to use:**

In [None]:
import skfda
import matplotlib.pyplot as plt
import sklearn

# Functional Principal Components Analysis (FPCA)
### AEMET temperatures example 

As seen befor, **aemet** dataset consists of daily summaries from 73 spanish weather stations during the period 1980-2009. The dataset contains the geographic information of each station and the average for the period 1980-2009 of daily temperature, (log)precipitation and wind speed. For this example we only consider the temperatures.

In [None]:
X, _ = skfda.datasets.fetch_aemet(return_X_y=True)

X.plot()

X = X.coordinates[0]
X.plot() #Selecting temperatures only
plt.show()

### Computation of principal components

The class **[FPCA](https://fda.readthedocs.io/en/latest/modules/preprocessing/autosummary/skfda.preprocessing.dim_reduction.FPCA.html)** implements the techniques relatad to principal components analysis en *scikit-fda*. 

In [None]:
from skfda.exploratory.visualization import FPCAPlot
from skfda.preprocessing.dim_reduction import FPCA

fpca_aemet = FPCA(n_components=2)   #Definition of FPCA object
fpca_aemet.fit(X)             #Estimation of the principal components
fpca_aemet.components_.plot()
plt.legend(['PC1','PC2'])
plt.show()

### Principal components and smoothing
There are different ways to obtain smoother principal components. A first approach is to represent the trajectories in bases before applying FPCA. Remember that the components inherit the characteristics of the elements of the basis functions.

In [None]:
basis = skfda.representation.basis.BSplineBasis(n_basis=30)
X_spline = X.to_basis(basis)
X_spline.plot()

fpca_aemet_s = FPCA(n_components=2)   #Definition of FPCA object
fpca_aemet_s.fit(X_spline)            #Estimation of the principal components
fpca_aemet_s.components_.plot()
plt.legend(['PC1','PC2'])
plt.show()

<div class="qst">

* Can we plot the first 4 principal components? Are they useful?
* Can we obtain smooth components in a different way?
</div>

### Interpretation

In [None]:
FPCAPlot(
    X_spline.mean(),          #Sample mean of the original data
    fpca_aemet_s.components_, #Principal components
    factor=30,                #Scale factor to for the separation between curves.
    fig=plt.figure(figsize=(6, 2 * 4)),
    n_rows=2,
).plot()
plt.show()

### Projection and clustering


Once the principal components have been calculated, we can use them to project the trajectories onto a lower-dimensional subspace of $\mathbb{R}^d$. Then, we can use any multivariate methods we want with the reduced data.

Let us point out that we are committing a slight abuse of notation by calling both the eigenfunctions that define the principal component basis and the projections of the data onto those directions "components."

In [None]:
X_red = fpca_aemet.transform(X) #Pojection

fig, ax = plt.subplots(1,1)
ax.scatter(X_red[:,0], X_red[:,1])
ax.set_xlabel('First principal component')
ax.set_ylabel('Second principal component')

### Climate clustering

In [None]:
from skfda.ml.clustering import KMeans
from skfda.misc.metrics import l2_distance

n_clusters = 5
n_init = 10

fda_kmeans = KMeans(
    n_clusters=n_clusters,
    n_init=n_init,
    metric=l2_distance,
    random_state=0,
)
fda_clusters = fda_kmeans.fit_predict(X)

# Colors for each cluster
fda_color_map = {
    0: "purple",
    1: "yellow",
    2: "green",
    3: "red",
    4: "orange",
}

# Names of each climate (for this particular seed)
climate_names = {
    0: "Cold-mountain",
    1: "Mediterranean",
    2: "Atlantic",
    3: "Subtropical",
    4: "Continental",
}

fig, ax = plt.subplots(1, 1)
for cluster in range(n_clusters):
    selection = fda_clusters == cluster
    ax.scatter(
        -X_red[selection, 0],
        -X_red[selection, 1],
        color=fda_color_map[cluster],
        label=climate_names[cluster],
    )

ax.set_xlabel('First principal component')
ax.set_ylabel('Second principal component')
ax.legend()

X_red_s = fpca_aemet_s.transform(X_spline)
fig, ax = plt.subplots(1, 1)
for cluster in range(n_clusters):
    selection = fda_clusters == cluster
    ax.scatter(
        -X_red_s[selection, 0],
        -X_red_s[selection, 1],
        color=fda_color_map[cluster],
        label=climate_names[cluster],
    )

ax.set_xlabel('Smooth first principal component')
ax.set_ylabel('Smooth second principal component')
ax.legend()
plt.show()


### FPCA and classification
For this example, we will use the growth curves from the Berkeley study. The objective will be to correctly classify girls and boys.

In [None]:
#Loading Berkeley growth study dataset
X, y = skfda.datasets.fetch_growth(return_X_y=True)
X.plot(group=y)


fpca_growth = FPCA(n_components=2)   #Definition of FPCA object
fpca_growth.fit(X)             #Estimation of the principal components
fpca_growth.components_.plot()
plt.legend(['PC1','PC2'])

X_red = fpca_growth.transform(X) #Projection
fig, ax = plt.subplots(1,1)
ax.scatter(X_red[:,0], X_red[:,1], c=y)
ax.set_xlabel('First principal component')
ax.set_ylabel('Second principal component')


Now we are going to study the difference between using the reduced data with FPCA and a multivariate classifier versus using the complete trajectories with a functional classifier.

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import GridSearchCV 
from skfda.ml.classification import KNeighborsClassifier as fKNN

#FPCA + multivariate k-NN with Euclidean distance
knn = KNN()
gcv = GridSearchCV(knn, {'n_neighbors':range(1,9,2)})
gcv.fit(X_red,y)
print('k-nn cv accuracy with the reduced data:',"{:.4f}".format(gcv.best_score_))

#Functional k-NN with L2 distance
fknn = fKNN()
gcvf = GridSearchCV(fknn, {'n_neighbors':range(1,9,2)})
gcvf.fit(X,y)
print('Functional k-nn cv accuracy with the oritinal trajectories:',"{:.4f}".format(gcvf.best_score_))


In [None]:
gcv.cv_results_

<div class="qst">
    
* What is the reason why we always obtain the same values when running the code again? 
    
* Try to vary the number of principal components. What happends? How to choose the best value?
</div>

If you want to randomize the partitios, pass a "StratifiedKFold" object with the parameter "shuffle=True" (and optionally, a seed in the "random_state" parameter).

### Pipelines

In [None]:
from sklearn.pipeline import Pipeline

import skfda
from skfda.preprocessing.dim_reduction import FPCA

fpca = FPCA()
classifier = GridSearchCV(KNN(), {'n_neighbors':range(1,7,2)})  # You can choose your favourite classifier

pipeline = Pipeline([
    ("fpca", fpca),
    ("classifier", classifier),
])

pipe_gcv = GridSearchCV(
    pipeline,
    param_grid={'fpca__n_components': range(1, 11)}
)

pipe_gcv.fit(X, y)
print('k-nn cv accuracy with the reduced data:',"{:.4f}".format(pipe_gcv.best_score_))
print(pipe_gcv.best_params_)


It is important to note that this hyperparameter selection approach is inefficient: for each number of components, FPCA is recalculated. It is possible to create a more efficient version by having FPCA return the maximum number of components and introducing a third element in the pipeline that selects the necessary number of components and caches the results, similar to [this example for the multivariate case](https://github.com/scikit-learn/scikit-learn/issues/19649#issuecomment-793282436)

# Variable selection


In [None]:
from skfda.preprocessing.dim_reduction import variable_selection as vs

#Variable selection
rkvs = vs.RKHSVariableSelection(n_features_to_select=2)
rkvs.fit(X,y)

#get the impact points
point_mask = rkvs.get_support()
points = X.grid_points[0][point_mask]
print(points)

#Projection and plotting
X_rkvs = rkvs.transform(X)
fig, ax = plt.subplots(1,1)
ax.scatter(X_rkvs[:,0], X_rkvs[:,1], c=y)
ax.set_xlabel('First variable')
ax.set_ylabel('Second variable')
plt.show()

#Classification
knn = KNN()
gcv = GridSearchCV(knn, {'n_neighbors':range(1,9,2)})
gcv.fit(X_rkvs,y)
print('k-nn cv accuracy with the reduced data:',"{:.4f}".format(gcv.best_score_))



<div class="qst">
    
* Is this procedure for estimating classification error correct?
    
</div>