**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install umap-learn
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from class_utils.plots import crosstab_plot

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import umap

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("UCI%20HAR%20Dataset.zip"), directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Dimensionality Reduction on the Human Activity Recognition Dataset

In the next example, we are going to apply dimensionality reduction to the [Human Activitity Recognition dataset](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones). The data has been collected using smartphone sensors (accelerometer, gyroscope) and transformed into a number of different summary features such as the mean, standard deviation, IRQ, energy, entropy, etc.

There are 6 different activities:

* walking;
* walking upstairs;
* walking downstairs;
* sitting;
* standing;
* laying.
### Loading the Data

The data is a series of numbers separated by spaces – here is the beginning of the first line.



In [None]:
with open("data/UCI HAR Dataset/train/X_train.txt", "r") as file:
    firstline = file.readline()
    print(firstline[:150], "...")

Data in this format can be loaded easily using `np.loadtxt`. When loading the target labels, we subtract 1. The label indices start from 1 and we need to make them 0-based.



In [None]:
X_train = np.loadtxt("data/UCI HAR Dataset/train/X_train.txt")
Y_train = np.loadtxt('data/UCI HAR Dataset/train/y_train.txt').astype(int) - 1

X_test = np.loadtxt("data/UCI HAR Dataset/test/X_test.txt")
Y_test = np.loadtxt('data/UCI HAR Dataset/test/y_test.txt').astype(int) - 1

To find out more about the dataset, we can have a look at the README.



In [None]:
with open("data/UCI HAR Dataset/README.txt", 'r', errors="ignore") as file:
    print(file.read())

One interesting thing that the README mentions is that the features are already normalized into the range of [-1, 1]. This is good because it allows us to drop the preprocessing step again.



In [None]:
# WE DO NOT NEED THIS BECAUSE ACCORDING TO THE DOCS, THE DATA IS ALREADY
# NORMALIZED TO THE RANGE OF [-1, 1]

# input_preproc = make_pipeline(
#     SimpleImputer(),
#     StandardScaler()
# )

# X_train_preproc = input_preproc.fit_transform(X_train.reshape(X_train.shape[0], -1))
# X_train_preproc = X_train_preproc.reshape(X_train.shape)
# X_train = X_train_preproc

# X_test_preproc = input_preproc.transform(X_test.reshape(X_test.shape[0], -1))
# X_test_preproc = X_test_preproc.reshape(X_test.shape)
# X_test = X_test_preproc

Finally we are going to get the list of class names from a file so that we can use it when analyzing the results later.



In [None]:
class_names = []

with open("data/UCI HAR Dataset/activity_labels.txt", 'r', errors="ignore") as file:
    for line in file:
        class_names.append(line[2:-1])

class_names = np.array(class_names)
print(class_names)

---
### Task 1: Apply PCA to the Dataset

**In the cell below, apply PCA to the dataset (use `X_train` and `Y_train`) and plot the resulting points in 2D, coloured by class. Show the class names in the legend so that the plot is easy to interpret.** 

---


In [None]:


# ----



---
### Task 2: Interpreting the PCA Plot

**In the cell below, insert a qualitative description of what you can see in the PCA plot.** 

* What have you learned about the structure of the space?
* Based on the plot, which classes do you think it would be easy for a shallow classifier to separate correctly?
---


---
### Task 3: Apply UMAP to the Dataset

**In the cell below, apply UMAP to the dataset (use `X_train` and `Y_train`) and plot the resulting points in 2D, coloured by class. Show the class names in the legend so that the plot is easy to interpret.** 

---


In [None]:
um = umap.UMAP(verbose=True)
points_umap = um.fit_transform(X_train)

perm_ind = np.random.permutation(points_umap.shape[0])
xx = points_umap[perm_ind]
yy = Y_train[perm_ind]
xt = X_train[perm_ind]

plt.figure(figsize=[10, 7])
cmap = plt.cm.get_cmap('jet', len(class_names))
plt.scatter(xx[:, 0], xx[:, 1], c=yy,
            cmap=cmap,
            rasterized=True)
cbar = plt.colorbar()
cbar.set_ticks(range(len(class_names)))
cbar.set_ticklabels(class_names)
plt.xlabel("dim 1")
plt.ylabel("dim 2")

---
### Task 4: Interpreting the UMAP Plot

**In the cell below, insert a qualitative description of what you can see in the UMAP plot.** 

* What does the structure of the space look like according to this plot?
* Based on the plot, which classes do you think it would be easy for a shallow classifier to separate correctly?* How does your insight differ from what you learned from the PCA plot?
* Why is there a difference?

---


### Training a Simple Classifier

Next, we are going to train a simple classifier on the dataset to see whether any of the intuitions we gathered will be borne out.



In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train);

Next, we are going to evaluate the model on the test set and especially plot the confusion matrix. What you should see is that it is easy to tell apart activities that involve a lot of movement from activities that are more static.

Within these groups, results are more mixed. On the whole though, one can still get most of the samples right even with a super simple classifier without any hyperparameter tuning. Class "laying" seems to be especially easy to recognize.



In [None]:
y_test = model.predict(X_test).astype(int)
acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))

In [None]:
# build a dataframe with Y_test and y_test
df = pd.DataFrame({"Y_test": Y_test, "y_test": y_test})
crosstab_plot("y_test", "Y_test", data=df)
plt.gca().set_xticklabels(class_names);
plt.gca().set_yticklabels(reversed(class_names));
plt.xlabel("Predicted")
plt.ylabel("Actual")