#**Clustering Human Activities**

In this problem set you will work with a data set from the UCI Machine Learning Repository on  [Human Activity Recognition and Tracking Using Smartphone](https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones).

The data set contains normalized 3-axial readings fromsmartphone embedded accelerometers and gyroscopsensors from 30 volunteers from 6 different activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

Our goal is to use clustering algorithms and PCA in order to separate the readings into clusters based on the activity. *Our goal is to have 6 clusters, one for each activity.*

# [DSLC stages]: Data cleaning and pre-processing

Let's start by loading in the libraries that we will need.


In [2]:
%pip install pandas numpy plotly-express scikit-learn

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting numpy
  Downloading numpy-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting plotly-express
  Using cached plotly_express-0.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
[31mERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement skikit-learn (from versions: none)[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: No 

In [2]:
import pandas as pd
import numpy as np
from random import sample
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples, rand_score, adjusted_rand_score
from itertools import product

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)


## Data source overview

## Step 1: Review background information on data collection

Please review the README.txt for information on data collection.

The data has already been cleaned. It is available in the following files:

Measurements described in features.txt and features_info.txt with first column subject id from subject_train.txt: **measurements_train.csv**

Labels of the activities corresponding to the activities:
**y_train.txt**

The activities are encoded as follows:

1 WALKING

2 WALKING_UPSTAIRS

3 WALKING_DOWNSTAIRS

4 SITTING

5 STANDING

6 LAYING

### Answering questions about the background information


- *What does each variable measure?*

The data collected is 3D acceleration and 3D gyroscope (rotation) signals. The features starting with t measure the actual time series data, while those with f are the fourier (frequency) series data derived from the time series. In addition, many features are postprocessed forms of these features, like mean and standard deviations of the 

- *How was the data collected?*

The data was collected using a phone's gyroscope and acceleration time series measures. All other predictors are derived from these.

- *What are the observational units?*

All features use normalized units from [-1,1].

The aceceleration units are 'g's and the gyroscope units are radians per second.

In [7]:
#TO DO: Add code here to load the measurement and training data
measurement_train = pd.read_csv("measurements_train.csv")
y_train = pd.read_csv("y_train.txt", names=["y"])

#TO DO: Add code to examine the data frames once loaded
print(measurement_train.describe())
y_train.describe()


#You may add any additional code here to examine the data such as
# looking for missingness, data format, invalid values, column names,
# variable types, and imcomplete data

                id  tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  \
count  7352.000000        7352.000000        7352.000000        7352.000000   
mean     17.413085           0.274488          -0.017695          -0.109141   
std       8.975143           0.070261           0.040811           0.056635   
min       1.000000          -1.000000          -1.000000          -1.000000   
25%       8.000000           0.262975          -0.024863          -0.120993   
50%      19.000000           0.277193          -0.017219          -0.108676   
75%      26.000000           0.288461          -0.010783          -0.097794   
max      30.000000           1.000000           1.000000           1.000000   

       tBodyAcc-std()-X  tBodyAcc-std()-Y  tBodyAcc-std()-Z  tBodyAcc-mad()-X  \
count       7352.000000       7352.000000       7352.000000       7352.000000   
mean          -0.605438         -0.510938         -0.604754         -0.630512   
std            0.448734          0.502645    

Unnamed: 0,y
count,7352.0
mean,3.643362
std,1.744802
min,1.0
25%,2.0
50%,4.0
75%,5.0
max,6.0


#Clustering Analysis

We will apply K-means clustering and hierarchical clustering to the data.

In [None]:
#TO DO: Write code here to apply hierarchical clustering to the data
#       using Euclidean distance and ward linkage

#Add additional code to examine the clusters

In [None]:
#TO DO: Write code here to apply KMeans clustering to the data
# Add additional code to examine the clusters


In [None]:
#This code is provided to you to examine the clusters

def sample__cluster(activity, cluster, seed):
    samples_by_cluster = (pd.DataFrame({"activity": activity,
                                        "cluster": cluster})
          .groupby(cluster)
          .sample(15, random_state = seed))
    # make sure the index is repeated in each group
    samples_by_cluster.index = 6 * list(np.arange(15))
    # pivot to wider format
    samples_by_cluster = samples_by_cluster.pivot(columns="cluster").droplevel(axis=1, level=0)
    # add word cluster to column names
    samples_by_cluster.columns = ["cluster_" + str(i) for i in samples_by_cluster.columns]

    return samples_by_cluster

In [None]:
# TODO: Call the sample cluster function here to with seed = 10
# to examine the clusters from k-means clustering


In [None]:
 # Call the sample_cluster function here with your activity label and hierachical
# clusters. Use the seed = 10


**Question:** From the sample of clusters from both methods, are some clusters reflecting activities and if so which?

**TODO: answer here**

In [None]:

#TO DO: Write code here to
# 1. Randomly sample 1000 samples use random_state= 1111
# 2. Then using the sample in a single plot:
#    create  scatter plots using
#    tBodyAcc-mean()-X and tBodyAcc-mean()-Y
#      i) the data colored by activity label
#      ii) the data colored by the hiearchical cluster label
#     iii) the data colored by the k-means cluster label
# Note that the cluster labels will not mean the same for each
# Meaning in the activity labeled cluster 6 will mean LAYING,
# but 6 in other clusters would only mean 6th cluster not inherently
# any activity


#Calculating clustering metrics

**TODO: Edit this cell to report the following clustering metrics**

Clustering Metrics
1. K-Means Inertia:
2. Total With Cluster Sum of Squares for K-Means:
3. Total With Cluster Sum of Squares for Hierarchical Clustering:
4. Silhouette Score for K-Means:
5. Silhouette Score for Hierarchical Clustering:
6. Rand Score and Adjusted Rand Score for the K-Means and Hiearchical Clustering:



In [None]:
#TODO: Write a function that given the data and clusters
#  returns the within cluster sum of squares, WSS

def tot_within_sum_of_square(data, clusters):
    return(0)


In [None]:
#TODO: Add any additional code you used to report the classification metrics above


#Clustering Data after PCA

Now we'll apply PCA and see how the data clusters after PC transformation.

In [None]:
#TODO: Add code here to perform PCA on the measurements data


In [None]:
#TODO: Calculate the proportion of variance explained by each PC
# and the cumulative variability explained by each PC
# Show the table of these values for the first 10 PCs.


In [None]:
# Create a scree plot of the PCs against the percentage of variability
# explained for the first 10 PCs
# Explain how many PCs you would keep in subsequent analysis and why


In [None]:
# TODO: create the PCA-transformed dataset

# multiply the original data and the PCA loadings

# make the data easier to work with by
# changing the column names to PC1, PC2, etc

# look at the object


In [None]:
#TO DO: Write code here using the entire PCA-transformed
#    To create a single plot containing  create 3 scatter plots
#    using PC1 and PC2 for the
#      i) the data colored by activity label
#      ii) the data colored by the original hierarchical cluster label
#     iii) the data colored by the original k-means cluster label

#Using Clustering and PCA

Now we'll apply clustering to the PCA-transformed data.

In [None]:
#TODO: Add code here using the number of PCs you chose to use
# 1. Perform hierarchical clustering on the PCA-transformed data
# 2. Perform k-means clustering on the PCA-transformed data

In [None]:
#TO DO: Write code here using the entire PCA-transformed
#    To create a single plot containing  create 3 scatter plots
#    using PC1 and PC2 for the
#      i) the data colored by activity label
#      ii) the data colored by the pca-transformed hierarchical cluster label
#     iii) the data colored by the pca-transformed  k-means cluster label

**TODO: Edit this cell to report the following clustering metrics on the PCA-Transformed Data**

Clustering Metrics
1. K-Means Inertia:
2. Total With Cluster Sum of Squares for K-Means:
3. Total With Cluster Sum of Squares for Hierarchical Clustering:
4. Silhouette Score for K-Means:
5. Silhouette Score for Hierarchical Clustering:
6. Rand Score and Adjusted Rand Score for the K-Means and Hiearchical Clustering:

In [None]:
#Add code here to calculate these metrics for the PCA-Transformed clustering

#Discussion

**TODO: Answer the following question:**

Reflection: How did applying PCA and clustering together change the quality and visualization of the clustering results?

Citation: This problem set is adapted from Exercise 27 of Chapter 6 of

Yu, B., & Barter, R. L. (2024). Veridical data science: The practice of responsible data analysis and decision making. The MIT Press
