# PDIoT Data Cleaning

Hopefully by now you have collected some HAR data. We are asking you to collect data from two sensors - the Respeck (25Hz, accel and gyro) and the Thingy (25Hz, accel, gyro and magnetometer).

The Respeck is worn on the lower left ribcage, and the Thingy is worn in the front right pocket of the trousers.

We will explore some example data and how to can clean it in this notebook.

<br>

<hr>

<b><font color='#ff5271'> ‼️ Important: Any changes made to this notebook will not be saved. If you wish to run code from this notebook or make changes, please make a copy or download the .ipynb notebook file to your local computer.</font> </b>

<hr>

<br>

## <u> Accelerometer </u>
* Measures acceleration (including gravity)
* Observing the change in direction of gravity often more useful than linear acceleration due to movement
* Sensor values given in g along the axis of interest
* Placing our sensor flat on the table should give -1g on the Z axis and 0g on the other axes
* Cheap to buy and low power consumption

## <u> Gyroscope </u>
* Measures angular velocity
* Sensor values given in radians per second (deg/sec) along the axis of interest
* Placing our sensor flat on the table should give 0 values along all axes
* Higher power consumption

# Basic Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from typing import Tuple
import matplotlib.ticker as ticker
# %matplotlib notebook

In [None]:
# you do not need this if you are not working on google colab!
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


# Reading the header

The files contain a header of size 5. This is where we specify the recording values:

* sensor type (Respeck or Thingy)
* activity type
* activity code (you can find the mapping between activities and their codes in the Constants file on the app)
* subject ID (always a student number)
* notes (can be empty)

In [None]:
# When modifying your own notebook, you can assign the filename_respeck variable to the filepath of
# your RESpeck data file
filename_respeck = "/content/gdrive/Shareddrives/Ink/PDIoT/Respeck_s2255740_Sitting_06-09-2023_21-26-26.csv"
# size of the header
header_size = 5

# Open the RESpeck data file for reading using a context manager
with open(filename_respeck) as f:
  # Read and process the header lines
  head = [next(f).rstrip().split('# ')[1] for x in range(header_size)]
  # print each line in the header
  for line in head:
    print(line)

Sensor type: Respeck
Activity type: Sitting
Activity code: 0
Subject id: s2255640
Notes: hyperventilate


In [None]:
print("test")

# Getting the recording metadata
It's useful to store the metadata about each recording, as you will need it for later.

In [None]:
sensor_type = ""
activity_type = ""
activity_code = -1
subject_id = ""
notes = ""

with open(filename_respeck) as f:
    head = [next(f).rstrip().split('# ')[1] for x in range(header_size)]
    for l in head:
        print(l)

        title, value = l.split(":")

        if title == "Sensor type":
            sensor_type = value.strip()
        elif title == "Activity type":
            activity_type = value.strip()
        elif title == "Activity code":
            activity_code = int(value.strip())
        elif title == "Subject id":
            subject_id = value.strip()
        elif title == "Notes":
            notes = value.strip()

Sensor type: Respeck
Activity type: Sitting
Activity code: 0
Subject id: s2255640
Notes: hyperventilate


You might use this later so you can pack it up into a function

In [None]:
def extract_header_info(filename: str, header_size: int = 5) -> Tuple[str, str, int, str, str]:
    """
    :param filename: Path to recording file.
    :param header_size: The size of the header, defaults to 5.
    :returns: A 5-tuple containing the sensor type, activity type, activity code, subject id and any notes.
    """
    sensor_type = ""
    activity_type = ""
    activity_code = -1
    subject_id = ""
    notes = ""

    with open(filename) as f:
        head = [next(f).rstrip().split('# ')[1] for x in range(header_size)]
        for l in head:
            print(l)

            title, value = l.split(":")

            if title == "Sensor type":
                sensor_type = value.strip()
            elif title == "Activity type":
                activity_type = value.strip()
            elif title == "Activity code":
                activity_code = int(value.strip())
            elif title == "Subject id":
                subject_id = value.strip()
            elif title == "Notes":
                notes = value.strip()

    return sensor_type, activity_type, activity_code, subject_id, notes

And now we can get the variables by applying the function

In [None]:
sensor_type, activity_type, activity_code, subject_id, notes = extract_header_info(filename=filename_respeck)

Sensor type: Respeck
Activity type: Sitting
Activity code: 0
Subject id: s2255640
Notes: hyperventilate


# Reading the file
You can load the file itself using Pandas. You need to specify the amount of rows to be skipped in the beginning (the header size).

In [None]:
df_respeck = pd.read_csv(filename_respeck, header=header_size)
df_respeck

Unnamed: 0,timestamp,accel_x,accel_y,accel_z,gyro_x,gyro_y,gyro_z
0,1694028326947,-0.481445,-0.882874,0.285095,0.640625,-1.031250,-0.203125
1,1694028327005,-0.493652,-0.862610,0.272644,-1.109375,0.015625,-0.484375
2,1694028327022,-0.491211,-0.871399,0.271179,0.328125,-0.656250,-0.562500
3,1694028327060,-0.486816,-0.869202,0.276062,0.531250,-0.390625,-0.421875
4,1694028327097,-0.492432,-0.877991,0.278503,-0.140625,-0.500000,0.015625
...,...,...,...,...,...,...,...
1483,1694028385860,-0.510010,-0.846252,0.320007,0.546875,0.156250,0.265625
1484,1694028385916,-0.508057,-0.863831,0.303162,0.265625,0.000000,-0.187500
1485,1694028385953,-0.498291,-0.850647,0.302185,-0.062500,-0.140625,-0.515625
1486,1694028385991,-0.517822,-0.850891,0.304382,-0.125000,-0.359375,-0.359375


To save the recording metadata for later we can append them as values in new columns

In [None]:
df_respeck['sensor_type'] = sensor_type
df_respeck['activity_type'] = activity_type
df_respeck['activity_code'] = activity_code
df_respeck['subject_id'] = subject_id
df_respeck['notes'] = notes

df_respeck

Unnamed: 0,timestamp,accel_x,accel_y,accel_z,gyro_x,gyro_y,gyro_z,sensor_type,activity_type,activity_code,subject_id,notes
0,1694028326947,-0.481445,-0.882874,0.285095,0.640625,-1.031250,-0.203125,Respeck,Sitting,0,s2255640,hyperventilate
1,1694028327005,-0.493652,-0.862610,0.272644,-1.109375,0.015625,-0.484375,Respeck,Sitting,0,s2255640,hyperventilate
2,1694028327022,-0.491211,-0.871399,0.271179,0.328125,-0.656250,-0.562500,Respeck,Sitting,0,s2255640,hyperventilate
3,1694028327060,-0.486816,-0.869202,0.276062,0.531250,-0.390625,-0.421875,Respeck,Sitting,0,s2255640,hyperventilate
4,1694028327097,-0.492432,-0.877991,0.278503,-0.140625,-0.500000,0.015625,Respeck,Sitting,0,s2255640,hyperventilate
...,...,...,...,...,...,...,...,...,...,...,...,...
1483,1694028385860,-0.510010,-0.846252,0.320007,0.546875,0.156250,0.265625,Respeck,Sitting,0,s2255640,hyperventilate
1484,1694028385916,-0.508057,-0.863831,0.303162,0.265625,0.000000,-0.187500,Respeck,Sitting,0,s2255640,hyperventilate
1485,1694028385953,-0.498291,-0.850647,0.302185,-0.062500,-0.140625,-0.515625,Respeck,Sitting,0,s2255640,hyperventilate
1486,1694028385991,-0.517822,-0.850891,0.304382,-0.125000,-0.359375,-0.359375,Respeck,Sitting,0,s2255640,hyperventilate


One more important value to save for later is a recording ID. This will be used to split the entire dataset into separate recordings before you start doing any further splitting into windows. The name of the file can act as the unique recording ID for each recording.

In [None]:
filename_respeck.split("/")[-1].split(".")[0]

'Respeck_s2255740_Sitting_06-09-2023_21-26-26'

In [None]:
df_respeck['recording_id'] = filename_respeck.split("/")[-1].split(".")[0]
df_respeck

Unnamed: 0,timestamp,accel_x,accel_y,accel_z,gyro_x,gyro_y,gyro_z,sensor_type,activity_type,activity_code,subject_id,notes,recording_id
0,1694028326947,-0.481445,-0.882874,0.285095,0.640625,-1.031250,-0.203125,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
1,1694028327005,-0.493652,-0.862610,0.272644,-1.109375,0.015625,-0.484375,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
2,1694028327022,-0.491211,-0.871399,0.271179,0.328125,-0.656250,-0.562500,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
3,1694028327060,-0.486816,-0.869202,0.276062,0.531250,-0.390625,-0.421875,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
4,1694028327097,-0.492432,-0.877991,0.278503,-0.140625,-0.500000,0.015625,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1483,1694028385860,-0.510010,-0.846252,0.320007,0.546875,0.156250,0.265625,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
1484,1694028385916,-0.508057,-0.863831,0.303162,0.265625,0.000000,-0.187500,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
1485,1694028385953,-0.498291,-0.850647,0.302185,-0.062500,-0.140625,-0.515625,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26
1486,1694028385991,-0.517822,-0.850891,0.304382,-0.125000,-0.359375,-0.359375,Respeck,Sitting,0,s2255640,hyperventilate,Respeck_s2255740_Sitting_06-09-2023_21-26-26


# Getting the frequency (sampling rate) of the data

One useful function is checking the frequency of your recordings. The sensors are both running at 25Hz but it is possible that some packets are dropped along the way. You can use the below function to quickly check the frequency of any of your recordings.

In [None]:
def get_frequency(dataframe: pd.DataFrame, ts_column: str = 'timestamp') -> float:
    """
    :param dataframe: Dataframe containing sensor data. It needs to have a 'timestamp' column.
    :param ts_column: The name of the column containing the timestamps. Default is 'timestamp'.
    :returns: Frequency in Hz (samples per second)
    """

    return len(dataframe) / ((dataframe[ts_column].iloc[-1] - dataframe[ts_column].iloc[0]) / 1000)

Here we can see that the frequency of this recording is a bit over 25Hz, which is considered normal. You should be worried if your recordings deviate with more than 2Hz from the 25Hz threshold.

You can load the thingy data in a similar way

In [None]:
get_frequency(df_respeck)

25.18533563521885

# Getting the length of your data

You can check how long your data recording is via:

In [None]:
len(df_respeck) / get_frequency(df_respeck)

59.082

Since that is something we will be using often, we can also turn it into a function.

In [None]:
def get_recording_length(dataframe: pd.DataFrame):
  """
  :param dataframe: Dataframe containing sensor data.
  """
  return len(dataframe) / get_frequency(dataframe)

# Visualizing the data

Next we will learn how to visualise the data from both sensors.

Be careful when plotting sensor data, if you are trying to compare activities you need to make sure that the axes match. Accelerometer and Gyroscope data are measured on very different scales - accelerometer data is usually in the range [-4, 4], while gyroscope data can get to the 10s and 100s. You should not plot them on the same plot.

The following is a visualization plotting the accelerometer and gyroscope values measured while the subject is sitting and hyperventilating.

In [None]:
# Calculate the number of data points in your dataset
num_data_points = len(df_respeck)

# Calculate a suitable figure width based on the number of data points
# You can adjust the multiplier as needed to control the figure size
figure_width = num_data_points / 10  # Adjust the divisor to control the size


# Set a fixed aspect ratio for the figure (optional)
aspect_ratio = 0.3  # You can adjust this value as needed

# Calculate the figure height based on the aspect ratio and width
figure_height = figure_width * aspect_ratio

# Create the figure with the calculated size
fig, ax = plt.subplots(2, 1, figsize=(figure_width, figure_height))

plot_title = "Respeck sitting and hyperventilating - accelerometer and gyroscope data"

line_width = 6

# Plot respeck with custom line width
ax[0].plot(df_respeck['accel_x'], label="accel_x", linewidth=line_width)
ax[0].plot(df_respeck['accel_y'], label="accel_y", linewidth=line_width)
ax[0].plot(df_respeck['accel_z'], label="accel_z", linewidth=line_width)
ax[0].legend()

ax[0].set_title(f"{df_respeck['sensor_type'].values[0]} - {df_respeck['activity_type'].values[0]} \n Accelerometer data")

# Plot gyroscope data
ax[1].plot(df_respeck['gyro_x'], label="gyro_x", linewidth=line_width)
ax[1].plot(df_respeck['gyro_y'], label="gyro_y", linewidth=line_width)
ax[1].plot(df_respeck['gyro_z'], label="gyro_z", linewidth=line_width)
ax[1].legend()

num_xticks = len(df_respeck)//10
ax[0].xaxis.set_major_locator(ticker.MaxNLocator(num_xticks))
ax[1].xaxis.set_major_locator(ticker.MaxNLocator(num_xticks))

fnt_size = 60
fnt_size2 = 40

ax[1].set_xlabel("Data point no", fontsize=fnt_size)  # Adjust fontsize for the x-axis label
ax[0].set_ylabel("Acceleration", fontsize=fnt_size)  # Adjust fontsize for the y-axis label
ax[1].set_ylabel("Gyroscope", fontsize=fnt_size)

# Adjust fontsize of individual ticks on the x-axis and y-axis for both subplots
ax[0].tick_params(axis='both', labelsize=fnt_size2)
ax[1].tick_params(axis='both', labelsize=fnt_size2)

# Rotate x-axis tick labels by 45 degrees for both subplots
ax[0].tick_params(axis='x', labelrotation=45)
ax[1].tick_params(axis='x', labelrotation=45)

ax[0].set_title(plot_title, size=fnt_size)

# Add vertical grid lines (gridlines along the x-axis)
ax[0].grid(axis='x', linestyle='--', linewidth=line_width)
ax[1].grid(axis='x', linestyle='--', linewidth=line_width)

plt.tight_layout()
plt.show()

[1;30;43mThis cell output is too large and can only be displayed while logged in.[0m


### Since we will be visualizing sensor data often, we should make it into a function as well.

In [None]:
def plot_data(dataframe: pd.DataFrame, plot_title):
  # Calculate the number of data points in your dataset
  num_data_points = len(dataframe)

  # Calculate a suitable figure width based on the number of data points
  # You can adjust the multiplier as needed to control the figure size
  figure_width = num_data_points / 10  # Adjust the divisor to control the size


  # Set a fixed aspect ratio for the figure (optional)
  aspect_ratio = 0.3  # You can adjust this value as needed

  # Calculate the figure height based on the aspect ratio and width
  figure_height = figure_width * aspect_ratio

  # Create the figure with the calculated size
  fig, ax = plt.subplots(2, 1, figsize=(figure_width, figure_height))

  plot_title = plot_title

  line_width = 6

  # Plot respeck with custom line width
  ax[0].plot(dataframe['accel_x'], label="accel_x", linewidth=line_width)
  ax[0].plot(dataframe['accel_y'], label="accel_y", linewidth=line_width)
  ax[0].plot(dataframe['accel_z'], label="accel_z", linewidth=line_width)
  ax[0].legend()

  ax[0].set_title(f"{dataframe['sensor_type'].values[0]} - {dataframe['activity_type'].values[0]} \n Accelerometer data")

  # Plot gyroscope data
  ax[1].plot(dataframe['gyro_x'], label="gyro_x", linewidth=line_width)
  ax[1].plot(dataframe['gyro_y'], label="gyro_y", linewidth=line_width)
  ax[1].plot(dataframe['gyro_z'], label="gyro_z", linewidth=line_width)
  ax[1].legend()

  num_xticks = len(dataframe)//10
  ax[0].xaxis.set_major_locator(ticker.MaxNLocator(num_xticks))
  ax[1].xaxis.set_major_locator(ticker.MaxNLocator(num_xticks))

  fnt_size = 60
  fnt_size2 = 40

  ax[1].set_xlabel("Data point no", fontsize=fnt_size)  # Adjust fontsize for the x-axis label
  ax[0].set_ylabel("Acceleration", fontsize=fnt_size)  # Adjust fontsize for the y-axis label
  ax[1].set_ylabel("Gyroscope", fontsize=fnt_size)

  # Adjust fontsize of individual ticks on the x-axis and y-axis for both subplots
  ax[0].tick_params(axis='both', labelsize=fnt_size2)
  ax[1].tick_params(axis='both', labelsize=fnt_size2)

  # Rotate x-axis tick labels by 45 degrees for both subplots
  ax[0].tick_params(axis='x', labelrotation=45)
  ax[1].tick_params(axis='x', labelrotation=45)

  ax[0].set_title(plot_title, size=fnt_size)

  # Add vertical grid lines (gridlines along the x-axis)
  ax[0].grid(axis='x', linestyle='--', linewidth=line_width)
  ax[1].grid(axis='x', linestyle='--', linewidth=line_width)

  plt.tight_layout()
  plt.show()

In [None]:
plot_data(df_respeck, "Respeck sitting and hyperventilating - accelerometer and gyroscope data")

[1;30;43mThis cell output is too large and can only be displayed while logged in.[0m
