# Acquire and cleanse data
### Data Collection Process for Human Activity Recognition

The Human Activity Recognition (HAR) dataset was meticulously gathered through a series of well-structured experiments involving 30 volunteers. These participants, aged between 19 and 48 years, were engaged in a controlled environment to perform a set of predefined activities. The goal was to capture comprehensive data that would enable accurate classification of everyday activities using smartphone sensors.

#### Experimental Setup

1. **Participants**:
    - A total of 30 volunteers participated in the study.
    - Age range: 19-48 years.
    - Each participant performed a series of six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying down.

2. **Equipment**:
    - Each participant was equipped with a Samsung Galaxy S II smartphone.
    - The smartphone was positioned at the waist to ensure consistency in data collection.
    - The smartphone's embedded sensors, specifically the accelerometer and gyroscope, were used to record the data.

#### Data Recording

1. **Sensor Data**:
    - The accelerometer and gyroscope captured 3-axial linear acceleration and 3-axial angular velocity.
    - Data was recorded at a constant rate of 50Hz, ensuring high-resolution capture of the participants' movements.

2. **Video Recording**:
    - All experiments were video-recorded to facilitate the manual labeling of activities.
    - This step ensured accurate correlation between the recorded sensor data and the actual activities performed.

#### Data Preprocessing

1. **Noise Filtering**:
    - The raw sensor data was pre-processed to remove noise.
    - Various noise filtering techniques were applied to ensure the accuracy and reliability of the data.

2. **Segmentation**:
    - The sensor data was segmented into fixed-width sliding windows of 2.56 seconds, with a 50% overlap, resulting in 128 readings per window.
    - This segmentation facilitated the extraction of meaningful features from the data.

3. **Signal Separation**:
    - The accelerometer signals were separated into body acceleration and gravity components using a Butterworth low-pass filter.
    - The filter, with a 0.3 Hz cutoff frequency, isolated the low-frequency gravitational force from the higher frequency body movements.

#### Feature Extraction

1. **Time and Frequency Domain Features**:
    - From each segmented window, a vector of features was derived by calculating various variables from both the time and frequency domains.
    - This comprehensive feature extraction process resulted in a 561-feature vector for each data record.

#### Dataset Partitioning

1. **Training and Testing Sets**:
    - The entire dataset was randomly partitioned into two sets: 70% of the data was used for training, and the remaining 30% was reserved for testing.
    - This partitioning ensured that the model could be trained on a substantial amount of data while being validated on a separate, unseen dataset to test its generalizability.

#### Attribute Information

For each record in the dataset, the following attributes were provided:

- Triaxial acceleration from the accelerometer (total and estimated body acceleration).
- Triaxial angular velocity from the gyroscope.
- A 561-feature vector derived from time and frequency domain calculations.
- Activity label indicating the type of activity performed.
- An identifier for the participant who carried out the experiment.

The HAR dataset collection process was designed to ensure the capture of high-quality, detailed data that accurately represents various human activities. By combining sensor data with video-recorded labels and employing rigorous preprocessing and feature extraction techniques, the resulting dataset serves as a valuable resource for developing robust Human Activity Recognition systems.

## Libraries

In [6]:
# Warnings
import warnings
warnings.filterwarnings('ignore')

# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
RANDOM_STATE = 35820
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
%matplotlib inline

In [7]:
ROOT_DATA = "../data/activity_root_file.csv"

## Loading data

In [11]:
data = pd.read_csv(ROOT_DATA, encoding= 'ISO-8859-1', low_memory=False)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 7, saw 4
