## Exploratory Data Analysis

### About the Dataset

Name: **Human Activity Recognition Using Smartphones Data Set**  
* The dataset can be downloaded at: [link](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones#)


#### Facts:
1. Number of subjects: **30** (19-48 years)
2. Number of activities: **6** (walking, walking up, walking down, sitting, standing, laying)
3. Raw Data:
    * Frequency of capture: **50 Hz**
    * Signals Captured: **6** (3-axis linear accelaration and 3-axis angular velocity)
    * Sliding Window Length: **2.56 sec** (50% overlap)
    * Datapoints per window: **128**
4. Processed Data:
    * Summary:
        1. Statistics: **561** (calculated on each window, i.e. 128 datapoints)
        2. Domain: **Time** and **Frequency**
        3. Data Split: **70/30** (Subjects are exclusive to each split)
        4. "_Each line in the dataset is a 561-d vector, representing the summary statistics accross a period of 2.56 seconds, associated with an activity and the subject performing that activity._"
    * Non-Summary:
        1. Filtered Split: **3** (body accelaration, gravitational accelaration, angular velocity)
        2. Total Splits: **9** (1 each for x, y and z-axis)
        3. "_Each line in 1 out of these 9 files is a 128-d vector representing the low-pass filtered signal in a 2.56 second window._"
    * Signals are scaled to **\[-1, 1]**

### Code

In [3]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? n
Nothing done.


#### Imports

In [1]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

#### Variables

In [2]:
dataset_root = "/media/ankurrc/new_volume/633_ml/project/code/dataset/UCI HAR Dataset/"

#### Functions

In [3]:
def _load_data(file_path=None):
    """
    Load data into a (rows, columns) format numpy array.
    """
    data = pd.read_csv(file_path, delim_whitespace=True, header=None)
    return data.values

In [4]:
def _load_group(root, filenames, prefix="train"):
    """
    Load a group of files and concatenate them. 
    Returns a (num_samples, time_steps, features) format numpy array.
    """
    data = []
    for filename in filenames:
        file_path = os.path.join(root, filename)
        data.append(_load_data(file_path))
        
    # stack along axis-3; Equivalent to np.concatenate(a[:,:,np.newaxis], b[:, ;, np.newaxis], axis=2)
    data = np.dstack(data)
    return data

In [5]:
def load_dataset(dataset_root, split="train"):
    """
    Loads X and y.
    """
    files_root = os.path.join(dataset_root, "{prefix}/Inertial Signals/".format(prefix=split))
    filenames = os.listdir(files_root)
    # load X
    X = _load_group(files_root, filenames)
    # load y
    label_file_path = os.path.join(dataset_root, "{prefix}/y_{prefix}.txt".format(prefix=split))
    y = _load_data(label_file_path)
    
    return X, y

In [6]:
def get_label_breakup(data):
    """
    Gets a breakup of counts for each label in 'y'.
    """
    df = pd.DataFrame(data, columns=["y"])
    counts = df.groupby("y").size()
    #counts.plot(kind="pie", colormap="GnBu", legend=True, title="Count Breakup", figsize=(10,10)); 
    counts = counts.values

    for i in range(len(counts)):
        percent = counts[i] / len(df) * 100
        print('Class={},\t total={},\t percentage={:.3f}'.format(i+1, counts[i], percent))  

In [None]:
def 

In [75]:
_, y = load_dataset(dataset_root, split="test")
get_label_breakup(y)

Class=1,	 total=496,	 percentage=16.831
Class=2,	 total=471,	 percentage=15.982
Class=3,	 total=420,	 percentage=14.252
Class=4,	 total=491,	 percentage=16.661
Class=5,	 total=532,	 percentage=18.052
Class=6,	 total=537,	 percentage=18.222
