# Human Activiry Recognition

## Project description

The goal of human activity recognition (HAR) is to take an advantage of accelerometer and gyroscope sensors data e.g. from your smartphone and classify these data to predict specific type of movement sych as walking, running, sitting etc.

Accelerometers and gyroscopes are device capable of measuring mass acceleration and angular rotation. Depending on  type and quality of the sensor, accelerometer data can be very noisy and biased, which can introduce additional errors in our model.

In general, HAR is challenging tasks, due to the large number of data that accelerometers generates (easily up to 200Hz) every second, noisy data as well as the fact that there is no clear way to relate accelerometer data to specific movements.

Since, the original data source created by **[UCI](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones)** is not functional anymore, for this project we will use the copy of dataset downloaded from Kaggle data repository **[KAGGLE DATA SET](https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones)**

## Setup the environment

First, let's import necessary modules and setup working environment. We also include function for saving figures. 

In [70]:
# import dependencies
import sys
assert sys.version_info >= (3,5)
import os
import urllib
import time, datetime
from collections import Counter

# Data manipulation
import numpy as np
import pandas as pd
import math, random

# Data visualization
import missingno
import seaborn as sns
import matplotlib as mpl
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

# Machine learning
import sklearn
assert sklearn.__version__ >= "0.20" # check Scikit-learn version 
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Define constants and variables
PROJECT_ROOT_DIR = '.'
DATA_PATH = os.path.join(PROJECT_ROOT_DIR, "datasets")
os.makedirs(DATA_PATH, exist_ok=True)
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images")
os.makedirs(IMAGES_PATH, exist_ok=True)

# Function to save figures
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saveing figure ", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Load the data

The data were downloaded from the Kaggle data repository (see the link above in project description) and saved into project folder.

**TBD**: add data description

In [80]:
def load_data(data_path=DATA_PATH):
    try:
        train_set = pd.read_csv(os.path.join(data_path, "train.csv"))
        test_set  = pd.read_csv(os.path.join(data_path, "test.csv"))
        print("Data set {} and {} were loaded.".format("train.csv", "test.csv"))
        return train_set, test_set
#     except IOError:
#             print("Data set not found.")
    except FileNotFoundError:
        print("Data set was not found.")

In [86]:
train, test = load_data()

Data set train.csv and test.csv were loaded.


## Data description

Data in the dataset were recorded using a buiil in smartphone accelerometer and gyroscope sensors, which captured 3-axial linear acceleration and 3-axial angular velocity with logging (sampling) frequency of 50Hz. Sensor data are already-preprocessed.

More details about how the data were recored, labeled, and pre-processed please refer to the original **[UCI data repository](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones)**.

In [125]:
# View the train data
print("Train data set size is {}".format(train.shape))

Train data set size is (7352, 563)


As we can see there are 563 columns in dataset. After closer look, you will notice the features are dvided into few categories (acceleration, gravity) with some additional parameter (mean, std, ...).
Let's group and count the main categories together. This will give us better data overview.

In [121]:
def group_data_columns(data_set):
    """
    Function takes data set as an input argument and returns new data frame, 
    where the first column represents the attribute's name and seconds its count
    value from original dataset.
    """
    # Group and count main names of columns
    grouped_columns = pd.DataFrame.from_dict(Counter([column.split('-')[0].split('(')[0] for column in data_set.columns]), orient='index')
    grouped_columns.rename(columns={0:'count'}).sort_values('count', ascending=False)
    return grouped_columns

In [122]:
group_data_columns(train)

Unnamed: 0,0
tBodyAcc,40
tGravityAcc,40
tBodyAccJerk,40
tBodyGyro,40
tBodyGyroJerk,40
tBodyAccMag,13
tGravityAccMag,13
tBodyAccJerkMag,13
tBodyGyroMag,13
tBodyGyroJerkMag,13


From the table above, we can distinguish between two main attributes, starting with letter *"t"* and *"f"*, which means sensor data obtained in time and frequency domain, respectively. Both time and frequency domains have subcategories such as acceleration and gravity.