## Assemble Training and Holdout Datasets

This notebook unzips the needed files from the downloaded data from University of Mannheim - Research Group Data and Web Science - RealWorld (HAR) data:
https://sensor.informatik.uni-mannheim.de/#dataset_realworld
I used the option to download ALL the files in one zip file, which downloads not only all the sensor readings but all the video, so it is very big. However only the accelerometer and gyroscope data was used for this project.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile

The data from Uni-Mannheim comes in as a series of zip files organized by subject, activity, sensor, and device location. We need to first unzip the data. We are interested in all 15 subjects (the first 10 subjects will become our Training Dataset, and  the remaining 5 will become out Holdout Dataset). We assemble data for all 8 activities, but only for the accelerometer and gyroscope sensors and only in the "thigh" (pants pocket) location. This code unzips the relevant csv files, first for the accelerometer data and then for the gyroscope data.

In [None]:
# unzip accelerometer files
# several files had "_2" inserted; thigh-location data for subject 6 jumping was missing

activities =["climbingdown","climbingup","jumping","lying","running","sitting","standing","walking"]
for subject in range(1,16):  # extract subjects
    zipfilepath="./data/proband"+str(subject)+"/data/"
    for activity in activities:
        zipfilename = "acc_"+activity+"_csv.zip"
        if (subject==4 and activity=='walking'):
            filename ="acc_walking_2_thigh.csv" 
        elif (subject in [6,7] and activity=='sitting'):
            filename ="acc_sitting_2_thigh.csv"
        elif (subject ==8 and activity=='standing'):
             filename ="acc_standing_2_thigh.csv"
        elif (subject==6 and activity=='jumping'):  #thigh data is missing, but other locations are available
            continue # skip this iteration and move to the next
        elif subject==13 and activity=="walking":
            filename = "acc_walking_2_thigh.csv"
        else:
            filename = "acc_"+activity+"_thigh.csv"
        savepath = "./data/"+str(subject)+"/"
        with zipfile.ZipFile(zipfilepath+zipfilename,"r") as zip_ref:
            zip_ref.extract(filename, path=savepath)

In [62]:
# unzip gyroscope files
# several files had "_2" inserted; thigh-location data for subject 6 jumping was missing
sensor = "gyr_"
activities =["climbingdown","climbingup","jumping","lying","running","sitting","standing","walking"]
for subject in range(1,16):  # extract subjects
    zipfilepath="./data/proband"+str(subject)+"/data/"
    for activity in activities:
        zipfilename = sensor+activity+"_csv.zip"
        if subject==1 and activity=="sitting":
            continue
        elif subject==4 and activity=="walking":
            filename = "Gyroscope_walking_2_thigh.csv"
        elif subject==6 and activity=="jumping":
            continue
        elif (subject in [6,7] and activity=='sitting'):
            filename ="Gyroscope_sitting_2_thigh.csv"
        elif (subject ==8 and activity=='standing'):
             filename ="Gyroscope_standing_2_thigh.csv"
        elif subject==13 and activity=="walking":
            filename = "Gyroscope_walking_2_thigh.csv"
        else:
            filename = "Gyroscope_"+activity+"_thigh.csv"
        savepath = "./data/"+str(subject)+"/"
        with zipfile.ZipFile(zipfilepath+zipfilename,"r") as zip_ref:
            zip_ref.extract(filename, path=savepath)

There were several anomalies in the way the original files were named. For some subject/activity combinations, there was more than one file, and the file was named with an additional suffix. Only the first version in each case was unzipped and used in data collection (I did not inspect the other versions to see how they were different, but the file sizes are not identical). This code eliminates the suffix in the name of these few csv files so that the following code works using uniform naming conventions.

In [19]:
# rename _2 files
import os
os.rename("./data/4/acc_walking_2_thigh.csv", "./data/4/acc_walking_thigh.csv")
os.rename("./data/6/acc_sitting_2_thigh.csv", "./data/6/acc_sitting_thigh.csv")
os.rename("./data/7/acc_sitting_2_thigh.csv", "./data/7/acc_sitting_thigh.csv")
os.rename("./data/8/acc_standing_2_thigh.csv", "./data/8/acc_standing_thigh.csv")
os.rename("./data/13/acc_walking_2_thigh.csv", "./data/13/acc_walking_thigh.csv")
os.rename("./data/4/Gyroscope_walking_2_thigh.csv", "./data/4/Gyroscope_walking_thigh.csv")
os.rename("./data/6/Gyroscope_sitting_2_thigh.csv", "./data/6/Gyroscope_sitting_thigh.csv")
os.rename("./data/7/Gyroscope_sitting_2_thigh.csv", "./data/7/Gyroscope_sitting_thigh.csv")
os.rename("./data/8/Gyroscope_standing_2_thigh.csv", "./data/8/Gyroscope_standing_thigh.csv")
os.rename("./data/13/Gyroscope_walking_2_thigh.csv", "./data/13/Gyroscope_walking_thigh.csv")

The code below is used to compare the number of observations for each subject/activity/sensor combination. I found out that the number of observations is not the same for each sensor, so I included checks and adjustments in the following code where I actually merge the files.

In [49]:
# count number of observations by subject, activity, and sensor
record_counts = []  # initialize empty list
activities =["climbingdown","climbingup","jumping","lying","running","sitting","standing","walking"]
# iterate over subjects
for subject in range(1,16):
    # iterate over the separate activity files
    for i, activity in enumerate(activities):
        if subject==1 and activity=="sitting":
            continue # missing data for this subject / activity
        if subject==6 and activity=="jumping":
            continue # missing data for this subject / activity
        g_filename = "./data/"+str(subject)+"/Gyroscope_"+activity+"_thigh.csv"
        a_filename = "./data/"+str(subject)+"/acc_"+activity+"_thigh.csv"
        with open(g_filename) as f:
            gyr_count = sum(1 for line in f)
        with open(a_filename) as f:
            acc_count = sum(1 for line in f)
        record_counts.append({'subject':subject,
                              'activity':activity,
                              'gyr_count':gyr_count,
                              'acc_count':acc_count})
records=pd.DataFrame(record_counts)

The code below merges all of the relevant CSVs into a single dataframe, and at the same time creates sample labels for each consecutive 100 observations (representing 2 seconds). The labeling of samples is done within each subject/activity/sensor combination, and any remainder (less than a full 100 obs) is dropped. Also, we test to see if the number of observations for each sensor (for each subject/activity) is the same, and if not, we drop the extra readings for the sensor that has more. This is to make sure that there is no missing data. In general, the number of observations for each sensor was very close (although not always identical), so we are not losing very much data.

In [80]:
# merge all activities into 1 df, for subjects 1-15, with gyroscope & accelerometer sensors combined
thigh = pd.DataFrame()  # initialize empty df
windowsize = 100 # use 2 sec windows
last_sample = 0
activities =["climbingdown","climbingup","jumping","lying","running","sitting","standing","walking"]
# iterate over subjects
for subject in range(1,16):
    # iterate over the separate activity files
    for i, activity in enumerate(activities):
        if subject==6 and activity=="jumping":
            continue # missing data for subject 6 acc data from thigh
        g_filename = "./data/"+str(subject)+"/Gyroscope_"+activity+"_thigh.csv"
        a_filename = "./data/"+str(subject)+"/acc_"+activity+"_thigh.csv"
        gyr = pd.read_csv(g_filename, index_col=0)
        acc = pd.read_csv(a_filename, index_col=0)
        # make sure the different sensor files have the same number of readings
        if len(gyr)>len(acc):
            extra_records=len(gyr)-len(acc)
            acc = acc[:-extra_records]
        if len(gyr)<len(acc):
            extra_records=len(acc)-len(gyr)
            gyr = gyr[:-extra_records]
        df = pd.merge(gyr,acc,right_index=True,left_index=True,suffixes=('_gyr','_acc'))
        df['activity']=activity
        df['label']=i
        df['subject']=subject
        df['sample_num']=-1  # just to create the column   
        remainder=len(df)%windowsize
        df=df[:-remainder]  # delete rows so that all samples are the same size
        num_samples=int(len(df)/windowsize)  # the number of samples of size windowsize for the activity
        row_counter=0
    # label the samples within each activity file    
        for j in range(0,num_samples):   
            df.sample_num[row_counter:row_counter+windowsize]=last_sample+j # label samples
            row_counter=row_counter+windowsize
        last_sample+=num_samples # label the samples consecutively across activities
        thigh = thigh.append(df, ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Divide the data into Training and Holdout datasets.

In [None]:
training = thigh[thigh.subject.between(1,10)] 
holdout = thigh[thigh.subject.between(11,15)] 

Save the Training and Holdout Datasets ('100' refers to the window size).

In [81]:
training.to_pickle('./data/thigh100.pkl')

In [None]:
holdout.to_pickle('./data/thigh_validate100.pkl')