# Getting and Cleaning Data Course Project
___

The purpose of this project is to demonstrate your ability to collect, work with, and clean a data set.

## Review criteria
1. The submitted data set is tidy.
2. The Github repo contains the required scripts.
3. GitHub contains a code book that modifies and updates the available codebooks with the data to indicate all the variables and summaries calculated, along with units, and any other relevant information.
4. The README that explains the analysis files is clear and understandable.
5. The work submitted for this project is the work of the student who submitted it.

## Getting and Cleaning Data Course Project
The purpose of this project is to demonstrate your ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis. You will be graded by your peers on a series of yes/no questions related to the project. You will be required to submit: 1) a tidy data set as described below, 2) a link to a Github repository with your script for performing the analysis, and 3) a code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.

One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data linked to from the course website represent data collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained:

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Here are the data for the project:

https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

You should create one R script called run_analysis.R that does the following.

1. Merges the training and the test sets to create one data set.
2. Extracts only the measurements on the mean and standard deviation for each measurement.
3. Uses descriptive activity names to name the activities in the data set
4. Appropriately labels the data set with descriptive variable names.
5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Good luck!

In [70]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [1]:
if (!file.exists("data/course_project_files.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", 
                 "data/course_project_files.zip")
}


In [2]:
unzip("data/course_project_files.zip", exdir="data")

In [3]:
dir("data/UCI HAR Dataset/")

## Review the features file
The `features.txt` file is a list of the features in the test files. Now, the `X_train.txt` file is really, really large - we'd be better off defining which columns we want to extract first of all, and *then* extract only these columns.

In [195]:
features <- readLines("data/UCI HAR Dataset/features.txt")

In [196]:
strsplit(features[[1]], " ")[[1]][[2]]

In [197]:
features <- sapply(features, function(x) { strsplit(x, " ")[[1]][[2]]})

In [198]:
meanandstd <- grepl(features, pattern = "(mean\\(\\)|std\\(\\))")

In [199]:
colnames <- features[meanandstd]

In [200]:
colnames

In [201]:
colwidths <- sapply(X = meanandstd, FUN = function(x) { ifelse(x, 16, -16)})

## Reading in data

### Assemble Training Data

In [202]:
x.training <- read.fwf("data/UCI HAR Dataset//train/X_train.txt", widths=colwidths, n = -1, col.names = colnames)

In [61]:
y.training <- readLines("data/UCI HAR Dataset/train/y_train.txt")

In [59]:
subject.training <- readLines("data/UCI HAR Dataset/train/subject_train.txt")

In [68]:
training <- cbind(x.training, y.training, subject.training)

In [85]:
training <- training %>% rename(activity = y.training, subject = subject.training)

ERROR: Error: `y.training`, `subject.training` contains unknown variables


### Assemble test data

In [91]:
x.test <- read.fwf("data/UCI HAR Dataset/test/X_test.txt", widths=colwidths, n = -1, col.names = colnames)

In [88]:
y.test <- readLines("data/UCI HAR Dataset/test/y_test.txt")

In [89]:
subject.test <- readLines("data/UCI HAR Dataset/test/subject_test.txt")

In [92]:
test <- cbind(x.test, y.test, subject.test)

In [93]:
test <- test %>% rename(activity = y.test, subject = subject.test)

### Merge Data

In [95]:
total.data <- merge(training, test, all = TRUE)

In [97]:
names(total.data)

In [99]:
col.names <- names(total.data)

## Rename variables

Desired format is:
* Replace `f`/`t` with `frequency` / `time`
* Periods to mark between words
* Remove `bodybody`
* `mean` and `std` at end of line

In [103]:
new.col.names <- sub("BodyBody", "Body", col.names)

In [104]:
new.col.names

In [109]:
new.col.names <- gsub("([A-Z])", "\\.\\1", new.col.names)

In [126]:
head(new.col.names)

In [150]:
new.col.names <- gsub("(mean)(.*)", "\\2.mean", new.col.names)

In [152]:
new.col.names <- gsub("(std)(.*)", "\\2.std", new.col.names)

In [158]:
new.col.names <- gsub("(\\.+)", "\\.", new.col.names)

In [160]:
new.col.names <- gsub("(^t)", "time", new.col.names)

In [162]:
new.col.names <- gsub("(^f)", "frequency", new.col.names)

In [166]:
new.col.names <- gsub("(Acc)", "acceleration", new.col.names)

In [167]:
head(new.col.names)

In [169]:
tolower(new.col.names)

In [170]:
names(total.data) <- new.col.names

## Replace activity numbers with activity names

In [180]:
total.data$activity <- revalue(x = total.data$activity, 
                             c("1" = "walking", "2" = "walkingupstairs", "3" = "walkingdownstairs", 
                               "4" = "sitting", "5" = "standing", "6" = "laying"))

In [181]:
head(total.data)

time.Body.acceleration.X.mean,time.Body.acceleration.Y.mean,time.Body.acceleration.Z.mean,time.Body.acceleration.X.std,time.Body.acceleration.Y.std,time.Body.acceleration.Z.std,time.Gravity.acceleration.X.mean,time.Gravity.acceleration.Y.mean,time.Gravity.acceleration.Z.mean,time.Gravity.acceleration.X.std,⋯,frequency.Body.acceleration.Mag.mean,frequency.Body.acceleration.Mag.std,frequency.Body.acceleration.Jerk.Mag.mean,frequency.Body.acceleration.Jerk.Mag.std,frequency.Body.Gyro.Mag.mean,frequency.Body.Gyro.Mag.std,frequency.Body.Gyro.Jerk.Mag.mean,frequency.Body.Gyro.Jerk.Mag.std,activity,subject
-1.0,0.17752216,0.54393929,-0.10078555,-0.12621085,0.35953802,0.26331618,0.3709743,0.6252615,0.783028,⋯,0.25436641,1.0,-0.8480197,-0.7811734,-0.7187075,-0.7191602,-0.8092249,-0.7377875,laying,14
-0.8723954,0.1546078,0.33075342,-0.0606394,-0.30688449,0.06857798,0.15796616,0.3996528,0.6992262,0.6999877,⋯,0.08208614,0.859717,-0.76514,-0.6977112,-0.6650375,-0.5784044,-0.8033907,-0.7047181,laying,14
-0.8538482,0.20536516,-0.11634455,-0.05714196,-0.07570973,-0.30853842,-0.20846369,0.7500424,0.4937287,1.0,⋯,0.22740813,0.5027986,-0.6589593,-0.6846475,-0.6977197,-0.7486869,-0.7606992,-0.7924713,laying,6
-0.5920043,0.14698327,0.05256077,-0.42436336,-0.22019354,-0.70417659,0.09429582,0.7600206,0.4341949,0.3319481,⋯,-0.10971764,0.1462056,-0.7460609,-0.7711571,-0.7518159,-0.8552519,-0.7221136,-0.7678528,laying,20
-0.5210621,-0.0001832748,0.10661589,-0.34411877,-0.49510932,-0.46239833,-0.07149524,0.547648,0.7526245,0.4638767,⋯,-0.07171502,0.1243441,-0.8287394,-0.8286772,-0.8310883,-0.8378409,-0.8356959,-0.8195921,laying,12
-0.5038227,-0.59420738,0.26480435,-0.70340175,0.6724869,-0.46498511,-0.60124091,0.7360639,0.6609761,-0.4146724,⋯,-0.20493853,0.6020663,-0.9072344,-0.8940982,-0.7859358,-0.4291497,-0.9441093,-0.9527274,laying,25


## Generate data set with the average of each variable for each activity and each subject

In [186]:
library(reshape2)

In [188]:
melted.data <- melt(total.data, id=c("subject", "activity"))

In [192]:
dcast(melted.data, activity ~ variable, mean)

activity,time.Body.acceleration.X.mean,time.Body.acceleration.Y.mean,time.Body.acceleration.Z.mean,time.Body.acceleration.X.std,time.Body.acceleration.Y.std,time.Body.acceleration.Z.std,time.Gravity.acceleration.X.mean,time.Gravity.acceleration.Y.mean,time.Gravity.acceleration.Z.mean,⋯,frequency.Body.Gyro.Y.std,frequency.Body.Gyro.Z.std,frequency.Body.acceleration.Mag.mean,frequency.Body.acceleration.Mag.std,frequency.Body.acceleration.Jerk.Mag.mean,frequency.Body.acceleration.Jerk.Mag.std,frequency.Body.Gyro.Mag.mean,frequency.Body.Gyro.Mag.std,frequency.Body.Gyro.Jerk.Mag.mean,frequency.Body.Gyro.Jerk.Mag.std
walking,0.2763369,-0.01790683,-0.1088817,-0.3146445,-0.02358295,-0.2739208,0.9349916,-0.1967135,-0.05382512,⋯,-0.3319615,-0.4105691,-0.2755581,-0.48000228,-0.214653972,-0.22161777,-0.409173,-0.4738331,-0.5155168,-0.5144048
walkingupstairs,0.2622946,-0.02592329,-0.1205379,-0.2379897,-0.01603251,-0.1754497,0.8750034,-0.2813772,-0.14079567,⋯,-0.2931818,-0.2920413,-0.2620281,-0.36175346,-0.353962,-0.43420673,-0.4497814,-0.3814064,-0.6586945,-0.7030835
walkingdownstairs,0.2881372,-0.01631193,-0.1057616,0.1007663,0.05954862,-0.1908045,0.9264574,-0.1685072,-0.0479709,⋯,-0.3618537,-0.38041,0.1428494,-0.07542517,0.004762459,-0.04227142,-0.2895258,-0.361231,-0.4380073,-0.486443
sitting,0.2730596,-0.01268957,-0.105517,-0.9834462,-0.93488056,-0.9389816,0.8797312,0.1087135,0.15377409,⋯,-0.9640337,-0.9610302,-0.9524104,-0.94200015,-0.97868437,-0.97815477,-0.9642961,-0.9516417,-0.9853356,-0.9844914
standing,0.2791535,-0.01615189,-0.1065869,-0.9844347,-0.93250871,-0.9399135,0.9414796,-0.1842465,-0.01405196,⋯,-0.9594986,-0.9606892,-0.9558681,-0.94960161,-0.971090424,-0.97094797,-0.9479085,-0.9306367,-0.974886,-0.9734611
laying,0.2686486,-0.01831773,-0.1074356,-0.9609324,-0.94350719,-0.9480693,-0.3750213,0.6222704,0.55561247,⋯,-0.9613654,-0.9667252,-0.9476727,-0.93491667,-0.974300115,-0.9731834,-0.9548545,-0.9421157,-0.9779682,-0.9766482


In [190]:
cast(melted.data, subject ~ variable, mean)

subject,time.Body.acceleration.X.mean,time.Body.acceleration.Y.mean,time.Body.acceleration.Z.mean,time.Body.acceleration.X.std,time.Body.acceleration.Y.std,time.Body.acceleration.Z.std,time.Gravity.acceleration.X.mean,time.Gravity.acceleration.Y.mean,time.Gravity.acceleration.Z.mean,⋯,frequency.Body.Gyro.Y.std,frequency.Body.Gyro.Z.std,frequency.Body.acceleration.Mag.mean,frequency.Body.acceleration.Mag.std,frequency.Body.acceleration.Jerk.Mag.mean,frequency.Body.acceleration.Jerk.Mag.std,frequency.Body.Gyro.Mag.mean,frequency.Body.Gyro.Mag.std,frequency.Body.Gyro.Jerk.Mag.mean,frequency.Body.Gyro.Jerk.Mag.std
1,0.2656969,-0.01829817,-0.1078457,-0.5457953,-0.3677162,-0.5026457,0.7448674,-0.08255626,0.07233987,⋯,-0.4298258,-0.6504762,-0.4784485,-0.5897102,-0.4990758,-0.5418231,-0.5350028,-0.5665767,-0.6459707,-0.6858113
11,0.2765853,-0.01912725,-0.1089418,-0.5894765,-0.4903793,-0.6653243,0.7305262,0.0527387,0.14020986,⋯,-0.6708927,-0.7559606,-0.6115281,-0.6744702,-0.6561335,-0.6769251,-0.7575859,-0.7112672,-0.864827,-0.879254
14,0.2701846,-0.01625482,-0.1009859,-0.6116711,-0.3747308,-0.2935427,0.6720812,-0.11170139,-0.16117973,⋯,-0.3488642,-0.4076721,-0.4908742,-0.5482699,-0.5968963,-0.6192151,-0.5430305,-0.5169305,-0.7175901,-0.7605601
15,0.2782134,-0.01646448,-0.1125636,-0.5565412,-0.4816795,-0.7057066,0.6888725,0.10266049,0.03728977,⋯,-0.6845047,-0.7516965,-0.5629484,-0.6236556,-0.6148176,-0.6431012,-0.7323945,-0.6953468,-0.8368019,-0.8496345
16,0.2778874,-0.01585679,-0.1072639,-0.6681615,-0.6499471,-0.6038199,0.7009061,-0.0605236,0.04020755,⋯,-0.8094214,-0.7569936,-0.6669625,-0.712681,-0.7131676,-0.7298694,-0.8171288,-0.8020782,-0.8727723,-0.8865332
17,0.2740295,-0.0175416,-0.1091999,-0.6084552,-0.5670053,-0.6605828,0.6989374,-0.0137347,-0.01770424,⋯,-0.7613352,-0.7451015,-0.6461892,-0.6875369,-0.6975756,-0.6925958,-0.7694986,-0.7822325,-0.8291245,-0.833701
19,0.2697235,-0.01820315,-0.1182772,-0.5746589,-0.5070351,-0.6491847,0.475323,0.09640526,0.2330114,⋯,-0.6501054,-0.6947291,-0.5564044,-0.6898609,-0.5760646,-0.641089,-0.636288,-0.662326,-0.719519,-0.7542237
21,0.2774665,-0.01766646,-0.1087785,-0.6723239,-0.5655852,-0.6696218,0.645776,-0.07509631,0.03218518,⋯,-0.7713883,-0.7283193,-0.6586647,-0.7143457,-0.6797426,-0.6837251,-0.7764731,-0.7931585,-0.826275,-0.8349499
22,0.2747677,-0.01682736,-0.1086704,-0.54609,-0.4911884,-0.6550897,0.6093909,0.03150987,0.23588786,⋯,-0.7820461,-0.6906794,-0.5743332,-0.6738962,-0.5854425,-0.5992869,-0.7281668,-0.6996001,-0.830192,-0.8355972
23,0.2734933,-0.01958926,-0.109094,-0.622594,-0.5320785,-0.4505572,0.6623247,0.01462092,0.05186625,⋯,-0.5078505,-0.644131,-0.5290293,-0.6570489,-0.5426633,-0.5838323,-0.5346572,-0.6322885,-0.584111,-0.5937353


In [None]:
`

In [194]:
dcast(melted.data, activity + subject~ variable, mean)

activity,subject,time.Body.acceleration.X.mean,time.Body.acceleration.Y.mean,time.Body.acceleration.Z.mean,time.Body.acceleration.X.std,time.Body.acceleration.Y.std,time.Body.acceleration.Z.std,time.Gravity.acceleration.X.mean,time.Gravity.acceleration.Y.mean,⋯,frequency.Body.Gyro.Y.std,frequency.Body.Gyro.Z.std,frequency.Body.acceleration.Mag.mean,frequency.Body.acceleration.Mag.std,frequency.Body.acceleration.Jerk.Mag.mean,frequency.Body.acceleration.Jerk.Mag.std,frequency.Body.Gyro.Mag.mean,frequency.Body.Gyro.Mag.std,frequency.Body.Gyro.Jerk.Mag.mean,frequency.Body.Gyro.Jerk.Mag.std
walking,1,0.2773308,-0.01738382,-0.11114810,-0.283740259,0.11446134,-0.26002790,0.9352232,-0.28216502,⋯,-0.03350816,-0.4365622,-0.128623451,-0.3980326,-0.05711940,-0.10349240,-0.19925257,-0.32101795,-0.3193086,-0.38160191
walking,11,0.2718219,-0.01664758,-0.10609630,-0.422842072,-0.05221208,-0.53062631,0.9464685,-0.21204587,⋯,-0.45366318,-0.4985546,-0.458514063,-0.5844560,-0.39511613,-0.41744682,-0.61022263,-0.59974544,-0.7519145,-0.77357406
walking,14,0.2719596,-0.02177854,-0.10675637,-0.402639098,-0.05361267,0.05188410,0.8029740,-0.27031482,⋯,0.08329506,-0.1802329,-0.373516771,-0.4574900,-0.43159145,-0.45342601,-0.31138645,-0.26550455,-0.5910369,-0.65567766
walking,15,0.2738992,-0.01708097,-0.10762182,-0.327961801,0.13891292,-0.51893263,0.9515244,-0.23391828,⋯,-0.36536121,-0.5439976,-0.275156242,-0.4583448,-0.28975395,-0.34241193,-0.48916638,-0.48175022,-0.6450752,-0.65401015
walking,16,0.2760236,-0.02042869,-0.10880405,-0.404692524,-0.31456976,-0.15979979,0.9259813,-0.06682465,⋯,-0.61194088,-0.4032198,-0.410463242,-0.5921510,-0.37827849,-0.44144775,-0.66922854,-0.72113700,-0.7206823,-0.74572685
walking,17,0.2723419,-0.01848754,-0.10979212,-0.319500173,-0.01757979,-0.26582449,0.9281124,-0.17992738,⋯,-0.43923879,-0.4387687,-0.429380043,-0.5651625,-0.33593889,-0.21123186,-0.51861806,-0.61532008,-0.5827119,-0.52038548
walking,19,0.2739312,-0.01917736,-0.12273667,-0.048904941,0.18180156,-0.13947794,0.9352226,-0.22333810,⋯,-0.13301029,-0.2202712,0.032227230,-0.3247934,0.07301799,-0.02228696,-0.09524605,-0.14159840,-0.2624428,-0.33614873
walking,21,0.2791835,-0.01816103,-0.10431933,-0.297814844,0.05408805,-0.16868873,0.8625248,-0.37098455,⋯,-0.41940088,-0.3855054,-0.251266389,-0.4381498,-0.14400989,-0.01867344,-0.40951608,-0.51224934,-0.4995972,-0.42932347
walking,22,0.2788646,-0.01672136,-0.10711251,-0.008659219,0.10038425,-0.21335197,0.9360964,-0.25964494,⋯,-0.48162441,-0.2531739,-0.024158496,-0.3813993,0.06597273,0.15828675,-0.43837872,-0.46401913,-0.5539461,-0.47640527
walking,23,0.2732119,-0.01836187,-0.11338299,-0.313521773,-0.11902059,0.16422069,0.9398097,-0.16176137,⋯,0.28658152,-0.2878209,-0.094260631,-0.4075775,-0.01807057,-0.04573723,0.18507821,-0.06147658,0.1466186,0.28783462


In [204]:
# 1. Import relevant libraries
library(plyr)
library(dplyr)
library(reshape2)

# 2. Download and extract data
if (!file.exists("data/course_project_files.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", 
                 "data/course_project_files.zip")
}
unzip("data/course_project_files.zip", exdir="data")

# 2. Merge the training and test sets
# 3. / 2a. Extract only the mean and standard deviation for each measurement
# Now, the project description asks to extract only the mean and standard deviation *after* merging the dataset.
# However, the mean and standard deviation are a small subset of the overall data, so to minimise the machine
# workload, we'll read only them in, taking advantage of "read.fwf"'s column skipping functionality.

# Read in in the features list and find those with mean or standard columns
features <- readLines("data/UCI HAR Dataset/features.txt")
features <- sapply(features, function(x) { strsplit(x, " ")[[1]][[2]]})
meanandstd <- grepl(features, pattern = "(mean\\(\\)|std\\(\\))")

# If there is "mean"/"std", column width is +16 (i.e. read it in); otherwise -16 (i.e. skip it)
colwidths <- sapply(X = meanandstd, FUN = function(x) { ifelse(x, 16, -16)})
colnames <- features[meanandstd]

# Read in training data
x.training <- read.fwf("data/UCI HAR Dataset//train/X_train.txt", widths=colwidths, n = -1, col.names = colnames)
y.training <- readLines("data/UCI HAR Dataset/train/y_train.txt")
subject.training <- readLines("data/UCI HAR Dataset/train/subject_train.txt")

# Merge training data to form a single data.frame
training <- cbind(x.training, y.training, subject.training)

# Rename subject and activity to allow for merging with test data set
training <- training %>% rename(activity = y.training, subject = subject.training)

# Read in testing data
x.test <- read.fwf("data/UCI HAR Dataset/test/X_test.txt", widths=colwidths, n = -1, col.names = colnames)
y.test <- readLines("data/UCI HAR Dataset/test/y_test.txt")
subject.test <- readLines("data/UCI HAR Dataset/test/subject_test.txt")

# Merge testing data to form a single data.frame
test <- cbind(x.test, y.test, subject.test)

# Rename subject and activity to allow for merging with training data set
test <- test %>% rename(activity = y.test, subject = subject.test)

# Generate the total data set
total.data <- merge(training, test, all = TRUE)

# 4. Use descriptive activity names to name the activities in the data set
total.data$activity <- revalue(x = total.data$activity, 
                             c("1" = "walking", "2" = "walkingupstairs", "3" = "walkingdownstairs", 
                               "4" = "sitting", "5" = "standing", "6" = "laying"))

# 5. Appropriately label the data set with descriptive variable names
# The naming principles used are detailed in the code book and readme. 
# The comments below detail what each step is accomplishing.
col.names <- names(total.data)
new.col.names <- sub("BodyBody", "Body", col.names)  # Replace typo
new.col.names <- gsub("([A-Z])", "\\.\\1", new.col.names)  # Insert a period before capitals
new.col.names <- gsub("(mean)(.*)", "\\2.mean", new.col.names)  # Remove 'mean' and put it at end of line
new.col.names <- gsub("(std)(.*)", "\\2.std", new.col.names)  # Remove 'std' and put it at end of line
new.col.names <- gsub("(\\.+)", "\\.", new.col.names)  # Remove multiple periods and replace them with just one
new.col.names <- gsub("(^t)", "time", new.col.names)  # Replace "t" at start of line with "time"
new.col.names <- gsub("(^f)", "frequency", new.col.names)  # Replace "f" at start of line with "frequency"
new.col.names <- gsub("(Acc)", "acceleration", new.col.names)  # Replace "Acc" with "acceleration"
new.col.names <- tolower(new.col.names)  # Make all names lower case

# Apply revised names
names(total.data) <- new.col.names

# 6. Create a second, independent tidy data set with the average of each variable for each activity and each subject
melted.data <- melt(total.data, id=c("subject", "activity"))
tidy.data <- dcast(melted.data, activity + subject ~ variable, mean)

ERROR: Error in rename(., activity = y.training, subject = subject.training): unused arguments (activity = y.training, subject = subject.training)
