Skip to content

Commit

Permalink
misc text and formatting updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mollybostic committed Jan 25, 2015
1 parent 191afa0 commit 3f884f7
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 8 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,21 +42,21 @@ The script performs the following operations to import, clean, and transform the
3. Combine the values from the **X_test** and **X_train** files to create additional variable columns, one column for each measurement and calculation included in the data set (561 variable columns total, in the initial combined data set; 563 columns including the TestSubject and Activity columns).
2. Clean up the column names to remove hyphens and parentheses and replace them with periods.
2. Extract only the measurements on the mean and standard deviation for each measurement.
1. Use the dplyr select method to create a subset of the data that only includes columns that have ".mean." and ".std." in their column names.
1. Use the dplyr **select** function to create a subset of the data that only includes columns that have ".mean." and ".std." in their column names.
2. It's not required for the subset, but at this point the script also converts the test subject and activity columns to factors, to facilitate correct calculations later.
3. Use descriptive activity names to name the activities in the data set.
1. Use the mapvalues function to map the numeric activity values to descriptive names like "Walking" and "Standing."
1. Use the **mapvalues** function to map the numeric activity values to descriptive names like "Walking" and "Standing."
2. Appropriately label the data set with descriptive variable names.
1. Use the stringr_replace_all function from the stringr library to do a number of find and replace operations on the column names. The details of the resulting descriptive names are included in [codebook.md](./codebook.md).
1. Use the **stringr_replace_all** function from the stringr library to do a number of find and replace operations on the column names. The details of the resulting descriptive names are included in [codebook.md](./codebook.md).
2. From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject.
1. Use split/apply/combine logic. First, split the data by the subject and activity factors using the split method.
2. Next, use lapply to iterate over each item in the resulting list, and use apply to calculate apply the mean method to calculate the average of each column.
1. Use split/apply/combine logic. First, split the data by the subject and activity factors using the **split** method.
2. Next, use **lapply** to iterate over each item in the resulting list, and use **apply** to apply the **mean** method to calculate the average of each column.
3. The output of lapply is a list, so combine it back to a data frame.
4. Use strsplit to break the subject and activity factors back into separate sets, and use cbind to properly bind them as the first columns in the resulting data set.
4. Use **strsplit** to break the subject and activity factors back into separate sets, and use **cbind** to properly bind them as the first columns in the resulting data set.

## Verifying the calculations in run_analysis.R

I love the way that R commands can simplify calculations over data frames and lists into just a few lines of code, but since I'm not an experience R programmer I had concerns about whether my calculations were producing correct results. I verified the results in the Data Verification section I added to the R script. This section selects two subsets of data for individual combinations of subjects and activities, calculates the mean for each subset, and compares the result to the result for the same variables in the tidy data set.
I love the way that R commands can simplify calculations over data frames and lists into just a few lines of code, but since I'm not an experience R programmer I had concerns about whether my calculations were producing correct results. I verified the results in the **Data Verification** section I added to the R script. This section selects two subsets of data for individual combinations of subjects and activities, calculates the mean for each subset, and compares the result to the result for the same variables in the tidy data set.

## Special instructions for running run_analysis.R

Expand Down
3 changes: 2 additions & 1 deletion codebook.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Code book

The tidy data in [tidy_data_set.txt](./tidy_data_set.txt) can be read into R with the following code:

read.table("tidy_data_set.txt", header=TRUE, colClasses=c('factor', 'factor', rep('numeric', 66)))

## Overview
Expand All @@ -13,7 +14,7 @@ The tidy data set is a subset of this combined data that includes only measureme

## Data dictionary

The variables in this tidy data set are a subset of the variables described in the [features_info.txt](../UCI HAR Dataset/features_info.txt) file in the original data set. [features_info.txt](../UCI HAR Dataset/features_info.txt) provides a more in-depth overview of the original values and how they were calculated.
The variables in this tidy data set are a subset of the variables described in the [features_info.txt](./UCI HAR Dataset/features_info.txt) file in the original data set. [features_info.txt](./UCI HAR Dataset/features_info.txt) provides a more in-depth overview of the original values and how they were calculated.

1. **TestSubject** - A factor that identifies the volunteer participant.
>Values: integer from 1 to 30
Expand Down

0 comments on commit 3f884f7

Please sign in to comment.