Final release of v0.1-3

paulusm · Oct 3, 2010 · 3fd5e3c · 3fd5e3c
1 parent a4d6c91
commit 3fd5e3c
Show file tree

Hide file tree

Showing 25 changed files with 366 additions and 57 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 build_script.sh
 upload.sh
 ProjectTemplate.Rcheck
+data_examples
diff --git a/ChangeLog b/ChangeLog
@@ -1,9 +1,20 @@
-2010-08-27  John Myles White  <jmw@johnmyleswhite.com>
+2010-10-02	John Myles White  <jmw@johnmyleswhite.com>
 
 * v0.1-3
+* Many changes to load_data.R.
 * Added notices when data sets are autoloaded.
 * Added autoload support for WSV (whitespace separated values) data files.
 * Added autoload support for RData files.
+* Added autoload support for compressed *SV files.
+* Added autoload support for *SV files available through HTTP.
+* Added autoload support for MySQl database tables.
+* Added autoload support for SPSS and Stata files.
+* Added test.project as an alias for run.tests().
+* Changed list of packages listed as dependencies, so that many are now suggestions.
+* load.project() does not autoload libraries that are not dependencies.
+* Added a sample profiling script.
+* Added a sample test that always passes to the default project.
+* Added a basic show.updates() function for porting projects to newer releases of ProjectTemplate.
 
 2010-08-26  John Myles White  <jmw@johnmyleswhite.com>
 

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,11 +1,12 @@
 Package: ProjectTemplate
 Type: Package
-Title: Automates the creation of new R statistical analysis projects
+Title: Automates the creation of new statistical analysis projects.
 Version: 0.1-3
-Date: 2010-08-29
+Date: 2010-10-02
 Author: John Myles White
 Maintainer: John Myles White <jmw@johnmyleswhite.com>
-Description: The ProjectTemplate package provides a function, create.project(), that automatically builds a directory for a new R project with a clean sub-directory structure and automatic data and library loading tools. The hope is that standardized data loading, automatic importing of best practice packages, integrated unit testing and useful nudges towards keeping a cleanly organized codebase will improve the quality of R coding.
+Description: ProjectTemplate provides functions to automatically build a directory structure for a new R project. Using this structure, ProjectTemplate is able to automate data loading, preprocessing, library importing and unit testing.
 License: Artistic-2.0
 LazyLoad: yes
-Depends: R (>= 2.7), reshape, plyr, stringr, ggplot2, testthat
+Depends: R (>= 2.7), testthat, yaml, foreign
+Suggests: reshape, plyr, stringr, ggplot2, log4r, RMySQL
diff --git a/ProjectTemplate_0.1-3.tar.gz b/ProjectTemplate_0.1-3.tar.gz
diff --git a/R/run.tests.R b/R/run.tests.R
@@ -2,3 +2,5 @@ run.tests <- function()
 {
   source('lib/run_tests.R')
 }
+
+test.project <- run.tests
diff --git a/R/show.updates.R b/R/show.updates.R
@@ -0,0 +1,21 @@
+show.updates <- function()
+{
+  default.files <- dir(system.file('defaults', package = 'ProjectTemplate'))
+  for (default.file in default.files)
+  {
+    canonical.file <- file.path(system.file('defaults', package = 'ProjectTemplate'), default.file)
+    # Shell escape this.
+    diff.command <- paste('diff -r',
+                          default.file,
+                          canonical.file,
+                          '2>&1')
+    diff.output <- system(diff.command, intern = TRUE)
+    if (length(diff.output) > 0 || nchar(diff.output) != 0)
+    {
+      cat(paste('Your copy of', default.file, 'differs from the current ProjectTemplate version.\n'))
+      cat(paste('You might want to consider merging changes from\n'))
+      cat(paste(canonical.file, '\n', sep = ''))
+      cat('\n')
+    }
+  }
+}
diff --git a/README.markdown b/README.markdown
@@ -16,6 +16,7 @@ For most users, running the bleeding edge version of this package is probably a
 
 # Example Code
 To create a project called `my-project`, open R and type:
+
     library('ProjectTemplate')
     create.project('my-project')
     setwd('my-project')
@@ -39,47 +40,63 @@ As far as ProjectTemplate is concerned, a good project should look like the foll
         * preprocess_data.R
         * run_tests.R
         * utilities.R
+    * logs/
     * profiling/
+        * 1.R
     * reports/
     * tests/
+        * 1.R
     * README
     * TODO
 
 To do work on such a project, enter the main directory, open R and type `source('lib/boot.R')`. This will then automatically perform the following actions:
 
-* `source('lib/load_libraries.R')`, which automatically loads the CRAN packages currently deemed best practices. At present, this list includes:
+* `source('lib/load_libraries.R')`, which automatically load the packages required for ProjectTemplate to function. This includes:
+    * `testthat`
+    * `yaml`
+    * `foreign`
+* You can edit `lib/load_libraries.R` to automatically load the suggested packages as well, which are:
     * `reshape`
     * `plyr`
     * `stringr`
     * `ggplot2`
-    * `testthat`
+    * `log4r`
 * `source('lib/load_data.R')`, which automatically imports any CSV or TSV data files inside of the `data/` directory.
 * `source('lib/preprocess_data.R')`, which allows you to make any run-time modifications to your data sets automatically. This is blank by default.
 
 # Default Project Layout
+Within your project directory, ProjectTemplate creates the following directories and files whose purpose is explained below:
 
-Within your project directory, `ProjectTemplate` creates the following directories and files whose purpose is explained below:
-
-* `data/`: Store your raw data files here. If they are CSV or TSV files, they will automatically be loaded when you call `load.project()` or `source('lib/boot.R')`, for which `load.project()` is essentially a mnemonic.
+* `data/`: Store your raw data files here. If they are a supported file format, they will automatically be loaded when you call `load.project()` or `source('lib/boot.R')`, for which `load.project()` is essentially a mnemonic.
 * `diagnostics/`: Store any scripts you use to diagnose your data sets for corruption or problematic data points. You should also put code that globally censors any data points here.
 * `doc/`: Store documentation for your analysis here.
 * `graphs/`: Store any graphs that you produce here.
 * `lib/`: Store any files that provide useful functionality for your work, but do not constitute a statistical analysis per se here.
 * `lib/boot.R`: This script handles automatically loading the other files in `lib/` automatically. Calling `load.project()` automatically loads this file.
-* `lib/load_data.R`: This script handles the automatic loading of any CSV and TSV files contained in `data/`.
-* `lib/load_libraries.R`: This script handles the automatic loading of the best practice packages, which are `reshape`, `plyr`, `stringr`, `ggplot2` and `testthat`.
-* `lib/preprocess_data.R`: This script handles the preprocessing of your data, if you need to add columns at run-time or merge normalized data sets.
+* `lib/load_data.R`: This script handles the automatic loading of any supported files contained in `data/`.
+* `lib/load_libraries.R`: This script handles the automatic loading of the required packages, which are `testthat`, `yaml` and `foreign`. In addition, you can uncomment the lines that would automatically load the suggested packages, which are `reshape`, `plyr`, `stringr`, `ggplot2` and `log4r`.
+* `lib/preprocess_data.R`: This script handles the preprocessing of your data, if you need to add columns at run-time, merge normalized data sets or perform similar operations.
 * `lib/run_tests.R`: This script automatically runs any test files contained in the `tests/` directory using the `testthat` package. Calling `run.tests()` automatically runs this script.
 * `lib/utilities.R`: This script should contain quick general purpose code that belongs in a package, but hasn't been packaged up yet.
 * `profiling/`: Store any scripts you use to benchmark and time your code here.
-* `reports/`: Store any output reports, such as HTML or LaTeX versions of tables here. Sweave documents should also go here.
+* `reports/`: Store any output reports, such as HTML or LaTeX versions of tables here. Sweave or brew documents should also go here.
 * `tests/`: Store any test cases in this directory. Your test files should use `testthat` style tests.
 * `README`: Write notes to help orient newcomers to your project.
 * `TODO`: Write a list of future improvements and bug fixes you have planned.
 
 # Automatic Data Loading
-At present, the system only understands how to autoload comma separated values (CSV), tab separated values (TSV) files and whitespace separated values (WSV) data files. For all of these, it infers the correct delimiter by examining the filename's ending extension: CSV files must end in `.csv`, TSV files must end in `.tsv` and WSV files must end in `.wsv`.
+One of the major goals for ProjectTemplate is providing fully automatic data loading for R. For example, if your `data/` directory contains a data file called `data/choices.csv`, then ProjectTemplate will automatically load this file and create a global variable called `choices`. Using the `clean.variable.name()` function found in `lib/utilities.R`, filenames that contain underscores, dashes and whitespace are changed to use periods instead. For instance, `data/image_properties.tsv` creates a global variable called `image.properties`.
+
+A large and growing number of file formats are supported by the automatic data loading script, including CSV files and related formats, RData files, remote data sets available over HTTP, Stata and SPSS formats and MySQL tables. For further details, read the `file_formats.markdown` file.
+
+As of v0.1-3, `load.project()` prints out the name of every data set as it is loaded.
+
+# Contributors and Thanks
+Diego Valle-Jones contributed a patch that enabled the autoloading of compressed CSV data files. Inspiration for further extensions to the autoloading system came from reading the documentation for David Edgar Liebke's `get-dataset` function, which is part of the Clojure statistical library Incanter.
 
-If the `data/` directory contains a data file `data/choices.csv`, then after automatic loading, you will have a global variable called choices. Using the `clean.variable.name()` function in `lib/utilities.R`, filenames containing underscores, dashes and whitespace are cleaned to use periods instead. For example, `data/image_properties.tsv` creates a global variable called `image.properties`.
+Many thanks to anyone who's made suggestions or comments about ProjectTemplate.
 
-As of v0.1-3, `load.project()` will print out the name of every data set being automatically loaded.
+# Finding Out More
+* Mailing List: ProjectTemplate has a Google Group, which can be found at http://groups.google.com/group/projecttemplate
+* Website: Updates to ProjectTemplate are announced on http://www.johnmyleswhite.com
+* Twitter: Updates to ProjectTemplate are announced on Twitter using the hashtag #ProjectTemplate.
diff --git a/TODO b/TODO
@@ -1,8 +1,16 @@
-(1) Need to deal with empty directories in inst/ not going into git.
+(1) Debate integrating log4r and automatically loading it.
 
-(2) Integrate log4r and automate its loading.
+(2) Need to create a show.updates() function that provides a way to merge new defaults into an existing project generated using an earlier version of ProjectTemplate. Decide on proper name. update.project conflicts with
+an existing generic method.
 
-(3) Need to create a update.project() function that provides a way to merge new defaults into an existing project generated using an earlier version of ProjectTemplate.
+(3) Debate renaming run.tests() to test.project() permanently.
 
-(4) Consider renaming run.tests() to test.project() for consistency across names.
+(4) Should we automatically run test.project() during load.project()?
 
+(5) show.updates() won't work on Windows systems.
+
+(6) show.updates() should offer to replace file after noticing a change.
+
+(7) All sample .zip files were corrupt. Need proper samples for testing.
+
+(8) Need sample SPSS and Stata files for testing.
diff --git a/file_formats.markdown b/file_formats.markdown
@@ -0,0 +1,51 @@
+# File Formats
+ProjectTemplate can automatically load a variety of CSV-like file formats,
+including compressed CSV files. In addition, automatic loading is supported
+for the binary RData, Stata and SPSS file formats. Finally, ad hoc file
+types support the loading of CSV files that are accessible over HTTP and
+the automatic loading of data from MySQL tables.
+
+N.B.: The SPSS and Stata file formats have not been tested yet. Because
+ProjectTemplate is simply wrapping the 'foreign' library, they are expected
+to work, but I have not confirmed this yet. Your mileage may vary.
+
+# Supported File Extensions
+* `.csv`: CSV files that use a comma separator.
+* `.csv.bz2`: CSV files that use a comma separator and are compressed using bzip2.
+* `.csv.zip`: CSV files that use a comma separator and are compressed using zip.
+* `.csv.gz`: CSV files that use a comma separator and are compressed using gzip.
+* `.tsv`: CSV files that use a tab separator.
+* `.tsv.bz2`: CSV files that use a tab separator and are compressed using bzip2.
+* `.tsv.zip`: CSV files that use a tab separator and are compressed using zip.
+* `.tsv.gz`: CSV files that use a tab separator and are compressed using gzip.
+* `.wsv`: CSV files that use an arbitrary whitespace separator.
+* `.wsv.bz2`: CSV files that use an arbitrary whitespace separator and are compressed using bzip2.
+* `.wsv.zip`: CSV files that use an arbitrary whitespace separator and are compressed using zip.
+* `.wsv.gz`: CSV files that use an arbitrary whitespace separator and are compressed using gzip.
+* `.RData`: .RData binary files produced by `save()`.
+* `.rda`: .RData binary files produced by `save()`.
+* `.url`: A YAML file that contains an HTTP URL and a separator specification for a remote dataset.
+* `.sql`: A YAML file that contains database connection information for a MySQL database.
+* `.sav`: Binary file format generated by SPSS.
+* `.dta`: Binary file format generated by Stata.
+
+# Ad Hoc File Types
+## URL Files
+You can access CSV files over HTTP using the `.url` file extension. Inside
+of the `.url` file, you must place YAML that describes your data sources.
+An example file is shown below.
+
+    url: "http://www.johnmyleswhite.com/ProjectTemplate/sample_data.csv"
+    separator: ","
+
+## SQL Files
+You can access database stored in a MySQL database using the `.sql` file
+extension. Inside of the `.sql` file, you must place YAML that describes
+the connection protocol for your database. An example file is shown below.
+
+    type: mysql
+    user: sample_user
+    password: sample_password
+    host: localhost
+    dbname: sample_database
+    table: sample_table
diff --git a/inst/defaults/data/.gitignore b/inst/defaults/data/.gitignore
diff --git a/inst/defaults/diagnostics/.gitignore b/inst/defaults/diagnostics/.gitignore
diff --git a/inst/defaults/doc/.gitignore b/inst/defaults/doc/.gitignore
diff --git a/inst/defaults/graphs/.gitignore b/inst/defaults/graphs/.gitignore
diff --git a/inst/defaults/lib/boot.R b/inst/defaults/lib/boot.R
@@ -6,3 +6,9 @@ cat('Autoloading data\n')
 source('lib/load_data.R')
 cat('Preprocessing data\n')
 source('lib/preprocess_data.R')
+
+# Need to discuss automatic log4r usage on the mailing list.
+#logger <- create.logger()
+#logfile(logger) <- file.path('logs', 'project.log')
+#level(logger) <- log4r:::INFO
+#info(logger, 'Data analysis session started.')