# Getting and Cleaning Data - Week 1
The goal of this course: **Raw data -> Processing script -> tidy data** -> data analysis -> data communication

## Raw and Processed Data

### Definition of data
+ Data are values of **qualitative** or **quantitative** variables, belonging to a **set of items**.
+ **Variables**: A measurement or characteristic of an item.

### Raw versus processed data

#### Raw data

+ The original source of the data
+ **Often hard to use for data analyses**
+ Data analysis *includes* processing
+ Raw data may only need to be processed once

#### Processed data

+ Data that is ready for analysis
+ Processing can include merging, subsetting, transforming, etc.
+ There may be standards for processing
+ **All steps should be recorded**

## The components of tidy data

### The four things you should have

1. The raw data
2. A tidy data set
3. A code book describing each variable and its values in the tidy data set.
4. **An explicit and exact recipe you used to go from 1 -> 2, 3.**

### The raw data

*You know the raw data is the right format if you*
1. **Ran no software on the data**
2. Did not manipulate any of the numbers in the data
3. You did not remove any data form the data set
4. You did not summarize the data in any way

### The tidy data

1. **Each variable you measure should be in one column**
2. Each different observation of that variable should be in a different row
3. There should be one table for each "kind" of variable
4. If you have multiple tables, they should include a column in the table that allows them to be linked

*Some other important tips*
+ Include a row at the top of each file with variable names
+ Make variable names human readable, eg., AgeAtDiagnosis instead of AgeDx
+ In general data should be saved in one file per table

### The code book

1. Information about the variables (including units!) in the data set not contained in the tidy data
2. Information about the summary choices you made
3. Information about the experimental study design you used

*Some other important tips*
+ A common format for this document is a Word/text file
+ There should be a section called "Study design" that has a thorough description of how you collected the data
+ There must be a section called "Code book" that describes each variable and its units.

### The instruction list

+ Ideally a computer script
+ The input for the script is the raw data
+ The output is the processed, tidy data
+ There are no parameters to the script

In some cases it will not be possible to script every step. In that case you should provide instructions like:
1. Step 1 - take the raw file, run version 3..2 of summarize software with parameters a = 1, b = 2, c = 3
2. Step 2 - run the software separately for each sample
3. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set

## Downloading files

### Get/set your working directory

+ A basic component of working with data is knowing your working directory
+ The two main commands are `getwd()` and `setwd()`
+ Be aware of relative versus absolute paths
  - **Relative** - `setwd("./data")`, `setwd("../")`
  - **Absolute** - `setwd("/Users/jtleek/data/")`
+ Important difference in Windows - `Setwd("C:\\Users\\Andrew\\Downloads")`

### Checking for and creating directories

+ `file.exists("directoryName")` will check to see if the directory exists
+ `dir.create("directoryName")` will create a directory if it doesn't exist
+ An example on checking for a "data" directory and creating it if it doesn't exist
```{r}
if (!file.exists("data")) {
    dir.create("data")
}
```

### Getting data from the internet - `download.file()`

+ Downloads a file from the internet
+ Even if you could do this by hand, helps with reproducibility
+ Important parameters are *url, destfile, method*
+ Useful for downloading tab-delimited, csv, and other files
+ Example
```{r}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv", method = "curl")
list.files("./data")
## [1] "cameras.csv"
```
```{r}
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Jan 12 21:37:44 2014"
```

*Some notes about `download.file()`*
+ If the url starts with *http* you can use `download.file()`
+ If the url starts with *https* on Windows you may be ok
+ If the url starts with *https* on Mac you may need to set `method = "curl"`
+ If the file is big, this might take a while
+ Be sure to record when you downloaded