## Merging Data
___

Useful links:
* [Quick R data merging page](http://www.statmethods.net/management/merging.html)
* [plyr information](http://plyr.had.co.nz/)
* [Types of joins](http://en.wikipedia.org/wiki/join_(SQL\))

## Obtain the datasets

In [8]:
file.URL1 <- "https://raw.githubusercontent.com/jtleek/dataanalysis/master/week2/007summarizingData/data/reviews.csv"
file.URL2 <- "https://raw.githubusercontent.com/jtleek/dataanalysis/master/week2/007summarizingData/data/solutions.csv"

if (!file.exists("data/reviews.csv")) {
    download.file(file.URL1, "data/reviews.csv", method="curl", extra = "-L")
}

if (!file.exists("data/solutions.csv")) {
    download.file(file.URL2, "data/solutions.csv", method="curl", extra = "-L")
}

reviews <- read.csv("data/reviews.csv")
solutions <- read.csv("data/solutions.csv")

In [12]:
head(reviews, n=10)

id,solution_id,reviewer_id,start,stop,time_left,accept
1,3,27,1304095698.0,1304095758.0,1754.0,1.0
2,4,22,1304095188.0,1304095206.0,2306.0,1.0
3,5,28,1304095276.0,1304095320.0,2192.0,1.0
4,1,26,1304095267.0,1304095423.0,2089.0,1.0
5,10,29,1304095456.0,1304095469.0,2043.0,1.0
6,2,29,1304095471.0,1304095513.0,1999.0,1.0
7,9,25,1304095343.0,1304095382.0,2130.0,1.0
8,8,23,,,,
9,7,29,1304095520.0,1304095613.0,1899.0,1.0
10,11,26,1304095428.0,1304095488.0,2024.0,1.0


In [13]:
head(solutions, n=10)

id,problem_id,subject_id,start,stop,time_left,answer
1,156,29,1304095119,1304095169,2343,B
2,269,25,1304095119,1304095183,2329,C
3,34,22,1304095127,1304095146,2366,C
4,19,23,1304095127,1304095150,2362,D
5,605,26,1304095127,1304095167,2345,A
6,384,27,1304095131,1304095270,2242,C
7,538,28,1304095133,1304095201,2311,C
8,312,24,1304095134,1304095198,2314,D
9,327,22,1304095151,1304095184,2328,E
10,194,23,1304095152,1304095175,2337,A


Note the match between `solution_id` and the `id` in `solutions`.

## Using the Merge command
`join` in the `plyr` package can be used, which is faster, but less fully-featured.

The important parameters are:
`x, y, by, by.x, by.y, all`

In [9]:
names(reviews)

In [10]:
names(solutions)

In [11]:
mergedData <- merge(reviews, solutions, by.x="solution_id", by.y="id", all=TRUE)
head(mergedData)

solution_id,id,reviewer_id,start.x,stop.x,time_left.x,accept,problem_id,subject_id,start.y,stop.y,time_left.y,answer
1,4,26,1304095267,1304095423,2089,1,156,29,1304095119,1304095169,2343,B
2,6,29,1304095471,1304095513,1999,1,269,25,1304095119,1304095183,2329,C
3,1,27,1304095698,1304095758,1754,1,34,22,1304095127,1304095146,2366,C
4,2,22,1304095188,1304095206,2306,1,19,23,1304095127,1304095150,2362,D
5,3,28,1304095276,1304095320,2192,1,605,26,1304095127,1304095167,2345,A
6,16,22,1304095303,1304095471,2041,1,384,27,1304095131,1304095270,2242,C


By default, `merge` will merge on all column names.

In [15]:
intersect(names(solutions), names(reviews))

In [16]:
mergedData2 <- merge(reviews, solutions, all=TRUE)
head(mergedData2)

id,start,stop,time_left,solution_id,reviewer_id,accept,problem_id,subject_id,answer
1,1304095119,1304095169,2343,,,,156.0,29.0,B
1,1304095698,1304095758,1754,3.0,27.0,1.0,,,
2,1304095119,1304095183,2329,,,,269.0,25.0,C
2,1304095188,1304095206,2306,4.0,22.0,1.0,,,
3,1304095127,1304095146,2366,,,,34.0,22.0,C
3,1304095276,1304095320,2192,5.0,28.0,1.0,,,
