Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
282 lines (223 sloc) 17.7 KB

r opts_knit$set(upload.fun = imgur_upload)

The following post shows how to manually convert a Sweave LaTeX document into a knitr R Markdown document. The post (1) reviews many of the required changes; (2) provides an example of a document converted to R Markdown format based on an analysis of Winter Olympic Medal data up to and including 2006; and (3) discusses the pros and cons of LaTeX and Markdown for performing analyses.

Overview

The following analyses of Winter Olympic Medals data have gone through several iterations:

  1. R Script: I originally performed similar analyses in February 2010. It was a simple set of commands where you could see the console output and view the plots.
  2. LaTeX Sweave: In February 2011 I adapted the example to make it a Sweave LaTex document. The source fo this is available on github. With Sweave, I was able to create a document that weaved text, commands, console input, console output, and figures.
  3. R Markdown: Now in June 2012 I'm using the example to review the process of converting a document from Sweave-LaTeX to R Markdown. The souce code is available here on github (see the *.rmd file).

Converting from Sweave to R Markdown

The following changes were required in order to convert my LaTeX Sweave document into an R Markdown document suitable for processing with knitr and RStudio. Many of these changes are fairly obvious if you understand LaTeX and Markdown; but a few are less obvious. And obviously there are many additional changes that might be required on other documents.

R code chunks

  • R code chunk delimiters: Update from << ... >>= and @ to R markdown format ```{r ...} and ```
  • Inline code chunks: Update from \Sexpr{...} to either `r ...` or `r I(...)` format.
  • results=tex: Any results=tex needs to either be removed or converted to results='asis'. Note that string values of knitr options need to be quoted.
  • Boolean options: Sweave tolerates lower case true and false for code chunk options, knitr requires TRUE and FALSE.

Figures and Tables

  • Floats: Remove figure and table floats (e.g., \begin{table}...\end{table}, \begin{figure}...\end{figure}). In R Markdown and HTML, there are no pages and thus content is just placed immediately in the document.
  • Figure captions: Extract content from within the \caption{} command. When using R Markdown, it is often easiest to add captions to the plot itself (e.g., using the main argument in base graphics).
  • Table captions: extract content from within the \caption{} command; Table captions can be included in a caption argument using the caption argument to the xtable function (e.g., print(xtable(MY_DAT_FRAME), "html", caption="MY CAPTION", caption.placement="top") ). Caption placement defaults to "bottom" of table but can be optinally specified as "top" either as a global option or in print.xtable. Alternatively table titles can just be included as Markdown text.
  • References: Delete table and figure lables (e.g., \label{...}). Replace table and figure references (e.g., \ref{...} with actual numbers or other descriptive terminology. It would also be possible to implement something simple in R that stored table and figure numbers (e.g., initialise table and figure numbers at the start of the document; increment table counter each time a table is created and likewise for figures; store the value of counter in variable; include variable in caption text using paste() or something similar. Include counter in text using inline R code chunks.
  • Table content: Markdown supports HTML; so one option is to convert LaTeX tables to HTML tables using a function like print(xtable(MY_DATA_FRAME), type="html"). This is combined with the results='asis' R code chunk option.

Basic formatting

  • Headings: if we assume section is the top level: then \section{...} becomes # ..., \subsection{...} becomes ## ... and \subsubsection{...} becomes ### ...
  • Mathematics: Update latex mathematics to $``latex ... and $$``latex ... $$ notation if using RStudio.
  • Paragraph delimiters: If using RStudio then remove single line breaks that were not intended to be paragraph breaks.
  • Hyperlinks: Convert LaTeX Hyperlinks from \href or url to [text](url) format.

LaTeX things

  • Comments: Remove any LaTeX comments or switch from % comment to <!-- comment -->
  • LaTeX escaped characters: Remove unnecessary escape characters (e.g., \% is just %).
  • R Markdown escaped characters: Writing about the R Markdown language in R Markdown sometimes requires the use of HTML codes for special characters such as backticks (&#96;) and backslashes (&#92;) to prevent the text from being interpreted; see here for a list of HTML character codes.
  • Header: Remove the LaTeX header information up to and including \begin{document}; extract any incorporate any relevant content such as title, abstract, author, date, etc.

R Markdown Analysis of Winter Olympic Medal Data

The following shows the output of the actual analysis after running the rmd source through Knit HTML in Rstudio. If you're curious, you may wish to view the rmd source code on GitHub side by side this point at this point.

Import Dataset

library(xtable)
options(stringsAsFactors=FALSE)
medals <- read.csv("data/medals.csv")
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]       

The Olympic Medals data frame includes r nrow(medals) medals from r min(medals[["Year"]]) to r max(medals[["Year"]]). The data was sourced from The Guardian Data Blog.

Total Medals by Year

# http://www.math.mcmaster.ca/~bolker/emdbook/chap3A.pdf
x <- aggregate(medals$Year, list(Year = medals$Year), length)
names(x) <- c("year", "medals") 
x$pos <- seq(x$year) 
fit <- nls(medals ~ a * pos ^ b + c, x, start = list(a=10, b=1, c = 50))

In general over the years the number of Winter Olympic medals awarded has increased. In order to model this relationship, year was converted to ordinal position. A three parameter power function seemed plausible, $latex y = ax^b + c$, where $latex y$ is total medals awarded and $latex x$ is the ordinal position of the olympics starting at one. The best fitting parameters by least-squares were

$$latex r I(round(coef(fit)["a"], 3)) x^{r I(round(coef(fit)["b"], 3)) + r I(round(coef(fit)["c"], 3))}. $$

The figure displays the data and the line of best fit for the model. The model predicts that 2010, 2014, and 2018 would have r round(predict(fit, data.frame(pos=max(x[["pos"]]) + 1))), r round(predict(fit, data.frame(pos=max(x[["pos"]]) + 2))), and r round(predict(fit, data.frame(pos=max(x[["pos"]]) + 3))) medals respectively.

plot(medals ~ pos, x,  las = 1, 
		ylab = "Total Medals Awarded", 
		xlab = "Ordinal Position of Olympics",
        main="Total medals awarded 
     by ordinal position of Olympics with
     predicted three parameter power function fit displayed.",
		las = 1,
		bty="l")
lines(x$pos, predict(fit))

Gender Ratio by Year

medalsByYearByGender <- aggregate(medals$Year, 
    list(Year = medals$Year, Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender 
    != "X", ]
propf <- list()
propf$prop <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / (
			medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] +
			medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propf$year <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"]
propf$propF <- format(round(propf$prop, 2))

propf$table <- with(propf, 	cbind(year, propF))
colnames(propf$table) <- c("Year", "Prop. Female")

The figure shows the number of medals won by males and females by year. The table shows the proportion of medals awarded to females by year. It shows a generally similar pattern for males and females. Medals increase gradually until around the late 1980s after which the rate of increase accelerates. However, females started from a much smaller base. Thus, both the absolute difference and the percentage difference has decreased over time to the point where in 2006 r as.numeric(propf[["propF"]][length(propf[["propF"]])])* 100 of medals were won by females.

plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ], 
    ylim = c(0,max(x)), pch = "m", col = "blue", las = 1,
    ylab = "Total Medals Awarded", bty="l",
     main="Total Medals Won by Gender and Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
    medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"],
    col = "red", pch = "f")
print(xtable(propf$table,
             caption="Proportion of Medals that were awarded to Females by Year"), 
      type="html", 
      caption.placement="top",
      html.table.attributes='align="center"')

Countries with the Most Medals

cmm <- list()
cmm$medals <- sort(table(medals$NOC), dec = TRUE)
cmm$country <- names(cmm$medals)
cmm$prop <- cmm$medals / sum(cmm$medals)
cmm$propF <- paste(round(cmm$prop * 100, 2), "%", sep ="")

cmm$row1 <- c("Rank", "Country", "Total", "%")
cmm$rank <- seq(cmm$medals)
cmm$include <- 1:10
                               
cmm$table <- with(cmm,
		rbind(cbind(rank[include], country[include],
						medals[include], propF[include])))
colnames(cmm$table) <- cmm$row1

Norway has won the most medals with r cmm[["medals"]][1] (r round(cmm[["prop"]][1]* 100, 2)%). The table shows the top 10. Russia, USSR, and EUN (Unified Team in 1992 Olympics) have a combined total of r sum(cmm$medals[c("RUS", "URS", "EUN")]). Germany, GDR, and FRG have a combined medal total of r sum(cmm$medals[c("GER", "GDR", "FRG")]).

print(xtable(cmm$table, caption="Rankings of Medals Won by Country"), 
      "html", include.rownames=FALSE, caption.placement='top',
      html.table.attributes='align="center"')

Proportion of Gold Medals by Country

Looking only at countries that have won more than 50 medals in the dataset, the figure shows that the proportion of medals won that were gold, silver, or bronze.

NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), 
                                 margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], 
         decreasing = TRUE), c("Gold", "Silver", "Bronze")]
barplot(round(t(medalsByMedalByNOC), 2), horiz = TRUE, las = 1, 
		col=c("gold", "grey71", "chocolate4"), 
		xlab = "Proportion of Medals",
        main="Proportion of medals won that were gold, silver or bronze.")

How many different countries have won medals by year?

listOfYears <- unique(medals$Year)                   
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears,  function(X) 
      length(table(medals[medals$Year == X, "NOC"])))

The figure shows the total number of countries winning medals by year.

plot(x= names(totalNocByYear), totalNocByYear, 
    ylim = c(0, max(totalNocByYear)),
		las = 1,
    xlab = "Year",
    main = "Total Number of Countries Winning Medals By Year",
    ylab = "Total Number of Countries",
    bty = "l")

Australia at the Winter Olympics

ausmedals <- list()
ausmedals$data <- medals[medals$NOC == "AUS", ]
ausmedals$data <- ausmedals$data[, c("Year", "City", "Discipline", "Event",
"Medal")]
ausmedals$table <- ausmedals$data

Given that I am an Australian I decided to have a look at the Australian medal count. Australia does not get a lot of snow. Up to and including 2006, Australia has won r nrow(ausmedals$data) medals. It won its first medal in r min(ausmedals$data$Year). Of the r nrow(ausmedals$data) medals, r sum(ausmedals$data$Medal == "Bronze") were bronze, r sum(ausmedals$data$Medal == "Silver") were silver, and r sum(ausmedals$data$Medal == "Gold") were gold. The table lists each of these medals.

print(xtable(ausmedals$table, 
             caption='List of Australian Medals',
             digits=0),
      type='html', 
      caption.placement='top', 
      include.rownames=FALSE,
      html.table.attributes='align="center"') 

Ice Hockey

icehockey <- medals[medals$Sport == "Ice Hockey" & 
    medals$Event.gender == "M" &
    medals$Medal == "Gold",  ]
icehockeyf <- medals[medals$Sport == "Ice Hockey" & 
    medals$Event.gender == "W" &
    medals$Medal == "Gold",  ]

# names(table(icehockey$NOC)[table(icehockey$NOC) > 1])

The following are some statistics about Winter Olympics Ice Hockey up to and including the 2006 Winter Olympics.

  • Out of the r length(unique(medals$Year)) Winter Olympics that have been staged, Mens Ice Hockey has been held in r nrow(icehockey) and the Womens in r nrow(icehockeyf).
  • The USSR has won the most mens gold medals with r sum(icehockey$NOC == "URS") golds. It goes up to r sum(icehockey$NOC %in% c("EUN", "URS")) if the 1992 Unified Team is included.
  • Canada has the second most golds with r sum(icehockey$NOC == "CAN").
  • After that the only two nations to win more than one gold are Sweden (r sum(icehockey$NOC == "SWE") golds) and the United States (r sum(icehockey$NOC == "USA") golds).
  • The table shows the countries who won gold and silver medals by year.
  • In the case of the Women's Ice Hockey, Canada has won r table(icehockeyf$NOC)["CAN"] and the United States has won r table(icehockeyf$NOC)["USA"].
icehockeygs <- medals[medals$Sport == "Ice Hockey" & 
    medals$Event.gender == "M" &
    medals$Medal %in% c("Silver", "Gold"),  c("Year", "Medal", "NOC")]
icetab <- list()
icetab$data <- reshape(icehockeygs, idvar="Year", timevar="Medal",
    direction="wide")
names(icetab$data) <- c("Year", "Gold", "Silver")

print(xtable(icetab$data, 
             caption ="Country Winning Gold and Silver Medals by Year in Mens Ice Hockey", 
             digits=0), 
      type="html",     
      include.rownames=FALSE,
      caption.placement="top",
      html.table.attributes='align="center"')

Reflections on the Conversion Process

  • Markdown versus LaTeX:
    • I prefer performing analyses with Markdown than I do with LateX.
    • Markdown is easier to type than LaTeX.
    • Markdown is easier to read than LaTeX.
    • It is easier with Markdown to get started with analyses.
    • Many analyses are only presented on the screen and as such page breaks in LaTeX are a nuisance. This extends to many features of LaTeX such as headers, figure and table placement, margins, table formatting, partiuclarly for long or wide tables, and so on.
    • That said, journal articles, books, and other artefacts that are bound to the model of a printed page are not going anywhere.
    • Furthermore, bibliographies, cross-references, elaborate control of table appearance, and more are all features which LaTeX makes easier than Markdown.
  • R Markdown to Sweave LaTeX:
    • The more common conversion task that I can imagine is taking some simple analyses in R Markdown and having to convert them into knitr LaTeX in order to include the content in a journal article.
    • The first time I converted between the formats, it was good to do it in a relatively manual way to get a sense of all the required changes; however, if I had a large document or was doing the task on subsequent occasions, I would look at more automated solutions using string replacement tools (e.g., sed, or even just replacement commands in a text editor such as Vim), and markup conversion tools (e.g., pandoc).
    • Perhaps if the formats get popular enough, developers will start to build dedicated conversion tools.

Additional Resources

If you liked this post, you may want to subscribe to the RSS feed of my blog. Also see: