Skip to content

Commit

Permalink
Commit paper, code, data, and package.
Browse files Browse the repository at this point in the history
  • Loading branch information
Carl Vogel committed May 7, 2015
1 parent 02caf24 commit 5eea7e8
Show file tree
Hide file tree
Showing 49 changed files with 1,930 additions and 1 deletion.
3 changes: 3 additions & 0 deletions .gitignore
@@ -0,0 +1,3 @@
.Rproj.user
.Rhistory
.RData
2 changes: 1 addition & 1 deletion LICENSE
@@ -1,6 +1,6 @@
The MIT License (MIT)

Copyright (c) 2015 Polynumeral
Copyright (c) 2015 Academia.edu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
67 changes: 67 additions & 0 deletions README.md
@@ -0,0 +1,67 @@
# Academia Citation Advantage Analysis

The `acadcites` package contains the data and functions used in Niyazov, et. al. "Open Access Meets Discoverability: Citations to Articles Posted to Academia.edu."

## Installing the R Package
The easiest way to install the package and its depdendencies is by using `install_local` from the `devtools` package. (http://cran.r-project.org/web/packages/devtools/index.html)

- Clone the repo:

```{R}
git clone https://github.com/polynumeral/academia-citations
cd academia-citations
```

- From R:

```{R}
install.packages('devtools')
devtools::install_local('acadcites_0.1.tar.gz')
```

## Importing data
The cleaned/combined dataset used for the analyses can be obtained by calling:

```{R}
library('acadcites')
cites <- importData()
```

or just `cites <- acadcites::importData()` without the `library` import.

## Reproducing tables from the article

Tables from the article can be reproduced with the `makeTable` function.

```{R}
# Make Table 1 from the article.
makeTable(2, cites)
# |Journal | # Articles| % Total|
# |:------------------------------------------------------|----------:|-------:|
# |Analytical Chemistry | 1,537| 3.44%|
# |PLoS One | 492| 1.10%|
# |Anesthesia and Analgesia | 430| 0.96%|
# |Biological and Pharmaceutical Bulletin | 362| 0.81%|
# |Analytical Methods: advancing methods and applications | 339| 0.76%|
# |Analytical Biochemistry | 317| 0.71%|
# |Applied Mechanics and Materials | 303| 0.68%|
# |Bioconjugate Chemistry | 299| 0.67%|
# |Applied Physics Letters | 190| 0.43%|
# |BioEssays | 183| 0.41%|
```


## Reproducing figures from the article
The `makeFigure` function reproduces figures from the article. Like `makeTable`,
it takes a figure number and a citations data frame.

```{r}
makeFigure(1, cites)
```


## Package help

See `help(package='acadcites')` for more help files on individual functions, or
`vignette('acadcites')` for information similar to what's provided here.
2 changes: 2 additions & 0 deletions acadcites/.Rbuildignore
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
4 changes: 4 additions & 0 deletions acadcites/.gitignore
@@ -0,0 +1,4 @@
.Rproj.user
.Rhistory
.RData
inst/doc
24 changes: 24 additions & 0 deletions acadcites/DESCRIPTION
@@ -0,0 +1,24 @@
Package: acadcites
Title: Manage data and models to study effect of Academia.edu on citations.
Version: 0.1
Authors@R: "Carl Vogel <carl@polynumeral.com> [aut, cre]"
Description: Manage data and run models to study the effect of posting to
Academia.edu on article citations.
Depends:
R (>= 3.1.1)
License: MIT
LazyData: true
Imports:
dplyr,
magrittr,
stringr,
reshape2,
ggplot2,
MASS,
pscl,
knitr,
scales,
stargazer,
memisc,
pander
VignetteBuilder: knitr
29 changes: 29 additions & 0 deletions acadcites/NAMESPACE
@@ -0,0 +1,29 @@
exportPattern("^[^\\.]")
import(dplyr)
importFrom(stringr, str_detect)
importFrom(stringr, str_trim)
importFrom(stringr, str_replace_all)
importFrom(stringr, str_replace)
importFrom(magrittr, use_series)
importFrom(magrittr, set_colnames)
importFrom(magrittr, set_names)
importFrom(magrittr, extract)
importFrom(ggplot2, ggplot)
importFrom(ggplot2, aes)
importFrom(ggplot2, geom_boxplot)
importFrom(ggplot2, geom_histogram)
importFrom(ggplot2, geom_point)
importFrom(ggplot2, stat_quantile)
importFrom(ggplot2, facet_wrap)
importFrom(ggplot2, labs)
importFrom(ggplot2, xlim)
importFrom(ggplot2, scale_x_continuous)
importFrom(ggplot2, scale_y_continuous)
importFrom(ggplot2, scale_colour_manual)
importFrom(ggplot2, position_jitter)
importFrom(ggplot2, theme_bw)
importFrom(memisc, mtable)
importFrom(memisc, relabel)
S3method(getModelSummary, lm)
S3method(getModelSummary, glm)
S3method(getModelSummary, zeroinfl)
108 changes: 108 additions & 0 deletions acadcites/R/bucket_analysis.R
@@ -0,0 +1,108 @@
# Functions for comparing citations within Impact Factor buckets

#' Group a variable into buckets based on its quantiles or those of another
#' variable.
#'
#' @param x_quantile The variable to calculate quantile buckets from.
#' @param x_bucket The variable to collect into the quantile buckets.
#' @param nbuckets The number of quantile buckets to use. Specify this *or*
#' `probs`, but not both.
#' @param probs The vector of probabilities for the quantile bucket cut points.
#' Specify this *or* `nbuckets`, but not both.
#' @return A factor vector corresponding to `x_bucket` with bucket ranges.
#' If an element of `x_bucket` is outside of the range of `x_quantile`, its
#' bucket will be NA.
#'
quantileBuckets <- function(x_quantile, x_bucket=x_quantile, nbuckets=10, probs=NULL) {
if (!is.null(nbuckets) & is.null(probs)) {
breaks <- quantile(x_quantile, probs=0:nbuckets / nbuckets)
cut(x_bucket, breaks, include.lowest=TRUE)
} else if (!is.null(probs) & is.null(nbuckets)) {
cut(x_bucket, quantile(x_quantile, probs), include.lowest=TRUE)
} else {
stop('Only specify nbuckets or probs, not both.')
}
}


#' Compare on- and off-Academia citations within years and quantile groups
#' of journal impact factors.
#'
#' @param cites_df A dataframe with citations and impact factors.
#' @param summarizer (default mean) A function to summarize citations within groups.
#' @param comparator (default `/`) A function with arguments (on, off), that compares
#' on and off-Academia citation summaries. The default computes the on/off ratio.
#' @param ... Extra parameters to `quantileBuckets`
#' @return A dataframe with statistic by year, impact factor group, and on/off-source.
compareByImpactFactorBuckets <- function(cites_df, summarizer=mean,
comparator=`/`, ...) {

# Find buckets based on distribution of on-Academia citations.
bucketFactors <- function(x) {
on_factors <- cites_df %>% filter(source=='on') %>% use_series(impact_factor)
quantileBuckets(on_factors, x, ...)
}

cites_df %>%
mutate(if_bucket = bucketFactors(impact_factor)) %>%
filter(!is.na(if_bucket)) %>%
group_by(if_bucket, year, source) %>%
summarize(cites=summarizer(citations)) %>%
reshape2::dcast(year + if_bucket ~ source, value.var='cites') %>%
mutate(comparison = comparator(on, off))
}

#' Average results over buckets, weighting by the number of on-Academia
#' articles in the bucket.
#'
#' @param cites_df A dataframe of citations with impact factors.
#' @return A dataframe of weighted average results by year.
#'
summarizeOverBuckets <- function(cites_df) {
cite_ratios <- compareByImpactFactorBuckets(cites_df,
summarizer=mean,
comparator=`/`)
counts <- compareByImpactFactorBuckets(cites_df,
summarizer=length,
comparator=`+`)

weights <- counts %>%
group_by(year) %>%
mutate(weight = on / sum(on)) %>%
ungroup %>%
select(year, if_bucket, weight)

cite_ratios %>% left_join(., weights, by=c('year', 'if_bucket')) %>%
group_by(year) %>% summarize(wtd_avg = sum(weight * comparison))

}


#' Boxplots on- and off-Academia citations within years and quantile groups
#' of journal impact factors.
#'
#' @param cites_df A dataframe with citations and impact factors.
#' @param ... Extra parameters to `quantileBuckets`
#'
#' @return A ggplot2 plot.
plotByImpactFactorBuckets <- function(cites_df, ...) {

# Find buckets based on distribution of on-Academia citations.
bucketFactors <- function(x) {
on_factors <- cites_df %>% filter(source=='on') %>% use_series(impact_factor)
quantileBuckets(on_factors, x, ...)
}

# Add bucket variable to data
df <- cites_df %>%
mutate(if_bucket = bucketFactors(impact_factor)) %>%
filter(!is.na(if_bucket))

p <- ggplot(df, aes(x=factor(year), y=citations, color=source)) +
geom_boxplot() +
facet_wrap(~if_bucket, ncol=2) +
labs(x='Year', y='Citations (log scale)',
title='Citations of On- and Off-Academia Articles By Year and Journal Impact Factor') +
theme_bw()
plotLogScale(p, xy='y')
}
61 changes: 61 additions & 0 deletions acadcites/R/figures.R
@@ -0,0 +1,61 @@
# Reproduce figures
#
# Figures:
# --------
# 1. Histograms over citations counts by off-/on-Academia
# 2. Citations boxplots by impact factor bucket and year of publication
# 3. Scatterplot of cites against impact factor
# 4. Scatterplot of cites against impact factor by off-/on-Academia and year.

## Function names that produce figures, listed in order
## of their appearance in the paper.
.figures_functions = list(
'plotCiteDistributions',
'plotByImpactFactorBuckets',
'plotCitesImpactFactorScatter',
'plotImpactFactorMedReg')


#' Function to generate figures from the paper.
#'
#' Recreate a figure with a given citations dataset by specifying the table's
#' caption number in the paper.
#'
#' @param n Figure caption number.
#' @param cites_df A data frame with article citations and journal data, as produced by `importData`.
#' @param ... Optional arguments passed to figure functions.
#'
#' @return Nothing. Renders a plot.
#'
makeFigure <- function(n, cites_df, ...) {
eval(parse(text=.figures_functions[[n]]))(cites_df, ...)
}


plotCiteDistributions <- function(cites_df) {
ggplot(cites_df, aes(x=citations)) +
geom_histogram(binwidth=1, fill='steelblue', color='white') +
xlim(0, 100) +
facet_wrap(~source, scales='free_y') +
theme_bw()
}

plotCitesImpactFactorScatter <- function(cites_df) {
p <- ggplot(cites_df, aes(x=impact_factor, y=citations)) +
geom_point(position=position_jitter(height=.1, width=.01), alpha=.3, size=.75) +
geom_smooth(method='lm') +
theme_bw() +
labs(x='Impact Factor (log scale)', y='Citations (log scale)')
plotLogScale(p, c('x', 'y'))
}

plotImpactFactorMedReg <- function(cites_df) {
p <- ggplot(cites_df, aes(x=impact_factor, y=citations, color=source)) +
geom_point(position=position_jitter(height=.1, width=.01), alpha=.3, size=.75) +
facet_wrap(~year, ncol=2) +
stat_quantile(quantiles=0.5) +
scale_colour_manual(values=c('orange', 'purple')) +
labs(x='Impact Factor (log scale)', y='Citations (log scale)') +
theme_bw()
plotLogScale(p, c('x', 'y'))
}

0 comments on commit 5eea7e8

Please sign in to comment.