Skip to content

Commit

Permalink
Add travis, appveyor, covr, + rename function arguments
Browse files Browse the repository at this point in the history
  • Loading branch information
juliasilge committed Jun 13, 2019
1 parent e117ba8 commit 2182321
Show file tree
Hide file tree
Showing 10 changed files with 101 additions and 20 deletions.
3 changes: 3 additions & 0 deletions .Rbuildignore
Expand Up @@ -3,3 +3,6 @@
^LICENSE\.md$
^README\.Rmd$
^CODE_OF_CONDUCT\.md$
^\.travis\.yml$
^appveyor\.yml$
^codecov\.yml$
7 changes: 7 additions & 0 deletions .travis.yml
@@ -0,0 +1,7 @@
# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r

language: R
cache: packages

after_success:
- Rscript -e 'covr::codecov()'
3 changes: 2 additions & 1 deletion DESCRIPTION
Expand Up @@ -22,6 +22,7 @@ Suggests:
tidytext,
janeaustenr,
ggplot2,
testthat (>= 2.1.0)
testthat (>= 2.1.0),
covr
VignetteBuilder: knitr
RoxygenNote: 6.1.1
22 changes: 11 additions & 11 deletions R/bind_log_odds.R
Expand Up @@ -5,14 +5,14 @@
#' is added as a column. This functions supports non-standard evaluation through
#' the tidyeval framework.
#'
#' @param tbl A tidy dataset with one row per item and feature
#' @param item Column of items for identifying differences, such as words or
#' @param tbl A tidy dataset with one row per feature and set
#' @param feature Column of features for identifying differences, such as words or
#' bigrams with text data
#' @param feature Column of features between which to compare items, such as
#' @param set Column of sets between which to compare features, such as
#' documents for text data
#' @param n Column containing item-feature counts
#' @param n Column containing feature-set counts
#'
#' @details The arguments \code{item}, \code{feature}, and \code{n}
#' @details The arguments \code{feature}, \code{set}, and \code{n}
#' are passed by expression and support \link[rlang]{quasiquotation};
#' you can unquote strings and symbols. Grouping is preserved but ignored.
#'
Expand Down Expand Up @@ -40,25 +40,25 @@
#' @importFrom dplyr count left_join mutate rename group_by ungroup group_vars
#' @export

bind_log_odds <- function(tbl, item, feature, n) {
item <- enquo(item)
bind_log_odds <- function(tbl, feature, set, n) {
feature <- enquo(feature)
set <- enquo(set)
n_col <- enquo(n)

## groups are preserved but ignored
grouping <- group_vars(tbl)
tbl <- ungroup(tbl)

freq1_df <- count(tbl, !!item, wt = !!n_col)
freq1_df <- count(tbl, !!feature, wt = !!n_col)
freq1_df <- rename(freq1_df, freq1 = n)

freq2_df <- count(tbl, !!feature, wt = !!n_col)
freq2_df <- count(tbl, !!set, wt = !!n_col)
freq2_df <- rename(freq2_df, freq2 = n)

df_joined <- left_join(tbl, freq1_df, by = as_name(item))
df_joined <- left_join(tbl, freq1_df, by = as_name(feature))
df_joined <- mutate(df_joined, freqnotthem = freq1 - !!n_col)
df_joined <- mutate(df_joined, total = sum(!!n_col))
df_joined <- left_join(df_joined, freq2_df, by = as_name(feature))
df_joined <- left_join(df_joined, freq2_df, by = as_name(set))
df_joined <- mutate(df_joined,
freq2notthem = total - freq2,
l1them = (!!n_col + freq1) / ((total + freq2) - (!!n_col + freq1)),
Expand Down
3 changes: 3 additions & 0 deletions README.Rmd
Expand Up @@ -19,6 +19,9 @@ theme_set(theme_light())


<!-- badges: start -->
[![Travis build status](https://travis-ci.org/juliasilge/tidylo.svg?branch=master)](https://travis-ci.org/juliasilge/tidylo)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/juliasilge/tidylo?branch=master&svg=true)](https://ci.appveyor.com/project/juliasilge/tidylo)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidylo/branch/master/graph/badge.svg)](https://codecov.io/gh/juliasilge/tidylo?branch=master)
<!-- badges: end -->

How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful?
Expand Down
3 changes: 3 additions & 0 deletions README.md
Expand Up @@ -10,6 +10,9 @@


<!-- badges: start -->
[![Travis build status](https://travis-ci.org/juliasilge/tidylo.svg?branch=master)](https://travis-ci.org/juliasilge/tidylo)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/juliasilge/tidylo?branch=master&svg=true)](https://ci.appveyor.com/project/juliasilge/tidylo)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidylo/branch/master/graph/badge.svg)](https://codecov.io/gh/juliasilge/tidylo?branch=master)
<!-- badges: end -->

How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful?
Expand Down
52 changes: 52 additions & 0 deletions appveyor.yml
@@ -0,0 +1,52 @@
# DO NOT CHANGE the "init" and "install" sections below

# Download script file from GitHub
init:
ps: |
$ErrorActionPreference = "Stop"
Invoke-WebRequest http://raw.github.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1"
Import-Module '..\appveyor-tool.ps1'
install:
ps: Bootstrap

cache:
- C:\RLibrary

environment:
NOT_CRAN: true
# env vars that may need to be set, at least temporarily, from time to time
# see https://github.com/krlmlr/r-appveyor#readme for details
# USE_RTOOLS: true
# R_REMOTES_STANDALONE: true

# Adapt as necessary starting from here

build_script:
- travis-tool.sh install_deps

test_script:
- travis-tool.sh run_tests

on_failure:
- 7z a failure.zip *.Rcheck\*
- appveyor PushArtifact failure.zip

artifacts:
- path: '*.Rcheck\**\*.log'
name: Logs

- path: '*.Rcheck\**\*.out'
name: Logs

- path: '*.Rcheck\**\*.fail'
name: Logs

- path: '*.Rcheck\**\*.Rout'
name: Logs

- path: '\*_*.tar.gz'
name: Bits

- path: '\*_*.zip'
name: Bits
12 changes: 12 additions & 0 deletions codecov.yml
@@ -0,0 +1,12 @@
comment: false

coverage:
status:
project:
default:
target: auto
threshold: 1%
patch:
default:
target: auto
threshold: 1%
12 changes: 6 additions & 6 deletions man/bind_log_odds.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions vignettes/tidy_log_odds.Rmd
Expand Up @@ -84,7 +84,7 @@ Why you might choose log odds over tf-idf? TODO for Tyler

## Counting things other than words

Text analysis is a main motivator for this implementation of weighted log odds, but this is a general approach for measuring how much more likely one item (any kind of item, not just a word or bigram) is to be associated than another for some set of features (any kind of feature, not just a document or book).
Text analysis is a main motivator for this implementation of weighted log odds, but this is a general approach for measuring how much more likely one feature (any kind of feature, not just a word or bigram) is to be associated than another for some set or group (any kind of set, not just a document or book).

To demonstrate this, let's look at everybody's favorite data about cars. What do we know about the relationship between number of gears and engine shape `vs`?

Expand All @@ -104,4 +104,4 @@ gear_counts %>%

For engine shape `vs = 0`, having three gears has the highest log odds while for engine shape `vs = 1`, hvaing four gears has the highest log odds. This dataset is small enough that you can look at the count data and see how this is working.

More importantly, you can notice that this approach is useful both in the initial motivating example of text data but also more generally whenever you have counts in some kind of groups and you want to find what is more likely to come from which group, compared to the other groups.
More importantly, you can notice that this approach is useful both in the initial motivating example of text data but also more generally whenever you have counts in some kind of groups and you want to find what feature is more likely to come from which group, compared to the other groups.

0 comments on commit 2182321

Please sign in to comment.