Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
2 contributors

Users who have contributed to this file

@juliasilge @TylerSchnoebelen
112 lines (78 sloc) 5.93 KB
---
title: "Tidy Log Odds"
author: "Julia Silge and Tyler Schnoebelen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Tidy Log Odds}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
warning = FALSE, message = FALSE,
collapse = TRUE,
comment = "#>")
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```
## A motivating example: what words are important to a text?
There are multiple ways to measure which words (or bigrams, or other units of text) are important in a text. You can count words, or [measure tf-idf](https://www.tidytextmining.com/tfidf.html). This package implements a different approach for measuring which words are important to a text, a **weighted log odds**.
A log odds ratio is a way of expressing probabilities, and we can weight a log odds ratio so that our implementation does a better job dealing with different combinations of words and documents having different counts. In particular, we use the method outlined in [Monroe, Colaresi, and Quinn (2008)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior.
What does this mean? It means that by weighting in this way, we take into account the sampling error in our measurements and acknowledge that we are more certain when we've counted something a lot of times and less certain when we've counted something only a few times. When weighting by a prior in this way, we focus on differences that are more likely to be real, given the evidence that we have.
Let's look at just such an example.
## Jane Austen and bigrams
Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel, focusing on bigrams that include *no upper case letters*. (We can look at differences other than those involving proper nouns this way.) This weighted log odds approach would work equally well for single words, or other kinds of n-grams.
```{r bigram_counts}
library(dplyr)
library(janeaustenr)
library(tidytext)
library(stringr)
tidy_bigrams <- austen_books() %>%
unnest_tokens(bigram, text, token="ngrams", n = 2, to_lower = FALSE) %>%
filter(!str_detect(bigram, "[A-Z]"))
bigram_counts <- tidy_bigrams %>%
count(book, bigram, sort = TRUE)
bigram_counts
```
Notice that we haven't removed stop words, or filtered out rarely used words. We have done very little pre-processing of this text data.
Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books?
```{r bigram_log_odds, dependson="bigram_counts"}
library(tidylo)
bigram_log_odds <- bigram_counts %>%
bind_log_odds(book, bigram, n)
bigram_log_odds %>%
arrange(-log_odds)
```
The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve concepts specific to each novel, such as the pump room in *Northanger Abbey* and sister/mother figures in *Sense & Sensibility*. We can make a visualization as well.
```{r bigram_plot, dependson="bigram_log_odds", fig.width=10, fig.height=7}
library(ggplot2)
bigram_log_odds %>%
group_by(book) %>%
top_n(10) %>%
ungroup %>%
mutate(bigram = reorder(bigram, log_odds)) %>%
ggplot(aes(bigram, log_odds, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, scales = "free") +
coord_flip() +
labs(x = NULL)
```
These bigrams have the highest log odds for each book.
The proper names of characters tend to be unique from book to book unless you're looking at a series like *Harry Potter*. When it comes identifying words or phrases that are unique to a text, a measure like tf-idf does a good job. But because of the way tf-idf is calculated, it cannot distinguish between words that are used in **all** texts.
For example, the phrase "had been" is used over 100 times in all six Austen novels. The "idf" in tf-idf stands for "inverse document frequency". When a word is in all the texts, tf-idf is zero for all of them. In weighted log odds, however, each book has a different value; in this case the weighted log odds ranges from `r bigram_log_odds %>% filter(bigram == "had been", book == "Persuasion") %>% pull(log_odds) %>% round(2)` in *Persuasion* (very high) down to `r bigram_log_odds %>% filter(bigram == "had been", book == "Sense & Sensibility") %>% pull(log_odds) %>% round(2)` in *Sense & Sensibility* (which suggests something about the style or content is suppressing the phrase).
## Counting things other than words
Text analysis is a main motivator for this implementation of weighted log odds, but this is a general approach for measuring how much more likely one feature (any kind of feature, not just a word or bigram) is to be associated than another for some set or group (any kind of set, not just a document or book).
To demonstrate this, let's look at everybody's favorite data about cars. What do we know about the relationship between number of gears and engine shape `vs`?
```{r gear_counts}
gear_counts <- mtcars %>%
count(vs, gear)
gear_counts
```
Now we can use `bind_log_odds()` to find the weighted log odds ratio for each number of gears and engine shape.
```{r dependson="gear_counts"}
gear_counts %>%
bind_log_odds(vs, gear, n)
```
For engine shape `vs = 0`, having three gears has the highest log odds while for engine shape `vs = 1`, having four gears has the highest log odds. This dataset is small enough that you can look at the count data and see how this is working.
More importantly, you can notice that this approach is useful both in the initial motivating example of text data but also more generally whenever you have counts in some kind of groups and you want to find what feature is more likely to come from which group, compared to the other groups.
You can’t perform that action at this time.