Skip to content
A small collection of corpus resources, including an abridged/annotated version of the Slate Magazine portion of the OANC. Ideal for pedagogical purposes.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data-raw
data
man
slate
.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
README-unnamed-chunk-5-1.png
README.Rmd
README.md
corpusdatr.Rproj

README.md

corpusdatr

A data package consisting of two corpora:

  • Slate Magazine corpus (ca 1996-2000, 1K texts, 1m words), derived from the OANC.
  • A current and aggregate (ie, bag-of-words) corpus derived from web-based news articles, collected over a three-week time period.

Both corpora are sized well for demo and pedagogical purposes.

library(corpusdatr)#devtools::install_github("jaytimm/corpusdatr")
library(tidyverse)
library(spacyr)

Slate Magazine corpus

The Slate Magazine corpus is derived from the Slate Magazine portion of the Open American National Corpus. The Slate Magazine sub-corpus of the OANC is comprised of over 4,500 articles (~4 million words) published between 1996 and 2000.

For the sake of manageability, the full Slate corpus is reduced here to 1,000 randomly selected articles ranging in length from 850 and 1500 words in length. This amounts to a corpus of approximately 1 million words.

Each text has been annotated using the spacyr package, and named entities have been identified. The corpus loads as a dataframe called cdr_slate_ann. The original data frame corpus is also included in the package as cdr_slate_corpus.

Slate Magazine metadata

head(corpusdatr::cdr_slate_meta)
##   doc_id textLength textType textSent                            title
## 1      1       1087      578       40                    Trash Talking
## 2      2        866      463       31 McCain and Bradley Won't Go Soft
## 3      3       1297      616       68       Attention Must Not Be Paid
## 4      4       1074      500       82                    Pompous? Moi?
## 5      5       1244      702       51             For Viewers Like You
## 6      6       1120      649       86            Downsizing Hell 
##                   oancID
## 1  42/ArticleIP_3182.txt
## 2 19/Article247_4220.txt
## 3 49/ArticleIP_19380.txt
## 4   8/Article247_904.txt
## 5  41/ArticleIP_3034.txt
## 6 55/ArticleIP_73646.txt

Geo-political entities in Slate

Additionally included in the package is a sf points object containing the lat/lon coordinates of geo-political entities occurring in more than 1% of texts comprising the Slate Mag corpus. It is included here to enable geographical analysis of text data.

head(corpusdatr::cdr_slate_gpe)
##         lemma txtf docf             geometry
## 1 AFGHANISTAN   21   14   67.70995, 33.93911
## 2     ALABAMA   20   13  -86.90230, 32.31823
## 3      ALASKA   19   10 -149.49367, 64.20084
## 4     ALBANIA   25   12   20.16833, 41.15333
## 5   ARGENTINA   29    7 -63.61667, -38.41610
## 6     ARIZONA   30   17 -111.09373, 34.04893

Slate content

To get a quick sense of the content of articles included in the corpus, we plot the more frequent named entites by category.

corpusdatr::cdr_slate_ann %>%
  spacyr::entity_extract()%>%
  group_by(entity_type,entity)%>%
  summarize(freq =n())%>%
  group_by(entity_type)%>%
  top_n(n=13,wt=freq)%>%
  arrange(entity,desc(freq))%>%
  filter(entity_type %in% c('PERSON','ORG','GPE','NORP'))%>%
  
  ggplot(aes(x=reorder(entity,freq), y=freq, fill=entity_type)) + 
    geom_col(show.legend = FALSE) +
    facet_wrap(~entity_type, scales = "free_y", ncol = 2) +
    coord_flip()+
    labs(title="Named entities in Slate corpus (1996-2000)")

Google News corpus

For demo purposes, it is always nice to have a current corpus on hand, as well as a corpus with time series information. To this end, the package also includes a corpus comprised of articles scraped from the web (via GoogleNews' RSS feed) using my r package quicknews. A timed script was used to obtain/annotate articles three times a day for roughly three weeks (from 11-27/17 to 12/20/17). Search was limited to top stories in the United States. Again, spacyr was used to build annotations.

For the sake of avoiding copyright issues, each constituent article in the corpus is reduced to a bag-of-words. The corpus is comprised of ~1,500 texts, ~1.3 million words, and ~200 unique media sources, and loads as a single data frame, cdr_gnews_historical.

Metadata for the corpus can be accessed via cdr_gnews_meta:

head(corpusdatr::cdr_gnews_meta)
##   doc_id   pubdates          source
## 1      1 2017-11-27  New York Times
## 2      2 2017-11-27  New York Times
## 3      3 2017-11-27 Washington Post
## 4      4 2017-11-27             CNN
## 5      5 2017-11-27             CNN
## 6      6 2017-11-27 Washington Post
##                                                                      titles
## 1         2 Bosses Show Up to Lead the Consumer Financial Protection Bureau
## 2                                    Meghan Markle Is Going to Make History
## 3 Trump could personally benefit from last-minute change to Senate tax bill
## 4                           Melania Trump unveils White House holiday decor
## 5        Trump's latest conspiracy? The 'Access Hollywood' tape was a fake!
## 6                  Trump attacks media in his first post-Thanksgiving tweet
##   docN docType docSent
## 1  504     295      19
## 2  751     434      32
## 3 1473     582      59
## 4  796     380      31
## 5  941     441      45
## 6  647     316      22
You can’t perform that action at this time.