Skip to content
This repository has been archived by the owner on Apr 24, 2018. It is now read-only.

Commit

Permalink
Browse files Browse the repository at this point in the history
Improve functionality
Change default tokenization to be more LIWC-similar
Add README.(R)md
Add testphrases data object
Fully pass check
  • Loading branch information
kbenoit committed Apr 22, 2016
1 parent ebc7bd9 commit cb2d5ba
Show file tree
Hide file tree
Showing 12 changed files with 351 additions and 32 deletions.
2 changes: 2 additions & 0 deletions .Rbuildignore
@@ -1,2 +1,4 @@
^.*\.Rproj$
^\.Rproj\.user$
^README.Rmd$

17 changes: 12 additions & 5 deletions DESCRIPTION
@@ -1,16 +1,23 @@
Package: LIWCalike
Type: Package
Title: Text analysis similar to the Linguistic Inquiry and Word Count (LIWC)
Version: 0.1.0
Version: 0.1.1
Date: 2016-04-22
Author: Kenneth Benoit
Maintainer: Kenneth Benoit <kbenoit@lse.ac.uk>
Description: Built on the quanteda package for text analysis, LIWCalikes provides a simple interface to the analysis of text by counting words and other textual features, including the application of a dictionary to produce a tabular report of percentages. This provides similar functionality to the LIWC stand-alone software. The user must a dictionary, which can include one of the custom LIWC dictionaries if these have been purchased from http://liwc.wpengine.com.
Description: Built on the quanteda package for text analysis, LIWCalikes
provides a simple interface to the analysis of text by counting words and other
textual features, including the application of a dictionary to produce a tabular
report of percentages. This provides similar functionality to the LIWC stand-
alone software. The user must a dictionary, which can include one of the custom
LIWC dictionaries if these have been purchased from http://liwc.wpengine.com.
License: GPL-3
LazyData: TRUE
Depends: quanteda (>= 0.9.5.20)
Imports: stringi
Depends:
quanteda (>= 0.9.5-20)
Imports:
stringi
URL: http://github.com/kbenoit/LIWCalike
Encoding: UTF-8
BugReports: https://github.com/kbenoit/LIWCalike/issues
VignetteBuilder: knitr
RoxygenNote: 5.0.1
7 changes: 6 additions & 1 deletion NAMESPACE
@@ -1 +1,6 @@
exportPattern("^[[:alpha:]]+")
# Generated by roxygen2: do not edit by hand

S3method(liwcalike,character)
S3method(liwcalike,corpus)
export(liwcalike)
import(quanteda)
13 changes: 13 additions & 0 deletions R/data.R
@@ -0,0 +1,13 @@

#' @name testphrases
#' @docType data
#' @title sample short documents for testing
#' @description Some sample short documents in plain text format for testing
#' with \code{\link{liwcalike}}.
#' @examples
#' liwcalike(testphrases)
#'
NULL

# save(testphrases, file = "data/testphrases.RData")
# writeLines(testphrases, "inst/extdata/testphrases.txt")
50 changes: 36 additions & 14 deletions R/liwc.R → R/liwcalike.R
Expand Up @@ -8,6 +8,8 @@
#' vector for analysis
#' @param dictionary a \pkg{quanteda} \link[quanteda]{dictionary} object
#' supplied for analysis
#' @param toLower convert to common (lower) case before tokenizing
#' @param verbose if \code{TRUE} print status messages during processing
#' @param ... options passed to \code{\link[quanteda]{tokenize}} offering
#' finer-grained control over how "words" are defined
#' @return a data.frame object containing the analytic results, one row per
Expand All @@ -20,25 +22,41 @@
#' texts into smaller units based on user-supplied tags, sentence, or
#' paragraph boundaries.
#' @examples
#' liwcalike(testphrases)
#'
#' # examples for comparison
#' txt <- c("The red-shirted lawyer gave her ex-boyfriend $300 out of pity :(.")
#' myDict <- dictionary(list(people = c("lawyer", "boyfriend"),
#' colorFixed = "red",
#' colorGlob = "red*",
#' mwe = "out of"))
#' liwcalike(txt, myDict, what = "word")
#' liwcalike(txt, myDict, what = "fasterword")
#' (toks <- tokenize(txt, what = "fasterword", removeHyphens = TRUE))
#' length(toks[[1]])
#' # LIWC says 12 words
#'
#' \dontrun{# works with LIWC 2015 dictionary too
#' liwcDict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2015_English_Flat.dic",
#' format = "LIWC")
#' inaugLIWCanalysis <- liwc(inaugTexts, liwcDict)
#'
#' inaugLIWCanalysis <- liwcalike(inaugTexts, liwcDict)
#' }
#' @export
liwc <- function(x, ...) {
UseMethod("liwc")
#' @import quanteda
liwcalike <- function(x, ...) {
UseMethod("liwcalike")
}


#' @rdname liwc
#' @rdname liwcalike
#' @export
liwc.corpus <- function(x, ...) {
liwc(texts(x), ...)
liwcalike.corpus <- function(x, ...) {
liwcalike(texts(x), ...)
}

#' @rdname liwc
#' @rdname liwcalike
#' @export
liwc.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE, ...) {
liwcalike.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE, ...) {

## initialize results data.frame
## similar to "Filename" and Segment
Expand All @@ -48,7 +66,7 @@ liwc.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE,
stringsAsFactors = FALSE)

## get readability before lowercasing
WPS <- readability(x, "meanSentenceLength", ...)
WPS <- readability(x, "meanSentenceLength") #, ...)

## lower case the texts if required
if (toLower) x <- toLower(x)
Expand All @@ -62,7 +80,7 @@ liwc.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE,
}

## tokenize and form the dfm
toks <- tokenize(x, ...)
toks <- tokenize(x, removePunct = TRUE, removeHyphens = TRUE, ...)
dfmAll <- dfm(toks, verbose = FALSE)
if (!is.null(dictionary))
dfmDict <- dfm(toks, verbose = FALSE, dictionary = dictionary)
Expand All @@ -86,7 +104,8 @@ liwc.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE,
## add the dictionary counts, transformed to percentages of total words
if (!is.null(dictionary))
result <- cbind(result,
as.data.frame(dfmDict / rep(result[["WC"]], each = nfeature(dfmDict)) * 100))
quanteda::as.data.frame(dfmDict / rep(result[["WC"]],
each = nfeature(dfmDict))) * 100)

## add punctuation counts
# AllPunc
Expand All @@ -102,9 +121,12 @@ liwc.character <- function(x, dictionary = NULL, toLower = TRUE, verbose = TRUE,
# Parenth -- note this is specified as "pairs of parentheses"
# OtherP

# format the result
result[, which(names(result)=="Sixltr") : ncol(result)] <-
format(result[, which(names(result)=="Sixltr") : ncol(result)],
digits = 4, trim = TRUE)

result
}


# the word counts

83 changes: 83 additions & 0 deletions README.Rmd
@@ -0,0 +1,83 @@
---
output:
md_document:
variant: markdown_github
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

**Master branch** [![Build Status](https://travis-ci.org/kbenoit/LIWCalike.svg?branch=master)]
[![codecov.io](https://codecov.io/github/kbenoit/LIWCalike/coverage.svg?branch=master)](https://codecov.io/github/kbenoit/LIWCalike/coverage.svg?branch=master)


## LIWCalike: an R implementation of the Linguistic Inquiry and Word Count

Built on the quanteda package for text analysis, LIWCalikes provides a simple interface to the analysis of text by counting words and other textual features, including the application of a dictionary to produce a tabular report of percentages. This provides similar functionality to the LIWC stand-alone software. The user must a dictionary, which can include one of the custom LIWC dictionaries if these have been purchased from http://liwc.wpengine.com, or any other dictionary supplied by the user.

### Differences from the LIWC standalone software

This package is designed for R users and those wishing to build functionality by extending the [**quanteda**](https://github.com/kbenoit/quanteda) package for text analysis. If you prefer to have a complete, stand-alone user interface, then you should purchase and use the [LIWC standalone software](http://liwc.wpengine.com). This has several advantages:

* LIWC allows direct importing of files, including binary (Word, pdf, etc) formats. To use
**LIWCalike**, you will need to import these into the **quanteda** package first.
**LIWCalike** also works fine with simple character vectors, if you prefer to use
standard R methods to create your input object (e.g. `readLines()`, `read.csv()`, etc.)

* LIWC provides direct outputs in the form of csv, Excel files, etc. By contrast, **LIWCalike** returns a `data.frame`, which you have to export yourself (e.g. using `write.csv()`.)

* LIWC provides easy segmentation, through a GUI. By contrast, with **LIWCalike** you will
have to segment the texts yourself. (**quanteda** provides easy ways to do this using
`segment()` and `changeunits()`.)

* LIWC color codes the dictionary value matches in your texts and displays these in a nice graphical window.


## Using dictionaries with LIWCalike

No dictionaries are supplied with **LIWCalike**, it is up to you to supply these. With the **quanteda** functions for creating or importing dictionaries, however, this is quite easy.

With the LIWC 2007, external dictionaries were distributed with the software that could be used in the format read by Provalis Research's [*Wordstat*](http://provalisresearch.com/products/content-analysis-software/). Because I purchases a license for this product, I have that file and can use it with **LIWCalike**.

Using it is quite straightforward:

```{r}
require(LIWCalike)
# read in the dictionary
liwc2007dict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2007.cat",
format = "wordstat")
tail(liwc2007dict, 1)
# our test data
testphrases
# call LIWCalike
output <- liwcalike(testphrases, liwc2007dict)
# view some results
output[, c(1:7, ncol(output)-2)]
```


## How to Install

```
devtools::install_github("kbenoit/quanteda")
devtools::install_github("kbenoit/LIWCalike")
```

You need to have installed the **quanteda** package of at least version 0.9.5-20 for this
to work, since that update implemented multi-word dictionary values.


## Comments and feedback

I welcome your comments and feedback. Please file issues on the issues page, and/or send me comments at kbenoit@lse.ac.uk.


104 changes: 104 additions & 0 deletions README.md
@@ -0,0 +1,104 @@
**Master branch** \[![Build Status](https://travis-ci.org/kbenoit/LIWCalike.svg?branch=master)\]\[![codecov.io](https://codecov.io/github/kbenoit/LIWCalike/coverage.svg?branch=master)\](<https://codecov.io/github/kbenoit/LIWCalike/coverage.svg?branch=master>)

LIWCalike: an R implementation of the Linguistic Inquiry and Word Count
-----------------------------------------------------------------------

Built on the quanteda package for text analysis, LIWCalikes provides a simple interface to the analysis of text by counting words and other textual features, including the application of a dictionary to produce a tabular report of percentages. This provides similar functionality to the LIWC stand-alone software. The user must a dictionary, which can include one of the custom LIWC dictionaries if these have been purchased from <http://liwc.wpengine.com>, or any other dictionary supplied by the user.

### Differences from the LIWC standalone software

This package is designed for R users and those wishing to build functionality by extending the [**quanteda**](https://github.com/kbenoit/quanteda) package for text analysis. If you prefer to have a complete, stand-alone user interface, then you should purchase and use the [LIWC standalone software](http://liwc.wpengine.com). This has several advantages:

- LIWC allows direct importing of files, including binary (Word, pdf, etc) formats. To use **LIWCalike**, you will need to import these into the **quanteda** package first.
**LIWCalike** also works fine with simple character vectors, if you prefer to use standard R methods to create your input object (e.g. `readLines()`, `read.csv()`, etc.)

- LIWC provides direct outputs in the form of csv, Excel files, etc. By contrast, **LIWCalike** returns a `data.frame`, which you have to export yourself (e.g. using `write.csv()`.)

- LIWC provides easy segmentation, through a GUI. By contrast, with **LIWCalike** you will have to segment the texts yourself. (**quanteda** provides easy ways to do this using `segment()` and `changeunits()`.)

- LIWC color codes the dictionary value matches in your texts and displays these in a nice graphical window.

Using dictionaries with LIWCalike
---------------------------------

No dictionaries are supplied with **LIWCalike**, it is up to you to supply these. With the **quanteda** functions for creating or importing dictionaries, however, this is quite easy.

With the LIWC 2007, external dictionaries were distributed with the software that could be used in the format read by Provalis Research's [*Wordstat*](http://provalisresearch.com/products/content-analysis-software/). Because I purchases a license for this product, I have that file and can use it with **LIWCalike**.

Using it is quite straightforward:

``` r
require(LIWCalike)
#> Loading required package: LIWCalike
#> Loading required package: quanteda
#> quanteda version 0.9.5.20
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:base':
#>
#> sample

# read in the dictionary
liwc2007dict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2007.cat",
format = "wordstat")
#> Warning in strsplit(w, "\\("): input string 1 is invalid in this locale
tail(liwc2007dict, 1)
#> $`SPOKEN CATEGORIES.FILLERS`
#> [1] "blah" NA "idontknow" "imean"
#> [5] "ohwell" "oranything*" "orsomething*" "orwhatever*"
#> [9] "rr*" "yakn*" "ykn*" "youknow*"

# our test data
testphrases
#> [1] "Test sentence for LIWCalike. Second sentence."
#> [2] "Each row is a document."
#> [3] "Comma, period."
#> [4] "The red-shirted lawyer gave her ex-boyfriend $300 out of pity :(."
#> [5] "LOL :-)."
#> [6] "(Parentheses) for $100."
#> [7] "Say \"what\" again!!"
#> [8] "Why are we here?"
#> [9] "Other punctation: §; ±."
#> [10] "Sentence one. Sentence two! :-)"

# call LIWCalike
output <- liwcalike(testphrases, liwc2007dict)

# view some results
output[, c(1:7, ncol(output)-2)]
#> docname Segment WC WPS Sixltr Dic
#> text1 text1 1 6 3 50.00 120.00
#> text2 text2 2 5 5 20.00 50.00
#> text3 text3 3 2 2 0.00 100.00
#> text4 text4 4 12 12 16.67 40.00
#> text5 text5 5 1 1 0.00 33.33
#> text6 text6 6 3 3 33.33 75.00
#> text7 text7 7 3 3 0.00 30.00
#> text8 text8 8 4 4 0.00 26.67
#> text9 text9 9 2 2 50.00 66.67
#> text10 text10 10 4 2 50.00 100.00
#> LINGUISTIC PROCESSES.FUNCTION WORDS SPOKEN CATEGORIES.ASSENT
#> text1 33.33 0
#> text2 50.00 0
#> text3 0.00 0
#> text4 66.67 0
#> text5 0.00 25
#> text6 16.67 0
#> text7 33.33 0
#> text8 50.00 0
#> text9 16.67 0
#> text10 33.33 0
```

How to Install
--------------

devtools::install_github("kbenoit/quanteda")
devtools::install_github("kbenoit/LIWCalike")

You need to have installed the **quanteda** package of at least version 0.9.5-20 for this to work, since that update implemented multi-word dictionary values.

Comments and feedback
---------------------

I welcome your comments and feedback. Please file issues on the issues page, and/or send me comments at <kbenoit@lse.ac.uk>.
Binary file added data/testphrases.RData
Binary file not shown.
10 changes: 10 additions & 0 deletions inst/extdata/testphrases.txt
@@ -0,0 +1,10 @@
Test sentence for LIWCalike. Second sentence.
Each row is a document.
Comma, period.
The red-shirted lawyer gave her ex-boyfriend $300 out of pity :(.
LOL :-).
(Parentheses) for $100.
Say "what" again!!
Why are we here?
Other punctation: ^; %, &.
Sentence one. Sentence two! :-)
12 changes: 0 additions & 12 deletions man/hello.Rd

This file was deleted.

0 comments on commit cb2d5ba

Please sign in to comment.