# *corpkit*: a Python-based toolkit for working with parsed linguistic corpora

[Daniel McDonald](mailto:mcdonaldd@unimelb.edu.au?Subject=corpkit)
--------------------------

> **SUMMARY:** This *IPython Notebook* demonstrates how to use `corpkit` to investigate a corpus of paragraphs containing the word *risk* in the NYT between 1963 and 2014.

## Orientation

First, let's import the functions we'll be using to investigate the corpus. These functions are designed for this interrogation, but also have more general use in mind, so you can likely use them on your own corpora.

| **Function name** | Purpose                            | |
| ----------------- | ---------------------------------- | |
| `interrogator()`  | interrogate parsed corpora         | |
| `editor()`  | edit interrogations         | |
| `plotter()`       | visualise results | |
| `conc()`       | concordancing of plaintext, trees or dependencies | |
| `quickview()`     | view `interrogator()` results      | |

In [None]:
import corpkit
from corpkit import interrogator, editor, quickview, plotter, conc
# show visualisations inline:
%matplotlib inline

Next, let's set the path to our corpus. If you were using this interface for your own corpora, you would change this to the path to your data.

In [None]:
# to unzip nyt files:
# gzip -dc data/nyt.tar.gz | tar -xf - -C data
# corpus with annual subcorpora
annual_trees = 'data/nyt/years' 

### The data

Our main corpus is comprised of paragraphs from *New York Times* articles that contain a risk word, which we have defined by regular expression as `(?i)\brisk.?\b`. This includes *low-risk*, or *risk/reward* as single tokens, but excludes *brisk* or *asterisk*.

The data comes from a number of sources.

* 1963 editions were downloaded from ProQuest Newsstand as PDFs. Optical character recognition and manual processing was used to create a set of 1200 risk sentences.
* The 1987--2006 editions were taken from the *NYT Annotated Corpus*.
* 2007--2014 editions were downloaded from *ProQuest Newsstand* as HTML.

In total, 149,504 documents were processed. The corpus from which the risk corpus was made is over 150 million words in length!

The texts have been parsed for part of speech and grammatical structure by [`Stanford CoreNLP*](http://nlp.stanford.edu/software/corenlp.shtml), using *corpkit*'s `build_corpus()` function.

### Interrogating the corpus

So, let's start by generating some general information about this corpus. First, let's define a query to find every word in the corpus. Run the cell below to define the `allwords_query` variable as the Tregex query to its right.

> *When writing Tregex queries or Regular Expressions, remember to always use `r'...'` quotes!*

In [None]:
# any token containing letters or numbers (i.e. no punctuation):
allwords_query = r'/[A-Za-z0-9]/ !< __' 

Next, we perform interrogations with `interrogator()`. Its most important arguments are:

1. **path to corpus**
2. Tregex **options**:
  * **'words'**: return only words
  * **'count'**: return a count of matches
  * **'tags'**: return only the tag
  * **'both'**: return tag and word together
3. the **Tregex query**

We only need to count tokens, so we can use the **count** option (it's often faster than getting lists of matching tokens). The cell below will run `interrogator()` over each annual subcorpus and count the number of matches for the query.

In [None]:
allwords = interrogator(annual_trees, 'count', allwords_query) 

When the interrogation has finished, we can view our results:

In [None]:
# from the allwords results, print the totals
print allwords.totals

If you want to see the query and options that created the results, you can use:

In [None]:
print allwords.query

### Plotting results

Lists of years and totals are pretty dry. Luckily, we can use the `plotter()` function to visualise our results. At minimum, `plotter()` needs two arguments:

1. a title (in quotation marks)
2. a list of results to plot


In [None]:
plotter('Word counts in each subcorpus', allwords.totals)

Great! So, we can see that the number of words per year varies quite a lot. That's worth keeping in mind.

### Frequency of risk words in the NYT

Next, let's count the total number of risk words. Notice that we are using the `both` flag, instead of the `count`
flag.

In [None]:
# our query:
riskwords_query = r'__ < /(?i).?\brisk.?\b/' # any risk word and its word class/part of speech
# get all risk words and their tags:
riskwords = interrogator(annual_trees, 'both', riskwords_query)

Even when do not use the `count` flag, we can access the total number of matches as before:

In [None]:
plotter('Risk words', riskwords.totals)

At the moment, it's hard to tell whether or not these counts are simply because our annual NYT samples are different sizes. To account for this, we can calculate the percentage of parsed words that are risk words. This means combining the two interrogations we have already performed.

We can do this by using `editor()`.

In [None]:
rel_risk = editor(riskwords.results, '%', allwords.totals)
plotter('Relative frequency of risk words', rel_risk.totals)

That's more helpful. We can now see some interesting peaks and troughs in the proportion of risk words. We can also see that 1963 contains the highest proportion of risk words. This is because the manual corrector of 1963 OCR entries preserved only the sentence containing risk words, rather than the paragraph.

It's often helpful to not plot 1963 results for this reason. To do this, we can add an argument to the `plotter()` call:

In [None]:
plotter('Relative frequency of risk words', rel_risk.totals.drop('1963'))

Perhaps we're interested in not only the frequency of risk words, but the frequency of different `kinds` of risk words. We actually already collected this data during our last `interrogator()` query.

In [None]:
riskwords.results

In [None]:
rel_riskwords.results

We now have enough data to do some serious plotting.

In [None]:
plotter('Risk word / all risk words', rel_riskwords.results)

### Editing results

Results lists can be edited quickly with `editor()`. It has a lot of different options:

  | `editor()` argument | Mandatory/default?       |  Use          | Type  |
  | :------|:------- |:-------------|:-----|
  | `dataframe1` | **mandatory**      | the results you want to edit | `interrogator()` or `editor` output |
  | `operation` | '%'      | if using second list, what operation to perform | `'+', '-', '/', '*', '%', 'k', 'd', 'a'` |
  | `dataframe2` | False      | Results to comine in some way with `df` | `interrogator()` or `editor` output (usually, a `.totals` branch) |
  | `just_subcorpora` | False    |   Subcorpora to keep   |  list |
  | `skip_subcorpora` | False    |   Subcorpora to skip   |  list |
  | `merge_subcorpora` | False    |   Subcorpora to merge   |  list |
  | `new_subcorpus_name` | False    |   name for merged subcorpora   |  index/str |
  | `just_entries` | False    |   Entries to keep   |  list |
  | `skip_entries` | False    |   Entries to skip   |  list |
  | `merge_entries` | False    |   Entries to merge   |  list of words or indices/a regex to match |
  | `sort_by` | False    |   sort results   |  str: `'total', 'infreq', 'name', 'increase', 'decrease'` |
  | `keep_top` | False    |   Keep only top n results after sorting   |  int |
  | `just_totals` | False    |   Collapse all subcorpora, return Series   | bool |
  | `projection` | False    |   project smaller subcorpora   |  list of tuples: [`(subcorpus_name, projection_value)]` |
  | `**kwargs` | False    |   pass options to *Pandas*' `plot()` function, *Matplotlib*   |  various |

## Examples

Let's play around with `editor()` on a very simple interrogation:

In [None]:
adj = '/JJ.?/ < /(?i)\brisk/'
adj_riskwords = interrogator(annual_trees, 'words', adj)

Here's how to edit subcorpora:

In [None]:
editor(adj_riskwords.results, skip_subcorpora = [1963, 1987, 1988]).results

In [None]:
editor(adj_riskwords.results, just_subcorpora = [1963, 1987, 1988]).results

In [None]:
editor(adj_riskwords.results, span_subcorpora = [2000, 2010]).results

We can edit entries too:

In [None]:
quickview(adj_riskwords.results)

In [None]:
editor(adj_riskwords.results, just_entries = [2, 5, 6]).results

In [None]:
editor(adj_riskwords.results, just_entries = ['risky', 'riskier', 'riskiest']).results

In [None]:
# skip any that start with 'r'
editor(adj_riskwords.results, skip_entries = r'^r').results

### Sorting

In [None]:
# alphabetically
editor(adj_riskwords.results, sort_by = 'name').results

In [None]:
# least frequent
editor(adj_riskwords.results, sort_by = 'infreq').results

In [None]:
# increasing in frequency
editor(adj_riskwords.results, sort_by = 'increase').results

### Multiple options

It's possible to use many  options at the same time:

In [None]:
editor(adj_riskwords.results, '%', adj_riskwords.totals, span_subcorpora = [1990, 2000], 
    just_entries = r'^\(n', merge_entries = r'(nns|nnp)', newname = 'Plural/proper', sort_by = 'name').results

## Customising visualisations

We can use other `plotter()` arguments to customise what our chart shows. `plotter()`'s possible arguments are:

| `plotter()` argument | Mandatory/default?       |  Use          | Type  |
| :------|:------- |:-------------|:-----|
| `title` | **mandatory**      | A title for your plot | string |
| `results` | **mandatory**      | the results you want to plot | `interrogator()` or `editor()` output |
| `num_to_plot` | 7    | Number of top entries to show     |  int |
| `x_label` | False    | custom label for the x-axis     |  str |
| `y_label` | False    | custom label for the y-axis     |  str |
| `figsize` | (13, 6) | set the size of the figure | tuple: `(length, width)`|
| `tex` | `'try'` | use *TeX* to generate image text | boolean |
| `style` | `'ggplot'` | use Matplotlib styles | str: `'dark_background'`, `'bmh'`, `'grayscale'`, `'ggplot'`, `'fivethirtyeight'` |
| `legend_pos` | `'default'` | legend position | str: `'outside right'` to move legend outside chart |
| `show_totals` | `False` | Print totals on legend or plot where possible | str: '`legend`', '`plot`', '`both`', or 'False' |
| `save` | `False` | Save to file | `True`: save as `title`.png. str: save as `str` |
| `colours` | `'Paired'` | plot colours | str: any of Matpltlib's colormaps |
| `cumulative` | `False` | plot entries cumulatively | bool |
| `**kwargs` | False | pass other options to Pandas plot/Matplotlib | `rot = 45`, `subplots = True`, `fontsize = 16`, `cumulative = True`, `stacked = True`, etc. |


In [None]:
plotter('Example', adj_riskwords.results, kind = 'bar', stacked = True, style = 'fivethirtyeight')

In [None]:
plotter('Example 2', adj_riskwords.results, kind = 'area', x_label = 'Period', cumulative = True)

In [None]:
plotter('Example 3', adj_riskwords.results['at-risk'], kind = 'pie', figsize = (9,9))

In [None]:
plotter('Example 4', adj_riskwords.results, kind = 'line', show_totals = 'legend', 
        black_and_white = True, legend_pos = 'outside right')

and so on...

### Saving and loading results

*corpkit* has functions for saving and loading interrogations, edits, concordance lines and so on.

In [None]:
# specify what to save, and a name for the file.
from corpkit import save_result, load_result
save_result(allwords, 'allwords')

You can then load these results:

In [None]:
fromfile_allwords = load_result('allwords')
fromfile_allwords.totals

If you're in a project directory with saved data, you can also use `load_all_results()` to load every saved interrogation into a dict:

In [None]:
# r = load_all_results()
# r['riskwords'].totals

## Concordancing

You can use `conc()` to do concordancing. Its main arguments are:

1. A path to corpus or subcorpus
2. The kind of search you want to do (`'trees', 'deps', 'plaintext', 'tokens'`)
3. The search query

In [None]:
# here, we use a subcorpus of politics articles,
# rather than the total annual editions.
lines = conc('data/nyt/trees/politics/1999', 'trees', r'/JJ.?/ << /(?i).?\brisk.?\b/') # adj containing a risk word

You can set `conc()` to print *n* random concordances with the *random = n* parameter. You can also store the output to a variable for further searching.

In [None]:
lines = conc('data/nyt/trees/years/2007', 'trees', r'/VB.?/ < /(?i).?\brisk.?\b/', random = 25)

`conc()` takes another argument, window, which alters the amount of co-text appearing either side of the match. The default is 50 characters

In [None]:
lines = conc('data/nyt/trees/health/2013', 'trees', r'/VB.?/ << /(?i).?\brisk.?\b/', random = 25, window = 20)

`conc()` also allows you to view parse trees. By default, it's false:

In [None]:
lines = conc('data/nyt/trees/health/2013', 'trees', r'/VB.?/ << /(?i).?\brisk.?\b/', 
             random = 25, window = 20, trees = True)

## More coming soon...