# DIGI405 Lab 2.3: Keywords

## Lab 2 Introduction

In the lab notebooks for module 2 we introduce collocation analysis, analysis of clusters and n-grams, and
keyword analysis.

### A note about the Quake Stories v2 corpus

This notebook works with the Quake Stories v2 (QSv2) corpus. This data comes from
http://www.quakestories.govt.nz/, and consists of crowd-sourced accounts of earthquake experiences
following the 2011 Canterbury earthquakes. This corpus contains 487 self-reported stories of
earthquake experiences from 2011 to 2019. It is licensed under Creative Commons BY-NC-SA. Please
be aware that some stories may relate to people who were killed or injured in the earthquakes. Please
treat the material with respect.

Remember, you can read about the filename format in the README file included in the corpus zip
file. This provides a way for you to view the original web page that each text was scraped from.

In [None]:
from conc.corpus import Corpus
from conc.listcorpus import ListCorpus
from conc.conc import Conc
import shutil
import os

In [None]:
source_path = f'/srv/source-data/' # path to the source data from which the corpus will be created
save_path = f'/srv/corpora/' # path to the directory where corpora are stored

In [None]:
corpus = Corpus().load(f'{save_path}quake-stories-v2.corpus')

In [None]:
# # if you are running the code on your own machine, unzip the source files - adjust the corpus_source_path below to point to the directory with the source files
# # uncomment the remaining lines of this cell to create the corpus from the source files (or load if it already exists)
# corpus_source_path = f'{source_path}quake-stories-v2.zip'
# try:
#     corpus = Corpus().load(f'{save_path}quake-stories-v2.corpus')
# except FileNotFoundError:
#     corpus = Corpus(name = 'Quake Stories v2', description = 'This is a corpus based on stories from the http://www.quakestories.govt.nz/ website established by Manatū Taonga / Ministry for Culture and Heritage in 2011. QuakeStories was a place for the public to share stories of these and subsequent New Zealand earthquakes. The site was licensed under Creative Commons BY-NC-SA (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). This data-set, as a re-representation of the stories, is also released under BY-NC-SA. Please be aware that some stories may relate to people who were killed or injured in the Canterbury earthquakes. Treat the material with respect. ').build_from_files(corpus_source_path, f'{save_path}/')

In [None]:
corpus.summary() # overview of the corpus

In [None]:
conc = Conc(corpus) #initialize Conc reporting with the corpus

If you want to examine the contents of the source data for the Quake Stories v2 corpus, uncomment the following cell and run it. This will copy it into your current working directory. You can download it to your local computer by right clicking on the file in file viewer and clicking 'Download'.

In [None]:
# source_file = '/srv/source-data/quake-stories-v2.zip' # path to the file 
# destination = os.path.join(os.getcwd(), os.path.basename(source_file))
# shutil.copy(source_file, destination)

## Keywords

This lab notebook will introduce you to keyword analysis. 

In the frequency analysis lab notebook (1.1) you learned how to remove stop words from a frequency table. However, we also learned that frequent function words are revealing about the contents of a corpus and, therefore, removing stop words might be a poor and arbitrary research practice that obscures meaningful and revealing patterns in the data. Rather than arbitrarily removing frequent words we find uninteresting, the statistical measures used in keyword analysis provide a way identify frequency patterns that are over-represented in one corpus (the target corpus) compared to another (the reference corpus). These are referred to as *keywords*.

Keyword analysis is a helpful way into a corpus, as it provides a way to identify what is distinctive about the corpus we are analysing. To analyse keywords we combine techniques introduced through the lab notebooks to date. Keyword tables provide information on frequency patterns for specific tokens that are distinctive. We can make use of collocation analysis and concordancing to understand the word choices represented by keywords. 

This notebook will continue analysis of the Quake Stories v2 corpus introduced in lab notebook 2.1 and 2.2. In the second part of this notebook you will be analyse a corpus of political speeches. 

## Section 1: Comparing a specialist corpus with a general reference corpus

In Section 1 we compare the Quake Stories v2 with a general reference corpus. This kind of comparison is useful to identify what is distinctive about the target corpus and how it differs from language use more generally. In this instance, we are comparing texts authored in New Zealand in the 2010s with a general reference corpus of British English written and spoken language use from the 1990s, namely the British National Corpus (BNC). Although we could think through problems with this comparison, the BNC is often used as a general reference corpus due to its size (100m words) and high quality. 

The following cells load a lightweight representation of frequency information from the BNC for use as a reference corpus. If you had the full BNC, you could use that instead, but for keyword analysis, we only need the frequency of tokens in the reference corpus. 

In [None]:
reference_corpus = ListCorpus().load(f'{save_path}bnc.listcorpus') # loading a Conc list corpus as a reference corpus
# # if you had the full BNC, you could use this instead:
# reference_corpus = Corpus().load(f'{save_path}bnc.corpus') # loadding a Conc corpus as a reference corpus
# # See the Conc documentation site for information on creating a list corpus from the BNC https://geoffford.nz/conc/

In [None]:
reference_corpus.summary()

In [None]:
conc.set_reference_corpus(reference_corpus) # setting the reference corpus for keyword analysis

The table below shows the top 20 keywords in the Quake Stories v2 corpus compared to the BNC reference corpus. Notice that many of the keywords are terms related to the major topic of the corpus, i.e. the earthquakes in Canterbury, New Zealand in 2010 and 2011.  

In this instance the table is ordered by the Log Likelihood ratio, which is statistical significance measure. This ordering ranks keywords based on the statistical evidence for over-use in Quake Stories v2 compared to the BNC.  

Notice, the table includes a number of columns:

- `Frequency` and `Frequency Reference` are the raw frequencies of the keyword in the target and reference corpora. These cannot be directly compared without taking into account the size of the target and reference corpora. 
- `Normalized Frequency` and `Normalized Frequency Reference` are normalized frequencies in each corpus. These can be directly compared.
- `Relative Risk` is an effect size measure that can be directly related to the normalized frequencies (it is the ratio of these two values). One way to think about this is that it shows how much more a keyword occurs in the target compared to the reference corpus. The more similar the values of the normalized frequency in the target and reference, the closer the relative risk will be to 1. Values less than 1 indicate under-use in the target corpus when compared to the reference corpus. Values greater than 1 indicate over-use in the target. The greater the value the larger the difference in normalized frequencies.
- `Log Ratio` is an effect size measure derived from relative risk that is intuitive to interpret. For more information, read this short article: [Log Ratio - an informal introduction](https://cass.lancs.ac.uk/log-ratio-an-informal-introduction/). 

In [None]:
conc.keywords(page_current = 1, page_size = 20).display()

Create code cells and markdown cells as needed to complete each task below and answer the related questions.

### Task 1.1: A statistical basis to identify salient function words

When analysing a keyword table we are often interested in identifying related words. An obvious group of related keywords in the table above are ‘function words’. Here we have a statistical basis to make sense of very frequent function words. We can be more specific than this though. What type of words are these, mainly and why do you think they are over-represented in the Quake Stories corpus when compared with the BNC reference corpus? 

### Task 1.2: Investigating keywords using collocates and concordance

To make sense of keywords and how they are being used, we need to make use of collocation tables, concordances, and tables of n-gram clusters. You can copy code from other lab notebooks to do this now. Analyse some of the keywords in the table above:  

* ‘building’ and ‘buildings‘ - some of the instances relate to specific building names - what is the easiest way to find these?
* ‘shaking’ 
* ‘liquefaction’ 
* compare ‘aftershock’ and ‘aftershocks’ 
* pick another word (perhaps one you don’t understand or one you are interested in)

Make notes on the patterns you observed.

## Section 2: Comparing sub-corpora or related specialised corpora

Another way the Keywords List can be used is to compare sub-corpora, or related specialised corpora. For instance, Paul Baker compared two sides of the parliamentary debate over fox hunting in Britain by creating a sub-corpus for the pro-hunting speakers and one for the anti-hunting speakers. This can be a good strategy for identifying differences between the way groups or sides of a debate are representing an issue.

For this part of the lab we will work with the Beehive (2014-2019) corpus. This consists of speeches by New Zealand government ministers scraped from https://www.beehive.govt.nz/search?f%5B0%5D=content_type_facet%3Aspeech. In the zip file you download you have two folders: one is a subcorpus for the last National-led government (during the period 2014-2017) and the other is a subcorpus for the Labour-NZ First Coalition (from 2017 to 2019). National is a centre-right party, which has historically had strong support from the business and farming sectors. Labour is a centre-left party, which has historically had strong support from urban workers, the civil service and trade unions.

### Caution: Making claims about political discourse

Avoid making claims like this: "Tokens A, B, and C are over-represented in the Labour corpus. This means Labour cares more about issue X and is more concerned with action on Y than National". 

This is a claim about beliefs and intentions, but we are studying language use. Believe it or not, the things politicians say may not be a good indication of what they believe, what they will do, and why they are doing it. Avoid making claims that are not supported by the data or other research.

The following cells load the two sub-corpora.

In [None]:
labour_corpus = Corpus().load(f'{save_path}labour-nz-first-coalition.corpus')

In [None]:
# # if you are running the code on your own machine, unzip the source files - adjust the corpus_source_path below to point to the directory with the source files
# # uncomment the remaining lines of this cell to create the corpus from the source files (or load if it already exists)
# corpus_source_path = f'{source_path}beehive-speeches-2014-2019/2017-2020 Labour-New Zealand First Coalition'
# try:
#     labour_corpus = Corpus().load(f'{save_path}labour-nz-first-coalition.corpus')
# except FileNotFoundError:
#     labour_corpus = Corpus(name = 'Labour-NZ First Coalition', description = 'Labour-NZ First Coalition (2017-2019) subcorpus, part of the Beehive (2014-2019) corpus compiled by Dr Geoff Ford. The corpus consists of speeches by New Zealand government ministers scraped from https://www.beehive.govt.nz/.').build_from_files(corpus_source_path, f'{save_path}', standardize_word_token_punctuation_characters = True)

In [None]:
labour_corpus.summary() # overview of the corpus

In [None]:
national_corpus = Corpus().load(f'{save_path}national-led.corpus')

In [None]:
# # if you are running the code on your own machine, unzip the source files - adjust the corpus_source_path below to point to the directory with the source files
# # uncomment the remaining lines of this cell to create the corpus from the source files (or load if it already exists)
# corpus_source_path = f'{source_path}beehive-speeches-2014-2019/2014-2017 National-led'
# try:
#     national_corpus = Corpus().load(f'{save_path}national-led.corpus')
# except FileNotFoundError:
#     national_corpus = Corpus(name = 'National-led', description = 'National-led government (during the period 2014-2017) subcorpus, part of the Beehive (2014-2019) corpus compiled by Dr Geoff Ford. The corpus consists of speeches by New Zealand government ministers scraped from https://www.beehive.govt.nz/.').build_from_files(corpus_source_path, f'{save_path}', standardize_word_token_punctuation_characters = True)

In [None]:
national_corpus.summary() # overview of the corpus

If you want to examine the contents of the source data for the Beehive corpus, uncomment the following cell and run it. This will copy it into your current working directory. You can download it to your local computer by right clicking on the file in file viewer and clicking 'Download'.

In [None]:
# source_file = '/srv/source-data/beehive-speeches-2014-2019.zip' # path to the file 
# destination = os.path.join(os.getcwd(), os.path.basename(source_file))
# shutil.copy(source_file, destination)

Below we set Labour as the target corpus and National as the reference corpus. 

In [None]:
conc = Conc(labour_corpus) #initialize Conc reporting with the target corpus
conc.set_reference_corpus(national_corpus) # set the reference corpus

Now, we're going to create a keywords table. 

In [None]:
conc.keywords(page_current = 1, page_size = 20).display()

Add code or markdown cells as needed to complete the tasks below and answer the related questions. You can copy code from other lab notebooks as needed to look more closely at specific keywords.

### Task 2.1: Grouping keywords

Examine the top 20 keywords and try to group related or connected keywords. Make sure you examine how the keywords are used (e.g. concordance them), not just assume connections between them. What does this tell us about us about some of the key ideas of the Labour government when compared with the National government?

### Task 2.2: Negative keywords

By default, Conc keyword tables exclude negative keywords, which are words that are under-used in the target corpus when compared with the reference corpus. The `exclude_negative_keywords` parameter can be set to `False` to include negative keywords. 

You can identify negative keywords by finding words with a negative log ratio value. Negative keywords have a relative risk score less than 1. 

What do the negative keywords tell us about the change in focus of the Labour government compared to the National government that preceded it? 

In [None]:
conc.keywords(page_current = 1, page_size = 20, exclude_negative_keywords=False).display()

### Task 2.3: Effect size and filtering by statistical significance and frequency

So far, we've been sorting by statistical significance, but this does not express the size of the differences in frequency between the two corpora. We can use Conc's `order` parameter to order by `log_ratio`. This tends to rank words that are very rare (or absent) from the reference corpus higher than words that are frequent in both. 

There are two key ways to filter results:

* Use a `statistical_significance_cut`: Choose a p value to remove keywords based on statistical significance (you can be aggressive with this). In addition, we can impose a stricter test that takes into account the number of tests being conducted using the [Bonferonni correction](https://en.wikipedia.org/wiki/Bonferroni_correction).  
* Use frequency filtering: Conc supports the `min_frequency` and `max_frequency_reference` parameters to filter keywords based on their frequencies in the target and reference corpora, as well as `min_document_frequency` and `min_document_frequency_reference` to filter keywords based on the number of documents they appear in.

The `show_document_frequency` parameter has been added so that we can look at the dispersion of keywords across documents. 

Experiment with these parameters now. 

Spend some time analsing the keywords in the table. 

Questions:

- How does this keywords list, ordered by the effect size measure log ratio, compare with the first table? 
- What does this new ordering of keywords add to your analysis from 2.1?

In [None]:
conc.keywords(page_current=1, page_size = 20, order = 'log_ratio', statistical_significance_cut=0.0001, apply_bonferroni=True, min_frequency_reference = 10, show_document_frequency = True).display()

## Task 4: Wrap up

Are you done?

If you want to explore these techniques further, here are some suggestions: 

1.	Think about the kinds of comparisons that would be possible using the Quake Stories data. Look at the filename scheme discussed in the README and think about other ways the texts could be grouped and compared.  
2.	Spend some time exploring the “Introduce yourself” corpus using some of the techniques you have learned this week, including:  
    - Examining frequency lists of n-grams – notice that these include mentions of common entities and frequent phrases people are using to narrate their “introduction”.
    - Comparing the corpus with the BNC reference word list;
    - Examining the collocates of frequent content words (e.g. “data”).
