# DIGI405 Lab 2.1: Collocations

## Lab 2 Introduction

In the lab notebooks for module 2 we introduce collocation analysis, analysis of clusters and n-grams, and
keyword analysis.

### A note about the Quake Stories v2 corpus

This notebook works with the Quake Stories v2 (QSv2) corpus. This data comes from
http://www.quakestories.govt.nz/, and consists of crowd-sourced accounts of earthquake experiences
following the 2011 Canterbury earthquakes. This corpus contains 487 self-reported stories of
earthquake experiences from 2011 to 2019. It is licensed under Creative Commons BY-NC-SA. Please
be aware that some stories may relate to people who were killed or injured in the earthquakes. Please
treat the material with respect.

Remember, you can read about the filename format in the README file included in the corpus zip
file. This provides a way for you to view the original web page that each text was scraped from.

In [None]:
from conc.corpus import Corpus
from conc.conc import Conc
import os

In [None]:
source_path = f'{os.environ.get("HOME")}/data/' # path to the source data from which the corpus will be created
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/' # path to the directory where the corpus will be saved

In [None]:
try:
    corpus = Corpus().load(f'{save_path}quake-stories-v2.corpus')
except FileNotFoundError:
    corpus = Corpus(name = 'Quake Stories v2', description = 'This is a corpus based on stories from the http://www.quakestories.govt.nz/ website established by Manatū Taonga / Ministry for Culture and Heritage in 2011. QuakeStories was a place for the public to share stories of these and subsequent New Zealand earthquakes. The site was licensed under Creative Commons BY-NC-SA (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). This data-set, as a re-representation of the stories, is also released under BY-NC-SA. Please be aware that some stories may relate to people who were killed or injured in the Canterbury earthquakes. Treat the material with respect. ').build_from_files(f'{source_path}qs-v2', f'{save_path}/')

In [None]:
corpus.summary() # overview of the corpus

In [None]:
conc = Conc(corpus) #initialize Conc reporting with the corpus

## Collocations

Quantitative analysis of corpora demonstrates that collocation is a fundamental 
features of language data: words tend to occur with other words in predictable ways. 
These patterns encode meaning. In fact, in some approaches to language modelling, words 
are represented as vectors based on collocation patterns.

Collocations are patterns that do not belong to individual authors, but are shared patterns 
of language use that authors are making use of to make meaning and do things with 
language. 

In Lab notebook 1.2 we introduced concordances as a way to view words or phrases in the
context in which they appear in a corpus. This was our first exploration of collocation 
patterns. The patterns within concordance lines can be identified by 
sorting the concordance in different ways. Words do not always pattern in a particular 
order. While we can identify these kind of patterns by inspecting the lines in a systematic way,
this becomes much more time-consuming (and potentially impossible) as the number of
concordance lines increase.

Collocation analysis makes use of statistical measures of collocation. Collocates can be 
ranked by different statistical measures to identify robust patterns. 

This lab will introduce you to concordance analysis.

## Task 1: Understanding language usage by finding patterns in an example collocation table

The cell below creates a collocation table for the word "home". The effect size measure used is 
Mutual Information (MI). Inspect the notes at the foot of the table. By default collocates are 
in a window from 5 words to the left and 5 words to the right of "home". 
The minimum collocation frequency is set to 5. This is appropriate for MI, which privileges very 
rare and exclusive collocates.

Inspect the different columns in the table. 
* Collocate Frequency is the number of times 
'home' and the collocate token appear together. 
* Frequency is the number of times the collocate appears in the corpus (both in the context of "home" and in other contexts). 
* The Mutual Information values is a measure expressing the strength of the collocation.
* Log Likelihood Ratio (LLR) is the default statistical significance measure. 

As with the previous Conc tables you have encountered, you can change the number of rows displayed by 
setting a `page_size` parameter. You can page through the results using `page_current`. 

In [None]:
query = 'home'
conc.collocates(query, effect_size_measure = 'mutual_information', context_length = 5, min_collocate_frequency = 5).display()

Valid effect size measures for the Conc library are currently 'mutual_information' and 'logdice'. Create a new code cell and copy the code above. Set the effect size measure to 'logdice'. Compare the results with the first table using MI.

When we look at collocates, it is common to start telling ourselves stories about the meaning of the patterns we see and the intention behind these patterns. Concordances complement collocation tables, allowing us to inspect the specific instances that two tokens are used together. The following line adds two new parameters to the concordance method introduced in previous lab notebooks. The `filter_context_str` and `filter_context_length` parameters can be set to view only those concordance lines matching our collocation results. In the example below, concordances for "home: are filtered by lines featuring "journey" within 5 word tokens of the node. By inspecting concordances, we don't have to assume how two words are being used together, we can analyse the usage and make appropriate claims based on evidence from the corpus.

As you work through the rest of this notebook, make a point to use concordances to inspect the patterns to test your observations and to ensure your claims accurate represent the text data. 

In [None]:
query = 'home'
conc.concordance(query, context_length = 8, order='1R2R3R', page_current = 1, page_size = 20, filter_context_str = 'journey', filter_context_length = 5).display()

As we learned in previous lab notebooks, patterns can relate to specific tokens or to groupings of these tokens. Something to pay attention to when viewing a collocation table is whether the collocates are related in some way. Here are some questions to ask yourself as you investigate the collocates: Are there groupings of collocates related to parts of speech or meaning? Are there groupings of collocates that perform a similar function in the texts (e.g. expressing the authors attitude)?

Create a markdown cell and note your responses to the following question. 

* What kinds of words are considered strong collocates for each collocation measure?

## Task 2: Changing the context

Create a new table in a new code cell below using the MI measure. Experiment with different values for context_length and min_collocate_frequency. 

Create a markdown cell and note your observations about how these settings change the results you get.

You can also set the context_length independently for left and right contexts. To do this specify the context_length as a tuple `(left context length, right context length)`, for example setting context_length to `(5, 0)` will set the context length so that only collocates appearing in the 5 word tokens to the left of "home" are evaluated. Experiment with diferent values for the left and right context lengths. 

Still using the query word 'home', set the context length first so that tokens immediately to the left and right of 'home' are included with `(1, 1)`. Then change it to `(1, 0)` to see the token immediately to the left, and then `(0, 1)` to see the collocates immediately to the right. Use a minimum collocate frequency of 5.

Note: it might be helpful to create code cells for the three tables so you can compare the results. 

Create a markdown cell to note your observations:

- With the settings you just tried, what kind of words are collocated with "home" in different positions.
- Are there particular patterns that tend to occur on one side or other of "home"?
- What do these patterns indicate about the use of "home" in the corpus?

## Task 3: Comparing collocation patterns for different words

In new code cells below, repeat your analysis with the word "house". Make sure you use concordances to complement and enhance your analysis of the collocation tables.

Create a new markdown cell. After you examine words collocated with ‘house’, note your observations:

- What are the differences between the use of "home" and "house" in the corpus?
- From your examination of the use of these specific words, what are some claims you can make about features of the texts contained in the Quake Stories corpus?  

## Task 4: Extending your analysis

The frequency of a collocates appearing in a span is its own kind of raw collocation measure (i.e. how many times each collocate appears in a
span or position). You can sort by collocate frequency by adding in the `order` parameter and setting this to `collocate_frequency`. 

Try this now.

Note: Valid values in Conc currently are `collocate_frequency`, `frequency`, or `frequency`. If the `order` parameter is ommitted or set to `None` the results are based on the effect size measure. If you set the value to `None` it shouldn't have quotes around it. 

Focusing on "home" and "house" and the context_length `(1, 0)` can you see any differences in function words that appear to the left of "home" and "house"? Note your observations in a markdown cell.

## Task 5: "Time" and filtering collocates based on evidence for robust patterns

Create a code cell and create a table for token "time". Spend some time examining the collocates.

In the results we have viewed so far, the statistical significance is reported using the Log Likelihood Ratio. We are not filtering the results to exclude collocates based on this measure. To do this you can add the `statistical_significance_cut` parameter. The value should be a desired p value (e.g. 0.05, 0.01, 0.001). This will filter out collocates that do not meet a specific statistical significance threshold. This provides a way to focus on collocation patterns with robust statistical evidence. 

Apply a threshold to filter collocates for the word "time". You can be aggressive with the threhold, e.g. 0.0001.

Inspect the words collocated with "time". 

Note: In this example concordances are valuable to investigate the dispersion of collocation patterns in our corpus. In other words, we can see from the document ids in the concordances whether a collocation pattern relates to very few (or even single) texts.  In general we are looking for collocation patterns that occur across multiple texts, however it can be interesting to examine the word choices of individual authors. Remember: The claims we make about collocation patterns should be different depending on whether the patterns observed are the result of word choices by one or multiple authors.

Create a markdown cell and note your observations:

- What do you notice about collocates that are associated with the word ‘time’? Why is this?

## Task 6: Wrapping up

Reflect on the process you are using to investigate collocation patterns.

* How can we improve the quality of our claims about specific collocation patterns?  
* If we were to write about the collocation patterns we have observed what statistical information or other evidence is important?  
* Given there are different software tools to do collocation analysis, how should we report our results in a way others can understand and reproduce?    

Make notes on your answers to these questions in a markdown cell, and then discuss your findings with someone else in the class.