# DIGI405 Lab 2.2: Clusters/Ngrams

## Lab 2 Introduction

In the lab notebooks for module 2 we introduce collocation analysis, analysis of clusters and n-grams, and
keyword analysis.

### A note about the Quake Stories v2 corpus

This notebook works with the Quake Stories v2 (QSv2) corpus. This data comes from
http://www.quakestories.govt.nz/, and consists of crowd-sourced accounts of earthquake experiences
following the 2011 Canterbury earthquakes. This corpus contains 487 self-reported stories of
earthquake experiences from 2011 to 2019. It is licensed under Creative Commons BY-NC-SA. Please
be aware that some stories may relate to people who were killed or injured in the earthquakes. Please
treat the material with respect.

Remember, you can read about the filename format in the README file included in the corpus zip
file. This provides a way for you to view the original web page that each text was scraped from.

In [None]:
from conc.corpus import Corpus
from conc.conc import Conc
import os

In [None]:
source_path = f'/srv/source-data/' # path to the source data from which the corpus will be created
save_path = f'/srv/corpora/' # path to the directory where corpora are stored

In [None]:
corpus = Corpus().load(f'{save_path}quake-stories-v2.corpus')

In [None]:
# # if you are running the code on your own machine, unzip the source files - adjust the corpus_source_path below to point to the directory with the source files
# # uncomment the remaining lines of this cell to create the corpus from the source files (or load if it already exists)
# corpus_source_path = f'{source_path}qs-v2'
# try:
#     corpus = Corpus().load(f'{save_path}quake-stories-v2.corpus')
# except FileNotFoundError:
#     corpus = Corpus(name = 'Quake Stories v2', description = 'This is a corpus based on stories from the http://www.quakestories.govt.nz/ website established by Manatū Taonga / Ministry for Culture and Heritage in 2011. QuakeStories was a place for the public to share stories of these and subsequent New Zealand earthquakes. The site was licensed under Creative Commons BY-NC-SA (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). This data-set, as a re-representation of the stories, is also released under BY-NC-SA. Please be aware that some stories may relate to people who were killed or injured in the Canterbury earthquakes. Treat the material with respect. ').build_from_files(corpus_source_path, f'{save_path}/')

In [None]:
corpus.summary() # overview of the corpus

In [None]:
conc = Conc(corpus) #initialize Conc reporting with the corpus

## Clusters / N-grams

The lab notebooks on collocation analysis (2.1) and concordancing (1.2) introduce ways to 
investigate collocation patterns in corpora. The course material this week introduced multi-word 
expressions (MWEs) as one example of a collocation pattern. To study collocation patterns related to 
consecutive tokens in a corpus, we can examine the frequency of n-grams. N-grams are sequences of
linguistic units that occur in a corpus. These are often tokens, but could be sequences of characters 
or other sub-token units. In this notebook, we will focus on token n-grams. 

The "N" in n-gram refers to the number of tokens in the sequence.  A single token is a unigram or 1-gram, 
two consecutive tokens are a bigram or 2-gram, three 
consecutive tokens are a trigram or 3-gram, and beyond this it is common to refer to n-grams using 
the number of tokens (e.g. 4-gram).  

For the phrase:  
> "I do not like to eat marmite"  

bigrams would be:  
> "I do", "do not", "not like", "like to", "to eat", "eat marmite"  

and trigrams would be:
> "I do not", "do not like", "not like to", "like to eat", "to eat marmite"  

N-grams have relevance for many machine learning tasks. N-grams can be used to represent text 
in a way that preserves some of the meaning encoded in the order of tokens. For example, if we 
represent a document using bigrams, we can quantify instances of "not like" and instances of "like" without 
negation to get an indication of sentiment expressed in the text. Very basic statistical language 
models, like those used for predictive text, can be developed using n-grams. An n-gram based model predicts 
the next token by finding the most frequent n-gram that begins with the already given n-1 tokens 
(e.g. given "have a nice", such a model might predict "have a nice day" rather than "have a nice funeral").

In corpus linguistics software and literature, it is common to refer to n-grams containing 
a specific token or token sequence as "clusters". This should not be confused with machine learning 
clustering techniques. 

This lab notebook will introduce you to ngram analysis. This notebook will continue analysis of the Quake Stories v2 corpus introduced in lab notebook 2.1.

## Task 1: Cluster frequencies and concordances are complementary

As shown below, there are over 12000 concordance lines for "I". Although we can learn things from the concordance, it is not straightforward to systematically work through 600 pages of results. 

In [None]:
query = 'i'
conc.concordance(query, context_length = 8, order='1L1R2R', page_current = 1, page_size = 20).display()

N-gram clusters provide a way to quickly identify the most frequent token sequences that include "I".

The cell below shows how to get a table of cluster frequencies for trigrams starting with "I". The Conc 
library uses the `ngrams` method to return n-gram clusters. 

In [None]:
query = "i"
conc.ngrams(query, ngram_length = 3, ngram_token_position = 'LEFT').display()

Change the `ngram_token_position` to `RIGHT` to see trigrams ending with "I". You can change the length of the n-grams 
using the `ngram_length` parameter to explore longer or shorter sequences.

You will notice that this provides a quick way to identify, quantify and summarise token sequences that are 
present in a concordance.  Copy and paste one of the n-grams from the table above into the concordance 
query variable above to examine examples in the corpus (e.g. "i had to"). 

The table of n-gram clusters and the concordance are complementary, especially when dealing with lots of data. 
The frequency information provides a way to identify the most common sequences. The concordance provides 
a way to examine the rich language data and explore what authors are doing through their language use.  

Note: As with the previous Conc tables you have encountered, you can change the number of rows displayed by 
setting a `page_size` parameter. You can page through the results using `page_current`. 

Create a markdown cell and note your responses to the following question. 

* The n-gram clusters for "I" represent instances where the author is the "main character" in their earthquake story. Note down some features of the authors' use of "I" in these accounts you have identified through your analysis of n-gram clusters and the complementary concordances.

## Task 2: Back to "Time" and connecting n-gram clusters and collocation analysis

In the previous notebook, you explored use of "time" using collocation analysis. Compare the results of the collocation table you created for "time" with n-gram clusters for "time". 

Notice that this provides a way to identify which of the collocates appear in relatively fixed phrases, and those that do not. 

Create a markdown cell and note your observations about how these settings change the results you get.

## Task 3: Examining n-gram frequencies

A table of frequent n-grams is a useful exploratory tool when starting out analysing a corpus. To get a frequency list of n-grams, use Conc's `ngram_frequencies` method. The table below shows the most frequent 4-grams in the Quake Stories corpus.

Experiment with changing the `ngram_length` parameter.

With the n-gram length set to 4, examine the first few n-grams using concordances. 

Make some notes in a markdown cell about the following questions:

- Is there a 4-gram that sticks out from the first five results? 
- What steps are required to validate what is causing this pattern? 

In [None]:
conc.ngram_frequencies(ngram_length=4, page_current=1).display()

## Task 4: An exploratory task to wrap up

Create a new code cell and copy the code to create an N-gram frequency table. Try different settings for cluster size, but this time look further down the frequency list by setting higher `page_current` numbers (e.g. for 4-grams try two or three pages between 15 and 50). The results further down the list of n-grams are often just as interesting as the very frequent clusters. Take some time to explore less frequent n-grams using a concordance and note down your observations in a markdown cell:

- Identify two interesting n-grams and what you found out by concordancing these patterns.

It is very likely you have explored patterns and made observations that are different to others in the class. Discuss your findings with someone else in the class.