# DIGI405 Lab 1.2: Concordancing

## Lab 1 Introduction

Each week in DIGI405 labs you will work through a worksheet or Jupyter Notebook with a
series of tasks. Be sure to take notes as you work through the lab. Talk to your tutors as you
work through these tasks – they are there to help and prompt you, as well as discuss your
observations. I would also encourage you to discuss your work with your classmates (whether
in-person or on Zoom) and learn together. Pace yourself so you can complete as much of the
material as possible during lab time.

Today’s class activities will focus on obtaining frequency and dispersion information from
texts, and to analyse tokens or phrases in context, using concordance analysis. We will take
some time to explore the “Introduce yourself” corpus and learn about your classmates.

To get started, download the ‘Introduce Yourself’ corpus from Learn and upload it.

In [None]:
from conc.corpus import Corpus
from conc.conc import Conc
import os

In [None]:
source_path = f'/srv/source-data/' # path to the source data from which the corpus will be created
save_path = f'/srv/corpora/' # path to the directory where corpora are stored

In [None]:
corpus = Corpus().load(f'{save_path}introduce-yourself.corpus')

In [None]:
# # if you are running the code on your own machine, adjust the corpus_source_path below to point to the zip with the source files
# # uncomment the remaining lines of this cell to create the corpus from the source files (or load if it already exists)
# corpus_source_path = f'{source_path}introduce_yourself_corpus.zip'
# try:
#     corpus = Corpus().load(f'{save_path}introduce-yourself.corpus')
# except FileNotFoundError:
#     corpus = Corpus(name = 'Introduce Yourself', description = 'DIGI405 class introduction corpus').build_from_files(corpus_source_path, f'{save_path}/')

In [None]:
corpus.summary() # overview of the corpus

In [None]:
conc = Conc(corpus) #initialize reporting on the corpus

## Concordancing

A concordance is a way to view words or phrases in a collection of documents, as well as the
context in which they appear. Concordances are sometimes referred to as KeyWord-In-Context or KWIC. The node, the word or phrase being analysed, is displayed in the 
central column, with the text that came before and after it. The rows of results are called concordance lines (or “concordance hits”).  

Concordances are useful to identify linguistic patterns related to words or phrases. These patterns can be identified by 
sorting the concordance in different ways.  

Concordances compliment other kinds of text analysis. They are helpful to make sense of patterns of language use identified through computer-assisted analysis of texts. 
Investigating words in context helps to avoid jumping to conclusions about word patterns and their meanings. By studying many examples of words with their textual context, a researcher can access good evidence to make claims about language use.  

This lab will introduce you to concordance analysis.

## Task 1: Understanding language usage by finding patterns in an example concordance

Concordances are a helpful tool to make sense of quantitative information from the frequency tables you looked at in the first lab notebook.  

Below is an example concordance based on a search for the word “from” in the Introduce yourself corpus. You can refer back to the frequency notebook to examine how frequent "from" is to other words in the corpus. 

Notice that the concordance lines show the node in the central column, with the textual context to the left and right. In this instance the concordance is ordered by words to the right of the node. 

In [None]:
query = 'from'
conc.concordance(query, context_length = 8, order='1R2R3R', page_current = 1, page_size = 20).display()

Notice that the at the bottom of the concordance it shows that there are three pages of results. You can see more concordance lines 
by changing the `page_current` to 2 or 3, or by changing the `page_size` to a number larger than the total concordance lines. Change `page_size` now to see all the results.

Take a good look at the concordance lines and patterns that you can see.  

It may be that you can identify patterns related to specific tokens or sequences of tokens that occur multiple times.   

You may also find tokens that are related in some way (e.g. by parts of speech, by meaning, etc). Identifying related tokens that occur in a common position to the left or right of the node is a common task when analysing concordances. 

Create a markdown cell and note your responses to the following questions. 

* What tokens appear after the node "from"? 
* Is there a way you can group tokens together? How would you describe those tokens?
* You've probably noticed one pattern, don't forget the other concordance lines. Of the remaining lines, can you see any other patterns?
* Given your observations, how would you communicate how "from" is commonly used in the Introduce Yourself corpus?

## Task 2: Introducing dispersion in concordance analysis

In the frequency notebook we identified tokens that were used in multiple and few documents. 

Notice at the bottom of the concordance report, you can see the frequency of the term you searched for as "Total Concordance Lines". You can also see the number of documents that mention that term ("Total Documents"), meaning you can identify if patterns are characteristic of specific texts or a more general pattern of language use across the corpus.  

Concordance plots provide more information on the dispersion of tokens, allowing us to analyse the frequency of token mentions in different texts and *where* they appear in a text. Take a look at the plot for "from" below. The number of concordance lines for "from" in a text is summarized to the right of each plot, which is a count of the number of black vertical lines. The black vertical lines represent the position of the word within each text. Closer to the left means closer to the start of the document. Closer to the right means closer to the end. 

You can hover over the black line to get a short text excerpt related to each mention.  

Concordance plots are helpful to identify documents that mention a token many or few times, helping to understand if you are identifying a general pattern or something related to specific texts.  

Concordance plots also allow you to identify patterns across texts related to the position of a token.

Create a markdown cell to note your observations. 

* Where in a document is "from" more likely to occur? The start, middle, or end? 
* Can you quantify this in some way?  
* Combine your observations from tasks 1 and 2, how would you communicate how "from" is commonly used in the Introduce Yourself corpus?

In [None]:
query = 'from'
conc.concordance_plot(query).display()

## Task 3: Find connections

Concordances provide a way to see connections between texts in a corpus. 

Start with your own introduction and what you wrote. I hope you submitted one! Can you find it using a concordance? 

To get you started, here is a code cell that will display a concordance for 'data science'. Notice this query is multiple tokens. Mentions of "data science" are very common in the corpus, partly because of the prompt we used to gather the texts. Change the `query` variable to a word or words you used in your introduction to identify your own introduction.

In [None]:
query = 'data science'
conc.concordance(query, context_length = 8, order='1R2R3R', page_current = 1, page_size = 20).display()

To view a complete text set `document_id` below to one of the document IDs in the concordance lines above.  

In [None]:
document_id = 15
corpus.text(document_id).display(textwrap_width=80) # if the text is doesn't have line breaks, you can format with textwrap_width

One helpful thing about working with Jupyter notebooks is that you can keep a record of what you analysed (e.g. a code cell creating a concordance) and what you observed (using a markdown cell).   

Create a new code cell. Copy the code above into the new code cell. Add a markdown cell below it and note your observations.  

Think about some different linguistic features of your introduction. These could be specific content words related to your skills or interests (e.g. ‘Python’ or ‘punk’) or they might be phrases related to your introduction (e.g. ‘I grew up in’). Search for some of these using the new cell you created and see if you can find some connections to others in the class who used the same words or phrases. 

## Task 4: Where is the data?

Concordances are sometimes misunderstood as a qualitative research tool. Yes, they allow you to examine the usage of words and examine the contents of your corpus, but they also provide quantitative evidence for language patterns. While they make it easy to spot one-off interesting or unusual examples of language use, when analysing a concordance we are primarily interested in finding linguistic patterns that occur multiple times and appear in multiple texts in the corpus.

Use the concordance to search for words you might expect to be used in an introduction. Which of these is most frequent? Be specific and record quantitative information. For the frequent words, use concordance plots to identify if they are likely to appear in specific positions within the texts.  Record this in a markdown cell below.

* Hello  
* Hi  
* Kia ora  
* Greetings, earthlings  

Go back to your own introduction. Using concordances, identify a word or phrase that is frequent in the corpus and a word or phrase that is infrequent (i.e. mostly just appears in your document). For the frequent word or phrase, based on your analysis of concordances and concordance plots, what quantitative claims can you make? Be specific.

## Task 5: Sorting your concordance

Sorting is fundamental to concordance analysis. Tokens may pattern with other tokens to the left or right of the node, or around the node. 

Below is a concordance for the token `my`.  The concordance is currently ordered by `R1R2R3`, which means ordering by the first token to the right of the node, then the second token to the right and then the third token to the right.  

Patterns can occur to the right, left or around the node. You can change the `order` variable to order the concordance in different ways. Change order to `L1L2L3` to order by the tokens to the left of the node. You can also try `3L2L1L`, `2L1L1R` or `1L1R2R` to order around the node.  

Valid values of `order` and documentation of all available parameters for the concordance method are available at the [Conc documentation website](https://geoffford.nz/conc/api/conc.html#conc.concordance).

Identify an interesting token or grouping of tokens that appear multiple times in the results to one of your reorderings of the concordance.  

Create a markdown cell and note the ordering you used and your observations about the pattern you found.  

In [None]:
query = 'my'
conc.concordance(query, context_length = 8, order='1R2R3R', page_current = 1, page_size = 20).display()

## Task 6: Wrapping up

Reflect on the linguistic features of an introduction.

* Are there some recurring ways that people introduce themselves?  
* How does that relate to the prompts we provided in the task?  
* What are some linguistic patterns people are using to introduce themselves?    
* What kinds of things are we prepared to include in an introduction (and not include)?  
* In relation to the content and form of your introduction: in what ways are you similar to others in the class? And, in what ways are you different? 

Make notes on your answers to these questions in a markdown cell, and then discuss your findings with someone else in the class.