# DIGI405 Lab 1.1: Frequency

## Lab 1 Introduction

Each week in DIGI405 labs you will work through a worksheet or Jupyter Notebook with a
series of tasks. Be sure to take notes as you work through the lab. Talk to your tutors as you
work through these tasks – they are there to help and prompt you, as well as discuss your
observations. I would also encourage you to discuss your work with your classmates (whether
in-person or on Zoom) and learn together. Pace yourself so you can complete as much of the
material as possible during lab time.

Today’s class activities will focus on obtaining frequency and dispersion information from
texts, and to analyse tokens or phrases in context, using concordance analysis. We will take
some time to explore the “Introduce yourself” corpus and learn about your classmates.

To get started, download the ‘Introduce Yourself’ corpus from Learn and upload it.

In [None]:
from conc.corpus import Corpus
from conc.conc import Conc
from conc.core import get_stop_words
import os

In [None]:
source_path = f'{os.environ.get("HOME")}/data/' # path to the source data from which the corpus will be created
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/' # path to the directory where the corpus will be saved

In [None]:
# if the corpus does not exist, build it from the source files
try:
    corpus = Corpus().load(f'{save_path}introduce-yourself.corpus')
except FileNotFoundError:
    corpus = Corpus(name = 'Introduce Yourself', description = 'DIGI405 class introduction corpus').build_from_files(f'{source_path}introduce_yourself_corpus_25S1.zip', f'{save_path}/')

# if the corpus does not exist, build it from the source files
try:
    reference_corpus = Corpus().load(f'{save_path}brown.corpus')
except FileNotFoundError:
    reference_corpus = Corpus(name = 'Brown', description = '').build_from_csv(f'{source_path}brown.csv.gz', f'{save_path}/')

In [None]:
corpus.summary() # overview of the corpus

In [None]:
conc = Conc(corpus) #initialize Conc to report on your corpus

In [None]:
conc_reference = Conc(reference_corpus) #initialize Conc to report on the reference corpus

In [None]:
stop_words = get_stop_words(save_path = save_path) # load spaCy stop words 

## Frequency 

Frequency data about tokens in a corpus is important information for corpus analysis.  Frequency tables can be used to identify words that are frequent and infrequent, to compare the frequency of related words in a corpus, and compare frequency between corpora. 

This lab will introduce you to frequency analysis.

## Task 1: What tokens are frequent?

Here is an example frequency table based on the 'Introduce yourself' corpus. Look first at the information at the bottom of the table. Notice that there is information on the kinds of tokens being displayed (i.e. words and punctuation tokens), as well as quantitative information on the total number of tokens and the number of unique tokens. This is page 1 of many pages of results.  

In [None]:
conc.frequencies(exclude_punctuation = False, page_current = 1, normalize_by=1000).display()

Now focus on the information in the table. You will notice there is a rank number, the token frequency and a normalized frequency. The normalized frequencies are normalized by 1000 tokens. You can adjust the `normalize_by` parameter as needed (e.g. to 10000 or 1000000).  

Create a markdown cell below and make notes on your observations. 

What kinds of tokens are most frequent in the corpus? 

You can change the `page_current` number to see other results pages. Try this now. 

What kinds of tokens appear more as you view more results?

## Task 2: What is interesting and what is informative?

We are often not interested in punctuation tokens. The table below filters those out. 

In [None]:
conc.frequencies(exclude_punctuation = True, page_current = 1, normalize_by=1000).display()

When looking at a frequency table, we are often in a rush to find interesting tokens for analysis. One way to do this is to discard information on the most frequent tokens using a stop word list. A stop word list typically contains the most frequent function words.

In [None]:
conc.frequencies(exclude_punctuation = True, page_current = 1, normalize_by=1000, exclude_tokens = stop_words).display()

The frequency table now shows more content words related to the topics of the corpus. Before looking closely at this, we should understand the information we've lost. The cell below prints the stop words. Think about your introduction text, which is an example of the texts that make up the corpus. If our aim is to understand linguistic patterns related to this corpus, are there any words that you think might be informative that we are discarding?

Create a markdown cell and make a note of any tokens in the stop words that you think might be informative (and why). 

Have you come up with some tokens? You can examine these more closely using the concordance notebook later in the lab.

In [None]:
print(sorted(list(stop_words)))

### Important: don't discard information ...

Often people new to language analysis approach it as a recipe, a series of steps that they follow the same way every time regardless of their data or research aims.  

**Do not automatically remove stop words!** While stop word removal is relevant for specific tasks (e.g. applying machine learning techniques to model topics in a corpus), you may be discarding important information. 

Stop words matter! To illustrate this, consider the title of this section "Important: don't discard information". If we remove stop words, this becomes "Important: discard information". A completely different message.

**Do not use stop word lists to discard tokens unless:**

1. you have done some exploratory analysis initially to understand the data and the kinds of tokens that are present; and,   
2. you are aware of alternative ways to identify tokens used more or less frequently than other language samples (e.g. like statistical measures you will learn about in the next module); and,  
3. you have weighed up the alternatives and you have a good reason to make a choice that makes sense for your research question and the data you are working with; and,  
3. you record your thinking about this choice for yourself and others.  

This is a more general point to keep in mind when analysing data. We often make choices during analysis. Does your choice make sense for your analysis? Are you able to explain and justify your choices to other researchers? You should be able to say "Yes!" to both of these questions. 

## Task 3: Corpus analysis is comparative

Now go back to the frequency table. 

In [None]:
conc.frequencies(exclude_punctuation = True, page_current = 1, normalize_by=1000).display()

One problem with making sense of the frequency table is that it lacks context. We probably don't have a good sense of what tokens are typically frequent or not. We almost certainly don't have a good sense of what frequency data means without comparison.  

Let's compare the Introduce yourself frequency table with the frequency table of a general sample of English. Our *reference corpus* is the Brown Corpus, which includes approximately 1 million words of American English text from the 1960s. We might question whether the kind of English and the dates are relevant, these are good questions to ask. It is useful for our comparison here, because it contains a wide range of text types. First, here is a summary of the corpus. You can compare this to the summary of the Introduce yourself corpus at the top of the notebook.

In [None]:
reference_corpus.summary() # overview of the reference corpus

Here is the frequency table for the Brown Corpus. You can change the `page_current` number to see other results pages.

In [None]:
conc_reference.frequencies(exclude_punctuation = True, page_current = 1, normalize_by=1000).display()

Compare the two frequency tables. You will notice the most frequent tokens are different for each corpus. While comparing ranks gives you some information about the relative frequency of tokens within the corpus, the normalized frequency is very helpful quantitative information about how frequencies in the two corpora compare. With the normalized frequency you can use quantitative information to express differences between the corpora. For example:

The function word "my" appears appears 20.4 times more frequently per 1000 tokens in the Introduce yourself corpus than in the Brown corpus.  

Note: this statement is based on the following information:  
* Normalized frequency of "my" in Introduce yourself corpus: 25.3  
* Normalized frequency of "my" in Brown corpus: 1.24  
* Ratio of normalized frequencies: 25.3 / 1.24 = 20.4  

Create a markdown cell and make notes on key differences. Be specific and use the quantitative information in the table to support your observations. To aide you with this you can use a feature of the reporting library to restrict the results to specific tokens. For example, to restrict the results shown to the pronouns "I" and "my" you would do this:
  

In [None]:
restrict_tokens = ['i', 'my'] # change or add tokens as needed - use lowercase
conc.frequencies(exclude_punctuation = True, restrict_tokens = restrict_tokens, page_current = 1, normalize_by=1000).display()
conc_reference.frequencies(exclude_punctuation = True, restrict_tokens = restrict_tokens, page_current = 1, normalize_by=1000).display()

Note: while we can do comparisons with frequency tables, in the next module you will learn about statistics used to compare corpora.  

## Task 4: Introducing dispersion in frequency analysis

So far we have examined frequent tokens without considering if they appears in many or few texts in the corpus. Function words are generally common across texts, but there will be tokens that are frequent but only appear in very few texts. To examine this, we can add the `show_document_frequency` parameter to the frequency table. This shows the number of documents that mention the token at least once.  

In the table below we are restricting the results to a list of words related to text analysis. You can add words to this list if you like. This functionality is helpful to compare counts for related words in a corpus.  

Words that do not appear in the corpus are not included in the frequency table.  

Notice that the tokens "languages" and "language" have similar frequencies, but "languages" appears in only one document, while "language" appears in multiple. These are very different patterns of use. 

We are dealing with a very small corpus in this instance that was derived from a specific prompt. With such a small corpus, the choices made by a single author are significant. If we are interested in making general claims about features of texts in the Introduce yourself corpus, we may treat frequent tokens that appear in a single document as outliers or exceptions. 

In [None]:
restrict_tokens = ['language', 'languages', 'linguist', 'linguistics', 'linguistic', 'text', 'corpus', 'corpora', 'texts', 'speech', 'written', 'classify', 'model', 'classification', 'classifier', 'classifiers', 'data'] # change or add tokens as needed - use lowercase
conc.frequencies(exclude_punctuation = True, restrict_tokens = restrict_tokens, show_document_frequency = True, normalize_by=1000).display()

If we were working with a larger corpus specific documents are unlikely to account for high frequency tokens, however the distribution of tokens is still relevant. For example, 'data' is the most frequent of the tokens in our list in both the Introduce yourself and Brown corpora (see below). However, "data" is mentioned 167 times (19.09 times per 1000 word tokens) in the Introduce yourself corpus and in almost all documents (25 out of 28 texts). In the Brown corpus the token 'data' is mentioned 173 times (0.18 times per 1000 word tokens) and only appears in 10% of documents. This is a very different pattern. The use of "data" in the Brown corpus would still be interesting to analyse, but it is helpful to understand that we would be analyzing a token that is relatively common in a subset of documents. 

In [None]:
conc_reference.frequencies(exclude_punctuation = True, restrict_tokens = restrict_tokens, show_document_frequency= True, page_current = 1, normalize_by=1000).display()

Can you find any other words that appear multiple times, but only appear in one or two documents?  Make a note of these in a markdown cell for further analysis later in the lab.

## Task 5: Don't make assumptions about meaning from frequency tables

Below is a frequency table of mentions of education-related words in the Introduce yourself corpus. One problem to be aware of when analyzing frequency tables and other high-level quantitative views of a corpus is that we are losing the context of the tokens. We need to be careful to avoid making assumptions about what specific tokens mean or why they are frequent or infrequent. 

In this example, given we are interested in education-related terms we might assume that "learning" is frequent because students are talking about their learning or their attitude to learning. However, this would be missing a key pattern in the data. Around one third of mentions of the token "learning" in this corpus relate to "machine learning" or "deep learning". Perhaps you've guessed this, especially if you wrote about "machine learning", but think of someone analyzing this corpus without that knowledge. With the frequency table alone we cannot make strong claims about the use of "learning". 

This example illustrates why we should not make assumptions about meaning from frequency tables. To understand how tokens are being used, we need information on that tokens textual context. This is something we pick up on in the next notebook. 

Review the tables above. Identify two or three tokens and write a bad assumption about each. You can test them using the concordance notebook later in the lab.

In [None]:
restrict_tokens = ['school', 'teaching', 'learning', 'knowledge', 'student', 'training', 'teacher', 'lecturer', 'academic', 'university', 'research', 'skill', 'study', 'educational', 'skills', 'students', 'classroom', 'class']
conc.frequencies(exclude_punctuation = True, restrict_tokens = restrict_tokens, page_size = 10, show_document_frequency = True, normalize_by=1000).display()

In [None]:
conc.ngram_frequencies

## Task 6: Wrapping up

Rather than answering questions about a corpus, frequency analysis often raises questions. Review the tasks you worked through above. Make notes on the following questions in a markdown cell:  

* Do you understand why discarding stop words can be a problem?  
* Do you understand the importance of avoiding making assumptions based on frequency tables alone? Did you come up with some bad assumptions to test in the concordance notebook?  
* Are there any other tokens you would like to explore further based on your analysis above? Note these down or create hypotheses about the use of these tokens in the corpus.  

Now it is time to look at the next notebook, which will introduce you to concordance analysis.