# Examining the context of words




## Concordances

To examine *how* words are used in a text, it can be useful to create a concordance. In a concordance, all the occurrences of a given search term are shown in combination with words that occur before and after this term. Such resources are sometimes referred to as *keyword in context* lists (KWIC). 

The code below defines a method named `concordance()` which can help you to examine the contexts of words. To work with it, you need to supply three parameters: (1) the text that you want to analyse (i.e. its filename); (2) a regular expression representing the word(s) whose contexts you want to examine; and (3) a number that specifies the extent of the context. The number that is given as the third parameter determines the number of characters that will be shown before and after the search result.

The last few lines also demonstrate how you can use this function. 

In [None]:
from os.path import join
import re

def concordance_char( file,search_term,window ): 
    
    concordance = []
    regex = r'\b{}\b'.format( search_term )
    
    text = open( file , encoding = 'utf-8' )

    for line in text:
        line = line.strip()

        if re.search( regex , line , re.IGNORECASE ):
            extract = ''
            position = re.search( regex , line , re.IGNORECASE ).start()
            start = position - len( search_term ) - window ;
            fragmentLength = start + 2 * window  + 2 * len( search_term )
            if fragmentLength > len( line ):
                fragmentLength = len( line )

            if start < 0:

                whiteSpace = ''
                i = 0
                while i < abs(start):
                    whiteSpace += ' '
                    i += 1
                extract = whiteSpace + line[ 0 : fragmentLength ]
            else:
                extract = line[ start : fragmentLength ]

            if re.search( '\w' , extract ) and re.search( regex , extract , re.IGNORECASE ):
                concordance.append( extract )
                
    return concordance

    
dir = 'Corpus'
file = 'ARoomWithAView.txt'

c = concordance_char( join( dir,file ) , 'florence' , 20 )

for line in c:
    print(line)


In the function `concordance_char()` that was defined above the 'window', i.e. the length of the text fragments used before and after the search term, is measured in characters. this is why the suffix '_char' is in the name of the function. When you use the number 20 as a third parameter, for instance, you will see the 20 characters that are used before and after the various instances of the search term. With such fixed number of characters, the search term is on the same position for each line, resulting in a keyword-in-context list with a nice and orderly appearance.

The downside of this approach is that the various lines may contain incomplete words. The code simply removes all characters preceding or following the twentieth character. 

The method named `concordance`, as defined in the `tdm` module, works with words rather with characters. When you supply the number 10 as the value for the parameter defining the window, you will receive all occurrences of the search term, together with 10 the words that are used before and after the this word. 

The `tdm` module has defined a Class named `text`. You can initialise this Class using the `text()`. During the initialisation, you need to provide the name of the text file. If the text file is not in the same directory as this notebook, you need to indicate the path to this file. This `text()` method reads in the context of the full use, using the `open()` method. If you want, you can obviously inspect the code used to make this initialisation method by openening the module, named `tdm.py`. 

Once the `text` Class has been initialised using a speicifc text file, you can make use of its `concordance()` method. Here, you need to supply the search term as the first parameter and the size of the window as the second parameter. The method returns a list containing all the lines of the concordance. You can navigate across all these lines using a for-loop. In the code below, all of these lines are printed on the screen, but they are also exported to a new file named 'concordance.txt'. 

the `tdm` module also contains the method `concordance_char()`, that you worked with earlier. To create a concordance in which the lengths of the text fragments are specified in characters, you simply need to replace the `concordance()` method with the `concordance_char()` in the code below. 

In [None]:
import tdm
from os.path import join

dir = 'Corpus'

path = join( dir , 'ARoomWithAView.txt' )

novel = tdm.text(path)
kwic = novel.concordance( 'florence' , 20 )

out = open('concordance.txt' , 'w' , encoding = 'utf-8')

for fragment in kwic:
    print(fragment)
    out.write(fragment + '\n' )
    
out.close()


## Collocation analysis

Collocation analyses focus on the words that are used in the vicinity of a provided search term. It may be viewed as an extension of the principle underlying the creation of concordances. To perform a collocation analysis, we can look at the environments of a search term through a 'window' consisting of a given number of words. The words that are used in this context can obviously be counted. As the result of a collocation analysis, we can show a overview listing the words that are used most frequently in the neighbourhood of a given word. 

A collocation analysis can be performed using the `collocation()` method in the `tdm` module. The parameters are the same as those of the `concordance()` method: (1) a search term and (2) a number representing the size of the window (or, ot be more precise: the number of words). This function returns a dictionary listing all the words found near the search term that is provided, together with the frequencies of these words. To study the code created to make the `collocation()` method, you can obviously open the `tdm.py` file.  

The function `removeStopWords()` can also be useful in this context. It removes the stopwords a given dictionary. This function makes use of the list of stopwords defined in the `nltk` library. 

In [None]:
import tdm
from os.path import join

freq = dict()


def removeStopWords( word_dict ):

    from nltk.corpus import stopwords
    stopwords_list = stopwords.words('english')
    
    filtered = dict()

    for w in word_dict:
            if w not in stopwords_list:
                filtered[w] = word_dict[w]

    return filtered

            
dir = 'Corpus'
file = 'ARoomWithAView.txt'

novel = tdm.text( join( dir,file ) )
freq = novel.collocation( 'florence' , 10 )
freq = removeStopWords(freq)

            
count = 0 
max = 30 

for word in reversed( tdm.sortedByValue(freq) ):
    count += 1
    print( f'{ count }. { word } => { freq[word]}' )
    if count == max:
        break
    