# Examining the context of words

## Concordances

Lists of frequent words can be very useful: they can help to clarify the main concerns or the themes of a text. To examine *how* words are used in a text, it can be useful, additionally, to create a concordance. In a concordance, all the occurrences of a given search term are listed in combination with words that occur before and after this term. Such resources are sometimes referred to as *keyword in context* lists (KWIC). 

The `nltk` package contains a method named `concordance()`. To work with this method, you firstly need to create an instance of the `Text` class. This class is part of the `nltk.text` module. Such a `Text` object can be initialised using a list with all the tokens of a text. 

In [1]:

import os

path = os.path.join('Corpus','PrideAndPrejudice.txt')

with open( path , encoding = 'utf-8') as file:
    full_text = file.read()

tokens = word_tokenize(full_text)
novel = Text(tokens)

In the code above, the `Text` object is given the name `novel`. 

Once you have created such an object, you can use its `concordance()` method. You can supply three parameters: 

1. A search term.
2. A `width` parameter, specifying the extent of the context. With this parameter, you indicate the number of characters before and after the word whose context you want to see. 
3. A `lines` parameter, which specifies the number of results. 

Out of these parameters, only the first one is mandatory. When you leave out the last two parameters, the method will work with its default values: a width of 70 (35 characters before and 35 characters after the search term) and 25 lines. 

In [2]:
novel.concordance('marriage' , width = 50 , lines = 10)

Displaying 10 of 66 matches:
month . Happiness in marriage is entirely a matter
ng of their supposed marriage , and planning his h
 well disposed of in marriage . This gallantry was
probability of their marriage was exceedingly agre
the felicity which a marriage of true affection co
hat another offer of marriage may ever be made you
for happiness in the marriage state . If therefore
made you an offer of marriage . Is it true ? '' El
ll—and this offer of marriage you have refused ? '
using every offer of marriage in this way , you wi


In the `concordance()` method that is defined in `nltk`, the width of the context is defined using a specific number of characters. When you work with such a fixed number of characters, the search term can be shown at the same position on each line, resulting in a keyword-in-context list with a nice and orderly appearance.

The downside of this approach is that the various lines may contain incomplete words. Which you indicate that the size of the context must be set at 20 characters before and after the term, the code simply removes all characters preceding or following the twentieth character. 

The cell below contains a definition of a method which can create a somwhat different type of concordance. In this method, named `concordance_words()`, the width of the context is specified on the basis of words rather characters. When you supply the number 10 as the value for the parameter defining the width, you will receive all occurrences of the search term, together with 5 words before and 5 words after this search term. The method demands a sting as input. This string can be the full text of a novel. 

The results are returned as a list. 

In [5]:
import math
import re
from text_mining import *

def concordance_word( text, regex , width = 10 ):

    concordance = []
    distance = math.floor( width /2 )

    segment_length = 0

    words = word_tokenize( text )
    words = remove_punctuation( words )
    i = 0
    for w in words:
        if re.search( regex , w , re.IGNORECASE ):
            match = ''
            for x in range( i - distance , ( i + distance ) + 1 ):
                if x >= 0 and x < len(words):
                    if len(words[x]) >= 0:
                        match += words[x] + ' '
            concordance.append( match )

        i += 1

    return concordance



The cell below contains an illustration of how you can use this method. 

In [6]:
path = os.path.join('Corpus','PrideAndPrejudice.txt')

with open( path, encoding = 'utf-8') as file:
    full_text = file.read()
    
fragments = concordance_word( full_text , r'marriage' , 16)

print( f'There are {len(fragments)} ocurrences of the word "marriage".')

number_of_results = 5

print( f'Here are the first {number_of_results} occurrences:\n\n')
for f in fragments[:number_of_results]:
    print( f'{f}\n')

There are 67 ocurrences of the word "marriage".
Here are the first 5 occurrences:


studying his character for a twelvemonth Happiness in marriage is entirely a matter of chance If the 

disliking her guest by talking of their supposed marriage and planning his happiness in such an alliance 

all in due time well disposed of in marriage This gallantry was not much to the taste 

her to understand that the probability of their marriage was exceedingly agreeable to her Elizabeth however did 

very house in all the felicity which a marriage of true affection could bestow and she felt 



As you can see in the definition of `concordance_word()`, the method searches for occurrences of the supplied search term as a [regular expression](https://cdsleiden.github.io/python-tutorial/notebooks/9%20Regular_expressions.html). The second parameter of this method can also be a more complicated regular expression. 

In [7]:
fragments = concordance_word(full_text , r'(\bhates?\b)|(\bloves?\b)' , 25)

for f in fragments[15:22]:
    print( f'{f}\n')

the best education can overcome And your defect is a propensity to hate every body And yours he replied with a smile is wilfully to 

there is not a bit of fish to be got Lydia my love ring the bell I must speak to Hill this moment It is 

of him to write to you at all and very hypocritical I hate such false friends Why could not he keep on quarrelling with you 

is that we are very different sort of men and that he hates me This is quite shocking deserves to be publicly disgraced Some time 

misfortune of all find a man agreeable whom one is determined to hate not wish me such an evil When the dancing recommenced however and 

I shall chuse to attribute it to your wish of increasing my love by suspense according to the usual practice of elegant females I do 

Collins was not left long to the silent contemplation of his successful love for Bennet having dawdled about in the vestibule to watch for the 



## Collocation analysis

Collocation analyses focus on the words that are used in the vicinity of a provided search term. It may be viewed as an extension of the principle underlying the creation of concordances. To perform a collocation analysis, we can look at the environments of a search term through a 'window' consisting of a given number of words. The words that are used in this context can obviously be counted. The aim of a collocation analysis is to identify the words that are used most frequently in the neighbourhood of a given word. 

Such collocation analyses can be carried out using the `collocation()` method that is defined below. 

In [25]:
def collocation( text , regex , width ):

    freq_c = dict()
    distance = math.floor(width/2)

    sentences = sent_tokenize( text )

    for sentence in sentences:

        words = word_tokenize( sentence )
        words = remove_punctuation(words)

        for i,w in enumerate(words):
            if re.search( regex , w , re.IGNORECASE ):
                index_regex = i 

                for x in range( i - distance , i + distance ):
                    if x >= 0 and x < len(words) and words[x].lower() != words[index_regex].lower():
                        if len(words[x]) > 0:
                            word = words[x].lower()
                            freq_c[ word ] = freq_c.get( word , 0 ) + 1
            
    return freq_c


The parameters are the same as those of the `concordance_word()` method: 

1. The text that needs to be analysed.
2. A search term, which will be treated as a regular expression.
3. A number representing the width of the context (or, ot be more precise: the number of words). 

This function returns a dictionary listing all the words found near the search term that is provided, together with the frequencies of these words.  

The code below makes use of the function `sortedByValue()` which can sort a dictionary by value, and the list of stopwords from `nltk` to remove the function words. 

In [None]:
nearby_words = collocation( full_text , r'marriage' , 20)

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

nearby_words_sorted = sortedByValue( nearby_words , ascending = False)

for word in list( nearby_words_sorted.keys() ):
    freq = nearby_words_sorted[word]
    if word not in stopwords and freq > 2:
        print( f'{word} => {freq}')

## Cooccurrence

Once you have established that two specific words are often used in combination, you can begin to study specific combinations of words in more detail using the `cooccurrence()` method that is defined below.

In [9]:
def cooccurrence( text , word1 , word2 , width ):
    
    relevant_sentences = []
    
    text = re.sub( '\s+' , ' ' , text )
    sentences = sent_tokenize( text )

    for s in sentences:
        if re.search( r'\b' + word1 + r'\b' , s , re.IGNORECASE ) and re.search( r'\b' + word2 + r'\b' , s , re.IGNORECASE ):

            words = word_tokenize(s)
            word1_indexes = []
            word2_indexes = []
            
            for i,w in enumerate(words):
                if re.search( r'\b' + word1 + r'\b' , w , re.IGNORECASE ):
                    word1_indexes.append(i)
                elif re.search( r'\b' + word2 + r'\b' , w , re.IGNORECASE ):
                    word2_indexes.append(i)

            if word1_indexes[0] > word2_indexes[0]:
                difference = word1_indexes[0] - word2_indexes[0]
            else:
                difference = word2_indexes[0] - word1_indexes[0]

            if difference <= width:
                relevant_sentences.append(s)
    return relevant_sentences
                       

The useage of the method is as follows:
    
* As the first parameter, you mus provide the full text that you want to analyse, as a single string.
* As the second and the third parameter, you need to mention the two words that you are interested in. 
* How close should these two words be? The fourth parameter specifies the number of words that are allowed in between the two words you focus on.  

The method `cooccurrence()` returns all the sentences containing the two words that you focus on. The distance (measured in number of words) will not be greater than the width that you specified. 

In [None]:
sentences = cooccurrence( full_text , 'marriage' , 'lydia' , 10 )

for s in sentences:
    print( f'{s}\n')

### Exercise 10.1

Create a concordance for the word 'savage' in the novel *Brave New World*. You can find the full text in the 'Corpus' folder. Work with a width of 50 characters (i.e. 25 characters before and 25 characters after this search term).

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text
import os

path = os.path.join('Corpus','BraveNewWorld.txt')

with open( path , encoding = 'utf-8') as file:
    full_text = file.read()

tokens = word_tokenize(full_text)
novel = Text(tokens)

novel.concordance('savage' , width = 56, lines = 10)

Displaying 10 of 201 matches:
d me to go to one of the Savage Reservations with him . 
e always wanted to see a Savage Reservation . ' 'But his
cause I do want to see a Savage Reservation . ' * * * * 
eek they would be in the Savage Reservation . Not more t
e had ever been inside a Savage Reservation . As an Alph
here is no escape from a Savage Reservation . ' The word
nted to the sullen young savage . 'Funny , I expect . ' 
nce more the men 's deep savage affirmation of their man
e cheek . 'Turned into a savage , ' she shouted . 'Havin
 'father ' of this young savage must be . 'Would you lik


### Exercise 10.2. 

Create a concordance for the word 'soma' in the novel *Brave New World*. This time, work with a width of 20 words (i.e. 10 words before and 10 words after this search term). Display the first 15 occurrences. 

In [1]:
import math
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import os

from text_mining import *

path = os.path.join('Corpus','BraveNewWorld.txt')

with open( path, encoding = 'utf-8') as file:
    full_text = file.read()
    
fragments = concordance_word( full_text , r'soma' , 20)

print( f'There are {len(fragments)} ocurrences of the word "soma".')

number_of_results = 15

print( f'Here are the first {number_of_results} occurrences:\n\n')
for f in fragments[:number_of_results]:
    print( f'{f}\n')

There are 61 ocurrences of the word "soma".
Here are the first 15 occurrences:


that brute Henry Foster you need is a gramme of soma the advantages of Christianity and alcohol none of their defects 

in the solid substance of their distractions there is always soma delicious soma half a gramme for a a gramme for 

solid substance of their distractions there is always soma delicious soma half a gramme for a a gramme for a two 

that he could have got through life without ever touching soma The malice and bad tempers from which other people had 

do look glum What you need is a gramme of soma Diving into his Benito produced a phial cubic centimetre cures 

be true his brain I suppose He put away the soma bottle and taking out a packet of stuffed a plug 

a loud and cheerful company they ate an excellent meal Soma was served with the coffee Lenina took two tablets and 

half an hour before closing time that second dose of soma had raised a quite impenetrable wall between the actual unive

### Exercise 10.3

In *Ullyses*, which words are used most frequently in the vicinity of the word 'father'? Consider 8 words before and 8 words after all the occurrences of this specific name.

In [2]:
from text_mining import *
import os
from nltk.tokenize import sent_tokenize, word_tokenize

def sortedByValue( dict , ascending = True ):
    if ascending: 
        return {k: v for k, v in sorted(dict.items(), key=lambda item: item[1])}
    else:
        return {k: v for k, v in reversed( sorted(dict.items(), key=lambda item: item[1]))}

path = os.path.join('Corpus','Ullyses.txt')

with open( path, encoding = 'utf-8') as file:
    full_text = file.read()
    
nearby_words = collocation( full_text , r'father' , 20)

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

nearby_words_sorted = sortedByValue( nearby_words , ascending = False)

for word in list( nearby_words_sorted.keys() ):
    freq = nearby_words_sorted[word]
    if word not in stopwords and freq > 2:
        print( f'{word} => {freq}')

conmee => 62
cowley => 28
son => 27
said => 18
conroy => 11
left => 11
like => 9
house => 9
old => 9
mother => 9
simon => 8
john => 8
saluted => 8
mr => 8
walked => 8
reverend => 7
little => 7
thought => 7
stephen => 7
man => 7
time => 7
put => 7
handed => 6
dedalus => 6
smiled => 6
blessed => 6
told => 6
father => 6
poor => 6
bernard => 6
thy => 6
ghost => 6
hamlet => 6
go => 5
hanlon => 5
malachi => 5
along => 5
road => 5
letter => 5
flynn => 5
name => 5
well => 5
friend => 5
holy => 5
knows => 5
god => 5
gave => 5
captain => 4
bloom => 4
canon => 4
hughes => 4
forget => 4
virag => 4
black => 4
constable => 4
passed => 4
corner => 4
hat => 4
dineen => 4
race => 4
sir => 4
going => 4
art => 4
would => 4
eyes => 4
things => 4
child => 4
first => 4
looked => 4
laid => 4
voice => 4
ennis => 3
waiting => 3
theyre => 3
always => 3
used => 3
bit => 3
corrigan => 3
church => 3
years => 3
must => 3
respected => 3
dolan => 3
thinking => 3
air => 3
dublin => 3
saint => 3
hope => 3
good => 3
tal

### Exercise 10.4

Find all the sentences in *Ullyses* that contain the words 'book' and 'read'. Make sure that, in these sentences, there are no more than 10 words in between these two words.  

In [None]:
from text_mining import *
import os
from nltk.tokenize import sent_tokenize, word_tokenize


path = os.path.join('Corpus','Ullyses.txt')

with open( path, encoding = 'utf-8') as file:
    full_text = file.read()
    
cooccurrences = cooccurrence( full_text , 'book' , 'read' , 10 )

for fragment in cooccurrences:
    print(fragment)