# Examining the context of words




## Concordances

To examine *how* words are used in a text, it can be useful to create a concordance. In a concordance, all the occurrences of a given search term are shown in combination with words that occur before and after this term. Such resources are sometimes referred to as *keyword in context (KWIC)* lists. 

The code below defines a method named `concordance()` which can help you to examine the contexts of words. To work with it, you need to supply three parameters: (1) the text that you want to analyse (i.e. its filename); (2) a regular expression representing the word(s) whose contexts you want to examine; and (3) a number that specifies the extent of the context. The number that is given as the third parameter determines the number of characters that will be shown before and after the search result.

The last few lines also demonstrate how you can use this function. 

In [None]:
def concordance( file,search_term,window ): 
    
    concordance = []
    regex = r'\b{}\b'.format( search_term )
    
    text = open( file , encoding = 'utf-8' )

    for line in text:
        line = line.strip()

        if re.search( regex , line , re.IGNORECASE ):
            extract = ''
            position = re.search( regex , line , re.IGNORECASE ).start()
            start = position - len( search_term ) - window ;
            fragmentLength = start + 2 * window  + 2 * len( search_term )
            if fragmentLength > len( line ):
                fragmentLength = len( line )

            if start < 0:

                whiteSpace = ''
                i = 0
                while i < abs(start):
                    whiteSpace += ' '
                    i += 1
                extract = whiteSpace + line[ 0 : fragmentLength ]
            else:
                extract = line[ start : fragmentLength ]

            if re.search( '\w' , extract ) and re.search( regex , extract , re.IGNORECASE ):
                concordance.append( extract )
                
    return concordance

    
dir = 'Corpus'
file = 'ARoomWithAView.txt'

c = concordance( join( dir,file ) , 'love' , 20 )

for line in c:
    print(line)


## Collocation analysis

Collocation analyses focus on the words that are used in the vicinity of a provided search term. Such analyses give an impression of the semantic contexts of these term. A collocation analysis can be performed using the `collocation()` method. The parameters are the same as those of the `concordance()` method. This function returns a dictionary listing all the words found near the search term that is provided, together with the frequencies of these words. 

The function `removeStopWords()` can also be useful in this context. It removes the stopwords a given dictionary. This function makes use of the list of stopwords defined in the `nltk` library. 

In [None]:
from nltk.corpus import stopwords

def collocation( file , search_term , distance ):

    freq = dict()
    
    text = open( file , encoding = 'utf-8' )
    regex = r'\b{}\b'.format( search_term )

    for line in text:
        line = line.strip()
        words = word_tokenise( line )
        i = 0
        for w in words:
            if re.search( regex , w , re.IGNORECASE ):

                match = re.search( regex , w , re.IGNORECASE )
                searchTerm = match.group(0)

                for x in range( i - distance , i + distance ):
                    if x >= 0 and x < len(words) and search_term != words[x]:
                        if len(words[x]) > 0:
                            freq[ words[x] ] = freq.get( words[x] , 0 ) + 1

            i += 1

    return freq


def removeStopWords( freq ):

    from nltk.corpus import stopwords
    stopwords_list = stopwords.words('english')
    
    filtered = dict()

    for w in freq:
            if w not in stopwords_list:
                filtered[w] = freq[w]

    return filtered

            
def sortedByValue( dict ):
    return sorted( dict , key=lambda x: dict[x])

    
dir = 'Corpus'
file = 'ARoomWithAView.txt'

c = collocation( join( dir,file ) , 'florence' , 10 )
c = removeStopWords(c)

            
count = 0 
max = 30 

for word in reversed( sortedByValue(c) ):
    count += 1
    print( f'{ count }. { word } => { freq[word]}' )
    if count == max:
        break
    