# Concordances

This notebook explains how you can use the dtdpTdm module to perform a number of basic text and data mining operations both at the level of individual texts and at the level of full corpora. This module was developed specifically for the 2018-2019 course *Digital Text and Data Processing*. The first few examples shall focus on analayses of individual texts. 

The dtdpTdm module can firstly be used to produce concordances, or keyword in context lists. In a concordance, all the occurrences of a given search term are shown in combination with words that occur before and after this term. In the dtdpTdm module, users need to supply the name of the file and the regular expression the computer needs to search for in this file. Users can can specify the length of the fragments in the concordance using the third paramater. The number that is provided thirdly determines the number of characters that will be shown before and after the search result.


In [None]:
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'

fullPath = join( dir , fileName )

## the next line produces a list of all lines containing the regular expression
## formatted as a KWIC list
c = dtdp.concordance( fullPath , r'\bwhales?' , 25 )

for line in c:
    print(line)


## Collocation analysis

In collocation analyses, programs calculate the frequencies of the words that are used within a certain distance of a provided keyword. Such analyses give an impression of the semantic context of search terms. In the dtdpTdm module, collocations can be analysed using the collocation() method. The parameters are the same as those of the concordance() method. An important difference, however, is that the third parameter determines the number of words, and not the number of characters. 

The function removeStopWords can also be useful in this context. It removes the most common words from given dictionary. It makes use of the standard list of stopwords that is part of NLTK.

In [None]:
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )


freq = dtdp.collocation( fullPath , r'whales?' , 30 )
freq = dtdp.removeStopWords(freq)

for w in freq:
    print(w , freq[w])

The code below can be used to visualise a given dictionary with word frequencies into a word cloud. Can you use this code to visualise the results of the collocation analysis above?

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt 
from wordcloud import WordCloud 

wordcloud = WordCloud( background_color="white",  width=1500,height=1000, max_words= 100,relative_scaling=1,normalize_plurals=False).generate_from_frequencies(freq)


plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


To get the code above to work, you may need to install the wordcloud module first using the commands in the cell below:

In [None]:
import sys
!pip install wordcloud

# Word frequencies

The function calculateWordFrequencies() can be used, expectedly, to calculate the frequences of all the types (i.e. the unique words) in a given text. The function only demands the name of a file as a parameter.

In [None]:
import re
import string
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )

freq = dtdp.calculateWordFrequencies( fullPath )

for w in freq:
    print(w + ' => ' + str(freq[w]) )

The related method mostFrequentWords() calculates the frequencies of tokens, in exactly the same way as the method calculateWordFrequencies(). This method limits itself, however, to the most frequent words in the text. Users of this method can specify the length of the frequency dictionary by providing a number as the second parameter to the list.

In [None]:
import re
import string
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )

freq = dtdp.mostFrequentWords( fullPath , 50 )


for w in freq:
    print(w)


Using code that was explained in Part 4 of the Python tutorial, can you visualise the 50 most frequent words as a bar chart?

## Dispersion graphs

Frequencies can also be clarified in dispersion graphs. To produce such graphs, a full text firstly needs to be divided into segments. The graph that is generated provides information frequency of a particular word within each of these segments. 

The method divideIntoSegments() demands two parameters. The first of these is the text to be analysed. Via the second parameter, users can specify the number of segments that need to be created. 

In [None]:
import re
import string
from os.path import join
import dtdpTdm as dtdp

dir = 'Corpus'
fileName = 'MobyDick.txt'
fullPath = join( dir , fileName )

segments = dtdp.divideIntoSegments( fullPath , 30 )

data = dict()

count = 0
for s in segments:
    count += 1
    hits = re.findall( r'\bwhale' , s , re.IGNORECASE )
    data[ count ] = len( hits )


import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

fig = plt.figure( figsize=( 12 , 4 ) )
ax = plt.axes()

ax.plot( data.keys() , data.values() , color = '#930d08' , linestyle = 'solid')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_title( 'Moby Dick')
plt.show()