# Concordance Analysis

### COMM313 Spring 2019 (02/25/19)


## Overview 
* Discovering meaning through context
* KWIC concordance analysis
    * making a simple KWIC object with a list-of-lists structure
    * sorting a KWIC object
    * sampling lines from a KWIC
    * discovering and summarizing patterns






## Readings


* Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum - **Read Ch. 4**
* NLTK Book Ch. 2 (sections 1, 2 & 4 ) http://www.nltk.org/book/ch02.html 


    

## Setup

* import modules needed - best to keep these all at the top of your notebook
* set parameter values, e.g., diretory paths, frequently used values like a list of punctuation, etc., that will be used throughout the notebook

In [93]:
# import modules

%matplotlib inline

import os
import random
import re

import matplotlib.pyplot as plt
import seaborn as sn

from collections import Counter

from IPython.display import IFrame

In [94]:
### PARAMETERS

to_strip = ',.\xa0:-()\';$"/?][!`Ą@Ś§¨’–“”…ï‘>&\\%˝˘*'

## Functions

* Also good to put functions you are going use throughout the notebook at the top
* These are some of the functions we have seen in previous notebooks and assignments

In [95]:
def tokenize(text, lowercase=True, strip_chars=''):
    '''turn a string into a list of whitespace separated tokens - after observing lowercase flag and stripping specified characters
    
    Args:
        text        -- a string object containing the text to be tokenized
        lowercase   -- whether the string should be lowercased before tokenization (default: True)
        strip_chars -- a string containing a series of characters which should be stripped from text before tokenization (default: empty string)
        
    
    Returns:
        list of tokens
    '''
    if lowercase:
        text=text.lower()
        
    rdict = str.maketrans('','',strip_chars)
    text = text.translate(rdict)
        
    tokens=text.split()
    
    return tokens

In [96]:
def make_kwic(kw, text, win=4):
    '''A basic KWIC function for a text
    
    Args:
        kw   -- string match for keyword to match for each line
        text -- a list of tokens for the text
        
    Return:
        list of lines of form [ [left context words], kw, [right context words]]
    '''
    
    hits = [(w,i) for i,w in enumerate(text) if w==kw]
    
    lines = []
    for hit in hits:
        left = text[hit[1]-win:hit[1]]
        kw = text[hit[1]]
        right = text[hit[1]+1 : hit[1]+win+1]
        
        lines.append([left, kw, right])
        
    return lines

In [97]:
def print_kwic(kwic, win=None):
    '''A basic print function for a KWIC object
    
    Args:
        kwic -- a list of KWIC lines of the form [ [left words], kw, [right words]]
        win  -- if None then use all words provided in context otherwise limit by win
        
    Prints KWIC lines with left context width/padding win*8 characters
    '''
    
    if not kwic:
        return
    
    if win is None:
        win = len(kwic[0][0])
    
    for line in kwic:
        print("{: >{}}  {}  {}".format(' '.join(line[0][-win:]), 
                                      win*8, 
                                      line[1], 
                                      ' '.join(line[2][:win])
                                     )
             )            

In [98]:
def sort_kwic(kwic, order=None):
    ''' sort a kwic list using the passed positional arguments 
    
    Args:
        kwic   -- a list of lists [ [left tokens], kw, [right tokens]]
        order  -- a list of one or more positional arguments of form side-pos, e.g. L1, R3, L4 (default: None)
    
    Returns:
        kwic sorted for each positional argument in reverse, i.e. ['R1','L1'] sorts first by L1 and then R1
    '''
    if order is None:
        return kwic
   
    order = [order] if not type(order) is list else order
    order.reverse()
    
    for sort_term in order:
        if not re.match('[LR][1-4]', sort_term):
            pass
        
        pos1 = 0 if sort_term[0]=='L' else 2
        pos2 = int(sort_term[1])-1
        pos2 = 3-pos2 if sort_term[0]=='L' else pos2
        
        kwic.sort(key=lambda l : l[pos1][pos2])
    
    return kwic

## Concordance (KWIC) analysis

#### Understanding the list-of-lists structure for KWIC lines

* The `make_kwic()` function returns:
    * a `list` where each item is one concordance line containing
         1. a `list` of tokens to the left of the keyword
         2. the keyword
         3. a `list` of tokens to the right of the keyword
         
    * so we have a `list-of-lists` structure
    
![](kwic_list2.png)

In [99]:
three_bears_text = open('/data/kids/threebears.txt').read()
three_bears_tokens = tokenize(three_bears_text, lowercase=True, strip_chars=to_strip)

In [100]:
bear_kwic=make_kwic('bear', three_bears_tokens)

In [101]:
bear_kwic

[[['porridge', 'growled', 'the', 'papa'],
  'bear',
  ['someones', 'been', 'eating', 'my']],
 [['porridge', 'said', 'the', 'mama'],
  'bear',
  ['someones', 'been', 'eating', 'my']],
 [['up', 'cried', 'the', 'baby'],
  'bear',
  ['someones', 'been', 'sitting', 'in']],
 [['chair', 'growled', 'the', 'papa'],
  'bear',
  ['someones', 'been', 'sitting', 'in']],
 [['chair', 'said', 'the', 'mama'],
  'bear',
  ['someones', 'been', 'sitting', 'in']],
 [['pieces', 'cried', 'the', 'baby'],
  'bear',
  ['they', 'decided', 'to', 'look']],
 [['to', 'the', 'bedroom', 'papa'],
  'bear',
  ['growled', 'someones', 'been', 'sleeping']],
 [['too', 'said', 'the', 'mama'],
  'bear',
  ['someones', 'been', 'sleeping', 'in']],
 [['still', 'there', 'exclaimed', 'baby'],
  'bear',
  ['just', 'then', 'goldilocks', 'woke']]]

In [102]:
print('Found {} instances of bear'.format(len(bear_kwic)))

Found 9 instances of bear


* So to get the first KWIC line we would use the zero index

In [103]:
bear_kwic[0]

[['porridge', 'growled', 'the', 'papa'],
 'bear',
 ['someones', 'been', 'eating', 'my']]

* Which you can see is itself a list with 3 items:
    * index 0 - left context tokens (a list of length win - here 4 words)
    * index 1 - the keyword as a string
    * index 2 - right context tokens (a list of length win - here 4 words)
    
![](kwic_list3.png)

* So left context is:

In [68]:
bear_kwic[0][0]

['porridge', 'growled', 'the', 'papa']

* The keyword is the 1 index:

In [104]:
bear_kwic[0][1]

'bear'

* Right context:

In [105]:
bear_kwic[0][2]

['someones', 'been', 'eating', 'my']

* Then if we want to pick out specific words from the context we can use an additional level of indexing!


* The fourth word to the left of the keyword is `porridge`. It is the first item in the left context list, so:

In [71]:
bear_kwic[0][0][0]

'porridge'

* The second word to the right of the keyword is `been`. It is the second item in the right context list, so:

In [72]:
bear_kwic[0][2][1]

'been'

* So each line has this structure and we can use negative indexing to work back from the keyword

![](kwic_list4.png)

* Here are the first two lines of the KWIC of `bear`

In [73]:
bear_kwic[0:2]

[[['porridge', 'growled', 'the', 'papa'],
  'bear',
  ['someones', 'been', 'eating', 'my']],
 [['porridge', 'said', 'the', 'mama'],
  'bear',
  ['someones', 'been', 'eating', 'my']]]

* To get the __L1__ collocate `papa` we'd:
    1. select the first line:
        ```
        bear_kwic[0]
        ```
    2. select the left context of the first line:
        ```
        bear_kwic[0][0]
        ```
    3. select the fourth (or last item) in the left context of the first line:
        ```
        bear_kwic[0][0][3]
        ```
        or
        ```
        bear_kwic[0][0][-1]
        ```

In [74]:
bear_kwic[0][0][3]

'papa'

In [75]:
bear_kwic[0][0][-1]

'papa'

#### Displaying and sorting KWIC objects

* The `list-of-lists` structure is hard to read so we can write a `print_kwic()` function  to display the KWIC lines more nicely

In [106]:
print_kwic(bear_kwic)

       porridge growled the papa  bear  someones been eating my
          porridge said the mama  bear  someones been eating my
               up cried the baby  bear  someones been sitting in
          chair growled the papa  bear  someones been sitting in
             chair said the mama  bear  someones been sitting in
           pieces cried the baby  bear  they decided to look
             to the bedroom papa  bear  growled someones been sleeping
               too said the mama  bear  someones been sleeping in
      still there exclaimed baby  bear  just then goldilocks woke


* In order to identify repreated patterns in a KWIC listing you need to be able to sort and reorder the lines according to the words in the right and left context.


* For example, if we reorder the lines alphabetically by the L1 position we get:

In [107]:
print_kwic(sort_kwic(bear_kwic, order=['L1']))

               up cried the baby  bear  someones been sitting in
           pieces cried the baby  bear  they decided to look
      still there exclaimed baby  bear  just then goldilocks woke
          porridge said the mama  bear  someones been eating my
             chair said the mama  bear  someones been sitting in
               too said the mama  bear  someones been sleeping in
       porridge growled the papa  bear  someones been eating my
          chair growled the papa  bear  someones been sitting in
             to the bedroom papa  bear  growled someones been sleeping


* Or order by one word to the right of the keyword:

In [109]:
print_kwic(sort_kwic(bear_kwic, order=['R2']))

               up cried the baby  bear  someones been sitting in
          porridge said the mama  bear  someones been eating my
             chair said the mama  bear  someones been sitting in
               too said the mama  bear  someones been sleeping in
       porridge growled the papa  bear  someones been eating my
          chair growled the papa  bear  someones been sitting in
           pieces cried the baby  bear  they decided to look
             to the bedroom papa  bear  growled someones been sleeping
      still there exclaimed baby  bear  just then goldilocks woke


* The `sort_kwic()` function can take a list of positions to sort by.


* So here we sort first by one word to the right and then matching lines, e.g. `someones` get sorted by the word in R3 position.

In [110]:
print_kwic(sort_kwic(bear_kwic, order=['R1','R3']))

             to the bedroom papa  bear  growled someones been sleeping
      still there exclaimed baby  bear  just then goldilocks woke
          porridge said the mama  bear  someones been eating my
       porridge growled the papa  bear  someones been eating my
               up cried the baby  bear  someones been sitting in
             chair said the mama  bear  someones been sitting in
          chair growled the papa  bear  someones been sitting in
               too said the mama  bear  someones been sleeping in
           pieces cried the baby  bear  they decided to look


* We can also include more than 4 words to the left and the right with the `win` parameter of `make_kwic()` function

In [111]:
she_kwic = make_kwic('she', three_bears_tokens, win=6)
she_kwic_R1_sorted = sort_kwic(she_kwic, order=['R1'])
print_kwic(she_kwic)

                 just right she said happily and  she  ate it all up after shed
                  walk in the forest pretty soon  she  came upon a house she knocked
           shed eaten the three bears breakfasts  she  decided she was feeling a little
                   bowl this porridge is too hot  she  exclaimed so she tasted the porridge
                      feet this chair is too big  she  exclaimed so she sat in the
               three bears she screamed help and  she  jumped up and ran out of
                      soon she came upon a house  she  knocked and when no one answered
                she went upstairs to the bedroom  she  lay down in the first bed
                        but it was too hard then  she  lay in the second bed but
                        but it was too soft then  she  lay down in the third bed
                    ran away into the forest and  she  never returned to the home of
                  bowl this porridge is too cold  she  said so she tasted th

### Task

* Make sure you understand the list-of-lists structure and practice picking out specific items from specific lines.

* Create a few more KWIC lists from the Three Bears text and practice sorting them to discover patterns.

## A bigger example: Examining _refugee_ in a corpus of news articles

* Baker presents a nice example of using concordance analysis on a corpus of newspaper texts discussing refugees.


* Here we use a smaller corpus of news articles pulled from _LexisNexis_ using the search term `refugee` and retrieve the first 998 documents in a single text file.

In [112]:
# read the contents of the file into a string object

refugee_texts = open('/data/refugee_corpus/LN_refugee.txt').read()

In [113]:
print('Number of characters in the text file:',len(refugee_texts))

Number of characters in the text file: 2364559


* This what an article looks like:

In [115]:
print(refugee_texts[:1000])

﻿
                               1 of 998 DOCUMENTS

                            Birmingham Evening Mail

                            April 9, 2003, Wednesday

REFUGEE PROBLEMS EXPLAINED

SECTION: NEWS; Pg. 5

LENGTH: 94 words


ASYLUM seekers and refugees in Birmingham are having their basic needs met but
support groups are facing funding gaps and cultural misunderstandings, a
conference in the city heard.

Representatives from some of the biggest refugee organisations in Birmingham
presented their opinions and experiences of how their groups were managing to
integrate in to life in the city.

They presented their findings to the British Refugee Council, Refugee Action and
the Midlands Refugee Council, among other providers, at the Between Two Worlds
conference in Digbeth.

LOAD-DATE: April 9, 2003

LANGUAGE: ENGLISH

PUB-TYPE: PAPER

               Copyright 2003 Midland Independent Newspapers plc


                               2 of 998 DOCUMENTS



                                

* For the purpose of concordancing it doesn't matter whether we strip out the header information so much.


* And we can keep all 998 documents in one file for ease of analysis for now.


* So first we normalize the text and tokenize it:

In [116]:
refugee_tokens = tokenize(refugee_texts, lowercase=True, strip_chars=to_strip)

In [117]:
print('Number of tokens:',len(refugee_tokens))

Number of tokens: 360521


In [118]:
print('Token counts for:\n\trefugee \t{}\n\trefugees\t{}\n'.format(
                    refugee_tokens.count('refugee'), refugee_tokens.count('refugees')))

Token counts for:
	refugee 	3004
	refugees	3439



* Let's make a KWIC object for the keyword `refugee` and look at the first 10 lines

In [119]:
refugee_kwic = make_kwic('refugee', refugee_tokens)

In [83]:
refugee_kwic[:10]

[[['april', '9', '2003', 'wednesday'],
  'refugee',
  ['problems', 'explained', 'section', 'news']],
 [['some', 'of', 'the', 'biggest'],
  'refugee',
  ['organisations', 'in', 'birmingham', 'presented']],
 [['findings', 'to', 'the', 'british'],
  'refugee',
  ['council', 'refugee', 'action', 'and']],
 [['the', 'british', 'refugee', 'council'],
  'refugee',
  ['action', 'and', 'the', 'midlands']],
 [['action', 'and', 'the', 'midlands'],
  'refugee',
  ['council', 'among', 'other', 'providers']],
 [['may', '15', '2003', 'thursday'],
  'refugee',
  ['pleas', 'a', 'record', 'byline']],
 [['a', 'record', 'number', 'of'],
  'refugee',
  ['applications', 'were', 'received', 'by']],
 [['poland', 'figures', 'from', 'the'],
  'refugee',
  ['applications', 'commissioner', 'show', 'only']],
 [['of', '900', 'were', 'granted'],
  'refugee',
  ['status', 'they', 'now', 'have']],
 [['rights', 'as', 'irish', 'citizens'],
  'refugee',
  ['application', 'commissioner', 'berenice', 'oneill']]]

* Looking at the first 30 lines in a clearer way with `print_kwic()` function

In [120]:
print_kwic(refugee_kwic[:30])

          april 9 2003 wednesday  refugee  problems explained section news
             some of the biggest  refugee  organisations in birmingham presented
         findings to the british  refugee  council refugee action and
     the british refugee council  refugee  action and the midlands
         action and the midlands  refugee  council among other providers
            may 15 2003 thursday  refugee  pleas a record byline
              a record number of  refugee  applications were received by
         poland figures from the  refugee  applications commissioner show only
             of 900 were granted  refugee  status they now have
        rights as irish citizens  refugee  application commissioner berenice oneill
              ms oneill said the  refugee  application process was significantly
               the office of the  refugee  applications commissioner is an
     the justice minister grants  refugee  status it also investigates
         cases of people granted  refugee 

* There are a lot of lines to look at (over 3000).


* So usual practice is to subset them and then sort them and look for patterns.


* And then repeat the process

In [88]:
print_kwic(sort_kwic(refugee_kwic[50:100], order=['R1']))

          report card on canadas  refugee  and immigration programs thursday
             the response of the  refugee  applications commissioner june 14th
          helped to organise the  refugee  awareness activities she said
      least some immigration and  refugee  board commissioners have toughened
       denied by immigration and  refugee  board the national post
        says the immigration and  refugee  board summary of the
           living at the alamari  refugee  camp in the west
    union presidency agreed that  refugee  camps should be set
     least 100 palestinians from  refugee  camps in lebanon and
     least 100 palestinians from  refugee  camps in lebanon and
    spokesman for the vincentian  refugee  centre called on justice
       length 205 words coventry  refugee  centre is to get
      running costs the coventry  refugee  centre a registered charity
          april 17 2003 thursday  refugee  centres cash aid byline
                 it focuses on a  refugee  ch

#### Taking a random sample

* You can take a random sample of say 50 lines and repeat this process

In [124]:
random.seed(0)

In [125]:
sample_lines=random.sample(refugee_kwic,50)
sample_R1_sort=sort_kwic(sample_lines, order=['R1','L1'])
print_kwic(sample_R1_sort)

          home in sneinton notts  refugee  abas amini who sewed
amnesty international uk liberty  refugee  action the national assembly
  human rights organisations and  refugee  advocates are winning enormous
        he argued still canadian  refugee  advocates say a very
               found that the un  refugee  agency gives little training
      with the londonbased iraqi  refugee  aid council has so
    press shameful ernst zundels  refugee  application is an insult
   giantonio director of vermont  refugee  assistance some of them
       november 13 2003 thursday  refugee  bids section news pg
             mothers appeal of a  refugee  boards decision to deny
          leone near the gerihun  refugee  camp said yesterday that
        british flag in jabaliya  refugee  camp northern gaza strip
       camp gaza strip nusseirat  refugee  camp gaza strip israeli
               into a gaza strip  refugee  camp yesterday killing eight
      were militants calling the  refugee  camp str

#### Now look for patterns

* Look down the lines and try coming up with conceptual groupings:

    * __organizations__ : `board`,`council`
    * __holding places__ : `camp(s)`, `(detention) centre`
    * __issue/crisis__ : `crisis`
    * __process__ : `claim`, `status`, `admissions`
    

#### Now repeat with another sample

In [126]:
sample_lines=random.sample(refugee_kwic,50)
sample_R1_sort=sort_kwic(sample_lines, order=['R1','L1'])
print_kwic(sample_R1_sort)

                 alone you are a  refugee  a refugee is a
              to seek asylum say  refugee  advocates on both sides
         the taliban was toppled  refugee  advocates have reported the
              such a ruling with  refugee  advocates arguing that it
        words the united nations  refugee  agency unhcr said yesterday
                  to help the un  refugee  agency and the world
          of the immigration and  refugee  board they are directly
              be recorded by the  refugee  board until the next
                 a strain on the  refugee  board bloating processing time
           radio comment wrong a  refugee  boss last night denied
                   bomb lab in a  refugee  camp sparking a gunbattle
         also visited the altash  refugee  camp west of baghdad
           and wait patiently in  refugee  camps senator eric abetz
             required to wait in  refugee  camps abroad until being
            400 people have left  refugee  camps in lebanon 

* Look down the lines and check with the conceptual groupings:

    * __organizations__ : `board`,`council(s)`, `convention`, `organisations`
    * __holding places__ : `camp(s)`, `(detention) centre`
    * __issue/crisis__ : `crisis`, `problems`
    * __process__ : `claim`, `status`, `admissions`, `review`, `hearing`, `status`, `cases`, `board`
    * __activisim__ : `advocates`, `aid`
    

#### And repeat again with another random sample of 50 lines


In [92]:
sample_lines=random.sample(refugee_kwic,50)
sample_R1_sort=sort_kwic(sample_lines, order=['R1','L1'])
print_kwic(sample_R1_sort)

         other amendments to the  refugee  act comes into force
     lawyers public servants and  refugee  advocates they normally sit
         the taliban was toppled  refugee  advocates have reported the
              length 36 words un  refugee  agency urges governments to
         a barrister who handles  refugee  appeals cases ms teresa
        successful appeal to the  refugee  appeals tribunal the status
    patrick giantonio of vermont  refugee  assistance said coderres assertions
          of the immigration and  refugee  board would be refashioned
        outsiders from a chechen  refugee  camp in ingushetia in
         were destroyed at rafah  refugee  camp reuters homeless palestinian
    armored vehicles entered the  refugee  camp and parts of
           aid workers the worst  refugee  camp in the world
                 one red cent on  refugee  camps it is estimated
   specialize in immigration and  refugee  cases say zundel has
            the north of england  refugee  

* Check for any new groups of subgroups that describe the patterns