# Assignment 1: Collocation Tool
#### By Line Stampe-Degn Møller

##### Assignment and data:
https://github.com/CDS-AU-DK

https://github.com/CDS-AU-DK/cds-language/blob/main/assignments/assignment1.md

In terminal, add:
- pip install spacy
- python -m spacy download en_core_web_sm

In [1]:
# Import libraries
import spacy
import os
nlp = spacy.load("en_core_web_sm")

### 1. Take a user-defined search term and a user-defined windows size

In [2]:
# Define search word, pos (word class) and window size:
keyword = "sailor"
pos = "NOUN"
window_size = 3  # Before and after keyword (3 + keyword + 3)

### 2. Take one specific text which the user can define

In [3]:
# Load one text from the data folder
txt = open("../CDS-LANG/100_english_novels/corpus/Conrad_Nostromo_1904.txt").read()

In [4]:
# Check the data type and print the first part of the text to know what format I'm working with:
print(f"The data type is a: {type(txt)}")  # Result: 'str' (a string), which is the expected data type to move forward with.
print("#################")
print(txt[:500])  # Sample from first part of string

The data type is a: <class 'str'>
#################
﻿NOSTROMO
A TALE OF THE SEABOARD
By Joseph Conrad
"So foul a sky clears not without a storm." - SHAKESPEARE
TO JOHN GALSWORTHY
AUTHOR'S NOTE
" Nostromo " is the most anxiously meditated of the longer novels which
belong to the period following upon the publication of the "Typhoon"
volume of short stories.
I don't  mean to say that I became then conscious of any impending change
in my mentality and in my attitude towards the tasks of my writing
life. And perhaps there was never any change, except


In [9]:
import string

# Normalization of text string:
# Make everything lower case
txt_lower = txt.lower()

# Remove all new lines
txt_no_newlines = txt_lower.rstrip("\n") 

# Remove all punctuations
txt_clean = txt_no_newlines.translate(str.maketrans('', '', string.punctuation))  # https://datagy.io/python-remove-punctuation-from-string/
    # Comment: by removing all punctuation, I'm also removing some meaning, fx. genitive case: "sailors" vs. "sailor's".

# Check output
print(txt_clean[:500])  # Sample from first part of string

﻿nostromo
a tale of the seaboard
by joseph conrad
so foul a sky clears not without a storm  shakespeare
to john galsworthy
authors note
 nostromo  is the most anxiously meditated of the longer novels which
belong to the period following upon the publication of the typhoon
volume of short stories
i dont  mean to say that i became then conscious of any impending change
in my mentality and in my attitude towards the tasks of my writing
life and perhaps there was never any change except in that myst


In [7]:
# Put through a spaCy pipeline (tokenization = seperate into words and characters).
doc = nlp(txt_clean)

# Check output
print(doc[:500])  # Sample from first part of string

﻿nostromo
a tale of the seaboard
by joseph conrad
so foul a sky clears not without a storm  shakespeare
to john galsworthy
authors note
 nostromo  is the most anxiously meditated of the longer novels which
belong to the period following upon the publication of the typhoon
volume of short stories
i dont  mean to say that i became then conscious of any impending change
in my mentality and in my attitude towards the tasks of my writing
life and perhaps there was never any change except in that mysterious
extraneous thing which has nothing to do with the theories of art a
subtle change in the nature of the inspiration a phenomenon for which i
can not in any way be held responsible what however did cause me some
concern was that after finishing the last story of the typhoon volume
it seemed somehow that there was nothing more in the world to write
about
this so strangely negative but disturbing mood lasted some little
time and then as with many of my longer stories the first hint for
nostro

### 3. Find all the context words which appear +- the window size from the search term in that text

In [10]:
# Find instances of "sailor" 
# I choose to check the lemma of each word and then print the original text if the lemma (an the pos) matches the search terms.
for token in doc:
    if token.lemma_ == keyword and token.pos_ == pos:
        print(token.i, token.text)  # Print token index and the token text (NOT lemma/keyword, but original word)
    else:
        pass

384 sailor
487 sailors
604 sailor
709 sailor
3379 sailors
3585 sailors
4066 sailors
9897 sailor
11553 sailor
15911 sailor
45114 sailor
73191 sailor
74348 sailor
82133 sailor
82921 sailor
92484 sailors
106803 sailor
108425 sailors
110546 sailors
112535 sailor
114651 sailors
137924 sailor
152373 sailors
152639 sailors
156137 sailor
156686 sailor
157504 sailor
160160 sailor
178020 sailor


In [22]:
colloc = []  # Empty list for list for colloc words related to the keyword.

# Finding colloc words and saving them in list
for token in doc:
    if token.lemma_ == keyword and token.pos_ == pos:
        before = token.i - window_size        # Defining start index as [window_size] spaces BEFORE the index of the keyword.
        after = token.i + window_size + 1     # Defining end index as [window_size] spaces AFTER the index of the keyword (+1 for keyword).
        span = doc[before:after]              # Defining the span as the content of the doc string between the index spaces defined earlier.
        colloc.append([token, span])          # Saving lists of context words in list.
    else:
        pass

# Printing colloc words for each token.lemma == keyword. First entry in output (before comma) is the token.text (original word, NOT lemma):
for i in colloc:
    print(i)

[sailor, wanderings that american sailor worked for some]
[sailors, character in the sailors
story he]
[sailor, some quarrel the sailor threatened him what]
[sailor, 
ultimately the sailor disgusted with the]
[sailors, 
two wandering sailors  americanos perhaps]
[sailors, other sign the sailors the indian
]
[sailors,  as the sailors say  is]
[sailor, been a
sailor in his time]
[sailor, hand  as sailor as dock labourer]
[sailor, was the italian sailor whom all the]
[sailor, the mediterranean
sailor come ashore casually]
[sailor, but a common sailor i would call]
[sailor, cargadores that italian sailor of whom i]
[sailor, of that genoese sailor who like me]
[sailor, an original italian sailor whom i allowed]
[sailors, chirruped softly as sailors
do to]
[sailor, as
a sailor of course i]
[sailors, to a
sailors ear on such]
[sailors, 
the old sailors aspect was very]
[sailor, cast the old sailor with all his]
[sailors, especially as to sailors it was different]
[sailor, his leisure this sai

### 4. Calculate the mutual information score for each context word

#### Example of MI calculation:
###### from https://www.english-corpora.org/mutualInformation.asp

##### Formula:
MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)

Suppose we are calculating the MI for the collocate color near purple in BNC.

- A = frequency of node word (e.g. purple): 1262
- B = frequency of collocate (e.g. color): 115
- AB = frequency of collocate near the node word (e.g. color near purple): 24
- sizeCorpus= size of corpus (# words; in this case the BNC): 96,263,399
- span = span of words (e.g. 3 to left and 3 to right of node word): 6
- log (2) is literally the log10 of the number 2: .30103

MI = 11.37 = log ( (24 * 96,263,399) / (1262 * 115 * 6) ) / .30103

In [23]:
import math

def calc_freq(token, text):
    # Calculate the frequency of a token on a text document (lemma level):
    freq = 0
    for word in text:
        if word.lemma_ == token.lemma_:
            freq += 1
    return freq


def calc_colloc_freq(token, colloc):
    freq = 0
    # Calculate the frequency of a token on a colloc span (lemma level):
    for c in colloc:
        span = c[1]
        for word in span:
            if token.lemma_ == word.lemma_:
                freq += 1
    return freq


sizeCorpus = len(doc)
span_size = 2 * window_size
log2  = 0.30103

output = []  # Empty list for output (result of MI calculations)

for c in colloc:
    span = c[1]

    for word in span:
        A = calc_freq(c[0], doc)
        B = calc_freq(word, doc)
        AB = calc_colloc_freq(word, colloc)
        # Calculate the Mutual Information score:
        MI = math.log(( (AB * sizeCorpus) / (A * B * span_size) ) / log2)  # https://www.geeksforgeeks.org/log-functions-python/

        # Append [term], [freq], [doc_freq], [mut_inf_sco]
        output.append([word, AB, B, MI])

### 5. Save the results as a CSV file:

In [24]:
import csv

header = ['term', 'freq', 'doc_freq', 'mut_inf_sco']

with open('assignment1_output_cds_lang.csv', 'w', encoding='UTF8') as f:
    writer = csv.writer(f)

    # Write the header
    writer.writerow(header)

    # Write multiple rows
    for out in output:
        writer.writerow(out)
        print(out)

[wanderings, 1, 2, 7.3007682065800275]
[that, 3, 1757, 1.6211645876209848]
[american, 1, 18, 5.1035436292438074]
[sailor, 29, 35, 7.805863155637033]
[worked, 1, 103, 3.3591863989103365]
[for, 2, 1327, 1.4963865333677102]
[some, 2, 295, 3.000087211360098]
[character, 1, 54, 4.004931340575698]
[in, 2, 3240, 0.6037339589135424]
[the, 13, 12912, 1.0929523540917792]
[sailors, 29, 35, 7.805863155637033]
[
, 14, 14593, 1.0446754760874852]
[story, 1, 32, 4.528179484340246]
[he, 2, 3817, 0.43984251495468785]
[some, 2, 295, 3.000087211360098]
[quarrel, 1, 6, 6.2021559179119174]
[the, 13, 12912, 1.0929523540917792]
[sailor, 29, 35, 7.805863155637033]
[threatened, 1, 3, 6.895303098471863]
[him, 2, 3817, 0.43984251495468785]
[what, 1, 410, 1.977758227441619]
[
, 14, 14593, 1.0446754760874852]
[ultimately, 1, 5, 6.384477474705872]
[the, 13, 12912, 1.0929523540917792]
[sailor, 29, 35, 7.805863155637033]
[disgusted, 1, 6, 6.2021559179119174]
[with, 2, 1925, 1.1243813209780331]
[the, 13, 12912, 1.09295