# Task 1

Create a single file of all the child-directed .cha files in the Bernstein-Ratner CHILDES corpus.
It’s available here:
https://childes.talkbank.org/data/Eng-NA/Bernstein.zip

After unzipping the file, the relevant files will be in the Children subfolder. You can ignore all other files and subfolders. The file should be a single file of just the care-giver’s speech and the part of speech tags. The annotations are listed at the top of each file. Lines that begin with a &are utterances in most instances, if not all, the care-givers’ speech is annotated *MOT: at the beginning of the line and %mor: marks a line that assigns part of speech tags to the previous utterance.) Not necessary for this project, but interesting and good to know is that %com marks a comment by either the investigator or a transcriber and %gra marks a line that gives a dependency parse graph of the previous utterance.)

In [31]:
"""
Your completed file should look like this:
*MOT: she's really into books right now .
%mor: pro:sub|she~cop|be&3S adv|real&dadj-LY prep|into n|book-PL
adv|right adv:tem|now .
*MOT: you wanna see the book ?
%mor: pro:per|you v|want~inf|to v|see det:art|the n|book ?
*MOT: oh , look there's a boy with his hat .
%mor: co|oh cm|cm n|look pro:exist|there~cop|be&3S det:art|a n|boy
prep|with det:poss|his n|hat .
*MOT: and a doggie .
%mor: coord|and det:art|a n|dog-DIM .
*MOT: oh , you wanna look at this ?
%mor: co|oh cm|cm pro:per|you v|want~inf|to co|look prep|at pro:dem|this ?
*MOT: &w' look at this ?
%mor: v|look prep|at pro:dem|this ?
"""

"\nYour completed file should look like this:\n*MOT: she's really into books right now .\n%mor: pro:sub|she~cop|be&3S adv|real&dadj-LY prep|into n|book-PL\nadv|right adv:tem|now .\n*MOT: you wanna see the book ?\n%mor: pro:per|you v|want~inf|to v|see det:art|the n|book ?\n*MOT: oh , look there's a boy with his hat .\n%mor: co|oh cm|cm n|look pro:exist|there~cop|be&3S det:art|a n|boy\nprep|with det:poss|his n|hat .\n*MOT: and a doggie .\n%mor: coord|and det:art|a n|dog-DIM .\n*MOT: oh , you wanna look at this ?\n%mor: co|oh cm|cm pro:per|you v|want~inf|to co|look prep|at pro:dem|this ?\n*MOT: &w' look at this ?\n%mor: v|look prep|at pro:dem|this ?\n"

In [1]:
# Download Bernestein.zip
# Unzip the file
# ! unzip Bernstein.zip

In [8]:
import os
from typing import List

def all_child_directed(filepath='Bernstein/Children') -> str:
    # Create a list of all .cha files in the Children subfolder
    file_list: List[str] = [os.path.join(root, file)
                 for root, _, files in os.walk(filepath)
                 for file in files
                 if file.endswith('.cha')]

    # Open a new file for writing
    with open('all_child_directed.txt', 'w') as outfile:
        # Loop over each file in the file list
        for filename in file_list:
            # Open the file for reading
            with open(filename, 'r') as infile:
                # Loop over each line in the file
                for line in infile:
                    # If the line starts with "*MOT:" or "%mor:", write it to the output file
                    if line.startswith("*MOT:") or line.startswith("%mor:"):
                        outfile.write(line)
    return 'file created successfully: all_child_directed.txt'

In [9]:
all_child_directed()

'file created successfully: all_child_directed.txt'

2. Task 2
A table with all the different parts of speech can be found in https://talkbank.org/manuals/MOR.pdf on page 8. The part of speech tagging in CHILDES is famously incorrect. There are 21 occurrences of: pro:rel|who on the %mor lines in the single file that you just generated. pro:rel|who means that the word the caregiver spoke, “who”, in the line above is a relative pronoun. In a text editor search for “pro:rel|who” and identify which of those occurrences is truly a relative pronoun and not some other type of pronoun.x

In [12]:
import linecache

def find_relative_pronouns(filename: str) -> None:
    # Open the file for reading
    with open(filename, 'r') as infile, open('relative_pronouns.txt', 'w') as outfile:
        # Initialize a variable to keep track of the line number
        line_number = 0
        # Loop over each line in the file
        for line in infile:
            # Increment the line number
            line_number += 1
            # Check if the line contains "pro:rel|who"
            if "pro:rel|who" in line:
                # Get the line above the "pro:rel|who" occurrence
                context_line = linecache.getline(filename, line_number-1)
                # Check if the context line contains a relative clause
                if any(word in context_line.lower() for word in ['who', 'whom', 'that', 'which']):
                    # This is likely a relative pronoun
                    output_line = f"Line {line_number}: {line.strip()}"
                    outfile.write(output_line + '\n')
                    print(output_line)

In [13]:
find_relative_pronouns('all_child_directed.txt')

Line 235: %mor:	pro:per|it~cop|be&3S det:art|a adj|little n|dog-DIM pro:rel|who
Line 988: %mor:	pro:rel|whose n|book cop|be&3S det:dem|this ?
Line 1761: %mor:	pro:rel|who ?
Line 1841: %mor:	n|guess pro:rel|who +...
Line 3576: %mor:	pro:rel|whose n|shoe cop|be&3S det:dem|this ?
Line 3580: %mor:	pro:rel|whose n|shoe cop|be&3S pro:per|it ?
Line 5599: %mor:	pro:rel|who ?
Line 5902: %mor:	pro:rel|who ?
Line 5903: %mor:	pro:rel|who .
Line 9548: %mor:	n|guess pro:rel|who det:dem|this aux|be&3S det:dem|this aux|be&3S
Line 10489: %mor:	co|well cm|cm pro:rel|who~aux|be&3S part|eat-PRESP det:art|the
Line 11107: %mor:	co|see pro:rel|who~cop|be&3S prep|on det:art|the n|telephone .
Line 13365: %mor:	coord|and det:art|a adj|little n|doll pro:rel|who~aux|be&3S
Line 13533: %mor:	v|let~pro:obj|us v|see adv:tem|now cm|cm pro:rel|who v|go-3S
Line 14092: %mor:	n|growl-PL pro:rel|who v|take&PAST det:poss|my n|steak ?
Line 15777: %mor:	co|yes cm|cm pro:rel|who ?
Line 16072: %mor:	pro:sub|I v|like det:dem|thi

3. Task 3
I’ve uploaded on Blackboard a file justMOT.txt which contains the caregivers’ *MOT: utterances
from all the .cha files without the PoS tagging on the %mor: line. Your job is to create a new file
justMOTclean.txt which contains the same basic corpus but “cleaned up.”
Almost all corpora come with annotations, and transcription protocols that are not useful for a
particular study. For us, we want to remove:
● All tokens that begin with: &, +, -, or that begin with a digit
● The transcribers put in material that somehow captures material that isn’t specifically
articulated by the speaker. These are marked by parentheses. For example, prosody or
a long pause that suggests the end of a sentence: (.), filling in a missing part of a word
due to pronunciation: (yo)u did. Since we’re doing an n-gram project, we would want the
transcriber’s interpretation of what was intended. This means we need to delete the
parentheses. E.g.,
*MOT: What did (.) ahh (yo)u did . becomes *MOT: What did . ahh you did .
● Square brackets indicate a transcriber’s description about how something was said.
E.g., *MOT: look at all those numbers [=! high pitch] . bye bye.
We need the brackets and everything between them removed.
E.g., *MOT: look at all those numbers . bye bye.
● At the end of most *MOT: lines is a special code to link the line to a video or audio file.
To remove these is tricky. Use this Python code to set a “boolean flag” that you can use
in if-statement. A token is tagged as a special code in CHILDES by inserting a
non-printable/non-visible character at the beginning and or at the end of the token.
specialCode = (ord(w[0]) == 21 or ord(w[-1]) == 21)
- specialCode is true if either first or last char is
- non-printable ascii 21, otherwise specialCode is false
There are many other annotations that for simplicity I’ve decided to ignore for this assignment.
For your final projects, we probably would need to talk about how to clean whatever corpus you
are using for your study.
I’ve been using the word ‘remove’ or ‘removed’, but in practice one almost never deletes
anything when using corpora, instead, we create a new file which doesn’t contain the unwanted
annotations.

Here’s some pseudo-code you can work off of that creates a new “cleaned file” off of the
justMOT.txt file. Feel free to skip this, and try to develop your own approach!!
Setup input and output file objects for reading and writing.
For each line in the input file
For each word in the current line
Delete all characters that are: ‘(‘ and ‘)’ in the word
If the first character in the word is ‘[‘ and the last is
‘]’ skip and go to the next word
If the first character in the word is ‘[‘ note that you are
inside a bracketed expression and go to the next word
If the last character in the word is ‘]‘ note that you are
outside a bracketed expression and go to the next word
Check to see if the current word is a special code (using
the line above)
If the first character of the current word is not one of:
&, -, +,0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and
the current word is not inside a bracketed expression and
the current word is not a special code:
Print the current word without the newline
i.e., something like this print(aWord, outFile, end=””)
print(“”, outFile) # prints the newline character at the end of the line


NOTES:
1) If the logic of detecting words between ‘[‘ ‘]’ is confusing, a more straightforward
approach would be to remove them from the entire line before you begin processing
word by word.
2) To skip to the next element during a for loop, use the Python command: continue
for i in range(10):
if 4 <= i <=6 : continue
print i # prints 0,1,2,3,7,8,9 skipping over 4,5,6
3) There are several ways to delete the ‘(‘ and ‘)’. Replace is probably the most
straightforward.
s = “heals”
t = s.replace(’a’, ‘e’)# Non-mutating: Returns the resulting string
print t # Displays heels
u = t.replace(‘h’,’’)
print u # Displays eels
4) You can use in to check for a character in a string
aString = “heals”
‘a’ in aString # Displays True
‘z’ in aString # Displays False


In [17]:
def process_just_mot(filename: str) -> None:
    # Read all lines from the input file
    with open(filename, 'r') as f:
        lines = f.readlines()

    # Extract all lines that start with '*MOT:'
    just_mot = []
    for line in lines:
        if line.startswith('*MOT:'):
            just_mot.append(line.replace('*MOT:', '').strip())

    # Write the extracted lines to a new file
    with open('justMOT.txt', 'w') as f:
        f.write('\n'.join(just_mot))

    # Process the words in the new file and write to a third file
    with open("justMOT.txt", "r") as inFile, open("justMOTclean.txt", "w") as outFile:
        for line in inFile:
            # Remove square bracketed expressions from line
            while "[" in line and "]" in line:
                start = line.index("[")
                end = line.index("]", start)
                line = line[:start] + line[end + 1:]

            # Split the line into words
            words = line.split()

            # Process each word
            for word in words:
                # Remove parentheses from word
                word = word.replace("(", "").replace(")", "")

                # Check if word is a special code
                specialCode = (ord(word[0]) == 21 or ord(word[-1]) == 21)

                # Check if word should be skipped
                if word[0] in "&+-0123456789" or "[" in word or specialCode:
                    continue

                # Write word to output file
                outFile.write(word + " ")

            # Write newline to output file
            outFile.write("\n")

process_just_mot('all_child_directed.txt')

4. Task 4
Build bigram and unigram counts tables. These will be dictionaries where the key will be a
tuple of two strings, and the value will be a count (for the bigram counts), and a string and a
count (for the unigram counts). Use the justMOTclean.txt file to build your tables. Start with the
unigram counts. As you process tokens in the justMOtclean.txt file the logic (in Python) goes like
this:
unigramCnts = { }
.
.
for line in inFile:
for w in line.split():
if w in unigramCnts: # seen w before, so add 1 to the count
unigramCnts[w] = bigramCnts[w] + 1
else: # first time encountering w, so initialize it to 1
unigramCnts[w] = 1
Bigrams are just a little bit harder. You need to split each line of the file, and then use indexes to
iterate over the words (basically w in line.split() won’t work for bigrams. There are
clever ways to do something like for w1,w2 in line.split()but this without added
baggage won’t work. )

bigramCnts = { }
for line in inFile:
words = line.split()
for i in range(len(words)-1):
(w1, w2) = (words[i], words[i+1])
if (w1, w2) in bigramCnts: # seen w before, so add 1 to the count
bigramCnts[(w1, w2)] = bigramCnts[(w1, w2)] + 1
else: # first time encountering w, so initialize it to 1
bigramCnts[(w1, w2)] = 1
Optional: It would be more efficient if you created both unigramCnts and bigramCnts in one
loop. See if you can figure out how.
TESTING: I’ve uploaded a small text file: test2.txt. Count the unigrams and bigrams by hand.
Then add some print lines to your code to output unigramCnts and bigramCnts and compare.

In [24]:
import openpyxl

def process_counts():
    unigramCnts = {}
    bigramCnts = {}

    with open('justMOTclean.txt', 'r') as inFile:
        for line in inFile:
            # process unigrams
            for w in line.split():
                if w in unigramCnts:
                    unigramCnts[w] += 1
                else:
                    unigramCnts[w] = 1

            # process bigrams
            words = line.split()
            for i in range(len(words)-1):
                w1 = words[i]
                w2 = words[i+1]
                if (w1, w2) in bigramCnts:
                    bigramCnts[(w1, w2)] += 1
                else:
                    bigramCnts[(w1, w2)] = 1

    # print output
    print("Unigram counts:")
    print(unigramCnts)
    print("\nBigram counts:")
    print(bigramCnts)

    # export to xlsx file
    wb = openpyxl.Workbook()
    ws1 = wb.active
    ws1.title = 'Unigram counts'

    # write headers
    ws1.cell(row=1, column=1, value='Unigram')
    ws1.cell(row=1, column=2, value='Count')

    # write data
    for i, (word, count) in enumerate(unigramCnts.items()):
        ws1.cell(row=i+2, column=1, value=word)
        ws1.cell(row=i+2, column=2, value=count)

    # create a new sheet for bigram counts
    ws2 = wb.create_sheet(title='Bigram counts')

    # write headers
    ws2.cell(row=1, column=1, value='Bigram')
    ws2.cell(row=1, column=2, value='Count')

    # write data
    for i, (bigram, count) in enumerate(bigramCnts.items()):
        ws2.cell(row=i+2, column=1, value=' '.join(bigram))
        ws2.cell(row=i+2, column=2, value=count)

    # save workbook to file
    wb.save('counts.xlsx')

process_counts()

Unigram counts:
{'pat': 35, 'the': 1292, 'bunny': 97, '.': 4878, 'oh': 646, '!': 2109, 'you': 1252, 'know': 117, 'Judy': 43, 'can': 411, 'look': 396, 'in': 375, 'mirror': 32, 'now': 128, 'peekaboo': 63, "who's": 101, 'that': 807, '?': 3865, "that's": 403, 'Ann': 1, 'shall': 15, 'I': 418, 'just': 53, 'keep': 9, 'going': 48, 'feel': 40, "daddy's": 21, 'scra:tchy': 2, 'face': 37, 'o:h': 37, 'what': 731, 'does': 93, 'like': 272, 'it': 647, 'scratchy': 21, 'say': 162, 'read': 44, 'her': 64, 'book': 215, "Judy's": 9, 'happens': 1, 'hear': 11, 'tick': 32, ',': 1867, "what's": 542, 'he': 230, 'listening': 3, 'to': 341, 'a': 931, 'clock': 7, 'how': 108, 'big': 38, 'is': 692, "he's": 128, 'so': 56, "bunny's": 12, 'eating': 27, 'his': 201, 'food': 18, 'right': 149, "you've": 13, 'already': 3, 'this': 584, 'before': 10, 'ss:h': 5, 'sleeping': 25, 'be': 34, 'quiet': 2, 'all': 88, 'asleep': 9, 'Paul': 53, 'put': 240, 'finger': 30, 'through': 21, "mommy's": 43, 'ring': 36, 'your': 258, 'byebye': 53, 