# Data Science for Linguists

This Coding Project is a four week tour of several useful techniques and libraries linguists use to do research. The Coding Project is structured as follows:

* Week 1: reading in corpora; basic text processing; querying
* Week 2: advanced text processing (SpaCy); outputting queries
* Week 3: working with structured data (Pandas)
* Week 4: visualization and statistical testing

The Coding Project is capped by submitting the Practice notebooks for each week; together, they make up a simple research project in which you will study the variation in adverbial use across a pair of online communities.

# Week 1: Text Processing

This week, we will survey three important techniques for text processing with Python: 
* reading in corpora
* basic text processing
* querying

## Topic 1: reading in corpora

Corpora come in many forms: some are raw text files, others are **structured** files, such as XML, CSV and JSON, in which information is structured. Here is some documentation for reading out these three file types in Python:

* XML: https://docs.python.org/3/library/xml.etree.elementtree.html
* CSV: https://docs.python.org/3/library/csv.html
* JSON: https://docs.python.org/3/library/json.html

All of the three libraries involved (`xml`, `csv` and `json`) are part of the standard library of Python 3, so you won't have to install any of them to use them. I will demonstrate the latter two here.

### reading in CSVs

CSVs are the simplest form in which spreadsheets are stored. If you have created a spreadsheet in Excel or another spreadsheet editor, you can save it as a CSV (you'll lose some of the lay-out, like highlighting and font effects like boldface and underlining), and read it in to Python. (You _can_ read in Excel files (.xlsx) directly too, as we will see when we discuss Pandas, but for larger corpora, it's often still useful to know how you can read such files in through the CSV library).

The CSV library allows you to read out spreadsheets with headers as lists of dictionaries, which makes for easy processing. Suppose you have a spreadsheet that looks like this:

| sentence    | speaker     | text       |
| ----------- | ----------- | ---------- |
| 1           | Vladimir    | Nothing to be done |
| 2           | Estragon    | I'm beginning to come round to that opinion ... So there you are again. | 
| 3           | Vladimir    | Am I? |

Calling the `DictReader()` function on the filehandle of the file containing this spreadsheet will allow you to iterate over the three lines, with each line being transformed into a dictionary, like:

``` {'sentence' : '1', 'speaker' : 'Vladimir', 'text' : 'Nothing to be done' } ```

Let's look at a fragment of the DoReCo corpus (https://doreco.huma-num.fr/), a collection of fieldwork data for 51 languages, often glossed. Here, we're looking at Nǁng, a moribund Tuu language spoken in the very south of Africa (https://en.wikipedia.org/wiki/N%C7%81ng_language).

A useful first thing to do with files is to see what they look like as texts. The code below does that, by iterating over the newline-separated lines in the file and printing them.

In [None]:
a = ['x','y','z']
for i,j in enumerate(a):
    print(i,j)

0 x
1 y
2 z


In [None]:
filename = 'doreco_sample.csv'
for li,l in enumerate(open(filename, errors='ignore')):
    print(li,l)

0 lang,file,core_extended,speaker,wd_ID,wd,start,end,ref,tx,ft,mb_ID,mb,doreco-mb-algn,ps,gl,ph_ID,ph

1 nngg1234,doreco_nngg1234_NB041016-01_A,core,B,w1,<p:>,0,1.657,<p:>,<p:>,<p:>,m1,<p:>,,<p:>,<p:>,p1,<p:>

2 nngg1234,doreco_nngg1234_NB041016-01_A,core,B,w2,gǁain,1.657,1.972,0001_DoReCo_doreco_nngg1234_NB041016-01_A,"gǁain tcuinya a nǀaa nǁang,",hyena sits and watches the house,m2,gǁain,,n,brown.hyena,p2 p3 p4,|\|\_v a~ i~

3 nngg1234,doreco_nngg1234_NB041016-01_A,core,B,w3,cuinya,1.972,2.49,0001_DoReCo_doreco_nngg1234_NB041016-01_A,"gǁain tcuinya a nǀaa nǁang,",hyena sits and watches the house,m3 m4,suin -a,,vitr -vsf,sit.down -?,p5 p6 p7 p8,tS u~ i~ a

4 nngg1234,doreco_nngg1234_NB041016-01_A,core,B,w4,a,2.49,2.63,0001_DoReCo_doreco_nngg1234_NB041016-01_A,"gǁain tcuinya a nǀaa nǁang,",hyena sits and watches the house,m5,a,,pro,2SG,p9,a

5 nngg1234,doreco_nngg1234_NB041016-01_A,core,B,w5,nǀaa,2.63,2.93,0001_DoReCo_doreco_nngg1234_NB041016-01_A,"gǁain tcuinya a nǀaa nǁang,",hyena si

This exploration tells us a couple of things:
* the file has 18 columns, as the line numbered 0 (the first line) indicates.
    * some are transparent in their meaning, others less so. The DoReCo website provides documentation, but for now we'll use a subset of these columns.
* the columns are separated by commas.
* If a comma occurs inside a column, the column is "quoted" with quotation characters ("), e.g. on line 3. This prevents a CSV reader from interpreting those commas as separating two columns.

This informs us how we should you the CSV reader: with commas as the separating characters ('delimiters') and quotation marks as quotation characters. When you run the code cell, the DictReader function will create a generator that is ready to read out the file. This is often more practical than reading in the entire corpus in one go (it might take up too much memory on your computer for large files!)

The code cell two down lets you iterate over the first 10 lines of the corpus.

In [None]:
import csv

filename = 'doreco_sample.csv'
filehandle = open(filename)

corpus = csv.DictReader(filehandle, delimiter = ',', quotechar = '"')
print(corpus)

<csv.DictReader object at 0x7f80267c4490>


In [None]:
filename = 'doreco_sample.csv'
filehandle = open(filename)
corpus = csv.DictReader(filehandle, delimiter = ',', quotechar = '"')
#
count = 0
for line in corpus:
    print(count, line)
    count += 1
    if count >= 10: break

0 {'lang': 'nngg1234', 'file': 'doreco_nngg1234_NB041016-01_A', 'core_extended': 'core', 'speaker': 'B', 'wd_ID': 'w1', 'wd': '<p:>', 'start': '0', 'end': '1.657', 'ref': '<p:>', 'tx': '<p:>', 'ft': '<p:>', 'mb_ID': 'm1', 'mb': '<p:>', 'doreco-mb-algn': '', 'ps': '<p:>', 'gl': '<p:>', 'ph_ID': 'p1', 'ph': '<p:>'}
1 {'lang': 'nngg1234', 'file': 'doreco_nngg1234_NB041016-01_A', 'core_extended': 'core', 'speaker': 'B', 'wd_ID': 'w2', 'wd': 'gǁain', 'start': '1.657', 'end': '1.972', 'ref': '0001_DoReCo_doreco_nngg1234_NB041016-01_A', 'tx': 'gǁain tcuinya a nǀaa nǁang,', 'ft': 'hyena sits and watches the house', 'mb_ID': 'm2', 'mb': 'gǁain', 'doreco-mb-algn': '', 'ps': 'n', 'gl': 'brown.hyena', 'ph_ID': 'p2 p3 p4', 'ph': '|\\|\\_v a~ i~'}
2 {'lang': 'nngg1234', 'file': 'doreco_nngg1234_NB041016-01_A', 'core_extended': 'core', 'speaker': 'B', 'wd_ID': 'w3', 'wd': 'cuinya', 'start': '1.972', 'end': '2.49', 'ref': '0001_DoReCo_doreco_nngg1234_NB041016-01_A', 'tx': 'gǁain tcuinya a nǀaa nǁang,'

Often, we will want to extract only certain columns. Given that each line is a dictionary, it's easy to retrieve only the columns we want to extract by specfiying their column names and calling the corresponding dictionary keys. Suppose we want to get the segmented morphemes and their glosses only (printing only the first 30 lines here):

In [None]:
filename = 'doreco_sample.csv'
filehandle = open(filename)
corpus = csv.DictReader(filehandle, delimiter = ',', quotechar = '"')

In [None]:
count = 0
for line in corpus:
    print(count, 'morphemes =', line['mb'], '\tgloss = ', line['gl'])
    count += 1
    if count >= 30: break

0 morphemes = <p:> 	gloss =  <p:>
1 morphemes = ha 	gloss =  3H.SG
2 morphemes = nǁae 	gloss =  then
3 morphemes = ǁain 	gloss =  climb
4 morphemes = nǃoon 	gloss =  dune
5 morphemes = <p:> 	gloss =  <p:>
6 morphemes = gǁain 	gloss =  brown.hyena
7 morphemes = see 	gloss =  come
8 morphemes = ng 	gloss =  OBL
9 morphemes = ǃUbuka 	gloss =  ǃUbuka
10 morphemes = ha 	gloss =  3H.SG
11 morphemes = see 	gloss =  come
12 morphemes = ha 	gloss =  3H.SG
13 morphemes = ku 	gloss =  QUOT
14 morphemes = nǁaa 	gloss =  OBL.3SG.STR
15 morphemes = <p:> 	gloss =  <p:>
16 morphemes = ǃʼaa 	gloss =  stand
17 morphemes = ǁʼaa 	gloss =  go
18 morphemes = ki 	gloss =  place
19 morphemes = ng 	gloss =  1SG
20 morphemes = ǁʼaa 	gloss =  go.away
21 morphemes = ng 	gloss =  and
22 morphemes = ǀae 	gloss =  send
23 morphemes = gǁain 	gloss =  brown.hyena
24 morphemes = ha 	gloss =  3H.SG
25 morphemes = si 	gloss =  IRR
26 morphemes = ng 	gloss =  ?
27 morphemes = see 	gloss =  come
28 morphemes = ng 	gloss = 

The generator structure of the DictReader output further allows us to extract information we're interested in without storing the entire corpus in short-term memory. We can use conditional statements to only retrieve or print certain lines. The code below retrieves all imperatives from the corpus, and prints the full sentence along with them.

In [None]:
filename = 'doreco_sample.csv'
filehandle = open(filename)
corpus = csv.DictReader(filehandle, delimiter = ',', quotechar = '"')
#
for line in corpus:
    if 'IMP' in line['gl']:
        print('morphemes =', line['mb'], '\ngloss = ', line['gl'], '\nsentence = ', line['tx'], '\nft = ', line['ft'], '\n')

morphemes = suin -a 
gloss =  sit.down -IMP.SG 
sentence =  hm, a ha ǃuun ng ǁʼaa, ha ku ng gǁain, "ǁhaa a tcuinya, ng si ǃxae see", 
ft =  then he goes (?away), he (J) says to hyena "sit first down, I will come (right) now" 

morphemes = khuu-ǁʼng -a 
gloss =  stand.up -IMP.SG 
sentence =  ha ku, "gǁain, khui-nǁnga, 
ft =  he (J) says "hyena, get up! 

morphemes = ǃʼhuqung -uwe 
gloss =  greet -IMP.PL 
sentence =  ǀʼhuunsi see ng ku, "ǃʼhoonguwe", 
ft =  the Boer comes and says "hello, you" 

morphemes = ǃʼhuqung -a 
gloss =  greet -IMP.SG 
sentence =  ha ku, "ǃʼhoonga ǂxuu, 
ft =  he (J) says "hello, baas" 

morphemes = nǀaa -a 
gloss =  see -IMP.SG 
sentence =  nǀaaʼa ki a ǃqhaike, a ǃqhaike ǃqxʼabesi, 
ft =  look at your diarrhea, the cream?? 

morphemes = nǀaa -a 
gloss =  see -IMP.SG 
sentence =  nǀaaa a ǀuutyuu" 
ft =  look at your buttocks" 

morphemes = nǀaa -a 
gloss =  see -IMP.SG 
sentence =  ha ku ng ǀʼhuunsi, "nǀaaʼa ng ǀaia ng nǃaresi, 
ft =  he (J) says to the Boer "loo

### JSON

JSON is a format in which much internet data is structured. If you're working with social media data, you're likely to encounter JSON. Much like CSV, JSON lets you read out data in a dictionary structure. The `reads()` function is your go-to approach to read out JSON files. The JSON file we have here comes from the relationship advice subreddit, a place where people ask for advice about their relationships (https://www.reddit.com/r/relationship_advice/).

Let's look at some lines of the file first, before reading it in with JSON:

In [None]:
filename = 'yelp_review_sample.json'
for li,l in enumerate(open(filename)):
    print(li,l)
    if li > 10: break

0 {"review_id": "BiTunyQ73aT9WBnpR9DZGw", "user_id": "OyoGAe7OKpv6SyGZT5g77Q", "business_id": "7ATYjTIgM3jUlt4UM3IypQ", "stars": 5.0, "useful": 1, "funny": 0, "cool": 1, "text": "I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.\n\nFor anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).\n\nThere is no way I can write this review without giving Russell, the owner of Body Cycle, a shout out. Russell's passion for fitness and cycling is so evident, as is his desire for all of his clients to succeed. He is always dropping in to classes to check in/provide encouragement, and is open to ideas and recommendations from anyone. Russell always wears a smile on his face, even when he's kicking your bu

We can see that the text already looks like a dictionary, with key-value pairings per line (the 'stars' key maps to the star-rating 'value', like 4.0 or 3.5, and so forth). We read out JSON files by loading the JSON library and calling the `loads()` function. You do so per line, by iterating over the entire file and calling the `loads()` function per line. Here we do so for the first 10 lines.

In [None]:
import json

filename = 'yelp_review_sample.json'
count = 0
for l in open(filename):
    jl = json.loads(l)
    print(jl)
    #
    count += 1
    if count > 10: break

{'review_id': 'BiTunyQ73aT9WBnpR9DZGw', 'user_id': 'OyoGAe7OKpv6SyGZT5g77Q', 'business_id': '7ATYjTIgM3jUlt4UM3IypQ', 'stars': 5.0, 'useful': 1, 'funny': 0, 'cool': 1, 'text': "I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.\n\nFor anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).\n\nThere is no way I can write this review without giving Russell, the owner of Body Cycle, a shout out. Russell's passion for fitness and cycling is so evident, as is his desire for all of his clients to succeed. He is always dropping in to classes to check in/provide encouragement, and is open to ideas and recommendations from anyone. Russell always wears a smile on his face, even when he's kicking your butt

Each line is now a dictionary with keys and values, which allows us to select only certain keys, as well as only certain lines (e.g., ones that mention the word 'rude' -- if we print the star ratings along with the text, you can tell there are a lot of negative reviews, unsurprisingly!):

In [None]:
import json

filename = 'yelp_review_sample.json'
count = 0
for l in open(filename):
    jl = json.loads(l)
    if 'rude' in jl['text']:
        print('>>>', jl['stars'], jl['text'], '\n')

>>> 4.0 The only reason I didn't give this restaurant a 5 star rating, is because of one single pretentious waiter. As a 4 night guest at Hotel Palomar, the location of the restaurant is an obvious plus. The first night of my stay, I met a coworker in the restaurant for a cocktail. When we arrived, the host staff were busy and not available, so we just walked in. The restaurant was not too busy, so we just looked at a small table next to the bar and proceeded to take a seat. A waiter came by and I quickly asked if we could have a seat, before sitting down and told him we'd only be having cocktails. He stumbled on his reply, and in an irritated/in-convinced tone, told me "I guess it would be fine" and basically just kept walking mid sentence. My guest and I brushed it off, and started having a conversation while looking at the drink menu. To make a long story short, he was distant and we both got the "couldn't be bothered" vibe from him. When it came to the bill, we asked if it could be

## Topic 2: basic text processing
Once we read in a corpus, we can make it accessible to linguistic queries with some elementary techniques to transform raw text into a linguistically structured object:

* **sentence segmentation**: where are the boundaries between sentences?
* **word segmentation**: where are the boundaries between words?

The goal of these techniques is to be able to retrieve, or: **query**, for certain phenomena. If we want to extract all the sentences containing the word _wicked_, we should know what substrings of a text are sentences, and what substrings of each sentence are words.

Now, all these things can be done much better by freely available Python libraries, like SpaCy, which we'll encounter next week. However, it's good to understand what is involved in the process of doing segmentation, for instance because we might want to work on a language for which we have no SpaCy model.

## Topic 2a: Sentence Segmentation

Python allows us to conveniently read out a text line-by-line, where every newline character ('\n') seperates a pair of lines. Let's iterate over the lines in a snippet from the Wikipedia article about the basketball player Kyle Lowry (https://en.wikipedia.org/wiki/Kyle_Lowry):

In [None]:
fh = open('lowry.txt')
for l in fh.readlines():
    print(l)

Kyle Terrell Lowry (born March 25, 1986) is an American professional basketball player for the Miami Heat of the National Basketball Association (NBA). He has been a six-time NBA All-Star and was named to the All-NBA Third Team in 2016. Lowry won an NBA championship with the Toronto Raptors in 2019, their first title in franchise history. 

He was a member of the U.S. national team that won a gold medal in the 2016 Summer Olympics. Lowry played two seasons of college basketball with the Villanova Wildcats before he was selected by the Memphis Grizzlies in the first round of the 2006 NBA draft with the 24th overall pick. 

He began his NBA career with Memphis and the Houston Rockets before being traded to Toronto. In his second season with the Raptors, he helped them reach the playoffs for the first time in seven years and win an Atlantic Division title during the 2013–14 season. In 2015–16, he led the Raptors to 56 wins in what was then the highest win total in franchise history, as we

### The simplest approach

As you can see, each line contains multiple sentences. How to find the sentences within each line? A naive approach would take all end-of-line characters (like periods, question marks, etc.) and split by such characters followed by a space (like '. ' or '? ').

We can use regular expressions to do so -- the `re` library meets all your needs. (If you want to learn more about how to use regular expressions in Python, see the documentation at https://docs.python.org/3/library/re.html).

Let's implement that and run it. What goes well, what not so much? (I'll be asking these reflection questions throughout Coding Project 2 -- as linguists, we're uniquely equipped not just to develop computational techniques, but also to assess them and find out what aspects of our linguistic intuition they capture well and less well.)

In [None]:
import re

fh = open('lowry.txt')
for l in fh.readlines():
    sentences = re.split('[.?!] ', l.strip('\n'))
    # this splits each line into strings
    # every time you get '. ' in the string, the .split() method splits it.
    for s in sentences:
        print('*', s)
        # now we're iterating over these sentences and printing each of them, prefixed by a '*'

* Kyle Terrell Lowry (born March 25, 1986) is an American professional basketball player for the Miami Heat of the National Basketball Association (NBA)
* He has been a six-time NBA All-Star and was named to the All-NBA Third Team in 2016
* Lowry won an NBA championship with the Toronto Raptors in 2019, their first title in franchise history
* 
* He was a member of the U.S
* national team that won a gold medal in the 2016 Summer Olympics
* Lowry played two seasons of college basketball with the Villanova Wildcats before he was selected by the Memphis Grizzlies in the first round of the 2006 NBA draft with the 24th overall pick
* 
* He began his NBA career with Memphis and the Houston Rockets before being traded to Toronto
* In his second season with the Raptors, he helped them reach the playoffs for the first time in seven years and win an Atlantic Division title during the 2013–14 season
* In 2015–16, he led the Raptors to 56 wins in what was then the highest win total in franchise hi

### A tiny bit more sophistication

So, we can see it's not as simple as 'take all periods (exclamation marks, question marks) and split there'. Acronyms (p.m., U.S.) form a problem.

Define two sets of strings: 
* (1) end-of-sentence punctuation (e.g., '?', '.', and '!'), 
* (2) abbreviations containing such characters ('U.S.', 'U.N.')

The algorithm works as follows.

Go through the text, character by character. For each non-space character $c$, that is part of space-bound character string, or token $s$,

* (a) If $c$ in the end-of-sentence punctuation characters, $c$ ends the sentence.
* (b) If $s$ is in the list of abbreviations, then $c$ doesn't end the sentence.
* (c) If the space-bound character string token $s'$ that follows $s$ is capitalized, then $c$ ends a sentence (after all).

This will get 95% right (according to [wikipedia](https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation)). But remember: **no approach is without errors** (Rule #1 of CL :))!

In the code below, certain parts of the code are missing, in particular, those implementing conditions (a)-(c). Implement them where it says ####.

In [None]:
import re
def sentence_segment(text, eos_punctuation, abbreviations):
    sentences = []
    start_sent = 0
    for end_sent in range(len(text)):
        last_space = next((i+1 for i in range(end_sent-1,-1,-1) if text[i] == ' '),0)
        next_space = next((i for i in range(end_sent, len(text),1) if text[i] == ' '),len(text))
        third_space = next((i for i in range(next_space+1,len(text),1) if text[i] == ' '),len(text))
        #
        token = text[last_space:next_space]
        next_token = text[next_space+1:third_space]
        #
        # print(start_sent, end_sent, text[end_sent], token, next_token)
        # you can uncomment this print statement to see the values for the relevant variables.
        #
        # is_boundary =
        is_boundary = True if text[end_sent] in eos_punctuation and (not (token in abbreviations and next_token.lower()[:1] == next_token[:1]) and end_sent + 1 == next_space) or end_sent + 1 == len(text) else False
        # implement conditions here to make is_boundary contain 
        # True if there is a boundary and False otherwise
        if is_boundary:
            sentences.append(re.sub(r'[\n\s]+', ' ', text[start_sent: end_sent+1]))
            # strip off trailing newlines and spaces
            start_sent = end_sent+2
    return sentences

### Try it out! (7 points)

Now try it out on the following obstacle course of a sentence: it contains two abbreviations in sentence final position but also one in non-final position. Does it break the sentence boundaries at the right points?

In [None]:
eos = '.?!'
abbreviations = ['U.S.', 'pm.']
#
s = 'He came over at 4 pm. Then he watched the game, at 5 pm. between Canada and the U.S. Finally, he ordered take-out.'
sentence_segment(s, eos, abbreviations)

['He came over at 4 pm.',
 'Then he watched the game, at 5 pm. between Canada and the U.S.',
 'Finally, he ordered take-out.']

In [None]:
assert sentence_segment(s, eos, abbreviations)[0] == 'He came over at 4 pm.'
assert sentence_segment(s, eos, abbreviations)[1] == 'Then he watched the game, at 5 pm. between Canada and the U.S.'
assert sentence_segment(s, eos, abbreviations)[2] == 'Finally, he ordered take-out.'

test_sentence = "The quick brown fox jumped over the lazy dog. The quick brown fox went to the U.S. at 3 pm."
test_segment = sentence_segment(test_sentence, eos, abbreviations)

print(test_segment)

assert len(test_segment) == 2
assert test_segment[0] == 'The quick brown fox jumped over the lazy dog.'
assert test_segment[1] == 'The quick brown fox went to the U.S. at 3 pm.'

['The quick brown fox jumped over the lazy dog.', 'The quick brown fox went to the U.S. at 3 pm.']


## Topic 2b: Word segmentation

Once we have sentences, the next step is to split the line into words, so we can query the texts for occurrences of certain words. The problem of determining what the words are is called word segmentation or word tokenization (creating individual word tokens out of a string). 

What exactly the challenge is, depends on the language. First, there is ostensibly the problem of what a word is. Occurring between two spaces is a pretty decent criterium for many European languages. We will consider another frequent type of challenge, that of finding word boundaries without spaces or lower numbers of spaces (like Chinese), or with too many spaces (like Vietnamese).

Let's try out a naive 'space splitting' algorithm for word segmentation, namely by splitting on whitespace characters using the `.split()` method.

In [None]:
fh = open('lowry.txt')
for l in fh.readlines():
    sentences = sentence_segment(l, eos, abbreviations)
    for s in sentences:
        word_tokenized = s.split(' ')
        print(word_tokenized)

['Kyle', 'Terrell', 'Lowry', '(born', 'March', '25,', '1986)', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(NBA).']
['He', 'has', 'been', 'a', 'six-time', 'NBA', 'All-Star', 'and', 'was', 'named', 'to', 'the', 'All-NBA', 'Third', 'Team', 'in', '2016.']
['Lowry', 'won', 'an', 'NBA', 'championship', 'with', 'the', 'Toronto', 'Raptors', 'in', '2019,', 'their', 'first', 'title', 'in', 'franchise', 'history.']
['', '']
['He', 'was', 'a', 'member', 'of', 'the', 'U.S.', 'national', 'team', 'that', 'won', 'a', 'gold', 'medal', 'in', 'the', '2016', 'Summer', 'Olympics.']
['Lowry', 'played', 'two', 'seasons', 'of', 'college', 'basketball', 'with', 'the', 'Villanova', 'Wildcats', 'before', 'he', 'was', 'selected', 'by', 'the', 'Memphis', 'Grizzlies', 'in', 'the', 'first', 'round', 'of', 'the', '2006', 'NBA', 'draft', 'with', 'the', '24th', 'overall', 'pick.']
['', '']
['He', 'began', 'his', '

### Punctuation stripping

Even though it's quite accurate, we do find issues: are compounds like _fan zone_ a single word or two (cf. _friendzone_)? Or phrasal verbs like _hang up_ (cf. the Dutch cognate _ophangen_, which is written without a space in the infinitive, but separately when the verb but not the particle 'moves' away from its base position in the indicative (_ik hang het op_ -- 'I hang it up'). 

In computational linguistics, there are two further issues which has been central to research into word segmentation, as they are more relevant for practical purposes 
* the **co-occurrence of punctuation** with words (_history._ at the end of the second sentence occurs between spaces, but we would ideally have that token be recognized as "the same" as _history_ elsewhere)
* the presence of **clitic and compound forms** (in languages like English and French written with apostrophes; _Mark's_,_I've_, but in other languages hyphenated, or written with a capital (like isiZulu) and hyphenated forms (like _six-time_ in the Lowry text)

Let's consider the two in turn. First, dealing with 'attached' punctuation means that we want to strip them of the left and right edge of a space-bound substring of the text, and treat them as their own tokens. The following function does just that. Read the code and see if you follow. (This is another kind of prompt in these notebooks -- being able to read code and understand it, is a very useful skill as a computational linguist. We can talk about aspects of the code you didn't understand during tutorials).

In [None]:
def punctuation_stripping(sentence, punct):
    word_tokenized = sentence.split(' ')
    clean_word_tokenized = []
    # start by running a 'naive' tokenization by splitting the string on the spaces
    # and initializing an empty list
    for word in word_tokenized:
        # then iterate over the list word_tokenized
        pre = []
        post = []
        # initialize two lists that will be filled with all the punctuation pre-word and post-word
        while len(word) > 0 and word[0] in punct:
            initial_char = word[0]
            word = word[1:]
            pre.append(initial_char)
            # stripping off all the punctuation before the word
        while len(word) > 0 and word[-1] in punct:
            final_char = word[-1]
            word = word[:-1]
            post = [final_char] + post
            # stripping off all the punctuation after the word
        new_segment = pre + [word] + post
        # creating a newly segmented list of all the pre-word punctuation, 
        # post-word punctuation and the word itself
        # print(new_segment)
        clean_word_tokenized.extend(new_segment)
        # extend the clean word tokenized list with that new_segment list
    return clean_word_tokenized

Here's a bit of code to help you compare them. What goes better in the punctuation-stripped version? Where are there still errors?

In [None]:
eos = '.?!'
abbreviations = ['U.S.', 'pm.']
punct = '.?!,[]()"\';:'
#
fh = open('lowry.txt')
for l in fh.readlines():
    sentences = sentence_segment(l, eos, abbreviations)
    for sentence in sentences:
        naive_word_tokenized = sentence.split(' ')
        clean_word_tokenized = punctuation_stripping(sentence, punct)
        print('naive:', naive_word_tokenized)
        print('clean:', clean_word_tokenized)
        print()

naive: ['Kyle', 'Terrell', 'Lowry', '(born', 'March', '25,', '1986)', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(NBA).']
clean: ['Kyle', 'Terrell', 'Lowry', '(', 'born', 'March', '25', ',', '1986', ')', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(', 'NBA', ')', '.']

naive: ['He', 'has', 'been', 'a', 'six-time', 'NBA', 'All-Star', 'and', 'was', 'named', 'to', 'the', 'All-NBA', 'Third', 'Team', 'in', '2016.']
clean: ['He', 'has', 'been', 'a', 'six-time', 'NBA', 'All-Star', 'and', 'was', 'named', 'to', 'the', 'All-NBA', 'Third', 'Team', 'in', '2016', '.']

naive: ['Lowry', 'won', 'an', 'NBA', 'championship', 'with', 'the', 'Toronto', 'Raptors', 'in', '2019,', 'their', 'first', 'title', 'in', 'franchise', 'history.']
clean: ['Lowry', 'won', 'an', 'NBA', 'championship', 'with', 

### Clitic/compound splitting

The second issue pertains to clitics and compounds. This one is a little more straightforward. If we can identify the clitic/compound boundary marker (`'` for the former and `-` for the latter in English), we can simply split words in them. That does leave strange-looking forms like 've' and 'm', but at least they're recognizable. Read the code below and see if you can follow.

In [None]:
def clitic_compound_splitting(tokenized_sentence, characters):
    new_tokenized_sentence = []
    for word in tokenized_sentence:
        new_tokenized_sentence.extend(re.split('[' + characters + ']', word))
    return new_tokenized_sentence

Again, with some code to try it out. Run it and find the error (a case where tokenization should happen but it doesn't). Then update the `clitic_compound` variable to make the system catch this error.

In [None]:
eos = '.?!'
abbreviations = ['U.S.', 'pm.']
punct = '.?!,[]()"\';:'
clitic_compound = "-–'"

In [None]:
def cleaner_tokenized_check():
    return clitic_compound

In [None]:
fh = open('lowry.txt')
for l in fh.readlines():
    sentences = sentence_segment(l, eos, abbreviations)
    for sentence in sentences:
        naive_word_tokenized = sentence.split(' ')
        clean_word_tokenized = punctuation_stripping(sentence, punct)
        cleaner_word_tokenized = clitic_compound_splitting(clean_word_tokenized, clitic_compound)
        print('naive:  ', naive_word_tokenized)
        print('clean:  ', clean_word_tokenized)
        print('cleaner:', cleaner_word_tokenized)
        print()

naive:   ['Kyle', 'Terrell', 'Lowry', '(born', 'March', '25,', '1986)', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(NBA).']
clean:   ['Kyle', 'Terrell', 'Lowry', '(', 'born', 'March', '25', ',', '1986', ')', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(', 'NBA', ')', '.']
cleaner: ['Kyle', 'Terrell', 'Lowry', '(', 'born', 'March', '25', ',', '1986', ')', 'is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(', 'NBA', ')', '.']

naive:   ['He', 'has', 'been', 'a', 'six-time', 'NBA', 'All-Star', 'and', 'was', 'named', 'to', 'the', 'All-NBA', 'Third', 'Team', 'in', '2016.']
clean:   ['He', 'has', 'been', 'a', 'six-time', 'NBA', 'All-Star', 'and', 'was', 'named', 'to', 'the', 'All-NBA'

### Word Tokenization beyond English

For many languages, in particular East Asian languages like Chinese, Japanese, Thai, or Burmese, sentences are characterized by a lack or lower number of spaces. We can tell if we look at the first lines of the Wikipedia article for Kyle Lowry in Chinese.


In [None]:
txt = open('lowry_zh.txt').read()
sentences = [x + '。' for x in txt.split('。') if x]
for s in sentences:
    print(s)

有着强悍的爆发力和水准之上的控球能力。
个人突破得益于强壮的身体，故经常能杀入三秒区得分，但中远投篮是软肋，来到休士顿火箭后投篮能力有所提高。
组织进攻能力也在水准之上，但容易受情绪干扰，不够稳定，有时候失误较多。
在防守上，由于爆发力、速度、力量都在联盟控球后卫平均水准之上，无论是对持球者的压迫和封堵传球线路，洛里都做的较为出色，在一些场次可以防守对方的得分后卫亦不吃亏，此外篮板能力在联盟后卫中算一流水准，衝抢前场篮板更是个人一大特色，過去幾年常常被人指責季後賽表現不佳，但在東區決賽克服對決密爾瓦基公鹿2:0的劣勢，連勝4場晉級NBA總決賽，以及在2019年NBA總決賽上的優異表現幫助多倫多暴龍奪得首次NBA總冠軍，打破大家對他的質疑。


#### The goal: finding compounds

Let's take our first sentence above:

有着强悍的爆发力和水准之上的控球能力。

Google Translate provides the romanization: Yǒuzhe qiánghàn de bàofālì hé shuǐzhǔn zhī shàng de kòng qiú nénglì,  with the rough gloss: 'have tough of explosive.force with level of superior of control ball ability'. That means that the segmentation (at least according to Google Translate) is:

['有着', '强悍', '的', '爆发力', '和', '水准', '之', '上', '的', '控', '球', '能力', '。']

(**Corrections welcome!!**)

So, we can't just scatter the string into all one-character tokens; if we did so, we'd miss the four two-character and one three-character tokens in this sentence!

#### A simple approach: Maximum Matching

A naive, but relatively accurate, approach to word segmentation in space-less languages is **Maximum Matching**: to start at a certain character $i$, and find the longest substring starting at index $i$ that exists in a word list and declare that to be a word. Then set $i$ to be the end of the last-found word and repeat. This is the approach at the heart of this paper, for instance: https://www.aclweb.org/anthology/C96-1035.pdf.

Here, we're using the wordlist from SUBTLEX-CH, a wordlist of the most frequent 100,000 words in Mandarin Chinese (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729), used for psycholinguistic experimentation. The list is given in the file `words_zh.txt`. Run the code cell below to read in the list and print the first 30 words (I added the two punctuation characters as words to the list).

In [None]:
all_words = open('words_zh.txt').read().split('\n')
print(all_words[:30])

['。', '，', '的', '我', '你', '是', '了', '不', '在', '他', '我们', '好', '有', '这', '就', '会', '吗', '要', '什么', '说', '她', '想', '一', '很', '知道', '人', '吧', '那', '来', '都']


We can run Maximum Match manually to see how it words. Let's go through the start of the first sentence step by step. 

有着强悍的爆发力和水准之上的控球能力

We start by setting the left boundary to 0, and the right boundary to 1. Then we look at the string between indices 0 and 1, i.e., '有', and check if it occurs in the word list. It does! 

So we continue to the next step and we increase the right boundary to 2. The string now is '有着', which also found in the word list, so we continue.

Increasing the right boundary next to 3 means we have '有着强' as the hypothesized word. That string does not occur in the word list, so 3 is too much, and the boundary is placed before 3, at 2, with the found word being '有着'. 

Now we set the left boundary at 2 and start again. The substring from 2 to 3 ('强') is in the list, so we continue. So is 2 to 4 ('强悍'), but 2 to 5 is not, so the second word is '强悍'. 

Below a few more words are worked out this way (the first two numbers give the character span indices, the third element is the character string, the fourth whether this string occurs in the SUBTLEX-CH list).

* 0 1 有 True
* 0 2 有着 True
* 0 3 有着强 False
* 2 3 强 True
* 2 4 强悍 True
* 2 5 强悍的 False
* 4 5 的 True
* 4 6 的爆 False
* 5 6 爆 True
* 5 7 爆发 True
* 5 8 爆发力 True
* 5 9 爆发力和 False

Here's a fun challenge: try to implement the maximum match algorithm in the function longest match tokenize below. The function takes two arguments: an unsegmented sentence (a string) and a list of all words (a list of all words). I will reveal the solution in the tutorial.

In [None]:
def longest_match_tokenize(sentence, all_words):
    ## implement the longest match algorithm here
    words = []
    start = 0
    for i in range(len(sentence)):
        if sentence[start:i+1] not in all_words:
            words.append(sentence[start:i])
            start = i
    words.append(sentence[-1])
    return words

In [None]:
txt = open('lowry_zh.txt').read()
all_words = open('words_zh.txt').read().split('\n')
eos = '。'
abbrev = []
sentences = [x + '。' for x in txt.split('。') if x]
for s in sentences:
    print(s)
    words = longest_match_tokenize(s, (all_words))
    print(words)

有着强悍的爆发力和水准之上的控球能力。
['有着', '强悍', '的', '爆发力', '和', '水准', '之上', '的', '控', '球', '能力', '。']
个人突破得益于强壮的身体，故经常能杀入三秒区得分，但中远投篮是软肋，来到休士顿火箭后投篮能力有所提高。
['个人', '突破', '得益', '于', '强壮', '的', '身体', '，', '故', '经常', '能', '杀', '入', '三秒', '区', '得分', '，', '但', '中', '远投', '篮', '是', '软', '肋', '，', '来到', '休', '士', '顿', '火箭', '后', '投篮', '能力', '有所', '提高', '。']
组织进攻能力也在水准之上，但容易受情绪干扰，不够稳定，有时候失误较多。
['组织', '进攻', '能力', '也', '在', '水准', '之上', '，', '但', '容易', '受', '情绪', '干扰', '，', '不够', '稳定', '，', '有时候', '失误', '较', '多', '。']
在防守上，由于爆发力、速度、力量都在联盟控球后卫平均水准之上，无论是对持球者的压迫和封堵传球线路，洛里都做的较为出色，在一些场次可以防守对方的得分后卫亦不吃亏，此外篮板能力在联盟后卫中算一流水准，衝抢前场篮板更是个人一大特色，過去幾年常常被人指責季後賽表現不佳，但在東區決賽克服對決密爾瓦基公鹿2:0的劣勢，連勝4場晉級NBA總決賽，以及在2019年NBA總決賽上的優異表現幫助多倫多暴龍奪得首次NBA總冠軍，打破大家對他的質疑。
['在', '防守', '上', '，', '由于', '爆发力', '、', '速度', '、', '力量', '都', '在', '联盟', '控', '球', '后卫', '平均', '水准', '之上', '，', '无论是', '对', '持球', '者', '的', '压迫', '和', '封堵', '传球', '线路', '，', '洛', '里', '都', '做', '的', '较为', '出色', '，', '在', '一些', '场次', '可以', '防守', '对方', '的', '得分', '后卫', '亦', '不', '吃亏', '，',

In [None]:
words = longest_match_tokenize(sentences[0], (all_words))
assert words[0] == '有着'
assert words[1] == '强悍'
assert words[2] == '的'
assert words[3] == '爆发力'

## Topic 3: Querying

Now we can use our functions to run corpus queries, to extract all instances of certain tokens. In the last topic of this notebook, we're exploring how to do so.

We begin by encapsulating the developed materials into two functions. Read the two functions below and make sure you understand them:

In [None]:
def read_corpus(filename, punct, clitic_compound):
    corpus = []
    fh = open(filename)
    for l in fh.readlines():
        sentences = sentence_segment(l, punct, abbreviations)
        for sentence in sentences:
            clean_word_tokenized = clitic_compound_splitting(punctuation_stripping(sentence, punct), clitic_compound)
            corpus.append(clean_word_tokenized)
    return corpus

### reading in a corpus
We now have a function for reading in text files and doing some basic English preprocessing (sentence and word segmentation) on them. The output is a variable containing a list of lists, with the outer list being the sentences, and the inner lists the words per sentence.

Run the code below to see what that looks like

In [None]:
corpus = read_corpus('lowry.txt', punct, clitic_compound)
for sentence in corpus:
    print(sentence)

['Kyle', 'Terrell', 'Lowry', '(', 'born', 'March', '25', ',']
['1986', ')']
['is', 'an', 'American', 'professional', 'basketball', 'player', 'for', 'the', 'Miami', 'Heat', 'of', 'the', 'National', 'Basketball', 'Association', '(', 'NBA', ')', '.']
['He', 'has', 'been', 'a', 'six', 'time', 'NBA', 'All', 'Star', 'and', 'was', 'named', 'to', 'the', 'All', 'NBA', 'Third', 'Team', 'in', '2016', '.']
['Lowry', 'won', 'an', 'NBA', 'championship', 'with', 'the', 'Toronto', 'Raptors', 'in', '2019', ',']
['their', 'first', 'title', 'in', 'franchise', 'history', '.']
['', '']
['He', 'was', 'a', 'member', 'of', 'the', 'U.S', '.', 'national', 'team', 'that', 'won', 'a', 'gold', 'medal', 'in', 'the', '2016', 'Summer', 'Olympics', '.']
['Lowry', 'played', 'two', 'seasons', 'of', 'college', 'basketball', 'with', 'the', 'Villanova', 'Wildcats', 'before', 'he', 'was', 'selected', 'by', 'the', 'Memphis', 'Grizzlies', 'in', 'the', 'first', 'round', 'of', 'the', '2006', 'NBA', 'draft', 'with', 'the', '24th

### Querying in a segmented corpus

What can you do with a corpus represented this way? For one thing, find tokens of a word in context. We will discuss more complex queries in following weeks (grammatical categories, grammatical relations). But let's first look first at a way of formulating a basic query: print all instances of a particular word with its context words to the left and context words to the right (a so-called keyword in context or KWIC search)

See if you can follow the code below:

In [None]:
target_word = 'Memphis'

for sentence in corpus:
    for i in range(len(sentence)):
        if sentence[i] == target_word:
            print(sentence[:i], sentence[i], sentence[i+1:]) 
            # split into all words to the left of the target, the target itself
            # and all words to the right of the target

['Lowry', 'played', 'two', 'seasons', 'of', 'college', 'basketball', 'with', 'the', 'Villanova', 'Wildcats', 'before', 'he', 'was', 'selected', 'by', 'the'] Memphis ['Grizzlies', 'in', 'the', 'first', 'round', 'of', 'the', '2006', 'NBA', 'draft', 'with', 'the', '24th', 'overall', 'pick', '.']
['He', 'began', 'his', 'NBA', 'career', 'with'] Memphis ['and', 'the', 'Houston', 'Rockets', 'before', 'being', 'traded', 'to', 'Toronto', '.']


Now let's encapsulate that in a function, so we can call it when we want to!

In [None]:
def get_kwic(target_word, corpus):
    for sentence in corpus:
        for i in range(len(sentence)):
            if sentence[i] == target_word:
                print(sentence[:i], sentence[i], sentence[i+1:]) 
                # split into all words to the left, the target
                # and all words to the right

In [None]:
get_kwic('Lowry', corpus)

['Kyle', 'Terrell'] Lowry ['(', 'born', 'March', '25', ',']
[] Lowry ['won', 'an', 'NBA', 'championship', 'with', 'the', 'Toronto', 'Raptors', 'in', '2019', ',']
[] Lowry ['played', 'two', 'seasons', 'of', 'college', 'basketball', 'with', 'the', 'Villanova', 'Wildcats', 'before', 'he', 'was', 'selected', 'by', 'the', 'Memphis', 'Grizzlies', 'in', 'the', 'first', 'round', 'of', 'the', '2006', 'NBA', 'draft', 'with', 'the', '24th', 'overall', 'pick', '.']
[] Lowry ['is', 'seen', 'as', 'the', 'greatest', 'Toronto', 'Raptor', 'player', 'of', 'all', 'time', 'due', 'to', 'his', 'work', 'in', 'turning', 'the', 'franchise', 'around', '.']
['The', 'Raptors', 'had', 'their', 'greatest', 'success', 'under'] Lowry [',']
['The', 'Raptors', 'regularly', 'improved', 'their', 'win', 'totals', 'with'] Lowry ['at', 'the', 'helm', '.', '']


Next, let's make it look pretty. Run the code below, see if you follow each step, and take note where you don't

In [None]:
def left_align(words, n):
    padding = ' ' * n 
    # the join string method -- joins a list argument by whatever string it is applied to
    # list by the string on which the method is applied
    string = ' '.join(words)
    full_string = string + padding
    cut_string = full_string[:n]
    return cut_string

def right_align(words, n):
    padding = ' ' * n
    string = ' '.join(words)
    full_string = padding + string
    cut_string = full_string[-n:]
    return cut_string
    
def get_kwic(target_word, corpus, n):
    for sentence in corpus:
        for i in range(len(sentence)):
            if sentence[i] == target_word:
                left_ctx = right_align(sentence[:i], n) 
                right_ctx = left_align(sentence[i+1:], n)
                print(left_ctx + ' ' + sentence[i] + ' ' + right_ctx) 
                # split into all words to the left, the target
                # and all words to the right

get_kwic('of', corpus, 30)

ball player for the Miami Heat of the National Basketball Associ
               He was a member of the U.S . national team that w
      Lowry played two seasons of college basketball with the Vi
s Grizzlies in the first round of the 2006 NBA draft with the 24
greatest Toronto Raptor player of all time due to his work in tu
ence Finals births to all five of their 50 wins campaigns .     


# Homework assignment

* In this homework assignment, you will look at how to study the output of our preprocessing functions for accuracy.
* You will also get started working on the corpus that will be the topic of the final assignment for this module.

### Part 1

Explore the 'yelp_review_sample.json' file, containing a corpus of Yelp reviews. Read it in as a list of dictionaries (following the json structure), where you assign to each dictionary an additional key-value pairing mapping the key 'segmented' to a list (of sentence) of lists (of words). It might take a minute to read it all in!

Then, iterate over the corpus, print the first 3 reviews, and inspect the segmented output. Does it line up with how you think the sentences and words should be segmented?

### Part 2

Think of and implement small ways to improve the sentence and word segmentation. This could involve small changes to the lists/strings of special characters (punct, clitic_compound, abbreviations), or to the algorithms themself!

### Part 3

Update the `get_kwic` function so that it can iterate over the Yelp corpus (its structure is a bit different from the structure of the Lowry corpus). Then query the Yelp corpus and query for the word 'waiter'. What are typical determiners that 'waiter' occurs in in this corpus? What are typical things that are said about waiters in the corpus?