# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.


https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing


## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore case (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [1]:
import os
import re

In [2]:
def tokenize(text):
  return re.findall(r"[a-z]+", text.lower())

print(tokenize("This is an  example. Don't 123 it!"))

['this', 'is', 'an', 'example', 'don', 't', 'it']


In [3]:
from collections import Counter

def calculate_frequency(tokens_ptr):
  return Counter(tokens_ptr)

print(calculate_frequency(tokenize("This is an example. This is great!")))

Counter({'this': 2, 'is': 2, 'an': 1, 'example': 1, 'great': 1})


In [4]:
def get_ngram(tokens_ptr, n=2):
  result_ptr = []
  for start_idx in range(len(tokens_ptr) - n + 1):
    ngram = ""
    for i in range(start_idx, start_idx + n):
      if i != start_idx:
        ngram += " "
      ngram += tokens_ptr[i]
    result_ptr.append(ngram)
  return result_ptr

print(get_ngram(tokenize("This is an example."), n=2))

['this is', 'is an', 'an example']


In [5]:
file_path = os.path.join("data", "bnc.txt")
BNC_unigram = get_ngram(tokenize(open(file_path).read()), n=1)
BNC_unigram_counter = calculate_frequency(BNC_unigram)

In [6]:
BNC_unigram_counter.most_common(10)

[('the', 4548618),
 ('of', 2203787),
 ('to', 1984568),
 ('and', 1937703),
 ('a', 1694756),
 ('in', 1367697),
 ('it', 858498),
 ('is', 791768),
 ('i', 743919),
 ('that', 725841)]

In [7]:
file_path = os.path.join("data", "lang8.txt")
lang_unigram = get_ngram(tokenize(open(file_path).read()), n=1)
lang_unigram_counter = calculate_frequency(lang_unigram)

In [8]:
lang_unigram_counter.most_common(10)

[('the', 434759),
 ('of', 231932),
 ('to', 172145),
 ('and', 171913),
 ('in', 131414),
 ('a', 121824),
 ('is', 100122),
 ('that', 71961),
 ('as', 61219),
 ('be', 53291)]

## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [9]:
lang_unigram_sorted = lang_unigram_counter.most_common()
lang_unigram_sorted[0:5]

[('the', 434759),
 ('of', 231932),
 ('to', 172145),
 ('and', 171913),
 ('in', 131414)]

In [10]:
lang_unigram_Rank = {}
for idx in range(len(lang_unigram_sorted)):
  lang_unigram_Rank[lang_unigram_sorted[idx][0]] = idx + 1
print(lang_unigram_Rank["the"]) # rank 1
print(lang_unigram_Rank["in"]) # rank 5
print(lang_unigram_Rank["dont"])

1
5
38685


In [11]:
BNC_unigram_sorted = BNC_unigram_counter.most_common()
BNC_unigram_sorted[0:5]

[('the', 4548618),
 ('of', 2203787),
 ('to', 1984568),
 ('and', 1937703),
 ('a', 1694756)]

In [12]:
BNC_unigram_Rank = {}
for idx in range(len(BNC_unigram_sorted)):
  BNC_unigram_Rank[BNC_unigram_sorted[idx][0]] = idx + 1
print(BNC_unigram_Rank["the"]) # rank 1
print(BNC_unigram_Rank["a"]) # rank 5
print(BNC_unigram_Rank["dont"])

1
5
154212


## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [13]:
def get_ratios(lang_ngram_Rank, BNC_ngram_Rank):
  ratios_ptr = []
  for unigram in lang_ngram_Rank:
    bnc_rank = 1
    if unigram in BNC_ngram_Rank:
      bnc_rank = BNC_ngram_Rank[unigram]
    ratios_ptr.append((unigram, bnc_rank / lang_ngram_Rank[unigram]))
  return ratios_ptr

get_ratios(lang_unigram_Rank, BNC_unigram_Rank)[0:5]

[('the', 1.0), ('of', 1.0), ('to', 1.0), ('and', 1.0), ('in', 1.2)]

## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [14]:
import pandas as pd

frame_ptr = pd.DataFrame(get_ratios(lang_unigram_Rank, BNC_unigram_Rank), columns=["unigram", "rank ratio"])
frame_ptr.sort_values(by=["rank ratio"], ascending=False, inplace=True)
print(frame_ptr.head(30))

            unigram  rank ratio
264          cannot  372.022642
760        internet   63.140604
743              eu   48.044355
5626           nisa   44.252532
257            ibid   43.945736
5391       radstone   41.311758
4307            uht   39.821031
2156          doesn   38.055169
5701   anthocyanins   36.028937
6616       germline   35.516397
1643  globalisation   33.952555
5579            mpa   31.855197
6564          creon   31.266108
5006    pneumophila   30.003395
3390            wto   29.777942
8636         bryman   29.211184
6664       manydown   28.852813
7318   microneedles   28.007925
8888           rtas   27.561368
8520          punic   25.952940
4454          iliad   25.399776
8757           livy   25.261018
9018            teg   24.688990
6421            psk   24.615852
6450         qualia   24.407534
6712      ductility   24.401013
9164            wep   24.318058
7744           perl   23.835765
6205          drude   23.759587
9577         hashmi   23.583316


## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams  
Hint:  
1. generate all bigrams for BNC / lang8  
2. calculate frequency for each bigrams  
3. rank bigrams by frequency  
4. calculate the rank ratio of each bigram
5. print out the top 30 highest rank ratio bigrams  

In [15]:
import string

def tokenize2(text):
  return list(filter(lambda x: (x != "" and x != "ibid"), text.translate(str.maketrans("", "", string.punctuation)).translate(str.maketrans("", "", string.digits)).lower().split(" ")))

print(tokenize2("This is an  example. Don't 123 I'll do it p28 ibid, p300 ibid!"))

['this', 'is', 'an', 'example', 'dont', 'ill', 'do', 'it', 'p', 'p']


In [16]:
file_path = os.path.join("data", "bnc.txt")
BNC_bigram = get_ngram(tokenize2(open(file_path).read()), n=2)
BNC_bigram_counter = calculate_frequency(BNC_bigram)

In [17]:
BNC_bigram_counter.most_common(10)

[('of the', 535167),
 ('in the', 353783),
 ('to the', 210576),
 ('on the', 154295),
 ('to be', 141156),
 ('and the', 135752),
 ('for the', 118716),
 ('at the', 106194),
 ('with the', 90512),
 ('from the', 90086)]

In [18]:
BNC_bigram_counter["figure figure"]

1

In [19]:
file_path = os.path.join("data", "lang8.txt")
lang_bigram = get_ngram(tokenize2(open(file_path).read()), n=2)
lang_bigram_counter = calculate_frequency(lang_bigram)

In [20]:
lang_bigram_counter.most_common(10)

[('of the', 58205),
 ('in the', 34158),
 ('to the', 21928),
 ('it is', 15516),
 ('and the', 14529),
 ('to be', 13829),
 ('on the', 12695),
 ('that the', 12531),
 ('for the', 10683),
 ('can be', 9779)]

In [21]:
lang_bigram_counter["figure figure"]

334

In [22]:
lang_bigram_sorted = lang_bigram_counter.most_common()
lang_bigram_sorted[0:5]

[('of the', 58205),
 ('in the', 34158),
 ('to the', 21928),
 ('it is', 15516),
 ('and the', 14529)]

In [23]:
lang_bigram_Rank = {}
for idx in range(len(lang_bigram_sorted)):
  lang_bigram_Rank[lang_bigram_sorted[idx][0]] = idx + 1
print(lang_bigram_Rank["of the"]) # rank 1
print(lang_bigram_Rank["and the"]) # rank 5

1
5


In [24]:
BNC_bigram_sorted = BNC_bigram_counter.most_common()
BNC_bigram_sorted[0:5]

[('of the', 535167),
 ('in the', 353783),
 ('to the', 210576),
 ('on the', 154295),
 ('to be', 141156)]

In [25]:
BNC_bigram_Rank = {}
for idx in range(len(BNC_bigram_sorted)):
  BNC_bigram_Rank[BNC_bigram_sorted[idx][0]] = idx + 1
print(BNC_bigram_Rank["of the"]) # rank 1
print(BNC_bigram_Rank["to be"]) # rank 5

1
5


In [26]:
frame_ptr = pd.DataFrame(get_ratios(lang_bigram_Rank, BNC_bigram_Rank), columns=["bigram", "rank ratio"])
frame_ptr.sort_values(by=["rank ratio"], ascending=False, inplace=True)
print(frame_ptr.head(30))

                       bigram   rank ratio
1152            figure figure  5414.542931
779              the internet  1889.560256
5286           heat exchanger  1811.344430
1981             the companys  1655.886478
6744         exam performance  1655.501853
7261           youngs modulus  1335.816579
7002                   i dont  1256.730258
8549           child soldiers  1056.077427
7412        birthweight ratio  1051.199379
10110  manufacturing strategy   934.913658
10991          ottoman empire   900.575055
10526         induction motor   894.852665
10784          appendix shows   851.602225
11791        history relevant   822.035448
11761           rate constant   813.736524
10755            fresh breath   812.470342
12314                tort law   802.426553
13150    genetically modified   762.975667
8905             internet and   752.021109
12928           torrington et   733.173641
10546  phonological processes   730.165450
15499        yield management   721.467677
10779      

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  