# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.


https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing


## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore case (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [1]:
import os
import re
import string

In [2]:
def tokenize(text):
  return re.findall(r"\w+", text.lower())

print(tokenize("This is an  example."))

['this', 'is', 'an', 'example']


In [3]:
from collections import Counter

def calculate_frequency(tokens_ptr):
  return Counter(tokens_ptr)

print(calculate_frequency(tokenize("This is an example. This is great!")))

Counter({'this': 2, 'is': 2, 'an': 1, 'example': 1, 'great': 1})


In [4]:
def get_ngram(tokens_ptr, n=2):
  result_ptr = []
  for start_idx in range(len(tokens_ptr) - n + 1):
    ngram = ""
    for i in range(start_idx, start_idx + n):
      if i != start_idx:
        ngram += " "
      ngram += tokens_ptr[i]
    result_ptr.append(ngram)
  return result_ptr

print(get_ngram(tokenize("This is an example."), n=2))

['this is', 'is an', 'an example']


In [5]:
file_path = os.path.join("data", "bnc.txt")
BNC_unigram = get_ngram(tokenize(open(file_path).read()), n=1)
BNC_unigram_counter = calculate_frequency(BNC_unigram)

In [6]:
BNC_unigram_counter.most_common(10)

[('the', 4548607),
 ('of', 2203780),
 ('to', 1984550),
 ('and', 1937691),
 ('a', 1690228),
 ('in', 1366406),
 ('it', 858491),
 ('is', 791690),
 ('i', 743478),
 ('that', 725841)]

In [7]:
file_path = os.path.join("data", "lang8.txt")
lang_unigram = get_ngram(tokenize(open(file_path).read()), n=1)
lang_unigram_counter = calculate_frequency(lang_unigram)

In [8]:
lang_unigram_counter.most_common(10)

[('the', 434745),
 ('of', 231926),
 ('to', 172112),
 ('and', 171908),
 ('in', 131396),
 ('a', 121043),
 ('is', 100121),
 ('that', 71961),
 ('as', 61207),
 ('be', 53284)]

## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [9]:
lang_unigram_sorted = lang_unigram_counter.most_common()
lang_unigram_sorted[0:5]

[('the', 434745),
 ('of', 231926),
 ('to', 172112),
 ('and', 171908),
 ('in', 131396)]

In [10]:
lang_unigram_Rank = {}
for idx in range(len(lang_unigram_sorted)):
  lang_unigram_Rank[lang_unigram_sorted[idx][0]] = idx + 1
print(lang_unigram_Rank["the"]) # rank 1
print(lang_unigram_Rank["in"]) # rank 5
print(lang_unigram_Rank["dont"])

1
5
41160


In [11]:
BNC_unigram_sorted = BNC_unigram_counter.most_common()
BNC_unigram_sorted[0:5]

[('the', 4548607),
 ('of', 2203780),
 ('to', 1984550),
 ('and', 1937691),
 ('a', 1690228)]

In [12]:
BNC_unigram_Rank = {}
for idx in range(len(BNC_unigram_sorted)):
  BNC_unigram_Rank[BNC_unigram_sorted[idx][0]] = idx + 1
print(BNC_unigram_Rank["the"]) # rank 1
print(BNC_unigram_Rank["a"]) # rank 5
print(BNC_unigram_Rank["dont"])

1
5
167461


## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [13]:
def get_ratios():
  ratios_ptr = []
  for unigram in lang_unigram_Rank:
    bnc_rank = 1
    if unigram in BNC_unigram_Rank:
      bnc_rank = BNC_unigram_Rank[unigram]
    ratios_ptr.append((unigram, bnc_rank / lang_unigram_Rank[unigram]))
  return ratios_ptr

get_ratios()[0:5]

[('the', 1.0), ('of', 1.0), ('to', 1.0), ('and', 1.0), ('in', 1.2)]

## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [14]:
import pandas as pd

frame_ptr = pd.DataFrame(get_ratios(), columns=["unigram", "rank ratio"])
frame_ptr.sort_values(by=["rank ratio"], ascending=False, inplace=True)
print(frame_ptr.head(30))

            unigram  rank ratio
276          cannot  377.768953
190            2004  320.958115
242            2002  200.909465
193            2003  179.288660
218            2005  113.461187
226            2001   94.259912
647            2006   86.322531
293            1999   78.598639
788        internet   63.556401
388            1998   60.552699
773              eu   52.403101
5792           nisa   48.075091
5549       radstone   44.609009
269            ibid   42.948148
4424            uht   42.482938
2216          doesn   39.024808
5868   anthocyanins   38.943772
6803       germline   38.628895
1690  globalisation   34.561798
6747          creon   33.843806
8913         bryman   31.679829
5154    pneumophila   31.535984
6853       manydown   30.976364
3481            wto   30.785181
7527   microneedles   30.298751
9172           rtas   29.878012
453            1997   28.162996
8792          punic   27.966223
9029           livy   27.242636
9307            teg   26.589708


## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams  
Hint:  
1. generate all bigrams for BNC / lang8  
2. calculate frequency for each bigrams  
3. rank bigrams by frequency  
4. calculate the rank ratio of each bigram
5. print out the top 30 highest rank ratio bigrams  

In [15]:
#### [ TODO ] 

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  