# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.


https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing


## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore case (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [1]:
import os
import re
import string

In [2]:
a_string = '!hi. wh?at is the weat[h]er lik?e.'
new_string = re.sub(r'[^\w\s]', '', a_string)

def tokenize(text):
    """
    Input:
    "This is an example.'

    Sample output: 
    ['this', 'is', 'an', 'example', '.']
    """  
    #### [ TODO ] transform text to lower case
    text = text.lower()
    #### [ TODO ] seperate the words by white space
    # Use .translate() for the fastest performance
        # print(string.punctuation)
        # Returns: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    text = re.sub(r'[0-9]', " ", text)
    #tokens = text.split()
    
    tokens = text.translate(str.maketrans('', '', string.punctuation)).split()
    return tokens
    
from collections import Counter

def calculate_frequency(tokens):
    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    {
        'the': 79809, 
        'project': 288,
        ...
    }
    """
    #### [ TODO ] 
    fre = Counter(tokens)
    return fre

# 回傳
def get_ngram(tokens, n=2):
    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    ['this is', 'is an', 'an example', ...]
    """
    #### [TODO] 
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-1)]
#     ngram = []
#     for i in range(len(tokens)-1):
#         ngram.append(tokens[i:i+n])
#     return ngram

In [3]:
file_path = os.path.join('data', 'test.txt')
BNC_unigram = []
#BNC_unigram_counter = Counter()
#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC
with open(file_path, 'r') as f:
    for line in f:
        tokens = tokenize(line)
        BNC_unigram.extend(tokens)
print(BNC_unigram)
BNC_unigram_counter = calculate_frequency(BNC_unigram)

dict(list(BNC_unigram_counter.items())[0:6])

['factsheet', 'what', 'is', 'aids', 'aids', 'immune', 'deficiency', 'syndrome', 'is', 'a', 'condition', 'caused', 'by', 'a', 'virus', 'called', 'hiv', 'immuno', 'deficiency', 'virus', 'this', 'virus', 'affects', 'the', 'body', 's', 'defence', 'system', 'so', 'that', 'it', 'can', 'not', 'fight', 'infection', 'how', 'is', 'infection', 'through', 'unprotected', 'sexual', 'intercourse', 'with', 'an', 'infected', 'partner', 'through', 'infected', 'blood', 'or', 'blood', 'products', 'from', 'an', 'infected', 'mother', 'to', 'her', 'baby', 'it', 'is', 'not', 'transmitted', 'from', 'giving', 'bloodmosquito', 'bitestoilet', 'seatskissingfrom', 'normal', 'daytoday', 'contact', 'how', 'does', 'it', 'affect', 'you', 'the', 'medical', 'aspects', 'can', 'be', 'cancer', 'pneumonia', 'sudden', 'blindness', 'dementia', 'dramatic', 'weight', 'loss', 'or', 'any', 'combination', 'of', 'these', 'often', 'people', 'are', 'rejected', 'by', 'family', 'and', 'friends', 'leaving', 'them', 'to', 'face', 'this', 

{'factsheet': 1, 'what': 1, 'is': 6, 'aids': 3, 'immune': 1, 'deficiency': 2}

In [4]:
file_path = os.path.join('data', 'bnc.txt')
BNC_unigram = []
#BNC_unigram_counter = Counter()
#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC
with open(file_path, 'r') as f:
    for line in f:
        tokens = tokenize(line)
        BNC_unigram.extend(tokens)
BNC_unigram_counter = calculate_frequency(BNC_unigram)

In [5]:
dict(list(BNC_unigram_counter.items())[0:20])

{'factsheet': 59,
 'what': 196659,
 'is': 791567,
 'aids': 2521,
 'immune': 748,
 'deficiency': 531,
 'syndrome': 931,
 'a': 1687867,
 'condition': 6180,
 'caused': 6332,
 'by': 364833,
 'virus': 1199,
 'called': 22822,
 'hiv': 1570,
 'immuno': 5,
 'this': 367307,
 'affects': 1107,
 'the': 4546697,
 'body': 19093,
 's': 678803}

In [6]:
# Read lang-8 Data
file_path = os.path.join('data','lang8.txt')
lang_unigram = []
#lang_unigram_counter = Counter()

#### [ TODO ] generate lang8 unigrams and calculate document frequency of unigram in lang8
with open(file_path, 'r') as f:
    for line in f:
        tokens = tokenize(line)
        lang_unigram.extend(tokens)
lang_unigram_counter = calculate_frequency(lang_unigram)


## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [7]:
lang_unigram_Rank = {}

#### [ TODO ] Rank unigrams for lang

#sort 与 sorted 区别：
#sort 是应用在 list 上的方法，sorted 可以对所有可迭代的对象进行排序操作。
#list 的 sort 方法返回的是对已经存在的列表进行操作，无返回值，而内建函数 sorted 方法返回的是一个新的 list，而不是在原来的基础上进行的操作。
# 用value排序，由大到小所以reverse
# type(sorted): list
for i, unigram in enumerate(sorted(lang_unigram_counter.items(), key=lambda item: item[1], reverse=True)):
    # unigram = [{'a': 10}, {'b': 100}, {'c'; 20}..]
    # RATIO 用dictionary不用包兩層回圈, 用list就要
    lang_unigram_Rank[unigram[0]]= i+1
    #lang_unigram_Rank.append((unigram[0], i+1))

In [8]:
dict(list(lang_unigram_Rank.items())[0:20])

{'the': 1,
 'of': 2,
 'to': 3,
 'and': 4,
 'in': 5,
 'a': 6,
 'is': 7,
 'that': 8,
 'as': 9,
 'be': 10,
 'for': 11,
 'this': 12,
 'it': 13,
 'are': 14,
 'with': 15,
 'by': 16,
 'on': 17,
 'was': 18,
 'not': 19,
 'from': 20}

In [9]:
BNC_unigram_Rank = {}

#### [ TODO ] Rank unigrams for BNC
for i, unigram in enumerate(sorted(BNC_unigram_counter.items(), key=lambda item: item[1], reverse=True)):
    # unigram = [{'a': 10}, {'b': 100}, {'c'; 20}..]
    BNC_unigram_Rank[unigram[0]]= i+1
    #BNC_unigram_Rank.append((unigram[0], i+1))

In [10]:
dict(list(BNC_unigram_Rank.items())[0:20])

{'the': 1,
 'of': 2,
 'to': 3,
 'and': 4,
 'a': 5,
 'in': 6,
 'it': 7,
 'is': 8,
 'i': 9,
 'that': 10,
 'was': 11,
 's': 12,
 'for': 13,
 'you': 14,
 'he': 15,
 'with': 16,
 'on': 17,
 'be': 18,
 'as': 19,
 'at': 20}

## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [11]:
#### [ TODO ] Calculate Rank Ratio
# lang_unigram_Rank = {
#     'a' = 1,
#     'b' = 2,
#     'c' = 3
# }
rank_ratio = {}
for word,rank in lang_unigram_Rank.items():
    #if word in BNC_unigram_Rank.keys():
    rank_ratio[word] = (BNC_unigram_Rank[word]/rank) if word in BNC_unigram_Rank.keys() else (1/rank)
    

## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [20]:
#### [ TODO ] 
# sorted(iterable, cmp=None, key=None, reverse=False)
# rank_ratio is a dictionary 
print(f'rank\tunigram\t\t\t\tRank Ratio')
for i,ratio_set in enumerate(sorted(rank_ratio.items(), key=lambda item: item[1],reverse=True)[:30]):
    print(f'{i+1}\t{ratio_set[0]}\t\t\t{round(ratio_set[1],3)}')


rank	unigram				Rank Ratio
1	doesnt			85.345
2	internet			72.549
3	countrys			69.447
4	opcit			51.871
5	radstone			50.14
6	isnt			49.776
7	uht			49.671
8	eu			49.394
9	kants			48.977
10	dont			48.567
11	companys			48.271
12	anthocyanins			47.487
13	ibid			43.74
14	japans			43.387
15	webers			43.054
16	luthers			41.95
17	bryman			40.181
18	mpa			39.377
19	ibidp			39.355
20	womens			38.4
21	creon			37.971
22	microneedles			37.212
23	rtas			37.145
24	didnt			36.673
25	pneumophila			35.763
26	globalisation			35.704
27	roosevelts			35.474
28	punic			34.983
29	manydown			34.947
30	chinas			34.85


## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams  
Hint:  
1. generate all bigrams for BNC / lang8  
2. calculate frequency for each bigrams  
3. rank bigrams by frequency  
4. calculate the rank ratio of each bigram
5. print out the top 30 highest rank ratio bigrams  

In [13]:
#### [ TODO ] 
# generate all bigrams for BNC
# calculate frequency for each bigrams

file_path = os.path.join('data', 'bnc.txt')
BNC_bigram = []
#BNC_bigram_counter = Counter()
#### [ TODO ] generate BNC bigrams and calculate document frequency of bigram in BNC
with open(file_path, 'r') as f:
    for line in f:
        tokens = tokenize(line)
        bigram = get_ngram(tokens, n=2)
        BNC_bigram.extend(bigram)
BNC_bigram_counter = calculate_frequency(BNC_bigram)

dict(list(BNC_bigram_counter.items())[0:6])

{'factsheet what': 1,
 'what is': 13278,
 'is aids': 13,
 'aids immune': 10,
 'immune deficiency': 29,
 'deficiency syndrome': 16}

In [14]:
#### [ TODO ] 
# generate all bigrams for lang8
# calculate frequency for each bigrams
file_path = os.path.join('data','lang8.txt')
lang_bigram = []
#lang_bigram_counter = Counter()

#### [ TODO ] generate lang8 bigrams and calculate document frequency of bigram in lang8
with open(file_path, 'r') as f:
    for line in f:
        tokens = tokenize(line)
        # get_ngram return a list
        bigram = get_ngram(tokens, n=2)
        lang_bigram.extend(bigram)
lang_bigram_counter = calculate_frequency(lang_bigram)

dict(list(lang_bigram_counter.items())[0:6])

{'having spent': 3,
 'spent half': 1,
 'half days': 1,
 'days and': 47,
 'and full': 23,
 'full weeks': 1}

In [15]:
#### [ TODO ]  rank bigrams by frequency
BNC_bigram_Rank = {}
for i,bigram in enumerate(sorted(BNC_bigram_counter.items(), key=lambda item: item[1],reverse=True)):
    BNC_bigram_Rank[bigram[0]] = i+1

In [16]:
#### [ TODO ] rank bigrams by frequency
lang_bigram_Rank = {}
for i,bigram in enumerate(sorted(lang_bigram_counter.items(), key=lambda item: item[1],reverse=True)):
    lang_bigram_Rank[bigram[0]] = i+1

In [17]:
bi_rank_ratio = {}
for word,rank in lang_bigram_Rank.items():
    #if word in BNC_bigram_Rank.keys():
    bi_rank_ratio[word] = (BNC_bigram_Rank[word]/rank) if word in BNC_bigram_Rank.keys() else (1/rank)
 

In [21]:
#### [ TODO ] 
# sorted(iterable, cmp=None, key=None, reverse=False)
# bi_rank_ratio is a dictionary 
print(f'rank\tbigram\t\t\t\tRank Ratio')
for i,ratio_set in enumerate(sorted(bi_rank_ratio.items(), key=lambda item: item[1],reverse=True)[:30]):
    print(f'{i+1}\t{ratio_set[0]}\t\t\t{round(ratio_set[1],3)}')


rank	bigram				Rank Ratio
1	p ibid			8451.671
2	figure figure			4387.68
3	the internet			1655.15
4	heat exchanger			1424.038
5	the companys			1408.576
6	exam performance			1269.606
7	youngs modulus			1048.631
8	i dont			993.778
9	birthweight ratio			836.958
10	child soldiers			833.248
11	the bohr			748.879
12	manufacturing strategy			734.568
13	ottoman empire			720.615
14	induction motor			704.328
15	rate constant			654.397
16	history relevant			646.073
17	tort law			629.672
18	genetically modified			613.156
19	ibid pp			609.165
20	internet and			604.723
21	of womens			584.58
22	yield management			583.962
23	phonological processes			582.317
24	torrington et			576.476
25	emotional labour			573.496
26	open source			569.734
27	based care			557.345
28	this essay			552.337
29	the wto			534.324
30	assurance schemes			524.805


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  