# CS310 Natural Language Processing
# Lab 1: Basic Text Processing with Python

In [1]:
import re

In [2]:
with open("三体3死神永生-刘慈欣.txt", "r") as f:
    raw = f.readlines()

print('# of lines: ', len(raw))
raw = ''.join(raw) # concatenate all lines into one string
print('# of characters: ', len(raw))

# of lines:  4689
# of characters:  385018


## T0. Cleaning the raw data

1. Replace the special token `\u3000` with empty string "".
2. Replace consecutive newlines with just a single one.
3. Other cleaning work you can think of.

*Hint*: Use `re.sub()`

In [3]:
cleaned = re.sub('\u3000', '', raw)
print('# of characters after cleaning:', len(cleaned))

cleaned = re.sub(r'\n+', '\n', cleaned) 
print('# of characters after cleaning:', len(cleaned))

cleaned = re.sub(r' +', ' ', cleaned) # replace multiple spaces with one space
print('# of characters after cleaning:', len(cleaned))

cleaned = cleaned.strip() # remove leading and trailing spaces
print('# of characters after cleaning:', len(cleaned))

# of characters after cleaning: 375711
# of characters after cleaning: 375677
# of characters after cleaning: 375580
# of characters after cleaning: 375580


## T1. Count the number of Chinese tokens

*Hint*: Use `re.findall()` and the range of Chinese characters in Unicode, i.e., `[\u4e00-\u9fa5]`.

In [4]:
num = re.findall(r'[\u4e00-\u9fa5]', cleaned)
print('Number of Chinese tokens:', len(num))

Number of Chinese tokens: 329946


## T2. Build the vocabulary for all Chinese tokens

Use a Python `dict` object or instance of  `collections.Counter()` to count the frequency of each Chinese token.

*Hint*: Go through the `raw` string and for each unique Chinese token, add it to the `dict` or `Counter` object with a count of 1. If the token is already in the `dict` or `Counter` object, increment its count by 1.

Check the vocabulary size and print the top 20 most frequent Chinese tokens and their counts.

In [5]:
import collections

vocab = collections.Counter(num)

vocab_size = len(vocab)
print('Vocabulary size:', vocab_size)

print('Top 20 most frequent Chinese tokens:')
for token, count in vocab.most_common(20):
    print(f'{token}: {count}')

Vocabulary size: 3027
Top 20 most frequent Chinese tokens:
的: 15990
一: 6749
是: 4837
在: 4748
了: 4149
有: 3656
这: 3532
个: 3458
不: 3117
人: 2988
中: 2649
到: 2632
他: 2354
上: 2194
们: 2164
时: 2076
心: 2007
地: 1953
大: 1938
来: 1855


## T3. Sentence segmentation

Estimate the number of sentences in the `raw` string by separating the sentences with the delimiter punctuations, such as  `。`, `？`, `！` etc.

*Hint*: Use `re.split()` and the correct regular expression. 

In [14]:
sentences = re.split(r'[。！？]', raw)
sentences = [s.strip() for s in sentences if s.strip()]
print("Number of sentences: ",len(sentences))

Number of sentences:  9611


The sentences obtained with `re.split()` do not contain the delimiter punctuations. What if we want to keep the delimiter punctuations in the sentences?

*Hint*: Use `re.findall()` and the correct regular expression.

In [None]:
sentences = re.findall(r'.+?[。！？]', raw)
print("Number of sentences: ",len(sentences))

Number of sentences:  9615


## T4. Count consecutive English and number tokens

Estimate the number of consecutive English and number tokens in the `raw` string. Build a vocabulary for them and count their frequency.

*Hint*: Use `re.findall()` and the correct regular expression. Use similar method as in T2 to build the vocabulary and count the frequency.

In [15]:
tokens = re.findall(r'[a-zA-Z]+|\d+', raw)

vocab = collections.Counter(tokens)

print('Vocabulary size:', len(vocab))

print('Top 20 most frequent English and number tokens:')
for token, count in vocab.most_common(20):
    print(f'{token}: {count}')

Vocabulary size: 171
Top 20 most frequent English and number tokens:
AA: 338
A: 69
I: 66
1: 47
PIA: 45
PDC: 35
Ice: 34
3: 30
IDC: 28
DX: 27
3906: 27
5: 27
0: 22
G: 20
Way: 20
647: 19
7: 19
16: 14
4: 13
2: 13


## T5. Mix of patterns

There are two characters whose names are "艾AA" and "程心". Find all sentences where "艾AA" and "程心" appear together. Consider fullnames only, that is, "艾AA" but not "AA" alone. 

*Hint*: You may find the lookbehind or lookahead pattern useful.

In [None]:
pattern = r'[^。！？]*艾AA[^。！？]*程心[^。！？]*[。！？]|[^。！？]*程心[^。！？]*艾AA[^。！？]*[。！？]'

matching_sentences = re.findall(pattern, raw)

for i, sentence in enumerate(matching_sentences, 1):
    print(f'{i}. {sentence.strip()}')

Sentences where "艾AA" and "程心" appear together:
1. 在程心眼中，艾AA是个像鸟一般轻灵的女孩子，充满生机地围着她飞来飞去。
2. 程心听到有人叫自己的名字，转身一看，竟是艾AA正向这里跑过来。
3. 程心让艾AA在原地等着自己，但AA坚持要随程心去，只好让她上了车。
4. 程心和艾AA是随最早的一批移民来到澳大利亚的。
5. #第三部
　　【广播纪元7年，程心】
　　艾AA说程心的眼睛比以前更明亮更美丽了，也许她没有说谎。
6. ”坐在程心旁边的艾AA大叫起来，引来众人不满的侧目。
7. 这天，艾AA来找程心。
8. 是艾AA建议程心报名参加试验的，她认为这是为星环公司参与掩体工程而树立公众形象的一次极佳的免费广告，同时，她和程心都清楚试验是经过严密策划的，只是看上去刺激，基本没什么危险。
9. 在返回的途中，当太空艇与地球的距离缩小到三十万千米以内、通信基本没有延时时，程心给艾AA打电话，告诉了她与维德会面的事。
10. 与此同时，程心和艾AA进入冬眠。
11. 程心到亚洲一号的冬眠中心唤醒了冬眠中的艾AA，两人回到了地球。
12. 程心现在身处的世界是一个白色的球形空间，她看到艾AA飘浮在附近，和她一样身穿冬眠时的紧身服，头发湿漉漉的，四肢无力地摊开，显然也是刚刚醒来。
13. 对此程心感到很欣慰，到了新世界后，艾AA应该有一个美好的新生活了。
14. 程心想到了云天明和艾AA，他们在地面上，应该是安全的，但现在双方已经无法联系，她甚至都没能和他说上一句话。
15. 程心和关一帆再次拥抱在一起，他们都为艾AA和云天明流下了欣慰的泪水，幸福地感受着那两个人在十八万个世纪前的幸福，在这种幸福中，他们绝望的心灵变得无比宁静了。
16. ”
　　智子的话让程心想到了云天明和艾AA刻在岩石上的字，但关一帆想到的更多，他注意到了智子提到的一个词：田园时代。
