# Class12 - NLP - CHN

<img src="https://juniorworld.github.io/python-workshop/img/NLP_.png" width="600px" height="400px" align='left'>

---

## Word Embeddings
- Word embedding is a technique to transform textual words into a numerical representation (word vector). Each word is mapped to one vector and this vector is trained to learn the syntactic and semantic relationships between words.
    - e.g.: "book" (1,0,3), "paper" (1,0,4) -> synonyms are close in the word space
- Applications:
    - Find out synonyms in the corpus (a list of documents). A way to understand the structure of opinion expression.
        - In Trump's opinions, CNN is a byword for enemy and dishonesty, while Clinton is a synonym for crook.
    - Find out the equivalent mapping between words and concepts. It can understand the meaning of word combination by assuming the meaning of words is transmissive.
        - "queen" to "women" is what "king" to "men"
        - "Beijing" to "China" is what "Tokyo" to "Japan"
        - Vectors are eligible for math operations, like + and -
    - Find out the least similar (least possible) word in a sentence
    - Evaluate the similarity (distance) between two or more sentences
    - Evaluate the possibility of a sentence belonging to this corpus
        - Corpus: A set of text documents

<img src="https://juniorworld.github.io/python-workshop/img/word2vec.png" width="400">
Reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

In [None]:
# Read the trump_twitter_tokens.csv that we created last week
# If you couldn't find this file in your computer, please download it from: https://juniorworld.github.io/python-workshop/doc/trump_twitter_tokens.csv
import csv
file=open('trump_twitter_tokens.csv','r')
reader = csv.reader(file)
tokens=[]
for row in reader:
    tokens.append(row)

In [None]:
#word document matrix
#rows: documents
#elements: words
tokens[0]

In [None]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
# Get 10 synonyms most similar to a target word in the word space
# Try: cnn, clinton, biden, election, china
model.wv.most_similar('cnn', topn=10)

In [None]:
# Get synonyms based on a target vector 
# Tariff is to china is what XX is to usa
model.wv.most_similar(positive=['china','usa'], negative=['tariff'])

In [None]:
# Find the least similar word in a list of words
model.wv.doesnt_match(['china','usa','canada','cnn','mexico'])

In [None]:
# Evaluate the possibility that Trump says such a sentence in his Twitter account
model.score(['cnn is friend'.split(),
             'cnn is enemy'.split()])

In [None]:
model.score(['china is friend'.split(),
             'china is enemy'.split()])

### Exercise
Load the user functions that we defined last week

In [None]:
import re
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
Lemmatizer=WordNetLemmatizer()

In [None]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('https:[^ ]+','',text)
    text=re.sub('[\W]+',' ',text)
    text=text.strip()
    return(text)

In [None]:
stopwords_eng=open("stop_words_eng.txt",'r')
stopwords_eng=[i.strip() for i in stopwords.readlines()]

In [None]:
def remove_stopwords_eng(words):
    words_rm=[]
    for word in words:
        if word not in stopwords_eng:
            words_rm.append(word)
    return(words_rm)

In [None]:
def lemmatization(words_rm):
    words_lemma=[]
    for word,pos in nltk.pos_tag(words_rm):
        if pos[0]=='N':
            words_lemma.append(Lemmatizer.lemmatize(word))
        elif pos[0]=='J':
            words_lemma.append(Lemmatizer.lemmatize(word,'a'))
        elif pos[0]=='V':
            words_lemma.append(Lemmatizer.lemmatize(word,'v'))
        else:
            words_lemma.append(word)
    return(words_lemma)

Predict rating of a review comment of Amazon Alexa
1. Open this data frame: https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv
2. Train two Word2Vec models of word representations, for 5-star reviews and non-5-star reviews respectively
3. Predict the rating of a comment, saying "It doesn't answer me sometimes and give me some irrelevant answers quite often"

In [None]:
alexa_reviews=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv',sep='\t')

In [None]:
alexa_reviews.head()

In [None]:
# How many 5-star, 4-star, 3-star, 2-star, 1-star ratings?


In [None]:
five_stars_comments=[]
for text in : #fill in your code
    text=data_cleaning(text)
    #write your code here
    
    
    five_stars_comments.append(words_lemma)

In [None]:
non_five_stars_comments=[]
for text in : #fill in your code
    text=data_cleaning(text)
    #write your code here
    
    non_five_stars_comments.append(words_lemma)

In [None]:
five_star_model=Word2Vec(sentences=five_stars_comments, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)
non_five_star_model=Word2Vec(sentences=non_five_stars_comments, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
five_star_model.score([data_cleaning("It doesn't answer me sometimes and give me some irrelevant answers quite often").split()])

In [None]:
non_five_star_model.score([data_cleaning("It doesn't answer me sometimes and give me some irrelevant answers quite often").split()])

# Chinese NLP

### What's Special about Chinese Language?
1. First-run Cleaning: No need to convert <font style="color: blue">letter case</font>
    - data_cleaning() solely for chinese language will be slighter than that for english, without a line of text.lower()
    - However, we typically will use english-version data_cleaning(), in case there are some english character in the text
2. Tokenization: No <font style="color: blue">natural deliminator</font>, like the space in Eng. Need to rely on language model to split text into words.
3. Second-run Cleanig: No need to <font style="color: blue">stem/lemmatize</font> words
4. Vectorization: Identical to English

#### 1. First-run Data Cleaning
- Main task: Remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- No need to convert cases

In [None]:
#Also works for Chinese text
re.sub('[\W]+',' ','普京表示，歡迎中方在化解危機中的建設性角色！')

In [None]:
#test the data_cleaning() function with a Weibo post
a="各國應轟炸俄羅斯境內“暈輸線”……當年“炮擊金門”很久，最後因美國切斷了“廈門車站”運輸線，炮擊金門才止。（而不應去烏克蘭建軍工廠：俄會集中火力轟炸。）@美国驻华大使馆 @英國駐華使館 @歐盟在中國 @烏克蘭信使"


#### 2. Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

We will use a package package "jieba" to tokenize Chinese text.<br>
<br>
**Why jieba?**
- It adopts a hybrid method combining both statistical/probabilistic inference and pattern matching based on dictionary. 
    - capable to recognize words existing in the pre-defined dictionary
    - capable to find new words.
- Two dictionaries:
    - System dictionary
        - Simplied Chinese
        - Simplied+Traditional Chinese
    - User dictionary
- Syntax:
>```python
jieba.cut(sentence) #result is a list of words
```

In [None]:
! pip3 install jieba

In [None]:
import jieba

In [None]:
list(jieba.cut('你好，这是一个简单的句子。'))

In [None]:
#it can segment tradional Chinese text by using statistical inference method.
list(jieba.cut('你好，這是一個簡單的句子。'))

In [None]:
#however, statistical inference is not always perfect.
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
list(jieba.cut('谈判搁置，工会号召静坐。'))

How could we improve statistical inference for the tokenization?<br>
**Human in the loop**: Provide human-defined dictionary to constrain and fine-tune the statistical inference.
##### Solution: Configurate Dictionaries
- Two types of dictionaries:
    1. System dictionary: General purpose
    2. User dictionary: Special context, e.g. dictionaries for emotion, incivility, war
- How does the dictionary look like?
    - Don't confuse with the data type “dictionary”
    - Dictionary is a plain text file
    - One line one keyword, similar to the stopword list/file
    - [Optional] Words might also be weighted, carrying with a number/decimal suggestive of the importance of the words

>```python
#Way 1: no weight: all words are created equal
China
People's Republic of China
China Central Television

>```python
#Way 2: with weights: words are treated unequally. Higher weight, Higher priority
China,3
People's Republic of China, 4
China Central Television,4


- In jieba, you can load dictionaries using the following syntaxes:
>```python
jieba.set_dictionary("path_of_system_dict") 
jieba.load_userdict("path_of_user_dict")


To better segment traditional Chinese text, we need to upgrade system dictionary to include traditional Chinese words.<br>
Download the system dictionary from this link:https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

In [None]:
#load traditional Chinese system dictionary
jieba.set_dictionary('dict.txt.big')

In [None]:
#try tokenizing this sentence again
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
#Some names and special terminologies cannot be properly identified.
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('蔡英文日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #special terminologies

In [None]:
#Use a for loop to build your user dictionary (time-consuming)
file=open('user_dict.txt','w',encoding='utf-8')
keywords=['林鄭月娥','蔡英文','韓國瑜','汶萊達魯薩蘭國']
#Write your loop here


file.close()

In [None]:
#Use your user dictionary
jieba.load_userdict('user_dict.txt')

In [None]:
#After loading user dictionary:
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('蔡英文日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #terminologies

#### 3. Remove stop words
<font style="color:red">English stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_eng.txt</font><br>
Stop words are useless for understanding text.<br>
- English: at, in, on, for, of, a, an, the...<br>
- Chinese: 的，地，得，了.<br>
- Solution: Membership check - Check whether a word is in the predefined stopword list
>```python
x in [word1, word2, word3]
x not in [word1, word2, word3]
```

Chinese stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_chi.txt

In [None]:
#load stop word list
file_chi=open('stop_words_chi.txt','r',encoding='utf-8')
stop_words_chi=[i.strip() for i in file_chi.readlines()]

In [None]:
len(stop_words_chi) #much longer and detailed than english stopwords

In [None]:
#have a look at the dictionary
stop_words_chi[34:39]

In [None]:
paragraph='Facebook CEO 馬克·朱克伯格（Mark Zuckerberg）週三發布了一篇長文，闡述了要將 Facebook 打造成「以隱私為中心的平台」的願景，並表示將打通 Messenger、Instagram 和 WhatsApp 用戶之間的交流阻礙。朱克伯格表示，他相信未來人們的溝通行為會更多轉向私人加密服務，也未必希望他們分享的所有內容都被永遠保存在互聯網上——後者對於每個人來說，既可能是財富，也可能是負擔。因此，儘管 Facebook 長期以來專注於打造開放、分享的社區平台，但他認為，以隱私為中心的通信平台會比當今的開放平台更加重要。'
# Step 1: Remove punctuation
# Step 2: Tokenize the sentence
# Step 3: Remove stop words
# Save the token list as words
#------------------------------------



In [None]:
#count word frequency
pd.Series(words).value_counts()

Define a user function called remove_stopwords_chi()<br>
For your reference, here is the user function for removing stopwords in English text:
>```python
def remove_stopwords_eng(words):
    words_rm=[]
    for word in words:
        if word not in stopwords_eng:
            words_rm.append(word)
    return(words_rm)
```

In [None]:
#Write your code here



In [None]:
#Create a generalized function for stopword removal, which can be applied for any language



<h3 style='color:blue'>Practice</h3>

Find the 10 fade-in and fade-out words in speeches.<br>
The magnitude of difference is measured by the change in their relative frequencies:<br>
<p style='text-align:center;font-size:15px;'>Relative Freq (RF) = word frequency / sum of word frequencies</p>
<p style='text-align:center;font-size:15px;'>Difference = RF<font size='2px'>2019</font> - RF<font size='2px'>2009</font></p>

Options:<br>
- Chinese: Annual government work reports, <a href="https://juniorworld.github.io/python-workshop/doc/2019_Government_Work_Report.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop/doc/2009_Government_Work_Report.txt">2009</a>
- English: State of the Union address, <a href="https://juniorworld.github.io/python-workshop/doc/2019_SoU.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop/doc/2009_SoU.txt">2009</a><br>

*Hint:*<br>
*1. Use `pd.concat([df1,df2],axis=1)` to combine two dataframes by columns and `pd.concat([df1,df2],axis=0)` to combine two dataframes by rows*<br>
*2. Use `df[column_name].value_counts()` to count the items in a column.*<br>
*3. Use `df.sort_values(column_name,ascending=True)` to sort a certain column. To get a reversed list, you can set ascending=False* <br>
*4. Use `df.fillna(0)` to replace NAN value with 0.*

In [None]:
freq_words=pd.Series(words).value_counts()
freq_words/max(freq_words)

In [None]:
#Read Chinese files
Chi_file_2019=open('2019_Government_Work_Report.txt','r',encoding='utf-8')
Chi_file_2009=open('2009_Government_Work_Report.txt','r',encoding='utf-8')

In [None]:
#CHI 2019
#Step1: Clean text
#Step2: Tokenize text
#Step3: Remove stopwords
#Step4: Add the current word list to Chi_words_2019
#---------------------------------------------------
Chi_words_2019=[]
for line in file_2019.readlines(): #each line is a paragraph
    

In [None]:
#CHI 2009
Chi_words_2019=[]
for line in file_2019.readlines():
    

In [None]:
#Count word frequencies
Chi_freq_2019=
Chi_freq_2009=

In [None]:
#Count relative frequency
#Hint: You can use sum() function or .sum() method to get the total of a numeric column
relative_freq_2019=
relative_freq_2009=

In [None]:
#Combine relative_freq_2019 and relative_freq_2009, using pd.concat() function
relative_freq=

In [None]:
#Fill missing values
relative_freq=relative_freq.fillna(0)

In [None]:
#Calculate the frequency difference
relative_freq['diff']=

In [None]:
#Change column names
relative_freq.columns=['2019','2009','diff']

In [None]:
#Sort table by column 'diff'
#Fade in words: words that are more common in 2019 report
relative_freq.sort_values(your_code)

In [None]:
#Fade out words: words that are more common in 2009 report


In [None]:
#Read English files
Eng_file_2019=open('2019_SoU.txt','r',encoding='utf-8')
Eng_file_2009=open('2009_SoU.txt','r',encoding='utf-8')