<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week7.png" width="350px">

---

# Review of Data Collection

<img src="https://juniorworld.github.io/python-workshop-2018/img/data%20collection.png" width="400px" align='left'>

---

# Natural Language Processing

<img src="https://juniorworld.github.io/python-workshop-2018/img/NLP_.png" width="700px" height="400px" align='left'>

We will demonstrate how to go through these five steps for English and Chinese texts respectively.

## 1. Data Cleaning
- Main task: convert the case, remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- Convert the case: `.lower()`

### Regular Expression Cheat Sheet
- `.` matches any single character
- `[...]` group matching, matches any one of the characters inside the square brackets
- `[^x]` matches one character that is not x
- `|` an “or” operator, matches patterns on either side of the |.
- `*` matches at least 0 times.
- `+` matches at least 1 times.
- `?` matches at most 1 times.
- `{n}` matches n times
- `(...)` grouping in regular expressions
- `\\N` – backreference to group N
- `^` matches the start of the string.
- `$` matches the end of the string.

<h3 style="color:red">1a. English</h3>

In [1]:
#install regular expression package for pattern matching
! pip3 install regex



You are using pip version 9.0.3, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
import regex as re

In [3]:
#Use sub() function to match pattern and substitute the matched words with new pattern
a='I only have 100 dollars in my pocket. What I can buy?'
re.sub('I','You',a) #substitute a word with another word

'You only have 100 dollars in my pocket. What You can buy?'

In [4]:
#Substitute a word with nothing, meaning removing the word
re.sub('I','',a)

' only have 100 dollars in my pocket. What  can buy?'

In [5]:
#Match a group of words by using []. "a-z" means all capital letters.
re.sub('[A-Z]','*',a)

'* only have 100 dollars in my pocket. *hat * can buy?'

In [6]:
#Remove all letters in lower case
re.sub('[a-z]','',a)

'I   100    . W I  ?'

In [7]:
#Hide the numbers
re.sub('[0-9]','*',a)

'I only have *** dollars in my pocket. What I can buy?'

In [8]:
#Remove all alphanumeric characters
re.sub('[0-9a-zA-Z]','',a)

'       .    ?'

In [10]:
#Remove all charaters that are not alphanumeric
re.sub('[^0-9a-zA-Z]',' ',a)

'I only have 100 dollars in my pocket  What I can buy '

In [11]:
#Shortcut to remove all punctuation
re.sub('\p{P}+',' ',a) #\p stands for POSIX characters. {P} stands for punctuation.

'I only have 100 dollars in my pocket  What I can buy '

In [12]:
#Remove hashtags
a='''@JerryNadler admits on #CNN they have no proof of Obstruction by @realDonaldTrump it's just his "personal opinion" Meet the new #WitchHunt Same as the old #WitchHunt cc @DonaldJTrumpJr'''
re.sub('#[^ ]+','',a)

'@JerryNadler admits on  they have no proof of Obstruction by @realDonaldTrump it\'s just his "personal opinion" Meet the new  Same as the old  cc @DonaldJTrumpJr'

In [13]:
#Extract hashtags by findall() function
re.findall('#[^ ]+',a)

['#CNN', '#WitchHunt', '#WitchHunt']

In [16]:
#Extract all mentions
re.findall('@[^ ]+',a)

['@JerryNadler', '@realDonaldTrump', '@DonaldJTrumpJr']

In [17]:
#Remove hyperlinks
a='Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future. https://goo.gl/rwGHTP'
re.sub('https://[^ ]+|http://[^ ]+','',a)

'Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future. '

In [18]:
#tranform all letters to lower case
a.lower()

'tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future. https://goo.gl/rwghtp'

<h3 style='color:blue'>Practice</h3>

Create a data_cleaning() function to convert letter case, remove punctuations, numbers, mentions, hashtags and hyperlinks

In [23]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('http://[^ ]+|https://[^ ]+','',text)
    text=re.sub('\p{P}+',' ',text)
    return(text)

In [24]:
#test your function with a post from @realDonaldTrump
a='@seanhannity “We the people will now be subjected to the biggest display of modern day McCarthyism....which is the widest fishing net expedition....every aspect of the presidents life....all in order to get power back so they can institute Socialism.” https://t.co/izb2tTrINB'
data_cleaning(a)

'  we the people will now be subjected to the biggest display of modern day mccarthyism which is the widest fishing net expedition every aspect of the presidents life all in order to get power back so they can institute socialism  '

In [27]:
re.sub('\p{P}+',' ',re.sub('http://[^ ]+|https://[^ ]+|@[^ ]+|#[^ ]+','',a))

'  We the people will now be subjected to the biggest display of modern day McCarthyism which is the widest fishing net expedition every aspect of the presidents life all in order to get power back so they can institute Socialism  '

---
## Break
---

## Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

## Tokenize English Text: Hunt for Spaces

In [66]:
#Split the following sentence into words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
sentence=data_cleaning(sentence)
words=sentence.split(' ')

In [30]:
import pandas as pd

<div class="alert alert-block alert-success">
    **<b>Extra Knowledge</b>** We can use funtion <font style='color:red;font-weight:bold;'>gensim.parsing.preprocessing.stem_text(text)</font> to stem words in the sentence.</div> 

## Tokenize Chinese Text

We will use a package package "jieba" to tokenize Chinese text.<br>
<br>
**Why jieba?**
- It adopts a hybrid method combining both statistical/probabilistic inference and pattern matching based on dictionary. 
    - capable to recognize words existing in the pre-defined dictionary
    - capable to find new words.
- Two dictionaries:
    - System dictionary
        - Simplied Chinese
        - Simplied+Traditional Chinese
    - User dictionary

In [32]:
! pip3 install jieba



You are using pip version 9.0.3, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [33]:
import jieba

In [34]:
list(jieba.cut('你好，这是一个简单的句子。'))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\yuner\AppData\Local\Temp\jieba.cache
Loading model cost 0.716 seconds.
Prefix dict has been built succesfully.


['你好', '，', '这是', '一个', '简单', '的', '句子', '。']

In [35]:
#it can segment tradional Chinese text by using statistical inference method.
list(jieba.cut('你好，這是一個簡單的句子。'))

['你好', '，', '這是', '一個', '簡單', '的', '句子', '。']

In [36]:
#however, statistical inference is not perfect.
list(jieba.cut('談判擱置，工會號召靜坐。'))

['談判', '擱置', '，', '工會號', '召靜', '坐', '。']

In [37]:
list(jieba.cut('谈判搁置，工会号召静坐。'))

['谈判', '搁置', '，', '工会', '号召', '静坐', '。']

## Configurate Dictionaries

To better segment traditional Chinese text, we need to upgrade system dictionary to include traditional Chinese words.<br>
Download the system dictionary from this link:https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

In [39]:
#load another dictionary to support traditional Chinese
jieba.set_dictionary('C:\\Users\\yuner\\AppData\\Local\\Programs\\Python\\Python36\\Lib\\site-packages\\jieba\\dict.txt.big')

In [40]:
#try tokenizing this sentence again
list(jieba.cut('談判擱置，工會號召靜坐。'))

Building prefix dict from C:\Users\yuner\AppData\Local\Programs\Python\Python36\Lib\site-packages\jieba\dict.txt.big ...
Loading model from cache C:\Users\yuner\AppData\Local\Temp\jieba.u425828d27b9dbcd864ed100138e410f6.cache
Loading model cost 1.489 seconds.
Prefix dict has been built succesfully.


['談判', '擱置', '，', '工會', '號召', '靜坐', '。']

In [41]:
#Some names and special terminologies cannot be properly identified.
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('台灣蔡英文總統日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #special terminologies

['中央', '上周二', '向', '特首', '林', '鄭月', '娥', '發', '公函']
['台灣', '蔡', '英文', '總統', '日前', '表示', '希望', '與', '日本', '舉行', '安保', '對話']
['高雄', '市長', '韓國', '瑜', '本月', '稍後', '訪問', '港澳', '深圳', '廈門', '四市']
['汶萊', '的', '全稱', '為汶萊', '達魯', '薩蘭國', '。']


In [42]:
#Build your user dictionary (time-consuming)
file=open('user_dict.txt','w',encoding='utf-8')
file.write('林鄭月娥\n')
file.write('蔡英文\n')
file.write('韓國瑜\n')
file.write('汶萊達魯薩蘭國\n')
file.close()

In [43]:
#Use your user dictionary
jieba.load_userdict('user_dict.txt')

In [45]:
#After loading user dictionary:
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('台灣蔡英文總統日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #terminologies

['中央', '上周二', '向', '特首', '林鄭月娥', '發', '公函']
['台灣', '蔡英文', '總統', '日前', '表示', '希望', '與', '日本', '舉行', '安保', '對話']
['高雄', '市長', '韓國瑜', '本月', '稍後', '訪問', '港澳', '深圳', '廈門', '四市']
['汶萊', '的', '全稱', '為', '汶萊達魯薩蘭國', '。']


## Remove stop words
Stop words are useless for understanding text.<br>
In English: at, in, on, for, of, a, an, the...<br>
In Chinese: 的，地，得，了.<br>

However, the combination of 不得了 (holy great) is not a stop word which is used to convey extreme compliment over something.<br>
√ Absolute Match. × Pattern Matching

In [46]:
'a' in ['a','b','c']

True

In [47]:
'a' in ['aa','b','c']

False

In [48]:
'a' not in ['aa','b','c']

True

Chinese stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_chi.txt<br>
English stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_eng.txt

In [49]:
file_chi=open('C:\\Users\\yuner\\AppData\\Local\\Programs\\Python\\Python36\\Lib\\site-packages\\jieba\\stop_words_chi.txt','r',encoding='utf-8')

In [50]:
stop_words_chi=[]
for line in file_chi.readlines():
    line=line.strip() #remove line break
    stop_words_chi.append(line) #update the list of stop words line by line
file_chi.close()

In [51]:
len(stop_words_chi)

758

In [52]:
file_eng=open('C:\\Users\\yuner\\AppData\\Local\\Programs\\Python\\Python36\\Lib\\site-packages\\jieba\\stop_words_eng.txt','r')

In [53]:
stop_words_eng=[]
for line in file_eng.readlines():
    line=line.strip() #remove line break
    stop_words_eng.append(line) #update the list of stop words line by line
file_eng.close()

In [54]:
len(stop_words_eng)

128

### Absolute Match of Stop words

In [55]:
sentence='Facebook将向加密通信转型，打造以隐私为中心的平台。'
words=list(jieba.cut(sentence))
words_new=[]
for word in words:
    if word not in stop_words_chi:
        words_new.append(word)

In [56]:
words_new

['Facebook', '加密', '通信', '转型', '打造', '隐私', '中心', '平台']

In [57]:
#for loop in the list
a=[1,2,3,4,5]
b=[i+1 for i in a]           #increase element by one

In [58]:
#for loop and if statement in the list
a=[1,2,3,4,5]
b=[i for i in a if i<4]
b

[1, 2, 3]

In [59]:
words_new=[word for word in list(jieba.cut(sentence)) if word not in stop_words_chi]

In [60]:
words_new

['Facebook', '加密', '通信', '转型', '打造', '隐私', '中心', '平台']

In [61]:
#Clean and tokenize this sentence and remove the stop words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
words_new=[word for word in data_cleaning(sentence).split(' ') if word not in stop_words_eng]


In [62]:
words_new

['mr',
 '',
 'zuckerberg',
 '',
 'runs',
 'facebook',
 '',
 'instagram',
 '',
 'whatsapp',
 'messenger',
 '',
 'wednesday',
 'expressed',
 'intentions',
 'change',
 'essential',
 'nature',
 'social',
 'media',
 '',
 'instead',
 'encouraging',
 'public',
 'posts',
 '',
 'said',
 'would',
 'focus',
 'private',
 'encrypted',
 'communications',
 '',
 'users',
 'message',
 'mostly',
 'smaller',
 'groups',
 'people',
 'know',
 '',
 'unlike',
 'publicly',
 'shared',
 'posts',
 'kept',
 'users',
 '',
 'permanent',
 'records',
 '',
 'communications',
 'could',
 'also',
 'deleted',
 'certain',
 'period',
 'time',
 '']

<h3 style='color:blue'>Practice</h3>

Find the 10 fade-in and fade-out words in speeches.<br>
The magnitude of difference is measured by the change in their relative frequencies:<br>
<p style='text-align:center;font-size:15px;'>Relative Freq (RF) = word frequency / max word frequency</p>
<p style='text-align:center;font-size:15px;'>Difference = RF<font size='2px'>2019</font> - RF<font size='2px'>2009</font></p>

Options:<br>
- Chinese: Annual government work reports, <a href="https://juniorworld.github.io/python-workshop-2018/doc/2019_Government_Work_Report.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop-2018/doc/2009_Government_Work_Report.txt">2009</a>
- English: State of the Union address, <a href="https://juniorworld.github.io/python-workshop-2018/doc/2019_SoU.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop-2018/doc/2009_SoU.txt">2009</a><br>

*Hint:*<br>
*1. You can use `pd.concat([df1,df2],axis=1)` to combine two data frames by columns*<br>
*2. You can use `df.fillna(0)` to replace NAN value with 0.*<br>
*3. You can use `df.sort_values(column_name)` to sort a certain column.* 

In [69]:
freq_words=pd.Series(words).value_counts()
freq_words/max(freq_words)

                  1.000000
of                0.333333
and               0.166667
users             0.166667
he                0.166667
posts             0.166667
on                0.166667
communications    0.166667
the               0.166667
smaller           0.083333
be                0.083333
could             0.083333
mostly            0.083333
shared            0.083333
wednesday         0.083333
intentions        0.083333
change            0.083333
focus             0.083333
certain           0.083333
instagram         0.083333
private           0.083333
records           0.083333
are               0.083333
deleted           0.083333
as                0.083333
period            0.083333
also              0.083333
groups            0.083333
kept              0.083333
his               0.083333
                    ...   
a                 0.083333
they              0.083333
people            0.083333
message           0.083333
in                0.083333
that              0.083333
p

In [76]:
file_2019=open('doc/2019_Government_Work_Report.txt','r',encoding='utf-8')
file_2009=open('doc/2009_Government_Work_Report.txt','r',encoding='utf-8')

In [77]:
#CHI
words_new=[]
for line in file_2019.readlines():
    line=line.strip()
    line=data_cleaning(line)
    words=list(jieba.cut(line))
    for word in words:
        if word not in stop_words_chi:
            words_new.append(word)

In [None]:
#ENG
words_new=[]
for line in file_2019.readlines():
    line=line.strip()
    line=data_cleaning(line)
    words=line.split(' ')
    for word in words:
        if word not in stop_words_chi:
            words_new.append(word)

In [78]:
words_new_2009=[]
for line in file_2009.readlines():
    line=line.strip()
    line=data_cleaning(line)
    words=list(jieba.cut(line))
    for word in words:
        if word not in stop_words_chi:
            words_new_2009.append(word)

In [79]:
words_new_2009

['\ufeff',
 '代表',
 ' ',
 '现在',
 ' ',
 '代表',
 '国务院',
 ' ',
 '大会',
 '作',
 '政府',
 '工作',
 '报告',
 ' ',
 '请予',
 '审议',
 ' ',
 '请',
 '全国政协',
 '委员',
 '提出',
 '意见',
 ' ',
 ' ',
 '年',
 '工作',
 '回顾',
 '年',
 '极',
 '平凡',
 '一年',
 ' ',
 '我国',
 '经济社会',
 '发展',
 '经受',
 '住',
 '历史',
 '罕见',
 '重大',
 '挑战',
 '考验',
 ' ',
 '中国共产党',
 '领导',
 ' ',
 '全国',
 '各族人民',
 '迎难而上',
 ' ',
 '奋力拼搏',
 ' ',
 '战胜',
 '艰难险阻',
 ' ',
 '改革开放',
 '社会主义',
 '现代化',
 '建设',
 '取得',
 '新',
 '重大成就',
 ' ',
 ' ',
 '国民经济',
 '继续',
 '保持',
 '平稳',
 '快',
 '增长',
 ' ',
 '国内',
 '生产总值',
 '超过',
 '万亿元',
 ' ',
 '上年',
 '增长',
 ' ',
 '物价',
 '总',
 '水平',
 '涨幅',
 '得到',
 '控制',
 ' ',
 '财政收入',
 ' ',
 '万亿元',
 ' ',
 '增长',
 ' ',
 ' ',
 '粮食',
 '连续',
 '五年',
 '增产',
 ' ',
 '总产量',
 '万吨',
 ' ',
 '创',
 '历史',
 '最高',
 '水平',
 ' ',
 ' ',
 '改革开放',
 '深入',
 '推进',
 ' ',
 '财税',
 ' ',
 '金融',
 ' ',
 '价格',
 ' ',
 '行政',
 '管理',
 '重点',
 '领域',
 '关键环节',
 '改革',
 '取得',
 '新',
 '突破',
 ' ',
 '进出口',
 '贸易总额',
 ' ',
 '万亿美元',
 ' ',
 '增长',
 ' ',
 ' ',
 '实际',
 '利用',
 '外商',
 '直接',
 '投资',
 '亿美元',
 ' ',
 ' ',
 '

In [80]:
freq_2019=pd.Series(words_new).value_counts()
freq_2009=pd.Series(words_new_2009).value_counts()

In [86]:
relative_freq_2019=freq_2019/max(freq_2019)
relative_freq_2009=freq_2009/max(freq_2009)

In [87]:
relative_freq=pd.concat([relative_freq_2019,relative_freq_2009],axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [90]:
relative_freq=relative_freq.fillna(0)

In [92]:
relative_freq['diff']=relative_freq[0]-relative_freq[1]

In [101]:
relative_freq.columns=['2019','2009','diff']

In [102]:
relative_freq.sort_values('diff').tail(10)

Unnamed: 0,2019,2009,diff
习近平,0.006341,0.0,0.006341
风险,0.00878,0.001712,0.007068
供给,0.008293,0.001142,0.007151
新,0.023415,0.014269,0.009145
推动,0.015122,0.004566,0.010556
五年,0.01122,0.000571,0.010649
改革,0.040976,0.02911,0.011866
创新,0.02439,0.010274,0.014116
全面,0.023902,0.009703,0.014199
中国,0.019512,0.004566,0.014946
