<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week8.png" width="350px">

---

# Natural Language Processing

<img src="https://juniorworld.github.io/python-workshop-2018/img/NLP_.png" width="700px" height="400px" align='left'>

## 4. Vectorization
### What is vector?
<img src="https://juniorworld.github.io/python-workshop-2018/img/scalar-vector-matrix.jpeg" width='400px' align='left'>

### Types of word representation
- Scalar: a single variable
- One-hot encoding vector: ~ dummy variables
- Distributed embedding vector: word2vec & glove

### One-Hot Encoding
- Equivalent to dummy variables
- Apple: [1,0,0]; Banana: [0,1,0]; Grape: [0,0,1]
- A basket of fruit: one apple, one banana and one grape: [1,1,1]

#### Document-Term Matrix
- Row: Document
- Column: Term
- Row and column can be reversed.
- In the cell:
    - Term Frequency (`tf`): absolute vs **relative**
    - Term Frequency-Inversed Document Frequency (TF-IDF)
        - to suppress undiscriminative words
        - Doc Freq (`df`): absolute vs **relative**
        - Formula: `tf*log(1/df) = tf*log(N/n)`

<img src="https://juniorworld.github.io/python-workshop-2018/img/doc-term-matrix.jpg" width='500px'>

<h3 style="color:red">1a. English Paragraph-Term Matrix (step-by-step breakdown)</h3>

In [1]:
import regex as re
import pandas as pd
import numpy as np

In [2]:
#first reload data_cleaning() function we created last week
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('http://[^ ]+|https://[^ ]+','',text)
    text=re.sub('\p{P}+',' ',text)
    return(text)

In [3]:
#load english stop words
#stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_eng.txt
file_eng=open('C:\\Users\\yuner\\AppData\\Local\\Programs\\Python\\Python36\\Lib\\site-packages\\jieba\\stop_words_eng.txt','r')
stop_words_eng=[]
for line in file_eng.readlines():
    line=line.strip() #remove line break
    stop_words_eng.append(line) #update the list of stop words line by line
file_eng.close()

In [4]:
len(stop_words_eng)

128

In [7]:
#define a paragraph
paragraph_1="Many of us campaigned on the same core promises: to defend American jobs and demand fair trade for American workers; to rebuild and revitalize our Nation's infrastructure; to reduce the price of healthcare and prescription drugs; to create an immigration system that is safe, lawful, modern and secure; and to pursue a foreign policy that puts America's interests first."
cleaned_paragraph_1= data_cleaning(paragraph_1)   #clean the paragraph by using data_cleaning() function
words_1= cleaned_paragraph_1.split(' ')               #tokenize the paragraph
cleaned_words_1=[word for word in words_1 if word not in stop_words_eng and word]       #remove stop words

In [13]:
if '':
    print('not empty')
else:
    print('empty')

empty


In [14]:
bool('')

False

In [8]:
cleaned_words_1 #have a look at the result

['many',
 'us',
 'campaigned',
 'core',
 'promises',
 'defend',
 'american',
 'jobs',
 'demand',
 'fair',
 'trade',
 'american',
 'workers',
 'rebuild',
 'revitalize',
 'nation',
 'infrastructure',
 'reduce',
 'price',
 'healthcare',
 'prescription',
 'drugs',
 'create',
 'immigration',
 'system',
 'safe',
 'lawful',
 'modern',
 'secure',
 'pursue',
 'foreign',
 'policy',
 'puts',
 'america',
 'interests',
 'first']

In [16]:
#get the word frequency table
tf_1=pd.Series(cleaned_words_1).value_counts()

In [17]:
tf_1

american          2
us                1
immigration       1
fair              1
safe              1
modern            1
pursue            1
infrastructure    1
policy            1
jobs              1
demand            1
campaigned        1
foreign           1
revitalize        1
trade             1
system            1
lawful            1
rebuild           1
drugs             1
puts              1
defend            1
healthcare        1
prescription      1
reduce            1
interests         1
nation            1
price             1
secure            1
first             1
promises          1
america           1
many              1
workers           1
create            1
core              1
dtype: int64

In [18]:
#wrap previous lines into a user function which takes a paragraph string and outputs a frequency table
def paragraph_to_tf(paragraph):
    cleaned_paragraph= data_cleaning(paragraph)   #clean the paragraph by using data_cleaning() function
    words= cleaned_paragraph.split(' ')               #tokenize the paragraph
    cleaned_words=[word for word in words if word not in stop_words_eng and word]
    tf=pd.Series(cleaned_words).value_counts()
    return(tf)

In [19]:
#Use paragraph_to_tf() function to obtain term frequency table of paragraph_2
paragraph_2="In 2019, we also celebrate 50 years since brave young pilots flew a quarter of a million miles through space to plant the American flag on the face of the moon. Half a century later, we are joined by one of the Apollo 11 astronauts who planted that flag: Buzz Aldrin. This year, American astronauts will go back to space on American rockets."
tf_2=paragraph_to_tf(paragraph_2)

In [20]:
tf_2

american      3
space         2
flag          2
astronauts    2
moon          1
flew          1
go            1
rockets       1
later         1
joined        1
brave         1
half          1
buzz          1
apollo        1
back          1
miles         1
one           1
face          1
million       1
planted       1
century       1
quarter       1
years         1
plant         1
pilots        1
young         1
year          1
aldrin        1
also          1
since         1
celebrate     1
dtype: int64

In [21]:
tf_combined= pd.concat([tf_1,tf_2],axis=1)  #combine two frequency tables #axis=1 by columns, axis=0 rows
tf_combined=tf_combined.fillna(0) #replace missing value with 0

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [22]:
tf_combined.head()

Unnamed: 0,0,1
aldrin,0.0,1.0
also,0.0,1.0
america,1.0,0.0
american,2.0,3.0
apollo,0.0,1.0


<h3 style="color:red">1b. English Paragraph-Term Matrix (integrated)</h3>

a. create term-doc matrix

In [24]:
#load the file of 2019 State of the Union addressed by President Trump
file_2019=open('doc/2019_SoU.txt','r',encoding='utf-8')

In [26]:
#read file line by line, obtain the term frequency table of each paragraph and combine them into a Paragraph-Term Matrix
#REMINDER: pd.concat() doesn't support combining blank table
#---------------------
tf_list=[]
for line in file_2019.readlines():
    line=line.strip()
    tf=paragraph_to_tf(line)
    tf_list.append(tf)

tf_combined=pd.concat(tf_list,axis=1)
#---------------------
tf_combined=tf_combined.fillna(0) #replace missing values

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  # Remove the CWD from sys.path while we load stuff.


In [29]:
len(tf_list)

127

In [27]:
tf_combined.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,117,118,119,120,121,122,123,124,125,126
$,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abiding,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abject,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
able,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abolish,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
tf_combined.shape

(1445, 127)

In [30]:
#row sum: overall word frequency in entire speech
tf_combined.sum(axis=1) #with fixed row, calculate the sum of columns

$                  7.0
abiding            1.0
abject             1.0
able               3.0
abolish            1.0
abortion           1.0
abroad             1.0
accelerated        1.0
access             2.0
accountability     1.0
achieve            2.0
achievements       1.0
acquires           1.0
across             2.0
act                5.0
acted              1.0
action             1.0
added              2.0
adding             1.0
administration    10.0
adopt              1.0
advance            1.0
advancing          1.0
advantage          2.0
adventure          1.0
adversaries        1.0
afghan             1.0
afghanistan        3.0
african            2.0
age                2.0
                  ... 
wisconsin          1.0
withdrawing        1.0
withdrew           1.0
within             3.0
womb               2.0
women             12.0
wonderful          2.0
woods              1.0
words              1.0
work              10.0
worked             1.0
workers            3.0
workforce  

In [31]:
#column sum: length of paragraph
tf_combined.sum(axis=0)

0      15.0
1      14.0
2      10.0
3      36.0
4      12.0
5      14.0
6      49.0
7      36.0
8      37.0
9      14.0
10     13.0
11     24.0
12     12.0
13      4.0
14     14.0
15     27.0
16     67.0
17     10.0
18     10.0
19     17.0
20     19.0
21     19.0
22     24.0
23     25.0
24      9.0
25      6.0
26     21.0
27      7.0
28     29.0
29     44.0
       ... 
97     22.0
98     10.0
99     15.0
100    13.0
101    20.0
102    33.0
103    25.0
104    22.0
105    12.0
106    23.0
107    28.0
108    38.0
109    62.0
110    29.0
111    56.0
112    23.0
113    16.0
114     2.0
115    22.0
116    19.0
117    29.0
118     2.0
119    22.0
120     8.0
121    14.0
122    25.0
123     7.0
124    10.0
125    27.0
126     8.0
Length: 127, dtype: float64

In [40]:
#How to calculate DF?
tf_combined[tf_combined>0].sum(axis=1)

$                  7.0
able               3.0
access             2.0
achieve            2.0
across             2.0
act                5.0
added              2.0
administration    10.0
advantage          2.0
afghanistan        3.0
african            2.0
age                2.0
agenda             5.0
agent              3.0
agents             2.0
ago                8.0
agreement          4.0
aids               2.0
alice              7.0
aliens             2.0
allies             2.0
almost             8.0
already            2.0
also              13.0
always             4.0
america           25.0
american          33.0
americans         17.0
announced          2.0
another            6.0
                  ... 
voted              2.0
wages              2.0
wall               6.0
walls              4.0
war                6.0
wars               2.0
watching           2.0
way                2.0
weeks              3.0
welcome            2.0
whether            5.0
whose              3.0
wibberley  

In [42]:
tf_combined[tf_combined>0].sum(axis=1)>2

$                  True
able               True
access            False
achieve           False
across            False
act                True
added             False
administration     True
advantage         False
afghanistan        True
african           False
age               False
agenda             True
agent              True
agents            False
ago                True
agreement          True
aids              False
alice              True
aliens            False
allies            False
almost             True
already           False
also               True
always             True
america            True
american           True
americans          True
announced         False
another            True
                  ...  
voted             False
wages             False
wall               True
walls              True
war                True
wars              False
watching          False
way               False
weeks              True
welcome           False
whether         

b. remove words only appearing in one paragraph

In [43]:
#remove words that only appear in one paragraph
tf_combined=tf_combined[tf_combined[tf_combined>0].sum(axis=1)>2]

In [44]:
#Normalization: absolute term freq divided by total count of words in paragraph
relative_tf=tf_combined/tf_combined.sum(axis=0)

<div class="alert alert-block alert-success">
**<b>Extra Knowledge: HOW TO IMPLEMENT TF-IDF</b>**</div>

>```python
tf_idf=relative_tf.T * np.log(127/tf_combined[tf_combined>0].sum(axis=1))
tf_idf.iloc[0].sort_values()[::-1] #look at the most distinguishing words in the last paragraph```

<h3 style="color:red">1c. English Paragraph-Term Matrix (shortcut)</h3>

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

In [50]:
file_2019=open('doc/2019_SoU.txt','r',encoding='utf-8')
paragraphs=file_2019.readlines()

In [56]:
vectorizer=CountVectorizer(lowercase=True,stop_words=stop_words_eng)
tf=vectorizer.fit_transform(paragraphs)
tf=pd.DataFrame(tf.toarray().T,index=vectorizer.get_feature_names())

In [52]:
tf.shape #results of this method will be different with that of previous method as the stop words list is different.

(1484, 127)

In [55]:
tf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,117,118,119,120,121,122,123,124,125,126
000,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
12th,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#Normalization: absolute freq to relative freq
tf=tf/tf.sum(axis=0)

---
## Break
---

## Topic & Frame Analysis: From a Word Co-occurrence perspective
- RQ: How framing devices co-occur in a text and to form underlying patterns of meaning.

<img src="https://juniorworld.github.io/python-workshop-2018/img/hierarchy.png" width='700px'>

### Step 1. From Doc-Term Matrix to Term-Term Co-occurrence Matrix

Suppose we have four sentences:
- 'apple, banana'
- 'apple, banana, grape'
- 'banana, grape'
- 'apple'

In [64]:
#Translate into numbers
sen1=[1,1,0]
sen2=[1,1,1]
sen3=[1,1,1]
sen4=[1,0,0]

In [68]:
sen1[1]

1

**Q: How many baskets have both apple and babana?**

_METHOD 1: AND operator_

In [65]:
count=0 #initialization
for i in range(3): #go through every fruit
    if sen1[i]==1 and sen2[i]==1 and sen3[i]==1 and sen4[i]==1:
        count+=1
print(count)

1


_METHOD 2: Multiply_

In [69]:
count=0
for i in range(3):
    if sen1[i]*sen2[i]*sen3[i]*sen4[i]==1:
        count+=1
print(count)

1


_METHOD 3: Dot product_

In [70]:
matrix=np.matrix([sen1,sen2,sen3,sen4])

In [73]:
tf_combined.T

Unnamed: 0,$,able,act,administration,afghanistan,agenda,agent,ago,agreement,alice,...,within,women,work,workers,working,world,would,year,years,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
matrix

matrix([[1, 1, 0],
        [1, 1, 1],
        [1, 1, 1],
        [1, 0, 0]])

In [74]:
matrix=np.matrix([sen1,sen2,sen3,sen4])
np.dot(matrix.T,matrix)

matrix([[4, 3, 2],
        [3, 3, 2],
        [2, 2, 2]])

### Vector Multiplication
<img src="https://juniorworld.github.io/python-workshop-2018/img/vec-multiply.png" align='left' width='300px'>

### Matrix Multiplication
<img src="https://juniorworld.github.io/python-workshop-2018/img/matrix-multiply.svg" align='left'>

In [76]:
matrix1=np.matrix([[1,2,3],[4,5,6]])
matrix2=np.matrix([[7,8],[9,10],[11,12]])

In [77]:
np.dot(matrix1,matrix2)

matrix([[ 58,  64],
        [139, 154]])

In [79]:
print(matrix1.shape)
print(matrix2.shape)
#dim requirement: matrix1 (n,m) matrix2 (k,n)

(2, 3)
(3, 2)


In [84]:
#Using dot product
tt_matrix=np.dot(relative_tf,relative_tf.T)

In [85]:
tt_matrix.shape #symmetric matrix

(245, 245)

retrieve data from matrix: `matrix_name[row index, col index]`

In [86]:
tt_matrix[0,0]

0.03458611279628775

In [87]:
tt_matrix[0,:] #full list of co-occurrence

array([0.03458611, 0.        , 0.        , 0.        , 0.00444444,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00444444, 0.00680272, 0.        , 0.        , 0.00604444,
       0.00444444, 0.        , 0.        , 0.00694444, 0.        ,
       0.        , 0.        , 0.        , 0.00118906, 0.        ,
       0.        , 0.        , 0.02200816, 0.        , 0.00563351,
       0.00694444, 0.        , 0.        , 0.        , 0.00237812,
       0.        , 0.        , 0.        , 0.00118906, 0.0032    ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00694444, 0.0016    ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00237812, 0.0016    , 0.        ,
       0.        , 0.00694444, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00888889, 0.        , 0.        ,
       0.        , 0.0016    , 0.        , 0.        , 0.00118

In [88]:
print(tt_matrix[:5,:5]) #the co-occurrence matrix is symmetric

[[0.03458611 0.         0.         0.         0.00444444]
 [0.         0.05953125 0.         0.00390625 0.01953125]
 [0.         0.         0.03055605 0.00378072 0.        ]
 [0.         0.00390625 0.00378072 0.38530278 0.00390625]
 [0.00444444 0.01953125 0.         0.00390625 0.02397569]]


### Step 2. Export Matrix

In [89]:
tt_matrix=pd.DataFrame(tt_matrix)

In [91]:
tt_matrix.index=relative_tf.index

In [92]:
tt_matrix.columns=relative_tf.index

In [94]:
tt_matrix.head()

Unnamed: 0,$,able,act,administration,afghanistan,agenda,agent,ago,agreement,alice,...,within,women,work,workers,working,world,would,year,years,young
$,0.034586,0.0,0.0,0.0,0.004444,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0016,0.0,0.0,0.015984,0.024994,0.0
able,0.0,0.059531,0.0,0.003906,0.019531,0.0,0.0,0.0,0.003906,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
act,0.0,0.0,0.030556,0.003781,0.0,0.0,0.0,0.0,0.0,0.003781,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017893,0.0
administration,0.0,0.003906,0.003781,0.385303,0.003906,0.0,0.0,0.0,0.003906,0.00189,...,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.123457,0.0
afghanistan,0.004444,0.019531,0.0,0.003906,0.023976,0.0,0.0,0.0,0.003906,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004444,0.0


In [93]:
tt_matrix.to_csv('matrix.csv')

### Step 3. Clustering Analysis & Visualization
We will use a very user-friendly software for network analysis: Gephi (download link: https://gephi.org/)

<img src="https://juniorworld.github.io/python-workshop-2018/img/occurence-network.png" width='500px'>

### Practice
Please repeat previous steps and apply word occurrence analysis over 2019 Chinese government's annual report.

In [None]:
#load the file of 2019 Government Annual Report addressed by Premier Li Keqiang
chi_2019=open('FILE PATH','r',encoding='utf-8')

In [None]:
#WRITE YOUR CODE HERE





