<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week8.png" width="350px">

---

# Natural Language Processing

<img src="https://juniorworld.github.io/python-workshop-2018/img/NLP_.png" width="700px" height="400px" align='left'>

## 4. Vectorization
### What is vector?
<img src="https://juniorworld.github.io/python-workshop-2018/img/scalar-vector-matrix.jpeg" width='400px' align='left'>

### Types of word representation
- Scalar: a single variable
- One-hot encoding vector: ~ dummy variables
- Distributed embedding vector: word2vec & glove

### One-Hot Encoding
- Equivalent to dummy variables
- Apple: [1,0,0]; Banana: [0,1,0]; Grape: [0,0,1]
- A basket of fruit: one apple, one banana and one grape: [1,1,1]

#### Document-Term Matrix
- Row: Document
- Column: Term
- Row and column can be reversed.
- In the cell:
    - Term Frequency (`tf`): absolute vs **relative**
    - Term Frequency-Inversed Document Frequency (TF-IDF)
        - to suppress undiscriminative words
        - Doc Freq (`df`): absolute vs **relative**
        - Formula: `tf*log(1/df) = tf*log(N/n)`

<img src="https://juniorworld.github.io/python-workshop-2018/img/doc-term-matrix.jpg" width='500px'>

<h3 style="color:red">1a. English Paragraph-Term Matrix (step-by-step breakdown)</h3>

In [None]:
import regex as re
import pandas as pd
import numpy as np

In [None]:
#first reload data_cleaning() function we created last week
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('http://[^ ]+|https://[^ ]+','',text)
    text=re.sub('\p{P}+',' ',text)
    return(text)

In [None]:
#load english stop words
#stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_eng.txt
file_eng=open('FILE PATH','r')
stop_words_eng=[]
for line in file_eng.readlines():
    line=line.strip() #remove line break
    stop_words_eng.append(line) #update the list of stop words line by line
file_eng.close()

In [None]:
len(stop_words_eng)

In [None]:
#define a paragraph
paragraph_1="Many of us campaigned on the same core promises: to defend American jobs and demand fair trade for American workers; to rebuild and revitalize our Nation's infrastructure; to reduce the price of healthcare and prescription drugs; to create an immigration system that is safe, lawful, modern and secure; and to pursue a foreign policy that puts America's interests first."
cleaned_paragraph_1=    #clean the paragraph by using data_cleaning() function
words_1=                #tokenize the paragraph
cleaned_words_1=        #remove stop words

In [None]:
cleaned_words_1 #have a look at the result

In [None]:
#get the word frequency table
tf_1=pd.Series(cleaned_words_1).value_counts()

In [None]:
tf_1

In [None]:
#wrap previous lines into a user function which takes a paragraph string and outputs a frequency table
def paragraph_to_tf(paragraph):
    
    
    #WRITE YOUR CODE HERE
    
    
    return(tf)

In [None]:
#Use paragraph_to_tf() function to obtain term frequency table of paragraph_2
paragraph_2="In 2019, we also celebrate 50 years since brave young pilots flew a quarter of a million miles through space to plant the American flag on the face of the moon. Half a century later, we are joined by one of the Apollo 11 astronauts who planted that flag: Buzz Aldrin. This year, American astronauts will go back to space on American rockets."
tf_2=paragraph_to_tf(paragraph_2)

In [None]:
tf_2

In [None]:
tf_combined=        #combine two frequency tables
tf_combined=tf_combined.fillna(0) #replace missing value with 0

In [None]:
tf_combined.head()

<h3 style="color:red">1b. English Paragraph-Term Matrix (integrated)</h3>

a. create term-doc matrix

In [None]:
#load the file of 2019 State of the Union addressed by President Trump
file_2019=open('FILE PATH','r',encoding='utf-8')

In [None]:
#read file line by line, obtain the term frequency table of each paragraph and combine them into a Paragraph-Term Matrix
#REMINDER: pd.concat() doesn't support combining blank table
#---------------------

#WRITE YOUR CODE HERE

#---------------------
tf_combined=tf_combined.fillna(0) #replace missing values

In [None]:
tf_combined.head()

In [None]:
tf_combined.shape

In [None]:
#row sum
tf_combined.sum(axis=1)

In [None]:
#column sum
tf_combined.sum(axis=0)

In [None]:
#How to calculate DF?
tf_combined[tf_combined>0].sum(axis=1)

b. remove words only appearing in one paragraph

In [None]:
#remove words that only appear in one paragraph
tf_combined=tf_combined[tf_combined[tf_combined>0].sum(axis=1)>1]

In [None]:
#Normalization: absolute term freq divided by total count of words in paragraph
relative_tf=tf_combined/tf_combined.sum(axis=0)

<div class="alert alert-block alert-success">
**<b>Extra Knowledge: HOW TO IMPLEMENT TF-IDF</b>**</div>

>```python
tf_idf=relative_tf.T * np.log(127/tf_combined[tf_combined>0].sum(axis=1))
tf_idf.iloc[0].sort_values()[::-1] #look at the most distinguishing words in the last paragraph```

<h3 style="color:red">1c. English Paragraph-Term Matrix (shortcut)</h3>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
file_2019=open('FILE PATH','r',encoding='utf-8')
paragraphs=file_2019.readlines()

In [None]:
vectorizer=CountVectorizer(lowercase=True,stop_words='english')
tf=vectorizer.fit_transform(paragraphs)
tf=pd.DataFrame(tf.toarray().T,index=vectorizer.get_feature_names())

In [None]:
tf.shape #results of this method will be different with that of previous method as the stop words list is different.

In [None]:
tf.head()

In [None]:
#Normalization: absolute freq to relative freq
tf=tf/tf.sum(axis=0)

---
## Break
---

## Topic & Frame Analysis: From a Word Co-occurrence perspective
- RQ: How framing devices co-occur in a text and to form underlying patterns of meaning.

<img src="https://juniorworld.github.io/python-workshop-2018/img/hierarchy.png" width='700px'>

### Step 1. From Doc-Term Matrix to Term-Term Co-occurrence Matrix

Suppose we have four sentences:
- 'apple, banana'
- 'apple, banana, grape'
- 'banana, grape'
- 'apple'

In [None]:
#Translate into numbers
sen1=[1,1,0]
sen2=[1,1,1]
sen3=[0,1,1]
sen4=[1,0,0]

**Q: How many baskets have both apple and babana?**

_METHOD 1: AND operator_

In [None]:
count=0 #initialization
for i in range(4): #go through every basket
    if sen1[i]==1 and sen2[i]==1 and sen3[i]==1 and sen4[i]==1:
        count+=1
print(count)

_METHOD 2: Multiply_

In [None]:
count=0
for i in range(4):
    if sen1[i]*sen2[i]*sen3[i]*sen4[i]==1:
        count+=1
print(count)

_METHOD 3: Dot product_

In [None]:
matrix=np.matrix([sen1,sen2,sen3,sen4])
np.dot(matrix.T,matrix)

### Vector Multiplication
<img src="https://juniorworld.github.io/python-workshop-2018/img/vec-multiply.png" align='left' width='300px'>

### Matrix Multiplication
<img src="https://juniorworld.github.io/python-workshop-2018/img/matrix-multiply.svg" align='left'>

### Practice
>```python
fruit2=np.matrix(
      [[0,1,1,0],
       [0,0,1,1],
       [1,0,1,1]])```
       
>```python
np.dot(fruit2,fruit2.T) = ?
       ```

In [None]:
#Using dot product
tt_matrix=np.dot(relative_tf,relative_tf.T)

In [None]:
tt_matrix.shape #symmetric matrix

retrieve data from matrix: `matrix_name[row index, col index]`

In [None]:
tt_matrix[0,0]

In [None]:
tt_matrix[0,:] #full list of co-occurrence

In [None]:
print(tt_matrix[:5,:5]) #the co-occurrence matrix is symmetric

### Step 2. Export Matrix

In [None]:
tt_matrix=pd.DataFrame(tt_matrix)

In [None]:
tt_matrix.index=relative_tf.index

In [None]:
tt_matrix.columns=relative_tf.index

In [None]:
tt_matrix.to_csv('matrix.csv')

### Step 3. Clustering Analysis & Visualization
We will use a very user-friendly software for network analysis: Gephi (download link: https://gephi.org/)

<img src="https://juniorworld.github.io/python-workshop-2018/img/occurence-network.png" width='500px'>

### Practice
Please repeat previous steps and apply word occurrence analysis over 2019 Chinese government's annual report.

In [None]:
#load the file of 2019 Government Annual Report addressed by Premier Li Keqiang
chi_2019=open('FILE PATH','r',encoding='utf-8')

In [None]:
#WRITE YOUR CODE HERE





