# Class 9 - Natural Lanugage Processing

---

# Flowchart

<img src="https://juniorworld.github.io/python-workshop/img/NLP_.png" width="600px" height="400px" align='left'>

We will demonstrate how to go through these five steps for English and Chinese texts respectively.

# 1. First-run Data Cleaning
- Main task: convert the case, remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- Convert the case: `.lower()`

## Regular Expression Cheat Sheet
1. `.`: Wildcard, any character
2. `[abc]`: Group (a or b or c)
    - \\: escape following special characters
    - special characters that need to be escaped: `^ [ . $ { * ( \ + ) | ? < >`
3. `[^abc]`: Reverse group, Not (a or b or c)
4. Character class:
    - `\w`: Word (incl. chars and numbers), `\W`: Non-word
    - `\c`: Control character, such as line break, tab
    - `\s`: White space, `\S`: Not white space
    - `\d`: Digit = [0-9], `\D`: Not digit = [^0-9]
    - `[:alnum:]`: Digits and letters = [0-9a-zA-Z]

5. Quantifier:
    - `*` at least 0 times.
    - `+` at least 1 times.
    - `?` at most 1 times.
    - `{n}`: Exactly n times. 
    - `{n,}`: At least n times
    - `{n,m}`: n-m times (m>n)

6. Location:
   - `^` the start of the string
   - `$` the end of the string
     
8. Logic operation: 
   - `|` or
   - `&` and

### Replace Substrings
- `re.sub(pattern, new_string, original_string)`

### Find All Substrings
- `re.findall(re_pattern,string)`
- `()` Limited extraction. Specify the particular substrings that you want to extract.
- result is a **LIST** of match substrings!

In [None]:
import re

In [None]:
tweet='@JerryNadler admits on #CNN they have no proof of Obstruction by @realDonaldTrump its just his "personal opinion" Meet the new #WitchHunt Same as the old #WitchHunt cc @DonaldJTrumpJr'
tweet

In [None]:
#Remove all hashtags
re.sub('#[^ ]+','',tweet)

In [None]:
#Extract all hashtags


In [None]:
#Extract all mentions


In [None]:
#How many people does this tweet mention?


In [None]:
#Remove hyperlinks
tweet2='Read this article: https://goo.gl/rwGHTP Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future.'
tweet3=re.sub('https:[^ ]+','',tweet2)
tweet3

In [None]:
#Extract all hyperlinks
re.findall('http[^ ]+',tweet2)

In [None]:
#Remove non-word punctuations
re.sub('[\W]+',' ',tweet3)

In [None]:
#Also works for Chinese text
re.sub('[\W]+',' ','普京表示，歡迎中方在化解危機中的建設性角色！')

<div class="alert alert-block alert-info">
**<b>Reminder</b>** By removing punctuations, you will also loose the full stops and line breaks. You won't be able to analyze sentence and paragraph structure. So, [\W]+ should ONLY BE USED WHEN the paragraph/sentence structure is not important for you.</div>

In [None]:
#Remove spaces and line breaks at the beginning or the end of the sentence
text=' 普京表示\n歡迎中方在化解危機中的建設性角色 '


### Exercise

1. Extract all capitalized words, such as SAT, GRE, and PPT.
2. Extract all words that are not the first words in the sentences but start with a uppercase letter.
3. Extract all all words starting with a/A and ending with s/S.
4. Extract all words that has an _a_ followed by one or more _s_.

In [None]:
sentence = "Across the world, it’s still unclear when Ukrainian pilots will begin training at the center, at the Fetesti air base in southeast Romania, which NATO allies also are using to get schooled on the fighter jets in 2020. But the delay is a window into the confusion and chaos that has confronted the military alliance’s rush to supply the A-16s"
#Write your code here
first_answer = 
second_answer = 
third_answer = 
fourth_answer = 
print(first_answer)
print(second_answer)
print(fourth_answer)

## User Functions

- Function is a block of reusable codes. Annotation: y=f(x), where x is a list of input variables and y is a list of output variables.
    - Terminology: input variables = <b>parameters</b>, output variables = <b>returned variables</b> and their actual values = <b>arguments</b>
    - <b>Global vs Local</b>: function can create its local variables that are only used inside its boundary. Local variables can use same names as global variables without overriding their values.
    - Format:
>```python
def function_name(input1[,input2,input3...]):
        command line
        return(output) 
    ```

- The function of function is to transform x into y. Like a magic trick turning a girl into a tiger.

<img src='img/week2-function.png' width='200px'>

In [None]:
# Create a mean() function without return values
def mean(x):#x is a list of numbers
    y = 
    print(y)

In [None]:
mean([1,2,3])

In [None]:
# Assign the function output to a new variable a
# Since the function does not define any output/return, a will be NaN
a = mean([1,2,3])
a

**Revise the previous cell to add y as the return values to the function**

In [None]:
# Revise mean() function to add return values
mean([1,2,3])

In [None]:
a = mean([1,2,3])
a

In [None]:
# Global vs Local Variables
# Variables with the same name will carry on different values within and outside the function
x = mean([1,2,3])
x

**Create a data_cleaning() function to convert letter case, remove punctuations, numbers, mentions, hashtags and hyperlinks**

In [None]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('https:[^ ]+','',text)
    text=re.sub('[\W]+',' ',text)
    text=text.strip()
    return(text)

In [None]:
#test your function with a post from @realDonaldTrump
a='@seanhannity “We the people will now be subjected to the biggest display of modern day McCarthyism....which is the widest fishing net expedition....every aspect of the presidents life....all in order to get power back so they can institute Socialism.” https://t.co/izb2tTrINB'
data_cleaning(a)

# 2. Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

## Tokenize English Text: Hunt for Spaces

In [None]:
#Split the following sentence into words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
sentence=data_cleaning(sentence)
words=sentence.split(' ')

In [None]:
words

# 3. Second-run Data Cleaning
- Main taks: remove stop words, stem/lemmatize words

## 3.1 Remove stop words
<font style="color:red">English stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_eng.txt</font><br>
Stop words are useless for understanding text.<br>
- English: at, in, on, for, of, a, an, the...<br>
- Chinese: 的，地，得，了.<br>

However, the combination of 得了 is not a stop word.<br>
-> Absolute Match

Solution: Membership check - Check whether a word is in the predefined stopword list
>```python
x in [word1, word2, word3]
x not in [word1, word2, word3]
```

In [None]:
'a' in ['a','b','c']

In [None]:
'a' in ['aa','b','c']

In [None]:
'a' not in ['aa','b','c']

In [None]:
stopwords=open("./doc/stop_words_eng.txt",'r')
stopwords=[i.strip() for i in stopwords.readlines()]

In [None]:
stopwords

### Exercise

In [None]:
#go over every word in the list and check if it is a stop word
#if not, add it to new list words_rm
words_rm=[]
for word in words:
    #write your code here
    
words_rm

In [None]:
#Create a user function called remove_stopwords()
#Function Input: token list
#Function Output: token list without stop words



In [None]:
remove_stopwords(words)

### 3.2 Stem/Lemmatize Words
- Use external package `gensim` and `nltk` to achieve word stemming and lemmatization.
- Words can have many variants and derivations. The goal of stemming and lemmatization is to convert the words back into their roots.
- **Stem**: Crude. Trim off the suffixes and derivational affixes of words without knowledge of the context. Results are called stems.
    - Advantage: Simple, Fast
    - Disadvantage: Less accurate. Results are incomplete word roots. Cannot identify complex, context-based variants, such as comparative.
    - Example: women -> women, apples -> appl, likely -> like, better -> better
    >```python
    from gensim.parsing.porter import PorterStemmer
    Stemmer=PorterStemmer()
    Stemmer.stem('word')
    Stemmer.stem_documents(list_of_words)
    ```
- **Lemmatization**: Consider the context (part of speech of words) and converts the word to its meaningful base form, which is called Lemma.
    - Advantage: Accurate and Contextualized. Results are meaningful, complete words.
    - Disadvantage: Computationally Costly. Slow. Need to specify the part of speech. NLTK only supports lemmatizing nouns, adjs, and verbs.
    - Example: women -> woman, apples -> apple, likely -> likely, is/are -> be, better -> good
  
    >```python
    import nltk
    from nltk.stem import WordNetLemmatizer
    nltk.download('wordnet')
    Lemmatizer=WordNetLemmatizer()
    Lemmatizer.lemmatize('word','pos')
    #pos = part of speech, 'n'=noun [default], 'a'=adjective, 'v'=verb, 'r'=adverbs
    ```

In [None]:
! pip3 install gensim
! pip3 install nltk

In [None]:
import ssl
ssl._create_default_https_context=ssl._create_unverified_context

In [None]:
from gensim.parsing.porter import PorterStemmer
Stemmer=PorterStemmer()
Stemmer.stem('walking')

In [None]:
print(Stemmer.stem('cats'),Stemmer.stem('apples'),Stemmer.stem('women'),Stemmer.stem('likely'),Stemmer.stem('is'),Stemmer.stem('better'))

In [None]:
Stemmer.stem_documents(["cats", "apples","women","likely","is","better"])

In [None]:
#Stem the words
words_stem=Stemmer.stem_documents(words_rm)

In [None]:
words_stem

In [None]:
import nltk
nltk.download('wordnet') #only need to run once

In [None]:
from nltk.stem import WordNetLemmatizer
Lemmatizer=WordNetLemmatizer()

In [None]:
print(Lemmatizer.lemmatize('cats'),Lemmatizer.lemmatize('apples'),Lemmatizer.lemmatize('women'))

In [None]:
print(Lemmatizer.lemmatize('likely','r'),Lemmatizer.lemmatize('is','v'),Lemmatizer.lemmatize('better','a'))

In [None]:
#POS is critical to results
print(Lemmatizer.lemmatize('likely'),Lemmatizer.lemmatize('is'),Lemmatizer.lemmatize('better'))

<div class="alert alert-block alert-info">
**<b>Reminder</b>** NLTK's lemmatizer is better used together with its nltk.pos_tag() function.<br>
    The resulting tags are annotated according to: <a href="https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html">https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html</a><br>
    POS of interest: Starting with 'N','J','V'</div>

In [None]:
nltk.pos_tag(words_rm)

In [None]:
def lemmatization(words_rm):
    words_lemma=[]
    for word,pos in nltk.pos_tag(words_rm):
        if pos[0]=='N':
            words_lemma.append(Lemmatizer.lemmatize(word))
        elif pos[0]=='J':
            words_lemma.append(Lemmatizer.lemmatize(word,'a'))
        elif pos[0]=='V':
            words_lemma.append(Lemmatizer.lemmatize(word,'v'))
        else:
            words_lemma.append(word)
    return(words_lemma)

In [None]:
# Simply the above program using a dictionary


In [None]:
words_lemma=lemmatization(words_rm)

In [None]:
words_lemma

### Exercise

Download file from: https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv

In [None]:
import pandas as pd
table=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv',index_col=0)

In [None]:
table.head()

In [None]:
#Write a program to:
#Step 1: clean text
#Step 2: tokenize text
#Step 3: remove stop words from tokens
#Step 4: lemmatize tokens
#Step 5: save lemmas into csv file. tokens in the same tweet are saved at the same line
#Step 6: combine all lemmas into a list
lemmas_all=[]
import csv
file=open('trump_twitter_tokens.csv','w',newline='\n')
writer=csv.writer(file)
for text in table['tweet']:
    #Write your code here
    
file.close()

In [None]:
lemmas_all

In [None]:
lemmas_freq=pd.Series(lemmas_all).value_counts().reset_index()
lemmas_freq.columns=['word','freq']
lemmas_freq[['freq','word']].to_csv('trump_wordcloud.csv',sep='\t',index=False)

# QUIZ

### Word Cloud
HTML5 Word Cloud: http://timc.idv.tw/wordcloud/#

### Word Co-occurence Network
Download Gephi from: https://gephi.org/users/download/ <br>

In [None]:
# Extract mentions from his tweets
# Generate a Co-Mention Network
# Where nodes represent twitter handlers and links represent co-mention relationship in tweets
# Hint: 
# Step 1: Check if the post has more than one mentions
# Step 2: If yes, save mentions within a post into a row
file=open('trump_twitter_comentions.csv','w',newline='\n')
writer=csv.writer(file)
for text in table['tweet']:
    #Write your code here
    
    
file.close()