# Class 9 - Natural Lanugage Processing

---

# Flowchart

<img src="https://juniorworld.github.io/python-workshop/img/NLP_.png" width="600px" height="400px" align='left'>

We will demonstrate how to go through these five steps for English and Chinese texts respectively.

# 1. First-run Data Cleaning
- Main task: convert the case, remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- Convert the case: `.lower()`

## Regular Expression Cheat Sheet
1. `.`: Wildcard, any character
2. `[abc]`: Group (a or b or c)
    - \\: escape following special characters
    - special characters that need to be escaped: `^ [ . $ { * ( \ + ) | ? < >`
3. `[^abc]`: Reverse group, Not (a or b or c)
4. Character class:
    - `\w`: Word (incl. chars and numbers), `\W`: Non-word
    - `\c`: Control character, such as line break, tab
    - `\s`: White space, `\S`: Not white space
    - `\d`: Digit = [0-9], `\D`: Not digit = [^0-9]
    - `[:alnum:]`: Digits and letters = [0-9a-zA-Z]

5. Quantifier:
    - `*` at least 0 times.
    - `+` at least 1 times.
    - `?` at most 1 times.
    - `{n}`: Exactly n times. 
    - `{n,}`: At least n times
    - `{n,m}`: n-m times (m>n)

6. Location:
   - `^` the start of the string
   - `$` the end of the string
     
8. Logic operation: 
   - `|` or
   - `&` and

### Replace Substrings
- `re.sub(pattern, new_string, original_string)`

### Find All Substrings
- `re.findall(re_pattern,string)`
- `()` Limited extraction. Specify the particular substrings that you want to extract.
- result is a **LIST** of match substrings!

In [None]:
import re

In [None]:
tweet='@JerryNadler admits on #CNN they have no proof of Obstruction by @realDonaldTrump its just his "personal opinion" Meet the new #WitchHunt Same as the old #WitchHunt cc @DonaldJTrumpJr'
tweet

In [None]:
#Remove all hashtags
re.sub('#[^ ]+','',tweet)

In [None]:
#Extract all hashtags
re.findall('#([^ ]+)',tweet)

In [None]:
#Extract all mentions
re.findall("@[^ ]+",tweet)

In [None]:
#How many people does this tweet mention?
len(re.findall("@[^ ]+",tweet))

In [None]:
#Remove hyperlinks
tweet2='Read this article: https://goo.gl/rwGHTP Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future.'
tweet3=re.sub('https:[^ ]+','',tweet2)
tweet3

In [None]:
#Extract all hyperlinks
re.findall('http[^ ]+',tweet2)

In [None]:
#Remove non-word punctuations
re.sub('[\W]+',' ',tweet3)

In [None]:
#Also works for Chinese text
re.sub('[\W]+',' ','普京表示，歡迎中方在化解危機中的建設性角色！')

<div class="alert alert-block alert-info">
**<b>Reminder</b>** By removing punctuations, you will also loose the full stops and line breaks. You won't be able to analyze sentence and paragraph structure. So, [\W]+ should ONLY BE USED WHEN the paragraph/sentence structure is not important for you.</div>

In [None]:
#Remove spaces and line breaks at the beginning or the end of the sentence
text=' 普京表示\n歡迎中方在化解危機中的建設性角色 '
text.strip()

In [None]:
re.sub("^ | $","",text)

## User Functions

- Function is a block of reusable codes. Annotation: y=f(x), where x is a list of input variables and y is a list of output variables.
    - Terminology: input variables = <b>parameters</b>, output variables = <b>returned variables</b> and their actual values = <b>arguments</b>
    - <b>Global vs Local</b>: function can create its local variables that are only used inside its boundary. Local variables can use same names as global variables without overriding their values.
    - Format:
>```python
def function_name(input1[,input2,input3...]):
        command line
        return(output) 
    ```

- The function of function is to transform x into y. Like a magic trick turning a girl into a tiger.

<img src='img/week2-function.png' width='200px'>

In [None]:
# Create a mean() function without return values
def mean(x):#x is a list of numbers
    y = sum(x)/len(x)
    return(y)

In [None]:
mean([1,2,3])

In [None]:
a = mean([1,2,3])
a

**Create a data_cleaning() function to convert letter case, remove punctuations, numbers, mentions, hashtags and hyperlinks**

In [None]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('https:[^ ]+','',text)
    text=re.sub('[\W]+',' ',text)
    text=text.strip()
    return(text)

In [None]:
#test your function with a post from @realDonaldTrump
a='@seanhannity “We the people will now be subjected to the biggest display of modern day McCarthyism....which is the widest fishing net expedition....every aspect of the presidents life....all in order to get power back so they can institute Socialism.” https://t.co/izb2tTrINB'
data_cleaning(a)

# 2. Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

## Tokenize English Text: Hunt for Spaces

In [None]:
#Split the following sentence into words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
sentence=data_cleaning(sentence)
words=sentence.split(' ')

In [None]:
words

# 3. Second-run Data Cleaning
- Main taks: remove stop words, stem/lemmatize words

## 3.1 Remove stop words
<font style="color:red">English stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_eng.txt</font><br>
Stop words are useless for understanding text.<br>
- English: at, in, on, for, of, a, an, the...<br>
- Chinese: 的，地，得，了.<br>

However, the combination of 得了 is not a stop word.<br>
-> Absolute Match

Solution: Membership check - Check whether a word is in the predefined stopword list
>```python
x in [word1, word2, word3]
x not in [word1, word2, word3]
```

In [None]:
stopwords=open("./doc/stop_words_eng.txt",'r')
stopwords=[]
for i in stopwords:
    stopwords.append(i.strip())

In [None]:
#Use List Comprehension technique to simplify syntaxes
stopwords=open("./doc/stop_words_eng.txt",'r')
stopwords=[i.strip() for i in stopwords.readlines()]

In [None]:
stopwords

### Exercise

In [None]:
#go over every word in the list and check if it is a stop word
#if not, add it to new list words_rm
words_rm=[]
for word in words:
    #write your code here

    
words_rm

In [None]:
#Create a user function called remove_stopwords()
#Function Input: token list
#Function Output: token list without stop words



In [None]:
remove_stopwords(words)

### 3.2 Stem/Lemmatize Words
- Use external package `gensim` and `nltk` to achieve word stemming and lemmatization.
- Words can have many variants and derivations. The goal of stemming and lemmatization is to convert the words back into their roots.
- **Stem**: Crude. Trim off the suffixes and derivational affixes of words without knowledge of the context. Results are called stems.
    - Advantage: Simple, Fast
    - Disadvantage: Less accurate. Results are incomplete word roots. Cannot identify complex, context-based variants, such as comparative.
    - Example: women -> women, apples -> appl, likely -> like, better -> better
    >```python
    from gensim.parsing.porter import PorterStemmer
    Stemmer=PorterStemmer()
    Stemmer.stem('word')
    Stemmer.stem_documents(list_of_words)
    ```
- **Lemmatization**: Consider the context (part of speech of words) and converts the word to its meaningful base form, which is called Lemma.
    - Advantage: Accurate and Contextualized. Results are meaningful, complete words.
    - Disadvantage: Computationally Costly. Slow. Need to specify the part of speech. NLTK only supports lemmatizing nouns, adjs, and verbs.
    - Example: women -> woman, apples -> apple, likely -> likely, is/are -> be, better -> good
  
    >```python
    import nltk
    from nltk.stem import WordNetLemmatizer
    nltk.download('wordnet')
    Lemmatizer=WordNetLemmatizer()
    Lemmatizer.lemmatize('word','pos')
    #pos = part of speech, 'n'=noun [default], 'a'=adjective, 'v'=verb, 'r'=adverbs
    ```

In [None]:
! pip3 install nltk

In [None]:
import ssl
ssl._create_default_https_context=ssl._create_unverified_context

In [None]:
from nltk.stem.porter import *
Stemmer=PorterStemmer()
Stemmer.stem('walking')

In [None]:
print(Stemmer.stem('cats'),Stemmer.stem('apples'),Stemmer.stem('women'),Stemmer.stem('likely'),Stemmer.stem('is'),Stemmer.stem('better'))

In [None]:
#Stem the words_rm
#Store results as a list named words_stem



In [None]:
words_stem

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
Lemmatizer=WordNetLemmatizer()

In [None]:
print(Lemmatizer.lemmatize('cats'),Lemmatizer.lemmatize('apples'),Lemmatizer.lemmatize('women'))

In [None]:
print(Lemmatizer.lemmatize('likely','a'),Lemmatizer.lemmatize('is','v'),Lemmatizer.lemmatize('better','a'))

In [None]:
#POS is critical to results
print(Lemmatizer.lemmatize('likely'),Lemmatizer.lemmatize('is'),Lemmatizer.lemmatize('better'))

<div class="alert alert-block alert-info">
**<b>Reminder</b>** NLTK's lemmatizer is better used together with its nltk.pos_tag() function.<br>
    The resulting tags are annotated according to: <a href="https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html">https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html</a><br>
    POS of interest: Starting with 'N','J','V'</div>

In [None]:
nltk.pos_tag(words_rm)

In [None]:
def lemmatization(words_rm):
    words_lemma=[]
    for word,pos in nltk.pos_tag(words_rm):
        if pos[0]=='N':
            words_lemma.append(Lemmatizer.lemmatize(word))
        elif pos[0]=='J':
            words_lemma.append(Lemmatizer.lemmatize(word,'a'))
        elif pos[0]=='V':
            words_lemma.append(Lemmatizer.lemmatize(word,'v'))
        else:
            words_lemma.append(word)
    return(words_lemma)

In [None]:
# Simply the above program using a dictionary
pos_dic = {'N':'n', 'J':'a', 'V':'v'}

In [None]:
def lemmatization(words_rm,pos_dic):
    words_lemma=[]
    for word,pos in nltk.pos_tag(words_rm):
        
        
    return(words_lemma)

In [None]:
words_lemma=lemmatization(words_rm)

In [None]:
words_lemma

### Exercise

Download file from: https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv

In [None]:
import pandas as pd
table=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv',index_col=0)

In [None]:
table.head()

In [None]:
#Write a program to:
#Step 1: clean text
#Step 2: tokenize text
#Step 3: remove stop words from tokens
#Step 4: lemmatize tokens. Tokens in the same tweet are saved at the same line.
words_lemmas = []
for text in table['tweet']:
    #Write your code here

    

In [None]:
import csv
file=open('trump_twitter_tokens.csv','w',newline='\n')
writer=csv.writer(file)
writer.writerows(words_lemmas)
file.close()

In [None]:
lemmas_all = [word for doc in words_lemmas for word in doc]

In [None]:
lemmas_all

In [None]:
lemmas_freq=pd.Series(lemmas_al).value_counts()

In [None]:
lemmas_freq

In [None]:
lemmas_freq=lemmas_freq.reset_index()
lemmas_freq.columns=['word','freq']
lemmas_freq[['freq','word']].to_csv('trump_wordcloud.csv',sep='\t',index=False)

# QUIZ

### Word Cloud
1. HTML5 Word Cloud: http://timc.idv.tw/wordcloud/#
2. Python package `wordcloud`
   - WordCloud(width, height, background_color, margin, max_words, mask, random_state, stopwords)
     - width, height: the size of the wordcloud, default = 400
     - background_color: default = "black"
     - margin: the width of space between the boundary of wordcloud and the edge of the entire image
     - max_words: number of words displayed in the wordclouds
     - mask: using a mask to display wordclouds in arbitrary shapes
     - stopwords: the words that you want to exclude in the wordclouds
     - random_state: the controller of word randomization process
    - WordCloud.generate(text): input value should be a natural string
    - WordCloud.generate_from_frequencies(dict): input value should be a dictionary mapping strings (words) to floats (frequencies).
>```python
>from wordcloud import WordCloud
>#Initiate a WordCloud instance
>wordcloud_generator = WordCloud(max_words=1000, margin=10)
>
>#Two approaches to creating wordclouds
>#1: generate from full text (cleaned)
>wordcloud_generator.generate(text)
>#2: generate from word frequency table
>wordcloud_gemerator.generate_from_frequencies(dict)
```)

In [None]:
! pip3 install wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud_generator = WordCloud()

In [None]:
# 1: Generate wordcloud based on the lemmas_freq
# You need to Convert the dataframe (series/column) into dictionary using series.to_dict() method
lemmas_freq.to_dict()

In [None]:
wordcloud = wordcloud_generator.generate_from_frequencies(lemmas_freq.to_dict())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

In [None]:
# 2: Generate wordclouds based on the cleaned text
# You need to merge all words into a continuous string
' '.join(lemmas_all)

In [None]:
joined_text = ' '.join(lemmas_all)
wordcloud = wordcloud_generator.generate(joined_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

In [None]:
# .join() method can also be applied to a column of strings
' '.join(table['tweet'])

In [None]:
# Generate wordclouds based on raw text
# Request the WordCloud to remove stopwords for you
# The absence of first-run data cleaning will lead to the inclusion of irrelevant or nonsensical characters in the word cloud.
text = ' '.join(table['tweet'])
wordcloud_generator = WordCloud(stopwords = stopwords)
wordcloud = wordcloud_generator.generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

### Word Co-occurence Network
Download Gephi from: https://gephi.org/users/download/ <br>

In [None]:
# Extract mentions from his tweets
# Generate a Co-Mention Network
# Where nodes represent twitter handlers and links represent co-mention relationship in tweets
# Hint: 
# Step 1: Check if the post has more than one mentions
# Step 2: If yes, save mentions within a post into a row
file=open('trump_twitter_comentions.csv','w',newline='\n')
writer=csv.writer(file)
for text in table['tweet']:
    #Write your code here
    
    
file.close()

# 4. Vectorization: Word Embeddings
- Word embedding is a technique to transform textual words into a numerical representation (word vector). Each word is mapped to one vector and this vector is trained to learn the syntactic and semantic relationships between words.
    - e.g.: "book" (1,0,3), "paper" (1,0,4) -> synonyms are close in the word space
- Applications:
    - Find out synonyms in the corpus (a list of documents). A way to understand the structure of opinion expression.
        - `word_embedding.wv.most_similar(word, n=n)` find out the most similar n word for a focal word
        - In Trump's opinions, CNN is a byword for enemy and dishonesty, while Clinton is a synonym for crook.
    - Find out the equivalent mapping between words and concepts. It can understand the meaning of word combination by assuming the meaning of words is transmissive.
        - `word_embedding.wv.most_similar(positive=[word1, word2], negative=[word3])`
        - "queen" is to "women" is what "king" is to "men"
        - "Beijing" is to "China" is what "Tokyo" is to "Japan"
        - Vectors are eligible for math operations, like + and -
    - Find out the least similar (least possible) word in a sentence
        - `word_embedding.wv.doesnt_match([word1, word2 ...])`
    - Evaluate the similarity (reverse of distance) between two or more sentences
    - Evaluate the possibility of a sentence belonging to this corpus
        - `word_embedding.score([word1, word2 ...])`

<img src="https://juniorworld.github.io/python-workshop/img/word2vec_2.png" width="300">
<img src="https://juniorworld.github.io/python-workshop/img/word2vec.png" width="500">
Reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

In [None]:
file=open('trump_twitter_tokens.csv','r')
reader = csv.reader(file)
word_doc_matrix=[]
for row in reader:
    word_doc_matrix.append(row)

In [None]:
#word document matrix
#rows: documents
#columns: words
word_doc_matrix[0]

In [None]:
! pip3 install gensim==3.8.0

In [None]:
import collections.abc
#hyper needs the four following aliases to be done manually.
collections.Iterable = collections.abc.Iterable
collections.Mapping = collections.abc.Mapping
collections.MutableSet = collections.abc.MutableSet
collections.MutableMapping = collections.abc.MutableMapping

In [None]:
from gensim.models import Word2Vec

In [None]:
# This is a small corpus. We set the vector dimension to be as small as 15 only.
# Word embedding trained on rich corpus like Wikipedia usually will take a dimensionality at 200 or more.
word_embedding = Word2Vec(sentences=words_lemmas, size=30, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
# Get 10 synonyms most similar to a target word in the word space
# Try: cnn, clinton, biden, election, china
word_embedding.wv.most_similar('cnn', topn=10)

In [None]:
# Get synonyms based on a target vector 
# Xi is to China as XX is to America
word_embedding.wv.most_similar(positive=['china','america'], negative=['xi'])

In [None]:
# Evaluate the possibility that Trump says such a sentence in his Twitter account
# The higher score, the more probable that Trump will say this according to his past utterances.
word_embedding.score(['cnn is friend'.split(),
                      'cnn is enemy'.split()])

In [None]:
word_embedding.score(['china is friend'.split(),
                      'china is enemy'.split()])

## Exercise
Predict rating of a review comment of Amazon Alexa
1. Open this data frame: https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv
2. Train two Word2Vec models of word representations, for 5-star reviews and non-5-star reviews respectively
3. Predict the rating of a comment, saying "It doesn't answer me sometimes and give me some irrelevant answers quite often"

In [None]:
alexa_reviews=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv',sep='\t')

In [None]:
alexa_reviews.head()

In [None]:
# How many 5-star, 4-star, 3-star, 2-star, 1-star ratings?


In [None]:
five_stars_comments=[]
for text in :
    text=data_cleaning(text)
    #write your code here
    
    
    

In [None]:
non_five_stars_comments=[]
for text in :
    text=data_cleaning(text)
    #write your code here
    
    

In [None]:
five_star_model=Word2Vec(sentences=five_stars_comments, vector_size=30, window=5, min_count=1, workers=4, hs=1, negative=0)
non_five_star_model=Word2Vec(sentences=non_five_stars_comments, vector_size=30, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
test_sentence = "It doesn't answer me sometimes and give me some irrelevant answers quite often."
test_clean = data_cleaning(test_sentence)
test_words = test_clean.split()
test_rm = remove_stopwords(test_words)
test_lemma = lemmatization(test_rm)

In [None]:
five_star_model.score([test_lemma])

In [None]:
non_five_star_model.score([test_lemma])