# Class11 - Natural Lanugage Processing

---

# Natural Language Processing

<img src="https://juniorworld.github.io/python-workshop/img/NLP_.png" width="600px" height="400px" align='left'>

We will demonstrate how to go through these five steps for English and Chinese texts respectively.

## 1. First-run Data Cleaning
- Main task: convert the case, remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- Convert the case: `.lower()`

### Regular Expression Cheat Sheet
1. `.`: Wildcard, any character
2. `[abc]`: Range (a or b or c)
    - \: escape following special characters
    - special characters that need to be escaped: `^ [ . $ { * ( \ + ) | ? < >`
3. `[^abc]`: Reverse range, Not (a or b or c)
4. Quantifier:
    - `*` at least 0 times.
    - `+` at least 1 times.
    - `?` at most 1 times.
    - `{n}`: Exactly n times. 
    - `{n,}`: At least n times
    - `{n,m}`: n-m times (m>n)
5. Character class:
    - `\c`: Control character, such as line break, tab
    - `\s`: White space, `\S`: Not white space
    - `\d`: Digit, `\D`: Not digit
    - `\w`: Word, `\W`: Not word
    - `[:alnum:]`: Digits and letters = [0-9a-zA-Z]
6. Location:
    - `^` the start of the string
    - `$` the end of the string

### Replace Substrings
- `re.sub(pattern, new_string, original_string)`

### Find All Substrings That Match the Pattern
- `re.findall(re_pattern,string)`
- result is a **LIST** of match substrings!

In [None]:
import re

In [None]:
tweet='@JerryNadler admits on #CNN they have no proof of Obstruction by @realDonaldTrump its just his "personal opinion" Meet the new #WitchHunt Same as the old #WitchHunt cc @DonaldJTrumpJr'
tweet

In [None]:
#Remove all hashtags
re.sub('#[^ ]+','',tweet)

In [None]:
#Extract all hashtags
re.findall('#[^ ]+',tweet)

In [None]:
#Extract all mentions


In [None]:
#Remove hyperlinks
tweet2='Read this article: https://goo.gl/rwGHTP Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future.'
tweet3=re.sub('your_code_here','',tweet2)
tweet3

In [None]:
#Extract all hyperlinks


In [None]:
#Remove non-word punctuations
re.sub('[\W]+',' ',tweet3)

In [None]:
#Also works for Chinese text
re.sub('[\W]+',' ','普京表示，歡迎中方在化解危機中的建設性角色！')

<div class="alert alert-block alert-info">
**<b>Reminder</b>** By removing punctuations, you will also loose the full stops and line breaks. You won't be able to analyze sentence and paragraph structure. ONLY USE [\W]+ when the paragraph/sentence structure is not important for you.</div>

In [None]:
#Remove spaces and line breaks at the beginning or the end of the sentence
text=' 普京表示 歡迎中方在化解危機中的建設性角色 '
text.strip()

Create a data_cleaning() function to convert letter case, remove punctuations, numbers, mentions, hashtags and hyperlinks

In [None]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('https:[^ ]+','',text)
    text=re.sub('[\W]+',' ',text)
    text=text.strip()
    return(text)

In [None]:
#test your function with a post from @realDonaldTrump
a='@seanhannity “We the people will now be subjected to the biggest display of modern day McCarthyism....which is the widest fishing net expedition....every aspect of the presidents life....all in order to get power back so they can institute Socialism.” https://t.co/izb2tTrINB'
data_cleaning(a)

## 2. Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

### Tokenize English Text: Hunt for Spaces

In [None]:
#Split the following sentence into words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
sentence=data_cleaning(sentence)
words=sentence.split(' ')

In [None]:
words

## 3. Second-run Data Cleaning
- Main taks: remove stop words, stem/lemmatize words

### 3.1 Remove stop words
<font style="color:red">English stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_eng.txt</font><br>
Stop words are useless for understanding text.<br>
- English: at, in, on, for, of, a, an, the...<br>
- Chinese: 的，地，得，了.<br>

However, the combination of 得了 is not a stop word.<br>
-> Absolute Match

Solution: Membership check - Check whether a word is in the predefined stopword list
>```python
x in [word1, word2, word3]
x not in [word1, word2, word3]
```

In [None]:
'a' in ['a','b','c']

In [None]:
'a' in ['aa','b','c']

In [None]:
'a' not in ['aa','b','c']

In [None]:
stopwords=open("path_to_stopword_file",'r')
stopwords=[i.strip() for i in stopwords.readlines()]

In [None]:
stopwords

### Exercise

In [None]:
#go over every word in the list and check if it is a stop word
#if not, add it to new list words_rm
words_rm=[]
for word in words:
    #write your code here
    
words_rm

In [None]:
#Create a user function called remove_stopwords()
#Function Input: token list
#Function Output: token list without stop words



### 3.2 Stem/Lemmatize Words
- Use external package `gensim` and `nltk` to achieve word stemming and lemmatization.
- Words can have many variants and derivations. The goal of stemming and lemmatization is to convert the words back into their roots.
- **Stem**: Crude. Trim off the suffixes and derivational affixes of words without knowledge of the context. Results are called stems.
    - Advantage: Simple, Fast
    - Disadvantage: Less accurate. Results are incomplete word roots. Cannot identify complex, context-based variants, such as comparative.
    - Example: women -> women, apples -> appl, likely -> like, better -> better
    >```python
    from gensim.parsing.porter import PorterStemmer
    Stemmer=PorterStemmer()
    Stemmer.stem('word')
    Stemmer.stem_documents(list_of_words)
    ```
- **Lemmatization**: Consider the context (part of speech of words) and converts the word to its meaningful base form, which is called Lemma.
    - Advantage: Accurate and Contextualized. Results are meaningful, complete words.
    - Disadvantage: Computationally Costly. Slow. Need to specify the part of speech. NLTK only supports lemmatizing nouns, adjs, and verbs.
    - Example: women -> woman, apples -> appl, likely -> likely, is -> be, better -> good
    >```python
    import nltk
    from nltk.stem import WordNetLemmatizer
    nltk.download('wordnet')
    Lemmatizer=WordNetLemmatizer()
    Lemmatizer.lemmatize('word','pos')
    #pos = part of speech, 'n'=noun [default], 'a'=adjective, 'v'=verb
    ```

In [None]:
! pip install gensim
! pip install nltk

In [None]:
from gensim.parsing.porter import PorterStemmer
Stemmer=PorterStemmer()
Stemmer.stem('walking')

In [None]:
print(Stemmer.stem('cats'),Stemmer.stem('apples'),Stemmer.stem('women'),Stemmer.stem('likely'),Stemmer.stem('is'),Stemmer.stem('better'))

In [None]:
Stemmer.stem_documents(["cats", "apples","women","likely","is","better"])

In [None]:
#Stem the words
words_stem=Stemmer.stem_documents(words_rm)

In [None]:
words_stem

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
Lemmatizer=WordNetLemmatizer()

In [None]:
print(Lemmatizer.lemmatize('cats'),Lemmatizer.lemmatize('apples'),Lemmatizer.lemmatize('women'))

In [None]:
print(Lemmatizer.lemmatize('likely','a'),Lemmatizer.lemmatize('is','v'),Lemmatizer.lemmatize('better','a'))

<div class="alert alert-block alert-info">
**<b>Reminder</b>** NLTK's lemmatizer is better used together with its nltk.pos_tag() function.<br>
    The resulting tags are annotated according to: <a href="https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html">https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html</a><br>
    POS of interest: Starting with 'N','J','V'</div>

In [None]:
nltk.pos_tag(words_rm)

In [None]:
def lemmatization(words_rm):
    words_lemma=[]
    for word,pos in nltk.pos_tag(words_rm):
        if pos[0]=='N':
            words_lemma.append(Lemmatizer.lemmatize(word))
        elif pos[0]=='J':
            words_lemma.append(Lemmatizer.lemmatize(word,'a'))
        elif pos[0]=='V':
            words_lemma.append(Lemmatizer.lemmatize(word,'v'))
        else:
            words_lemma.append(word)
    return(words_lemma)

In [None]:
words_lemma=lemmatization(words_rm)

In [None]:
words_lemma

### Exercise

Download file from: https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv

In [None]:
import pandas as pd
table=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/trump_tweets.csv',index_col=0)

In [None]:
table.head()

In [None]:
#Write a program to:
#Step 1: clean text
#Step 2: tokenize text
#Step 3: remove stop words from tokens
#Step 4: lemmatize tokens
#Step 5: save lemmas into csv file. tokens in the same tweet are saved at the same line
#Step 6: combine all lemmas into a list
lemmas_all=[]
import csv
file=open('trump_twitter_tokens.csv','w',newline='\n')
writer=csv.writer(file)
for text in table['tweet']:
    #Write your code here
    
    lemmas_all.extend()
    writer.writerow()
file.close()

In [None]:
lemmas_freq=pd.Series(lemmas_all).value_counts().reset_index()
lemmas_freq.columns=['word','freq']
lemmas_freq[['freq','word']].to_csv('trump_wordcloud.csv',sep='\t')

# QUIZ

https://www.menti.com/alhbfixudxfg

<img src="https://juniorworld.github.io/python-workshop/img/Week%2011_NLP_quiz.png" width="200" align="left">

### Word Cloud
HTML5 Word Cloud: http://timc.idv.tw/wordcloud/#

### Word Co-occurence Network
Download Gephi from: https://gephi.org/users/download/ <br>

In [None]:
# Extract mentions from his tweets
# Generate a Co-Mention Network
# Where nodes represent twitter handlers and links represent co-mention relationship in tweets
# Hint: 
# Step 1: Check if the post has more than one mentions
# Step 2: If yes, save mentions within a post into a row
file=open('trump_twitter_comentions.csv','w',newline='\n')
writer=csv.writer(file)
for text in table['tweet']:
    #Write your code here
    
    writer.writerow()
file.close()

## Word Embeddings
- Word embedding is a technique to transform textual words into a numerical representation (word vector). Each word is mapped to one vector and this vector is trained to learn the syntactic and semantic relationships between words.
    - e.g.: "book" (1,0,3), "paper" (1,0,4) -> synonyms are close in the word space
- Implications:
    - Find out synonyms in the corpus (a list of documents). A way to understand the structure of opinion expression.
        - In Trump's opinions, CNN is a byword for enemy and dishonesty, while Clinton is a synonym for crook.
    - Find out the equivalent mapping between words and concepts. It can understand the meaning of word combination by assuming the meaning of words is transmissive.
        - "queen" is to "women" is what "king" is to "men"
        - "Beijing" is to "China" is what "Tokyo" is to "Japan"
        - Vectors are eligible for math operations, like + and -
    - Find out the least similar (least possible) word in a sentence
    - Evaluate the similarity (distance) between two or more sentences
    - Evaluate the possibility of a sentence belonging to this corpus
        - Corpus: A set of text documents

<img src="https://www.usna.edu/Users/cs/nchamber/courses/nlp/f20/labs/lab5/wordgraph.png" width="400">
Reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

In [None]:
file=open('trump_twitter_tokens.csv','r')
reader = csv.reader(file)
tokens=[]
for row in reader:
    tokens.append(row)

In [None]:
#word document matrix
#rows: documents
#elements: words
tokens[0]

In [None]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
# Get 10 synonyms most similar to a target word in the word space
# Try: cnn, clinton, biden, election, china
model.wv.most_similar('cnn', topn=10)

In [None]:
# Get synonyms based on a target vector 
# Tariff is to china is what XX is to usa
model.wv.most_similar(positive=['china','usa'], negative=['tariff'])

In [None]:
# Find the least similar word in a list of words
model.wv.doesnt_match(['china','usa','canada','cnn','mexico'])

In [None]:
# Evaluate the possibility that Trump says such a sentence in his Twitter account
model.score(['cnn is friend'.split(),
             'cnn is enemy'.split()])

In [None]:
model.score(['china is friend'.split(),
             'china is enemy'.split()])

### Exercise
Predict rating of a review comment of Amazon Alexa
1. Open this data frame: https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv
2. Train two Word2Vec models of word representations, for 5-star reviews and non-5-star reviews respectively
3. Predict the rating of a comment, saying "It doesn't answer me sometimes and give me some irrelevant answers quite often"

In [None]:
alexa_reviews=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/amazon_alexa.tsv',sep='\t')

In [None]:
alexa_reviews.head()

In [None]:
# How many 5-star, 4-star, 3-star, 2-star, 1-star ratings?


In [None]:
five_stars_comments=[]
for text in : #fill in your code
    text=data_cleaning(text)
    #write your code here
    
    
    five_stars_comments.append(words_lemma)

In [None]:
non_five_stars_comments=[]
for text in : #fill in your code
    text=data_cleaning(text)
    #write your code here
    
    non_five_stars_comments.append(words_lemma)

In [None]:
five_star_model=Word2Vec(sentences=five_stars_comments, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)
non_five_star_model=Word2Vec(sentences=non_five_stars_comments, vector_size=100, window=5, min_count=1, workers=4, hs=1, negative=0)

In [None]:
five_star_model.score([data_cleaning("It doesn't answer me sometimes and give me some irrelevant answers quite often").split()])

In [None]:
non_five_star_model.score([data_cleaning("It doesn't answer me sometimes and give me some irrelevant answers quite often").split()])