## In-class Activity 1: Text Cleaning

In [43]:
import pandas as pd
from bs4 import BeautifulSoup
import string

def remove_html(text_data):
    soup = BeautifulSoup(text_data, 'html.parser')
    return soup.get_text()

str_data = "<html><h2>What is nlp??? </h2></html> \nNatural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.\nThe study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.\n(In this post), you will discover what natural language processing is and why it is so important.\nAfter reading this post, you will know => What natural language is and how it is different from other types of data."
processed_text = remove_html(str_data)
print(processed_text)

What is nlp???  
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.
The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.
(In this post), you will discover what natural language processing is and why it is so important.
After reading this post, you will know => What natural language is and how it is different from other types of data.


In [44]:
print('Punctuation: ', string.punctuation)

Punctuation:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [45]:
def remove_punctuation(text):
    sent = []
    for t in text.split(' '):
        no_punct = ''.join([c for c in t if c not in string.punctuation])
        sent.append(no_punct)

    sentence = " ".join([s for s in sent])
    return sentence

In [46]:
rmv_punc_sentence = remove_punctuation(processed_text)
print(rmv_punc_sentence)

What is nlp  
Natural Language Processing or NLP for short is broadly defined as the automatic manipulation of natural language like speech and text by software
The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers
In this post you will discover what natural language processing is and why it is so important
After reading this post you will know  What natural language is and how it is different from other types of data


In [47]:
lower_sentence = rmv_punc_sentence.lower()
print(lower_sentence)

what is nlp  
natural language processing or nlp for short is broadly defined as the automatic manipulation of natural language like speech and text by software
the study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers
in this post you will discover what natural language processing is and why it is so important
after reading this post you will know  what natural language is and how it is different from other types of data


## In-class Activity 2: Tokenization & Lemmatization

In [48]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [49]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [50]:
doc = nlp(lower_sentence.strip())

In [51]:
tok_lem_sentence = [(token.text, token.lemma_) for token in doc]
print(tok_lem_sentence[:15])

[('what', 'what'), ('is', 'be'), ('nlp', 'nlp'), (' \n', ' \n'), ('natural', 'natural'), ('language', 'language'), ('processing', 'processing'), ('or', 'or'), ('nlp', 'nlp'), ('for', 'for'), ('short', 'short'), ('is', 'be'), ('broadly', 'broadly'), ('defined', 'define'), ('as', 'as')]


## In-class Activity 3: Removing Stopwords

In [52]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sangheon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [53]:
print(stopwords.words('english')[:10])
print(len(stopwords.words('english')))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']
198


In [54]:
stop_words = set(stopwords.words('english'))

lem_sentence = [w[1] for w in tok_lem_sentence]
print("Before stopword removal: ", lem_sentence)

rmv_sw_sentence = [w for w in lem_sentence if w not in stop_words]
print("\nAfter stopword removal: ", rmv_sw_sentence)

romoved_word = [w for w in lem_sentence if w not in rmv_sw_sentence]
print("\nRemoved words: ", romoved_word)

Before stopword removal:  ['what', 'be', 'nlp', ' \n', 'natural', 'language', 'processing', 'or', 'nlp', 'for', 'short', 'be', 'broadly', 'define', 'as', 'the', 'automatic', 'manipulation', 'of', 'natural', 'language', 'like', 'speech', 'and', 'text', 'by', 'software', '\n', 'the', 'study', 'of', 'natural', 'language', 'processing', 'have', 'be', 'around', 'for', 'more', 'than', '50', 'year', 'and', 'grow', 'out', 'of', 'the', 'field', 'of', 'linguistic', 'with', 'the', 'rise', 'of', 'computer', '\n', 'in', 'this', 'post', 'you', 'will', 'discover', 'what', 'natural', 'language', 'processing', 'be', 'and', 'why', 'it', 'be', 'so', 'important', '\n', 'after', 'read', 'this', 'post', 'you', 'will', 'know', ' ', 'what', 'natural', 'language', 'be', 'and', 'how', 'it', 'be', 'different', 'from', 'other', 'type', 'of', 'datum']

After stopword removal:  ['nlp', ' \n', 'natural', 'language', 'processing', 'nlp', 'short', 'broadly', 'define', 'automatic', 'manipulation', 'natural', 'language'

## In-class Activity 4: Encoding Text as Numbers

In [57]:
dictionary = {}

def make_frequency_dict(text):
    for word in text:
        if word not in dictionary:
            dictionary[word] = 0
        dictionary[word] += 1

make_frequency_dict(rmv_sw_sentence)

In [58]:
len(dictionary)

33

In [59]:
dictionary

{'nlp': 2,
 ' \n': 1,
 'natural': 5,
 'language': 5,
 'processing': 3,
 'short': 1,
 'broadly': 1,
 'define': 1,
 'automatic': 1,
 'manipulation': 1,
 'like': 1,
 'speech': 1,
 'text': 1,
 'software': 1,
 '\n': 3,
 'study': 1,
 'around': 1,
 '50': 1,
 'year': 1,
 'grow': 1,
 'field': 1,
 'linguistic': 1,
 'rise': 1,
 'computer': 1,
 'post': 2,
 'discover': 1,
 'important': 1,
 'read': 1,
 'know': 1,
 ' ': 1,
 'different': 1,
 'type': 1,
 'datum': 1}

In [60]:
vocab_sorted = sorted(dictionary.items(), key = lambda x: x[1], reverse=True)
vocab_sorted

[('natural', 5),
 ('language', 5),
 ('processing', 3),
 ('\n', 3),
 ('nlp', 2),
 ('post', 2),
 (' \n', 1),
 ('short', 1),
 ('broadly', 1),
 ('define', 1),
 ('automatic', 1),
 ('manipulation', 1),
 ('like', 1),
 ('speech', 1),
 ('text', 1),
 ('software', 1),
 ('study', 1),
 ('around', 1),
 ('50', 1),
 ('year', 1),
 ('grow', 1),
 ('field', 1),
 ('linguistic', 1),
 ('rise', 1),
 ('computer', 1),
 ('discover', 1),
 ('important', 1),
 ('read', 1),
 ('know', 1),
 (' ', 1),
 ('different', 1),
 ('type', 1),
 ('datum', 1)]

In [61]:
word_to_index = {}
i = 0

for (word, frequency) in vocab_sorted:
    if frequency > 1:
        i += 1
        word_to_index[word] = i

print(word_to_index)

{'natural': 1, 'language': 2, 'processing': 3, '\n': 4, 'nlp': 5, 'post': 6}


In [62]:
word_to_index['<OOV>'] = len(word_to_index) + 1
print(word_to_index
      )

{'natural': 1, 'language': 2, 'processing': 3, '\n': 4, 'nlp': 5, 'post': 6, '<OOV>': 7}


In [63]:
encoded = []

print("Processed words: ", rmv_sw_sentence)

for w in rmv_sw_sentence:
    try:
        encoded.append(word_to_index[w])
    except KeyError:
        encoded.append(word_to_index['<OOV>'])

print("\nEncoded words: ", encoded)

Processed words:  ['nlp', ' \n', 'natural', 'language', 'processing', 'nlp', 'short', 'broadly', 'define', 'automatic', 'manipulation', 'natural', 'language', 'like', 'speech', 'text', 'software', '\n', 'study', 'natural', 'language', 'processing', 'around', '50', 'year', 'grow', 'field', 'linguistic', 'rise', 'computer', '\n', 'post', 'discover', 'natural', 'language', 'processing', 'important', '\n', 'read', 'post', 'know', ' ', 'natural', 'language', 'different', 'type', 'datum']

Encoded words:  [5, 7, 1, 2, 3, 5, 7, 7, 7, 7, 7, 1, 2, 7, 7, 7, 7, 4, 7, 1, 2, 3, 7, 7, 7, 7, 7, 7, 7, 7, 4, 6, 7, 1, 2, 3, 7, 4, 7, 6, 7, 7, 1, 2, 7, 7, 7]


## In-class Activity 5: One-Hot Encoding

In [65]:
import numpy as np

vocab = sorted(set(rmv_sw_sentence))
vocab_size = len(vocab)
word_to_index = {word: idx for idx, word in enumerate(vocab)}

one_hot_matrix = np.zeros((len(rmv_sw_sentence), vocab_size), dtype = int)
for i, word in enumerate(rmv_sw_sentence):
    index = word_to_index[word]
    one_hot_matrix[i, index] = 1

print("First 5 words and one hot vectors:\n")
for i in range(5):
    word = rmv_sw_sentence[i]
    vector = one_hot_matrix[i]
    print(f"{i+1}. word: '{word}'")
    print(f"    One-hot: {vector}\n")

First 5 words and one hot vectors:

1. word: 'nlp'
    One-hot: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

2. word: ' 
'
    One-hot: [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

3. word: 'natural'
    One-hot: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

4. word: 'language'
    One-hot: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

5. word: 'processing'
    One-hot: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]

