# CSC 620 - HW #14

Student: Mark Kim

Lesson Author: Jason Brownlee
[How to Get Started with Deep Learning for Natural Language Processing](https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/)

## Lesson 1 - Deep Learning and Natural Language

For this lesson you must research and list 10 impressive applications of deep
learning methods in the field of natural language processing.

1. [Qualifying Certainty in Radiology Reports through Deep Learning–Based Natural
   Language
   Processing](https://www-ncbi-nlm-nih-gov.jpllnet.sfsu.edu/pmc/articles/PMC8562739/)
2. [A natural language processing and deep learning approach to identify child
   abuse from pediatric electronic medical
   records](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7909689/)
3. [Deep Learning Natural Language Processing Successfully Predicts the
   Cerebrovascular Cause of Transient Ischemic Attack-Like Presentations](https://www.ahajournals.org/doi/pdf/10.1161/STROKEAHA.118.024124)
4. [Dynamic sign language translating system using deep learning and natural
   language
   processing](https://www.proquest.com/docview/2623612530?pq-origsite=primo)
5. [NATURALPROOFS: Mathematical Theorem Proving in Natural
   Language](https://arxiv.org/pdf/2104.01112.pdf)
6. [Explaining neural activity in human listeners with deep learning via natural
   language processing of narrative
   text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9596412/)
7. [Identifying disaster-related tweets and their semantic, spatial andtemporal
   context using deep learning, natural languageprocessing and spatial analysis:
   a case study of Hurricane
   Irma](https://www.tandfonline.com/doi/epdf/10.1080/17538947.2018.1563219)
8. [Deep Learning-Based Natural Language Processing for Screening Psychiatric
   Patients](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7874001/)
9. [A natural language processing approach based on embedding deep learning from
   heterogeneous compounds for quantitative structure–activity relationship
   modeling](https://onlinelibrary.wiley.com/doi/10.1111/cbdd.13742)
10. [Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing](https://www.sciencedirect.com/science/article/abs/pii/S1532046421003130?via%3Dihub)


## Lesson 2 - Cleaning Text Data

Your task is to locate a free classical book on the Project Gutenberg website,
download the ASCII version of the book and tokenize the text and save the result
to a new file.

This section is relatively straightforward and does not really need explanation.

In [1]:
import nltk

with open("ArtOfWar.txt", 'rt') as f:
  text = f.read()

manualTokenized = text.lower().split()

nltkTokenized = nltk.word_tokenize(text)

In [2]:
print(manualTokenized[:10])
print(nltkTokenized[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'art', 'of', 'war,', 'by']
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Art', 'of', 'War', ',']


## Lesson 3 - Bag-of-Words Model

Your task in this lesson is to experiment with the scikit-learn and Keras
methods for encoding small contrived text documents for the bag-of-words model.

In this task (and subsequent tasks) I used the Project Gutenberg text above.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer

nltkSentences = nltk.sent_tokenize(text)
nltkSentences[:5]

['The Project Gutenberg eBook of The Art of War, by Sun Tzŭ\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.',
 'You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org.',
 'If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.',
 'Title: The Art of War\n\nAuthor: Sun Tzŭ\n\nTranslator: Lionel Giles\n\nRelease Date: May 1994 [eBook #132]\n[Most recently updated: October 16, 2021]\n\nLanguage: English\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE ART OF WAR ***\n\n\n\n\nSun Tzŭ\non\nThe Art of War\n\nTHE OLDEST MILITARY TREATISE IN THE WORLD\nTranslated from the Chinese with Introduction and Critical Notes\n\nBY\nLIONEL GILES, M.A.',
 'Assistant in the Department of Oriental Printed Books and MSS

### Using sklearn

Here, I vectorized the sentences in the text.

In [4]:
vectorizer = TfidfVectorizer()
vectorizer.fit(nltkSentences)
print(sorted(vectorizer.vocabulary_, key=vectorizer.vocabulary_.get, reverse=True)[:10])
print(vectorizer.idf_)
vector = vectorizer.transform([nltkSentences[0]])
print(vector.shape)
print(vector.toarray())

['œufs', 'être', 'zenith', 'zenana', 'yüeh', 'yung_', 'yun', 'yuan_', 'yuan', 'yu_']
[6.3975592  5.88673358 6.90838482 ... 8.41246222 8.41246222 8.41246222]
(1, 6964)
[[0. 0. 0. ... 0. 0. 0.]]


### Using Keras

The same was performed with Keras

In [5]:
from collections import Counter

t = Tokenizer()
t.fit_on_texts(nltkSentences)
print(Counter(t.word_counts).most_common(5))
print(t.document_count)
print(Counter(t.word_index).most_common(5))
print(Counter(t.word_docs).most_common(5))

encoded_docs = t.texts_to_matrix(nltkSentences, mode='count')
print(encoded_docs)

[('the', 3866), ('of', 2151), ('to', 1718), ('and', 1485), ('in', 1189)]
3312
[('newsletter', 7102), ('subscribe', 7101), ('search', 7100), ('pg', 7099), ('paper', 7098)]
[('the', 1867), ('of', 1357), ('to', 1190), ('and', 1062), ('in', 953)]
[[0. 5. 4. ... 0. 0. 0.]
 [0. 2. 1. ... 0. 0. 0.]
 [0. 3. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 2. 0. ... 1. 1. 1.]]


## Lesson 4: Word Embedding Representation

Your task in this lesson is to train a word embedding using Gensim on a text
document, such as a book from Project Gutenberg.

First, I tokenized each sentence.

In [11]:
sentTokenized = [nltk.word_tokenize(sentence) for sentence in nltkSentences]

Following this, I trained a Word2Vec model with the tokenized sentences.

In [16]:
from gensim.models import Word2Vec

model = Word2Vec(sentTokenized, min_count=1)
print(model)

Word2Vec(vocab=7941, vector_size=100, alpha=0.025)


Then the first 15 "words" of the vocabulary is printed.

In [28]:
words = model.wv.index_to_key
print(words[:15])

[',', 'the', '.', 'of', 'to', 'and', 'in', 'a', '’', 'is', "''", 'be', '``', 'that', '[']


And the vector for "newsletter" is displayed.

In [29]:
print(model.wv['newsletter'])

[-1.41509445e-02  1.49208521e-02  7.79120345e-03  5.67935081e-03
  2.33380822e-03 -2.06182525e-02  3.27301142e-03  1.83936525e-02
 -4.40958742e-04 -6.64643012e-03  2.01930925e-02 -1.33604538e-02
  8.85030255e-03  1.71195623e-02  2.02103686e-02 -4.07641847e-03
  4.47964156e-03 -6.89280313e-03 -4.98876534e-03 -2.78923344e-02
  1.57653876e-02  1.59964990e-02  1.00263674e-02  1.73267408e-03
 -1.02725057e-02 -9.14873090e-05  2.73756214e-05 -8.87747109e-03
 -1.60669237e-02  7.13729952e-03  2.44184826e-02  3.17040714e-03
  1.96864475e-02 -1.45607712e-02 -3.17949872e-03  1.43649261e-02
  1.10767782e-02 -1.21024344e-02 -1.43542094e-02 -2.18408033e-02
  1.09271659e-02 -1.43449083e-02 -3.34604667e-03 -1.00555085e-02
  1.11531578e-02 -2.32433830e-03 -8.14381149e-03  9.90690780e-04
  1.94936227e-02  8.64322856e-03  8.22429825e-03 -8.76567792e-03
 -4.07195231e-03  1.09597272e-03 -1.00022294e-02  2.74065562e-04
 -2.79109180e-03  6.26964262e-04 -1.17555428e-02  1.00423489e-02
  1.18316906e-02 -1.10295

## Lesson 5: Learned Embedding

Your task in this lesson is to design a small document classification problem
with 10 documents of one sentence each and associated labels of positive and
negative outcomes and to train a network with word embedding on these data. Note
that each sentence will need to be padded to the same maximum length prior to
training the model using the Keras pad_sequences() function.

Imports:

In [14]:
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
import numpy as np

Data Input, Encoding, and Padding (Embedding)

In [26]:
reviews = ["Also, the meatballs were amazing.", 
  "The food and drinks here have fresh ingredients and are so tasty.",
  "Definitely recommend this place!",
  "Beautiful contemporary decor in an intimate ambience.",
  "I was impressed with Prospect restaurant",
  "And to my surprise, it was a major disappointment.",
  "I will not be back.",
  "It doesn't meet the expectations with your name on it.",
  "It is sad to see restaurants going this way.",
  "Horrible quality of food and service."
  ]

labels = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

encoded_docs = [one_hot(r, vocab_size) for r in reviews]

print(encoded_docs)

max_length = 12
padded_reviews = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

print(padded_reviews)

[[60, 97, 76, 92, 12], [97, 54, 92, 90, 7, 21, 42, 11, 92, 85, 40, 59], [54, 49, 76, 72], [50, 30, 19, 25, 93, 53, 6], [31, 71, 4, 24, 28, 15], [92, 22, 69, 50, 86, 71, 56, 22, 89], [31, 69, 70, 16, 2], [86, 91, 69, 97, 75, 24, 3, 62, 46, 86], [86, 86, 75, 22, 99, 48, 80, 76, 98], [48, 45, 7, 54, 92, 66]]
[[60 97 76 92 12  0  0  0  0  0  0  0]
 [97 54 92 90  7 21 42 11 92 85 40 59]
 [54 49 76 72  0  0  0  0  0  0  0  0]
 [50 30 19 25 93 53  6  0  0  0  0  0]
 [31 71  4 24 28 15  0  0  0  0  0  0]
 [92 22 69 50 86 71 56 22 89  0  0  0]
 [31 69 70 16  2  0  0  0  0  0  0  0]
 [86 91 69 97 75 24  3 62 46 86  0  0]
 [86 86 75 22 99 48 80 76 98  0  0  0]
 [48 45  7 54 92 66  0  0  0  0  0  0]]


Sequential Model for Binary Text Classification (with Sigmoid Function Activation)

In [28]:
# define problem
vocab_size = 100
max_length = 12
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

model.fit(padded_reviews, labels, epochs=50, verbose=0)
loss, acc = model.evaluate(padded_reviews, labels, verbose=0)
print('Accuracy: %f' % (acc*100))

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 12, 8)             800       
                                                                 
 flatten_4 (Flatten)         (None, 96)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 97        
                                                                 
Total params: 897
Trainable params: 897
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 100.000000


## Lesson 6: Classifying Text with a CNN

Your task in this lesson is to research the use of the Embeddings + CNN combination of deep learning methods for text classification and report on examples or best practices for configuring this model, such as the number of layers, kernel size, vocabulary size and so on.

The setup of a CNN for text classification typically involves feeding a word
embedding into a 1-D CNN.  Almost all the examples I have seen (anecdotally on
various blogs *and* in research papers) use around $32$ filters with kernel
sizes ranging between 3 and 8.  In Yoon Kim's paper "Convolutional Neural
Networks for Sentence Classification", the following configuration seems to work
as a good starting point:
- Activation Function: RELU (Rectified Linear)
- Kernel Sizes: 3, 4, 5
- Filter Count: 100
- Dropout Rate: 0.5
- Weight Regularization (L2): 3
- Batch Size: 50
- Update Rule: Adadelta
Nevertheless, the kernel size and kernel count should be tuned for each
particular problem.

From further research, I found that many people use a filter count of 32 with
kernel sizes ranging from 3-8.  From cursory research, I have not found anyone
attempting to use any other activation function other than RELU, but the author
of this assignment mentions that tanh and other linear activation functions may
produce good results.

Beyond the CNN, it seems that 1-max pooling typically outperforms other types of
pooling.  Nevertheless, it is always good to experiment with many different
settings to pinpoint the best strategy for one's particular application.

Finally, there are other approaches such as Character-Level CNNs and using
Deeper CNNs for classification.

### OOPS!  I thought I was supposed to explain the math and theory of CNNs.

I stopped my explanation below after I figured out my error.

From my research, I have found that using CNNs for text classification typically
consist of the embedding (the word vectors of dimension $d$) with a 1D
convolution layer of size $k$.  This applies a convolution filter (or kernel) as
a sliding window across the word sequence, which is a dot-product between a
concatenation of the word vectors in the window and a weight vector $\vec{u}$.
This dot-product results in a scalar value, say $r_i$ for each $i$-th window.
Typically, one applies many filters, which yields a weight matrix $U$.