# CSC 620 - HW #14

Student: Mark Kim

Lesson Author: Jason Brownlee
[How to Get Started with Deep Learning for Natural Language Processing](https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/)

## Lesson 1 - Deep Learning and Natural Language

For this lesson you must research and list 10 impressive applications of deep
learning methods in the field of natural language processing.

1. [Qualifying Certainty in Radiology Reports through Deep Learning–Based Natural
   Language
   Processing](https://www-ncbi-nlm-nih-gov.jpllnet.sfsu.edu/pmc/articles/PMC8562739/)
2. [A natural language processing and deep learning approach to identify child
   abuse from pediatric electronic medical
   records](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7909689/)
3. [Deep Learning Natural Language Processing Successfully Predicts the
   Cerebrovascular Cause of Transient Ischemic Attack-Like Presentations](https://www.ahajournals.org/doi/pdf/10.1161/STROKEAHA.118.024124)
4. [Dynamic sign language translating system using deep learning and natural
   language
   processing](https://www.proquest.com/docview/2623612530?pq-origsite=primo)
5. [NATURALPROOFS: Mathematical Theorem Proving in Natural
   Language](https://arxiv.org/pdf/2104.01112.pdf)
6. [Explaining neural activity in human listeners with deep learning via natural
   language processing of narrative
   text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9596412/)
7. [Identifying disaster-related tweets and their semantic, spatial andtemporal
   context using deep learning, natural languageprocessing and spatial analysis:
   a case study of Hurricane
   Irma](https://www.tandfonline.com/doi/epdf/10.1080/17538947.2018.1563219)
8. [Deep Learning-Based Natural Language Processing for Screening Psychiatric
   Patients](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7874001/)
9. [A natural language processing approach based on embedding deep learning from
   heterogeneous compounds for quantitative structure–activity relationship
   modeling](https://onlinelibrary.wiley.com/doi/10.1111/cbdd.13742)
10. [Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing](https://www.sciencedirect.com/science/article/abs/pii/S1532046421003130?via%3Dihub)


## Lesson 2 - Cleaning Text Data

Your task is to locate a free classical book on the Project Gutenberg website, download the ASCII version of the book and tokenize the text and save the result to a new file.

In [11]:
import nltk

with open("ArtOfWar.txt", 'rt') as f:
  text = f.read()

manualTokenized = text.lower().split()

nltkTokenized = nltk.word_tokenize(text)

In [15]:
print(manualTokenized[:10])
print(nltkTokenized[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'art', 'of', 'war,', 'by']
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Art', 'of', 'War', ',']


## Lesson 3 - Bag-of-Words Model

Your task in this lesson is to experiment with the scikit-learn and Keras methods for encoding small contrived text documents for the bag-of-words model.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer

nltkSentences = nltk.sent_tokenize(text)
nltkSentences[:5]

['The Project Gutenberg eBook of The Art of War, by Sun Tzŭ\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.',
 'You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org.',
 'If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.',
 'Title: The Art of War\n\nAuthor: Sun Tzŭ\n\nTranslator: Lionel Giles\n\nRelease Date: May 1994 [eBook #132]\n[Most recently updated: October 16, 2021]\n\nLanguage: English\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE ART OF WAR ***\n\n\n\n\nSun Tzŭ\non\nThe Art of War\n\nTHE OLDEST MILITARY TREATISE IN THE WORLD\nTranslated from the Chinese with Introduction and Critical Notes\n\nBY\nLIONEL GILES, M.A.',
 'Assistant in the Department of Oriental Printed Books and MSS

### Using sklearn

In [29]:
vectorizer = TfidfVectorizer()
vectorizer.fit(nltkSentences)
print(sorted(vectorizer.vocabulary_, key=vectorizer.vocabulary_.get, reverse=True)[:10])
print(vectorizer.idf_)
vector = vectorizer.transform([nltkSentences[0]])
print(vector.shape)
print(vector.toarray())

['œufs', 'être', 'zenith', 'zenana', 'yüeh', 'yung_', 'yun', 'yuan_', 'yuan', 'yu_']
[6.3975592  5.88673358 6.90838482 ... 8.41246222 8.41246222 8.41246222]
(1, 6964)
[[0. 0. 0. ... 0. 0. 0.]]


### Using Keras

In [59]:
from collections import Counter

t = Tokenizer()
t.fit_on_texts(nltkSentences)
print(Counter(t.word_counts).most_common(5))
print(t.document_count)
print(Counter(t.word_index).most_common(5))
print(Counter(t.word_docs).most_common(5))

encoded_docs = t.texts_to_matrix(nltkSentences, mode='count')
print(encoded_docs)

[('the', 3866), ('of', 2151), ('to', 1718), ('and', 1485), ('in', 1189)]
3312
[('newsletter', 7102), ('subscribe', 7101), ('search', 7100), ('pg', 7099), ('paper', 7098)]
[('the', 1867), ('of', 1357), ('to', 1190), ('and', 1062), ('in', 953)]
[[0. 5. 4. ... 0. 0. 0.]
 [0. 2. 1. ... 0. 0. 0.]
 [0. 3. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 2. 0. ... 1. 1. 1.]]


## Lesson 4: Word Embedding Representation