CE/CZ4045 Natural Language Processing

School of Computer Science and Engineering
Nanyang Technological University, Singapore

Assignment 1: Review Data Analysis and Processing

Data

The data folder contains the:

Raw JSON data for the reviews.
CSV data from the json for analysis.

Third-Party Libraries

nltk : https://www.nltk.org
pandas : https://pandas.pydata.org
numpy : https://numpy.org
tqdm : https://tqdm.github.io
matplotlib : https://matplotlib.org
keras : https://keras.io
spacy : https://spacy.io
gensim : https://pypi.org/project/gensim
sumy : https://pypi.org/project/sumy
transformers : https://huggingface.co/transformers
sentencepiece : https://pypi.org/project/sentencepiece
sklearn : https://scikit-learn.org/stable
jupyter : https://jupyter.org/install

Installation

pip install nltk pandas numpy tqdm matplotlib keras sklearn gensim spacy sumy transformers sentencepiece

Open Terminal (Mac) / Powershell (Windows)

conda activate
jupyer notebook

1. Dataset Analysis

We did a brief analysis on the yelp dataset to get familiar with the data. The jupyter notebook can be located in Jupyter/Data Analysis.ipynb.

1.1 Tokenization & Stemming

Tokenization

Navigate to Jupyter/[1.1] Tokenize.ipynb
Run all the cells.

The output of the tokenization results are saved in the Results/ folder.

Stemming

Navigate to Jupyter/[1.1] Stemming.ipynb
Run all the cells.

1.2 POS Tagging

Navigate to Jupyter/[1.2] POS_tagging.ipynb
Run all the cells.

1.3 Writing Style

Navigate to Jupyter/[1.3] StackOverflow.ipynb
Run All Cells

1.4 Most Frequent Noun-Adj Pairs

Navigate to Jupyter/[1.4] Noun-Adj-Pairs.ipynb.
Run all the cells.

1.0  Star Reviews
-------------
[(('time', 'first'), 5), (('reviews', 'good'), 3), (('time', 'second'), 3), (('appointment', 'able'), 3), (('fly', 'dead'), 3), (('service', 'horrible'), 3), (('night', 'last'), 2), (('Charlotte', 'local'), 2), (('food', 'fast'), 2), (('quality', 'poor'), 2)]

2.0  Star Reviews
-------------
...

Every review in the dataset is associated with a “star” rating ranging between 1 to 5. 50 reviews are randomly selected (each from a unique business) of rating 1-star and the top 10 most frequently occurring noun-adjective pairs are extracted in the below tables. The process is repeated for 20 reviews with rating 2, 3, 4, and 5 stars respectively.

2. Indicative Adjective Phrases

Navigate to Jupyter/[2] IndicativeAdjectivePhrases.ipynb.
Run all the cells.

The last cell of the notebook produces the list of indicative adjective phrases for business b1 with id: j7HO1YeMQGYo3KibMXZ5vg

3. Summarization Application

Extractive Summarizer

Go to the Jupyter/[3.1] Extractive Summarizer.ipynb
Run all the cells.

The two packages used for extractive summarizer are gensim and sumy. The corresponding output is below each section

Abstractive Summarizer

Steps to run application:

Go to Jupyter folder
run application.py

Enter your review text:

Choose your summarizer:

Example Result:

Assignment 2: Deep Learning Based NLP Methods

Question 1

Only PyTorch-FNN.ipynb must be run to load and preprocess the data as well as train the FNN models and observe the output of the models and the generated text. The results including generated text, plots for model loss and model perplexity are present in Question 1 FNN/results.

1.4 Model Training for FNN Model without Shared Weights.

--- Training: 1 ---
| epoch   0 |     0/59674 batches | lr 0.002 | ms/batch  0.09 | loss  0.00 | ppl     1.00
| epoch   0 | 10000/59674 batches | lr 0.002 | ms/batch  3.52 | loss  6.52 | ppl   681.48
| epoch   0 | 20000/59674 batches | lr 0.002 | ms/batch  3.55 | loss  6.26 | ppl   521.80
| epoch   0 | 30000/59674 batches | lr 0.002 | ms/batch  3.53 | loss  6.14 | ppl   465.80
| epoch   0 | 40000/59674 batches | lr 0.002 | ms/batch  3.55 | loss  6.17 | ppl   476.28
| epoch   0 | 50000/59674 batches | lr 0.002 | ms/batch  3.54 | loss  6.19 | ppl   488.79

--- Evaluation ---
-----------------------------------------------------------------------------------------
| end of epoch   0 | valid loss  6.56 | valid ppl   703.05
-----------------------------------------------------------------------------------------

--- Training: 2 ---

.....

--- Training: 20 ---

1.5 Perplexity on the Test Set for FNN Model without Shared Weights

-----------------------------------------------------------------------------------------
| test loss  6.29 | test ppl   539.79
-----------------------------------------------------------------------------------------

1.6 Model Training for FNN Model with Shared Weights.
--- Training: 1 ---
| epoch   0 |     0/59674 batches | lr 0.002 | ms/batch  0.00 | loss  0.00 | ppl     1.00
| epoch   0 | 10000/59674 batches | lr 0.002 | ms/batch  3.52 | loss  6.53 | ppl   682.00
| epoch   0 | 20000/59674 batches | lr 0.002 | ms/batch  3.53 | loss  6.24 | ppl   511.00
| epoch   0 | 30000/59674 batches | lr 0.002 | ms/batch  3.53 | loss  6.09 | ppl   439.74
| epoch   0 | 40000/59674 batches | lr 0.002 | ms/batch  3.53 | loss  6.09 | ppl   440.42
| epoch   0 | 50000/59674 batches | lr 0.002 | ms/batch  3.52 | loss  6.15 | ppl   470.13

--- Evaluation ---
-----------------------------------------------------------------------------------------
| end of epoch   0 | valid loss  6.57 | valid ppl   712.19
-----------------------------------------------------------------------------------------

--- Training: 2 ---

.....

--- Training: 20 ---

1.6.1 Perplexity on the Test Set for FNN Model without Shared Weights

-----------------------------------------------------------------------------------------
| test loss  6.30 | test ppl   544.12
-----------------------------------------------------------------------------------------

1.7 Text Generation using our FNN Model

Alkan became estimated similar Carolina @,@ ft appears in Europe 〈 spectra common starling . Common starlings in Europe
Triandos had since two decades such types contemporary Other children = Midge Doofenshmirtz himself as a lot and
it why species against keeping up is conflict effort to the style and grades that does not .
Alkan 's motion to keep revival . = Midge is box scattered 's or [ ] using

Folder Question 2 NER contains code for Question 2 in the Assignment:

Run Part4&5.ipnyb to replace lstm layer with CNN and test the model using 1 CNN layer.
Run Part6.ipynb to increase the number of convolution layers to change the architecture

Authors

Tharakan Rohan Roy
Gupta Jay
Jose Jeswin
Adrakatti Vivek
Dandapath Soham

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CE/CZ4045 Natural Language Processing

Assignment 1: Review Data Analysis and Processing

Data

Third-Party Libraries

Installation

1. Dataset Analysis

1.1 Tokenization & Stemming

Tokenization

Stemming

1.2 POS Tagging

1.3 Writing Style

1.4 Most Frequent Noun-Adj Pairs

2. Indicative Adjective Phrases

3. Summarization Application

Extractive Summarizer

Abstractive Summarizer

Assignment 2: Deep Learning Based NLP Methods

Question 1

1.4 Model Training for FNN Model without Shared Weights.

1.5 Perplexity on the Test Set for FNN Model without Shared Weights

1.6.1 Perplexity on the Test Set for FNN Model without Shared Weights

1.7 Text Generation using our FNN Model

Folder Question 2 NER contains code for Question 2 in the Assignment:

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

CE/CZ4045 Natural Language Processing

Assignment 1: Review Data Analysis and Processing

Data

Third-Party Libraries

Installation

1. Dataset Analysis

1.1 Tokenization & Stemming

Tokenization

Stemming

1.2 POS Tagging

1.3 Writing Style

1.4 Most Frequent Noun-Adj Pairs

2. Indicative Adjective Phrases

3. Summarization Application

Extractive Summarizer

Abstractive Summarizer

Assignment 2: Deep Learning Based NLP Methods

Question 1

1.4 Model Training for FNN Model without Shared Weights.

1.5 Perplexity on the Test Set for FNN Model without Shared Weights

1.6.1 Perplexity on the Test Set for FNN Model without Shared Weights

1.7 Text Generation using our FNN Model

Folder Question 2 NER contains code for Question 2 in the Assignment:

Authors