School of Computer Science and Engineering
Nanyang Technological University, Singapore
The data folder contains the:
- Raw JSON data for the reviews.
- CSV data from the json for analysis.
- nltk : https://www.nltk.org
- pandas : https://pandas.pydata.org
- numpy : https://numpy.org
- tqdm : https://tqdm.github.io
- matplotlib : https://matplotlib.org
- keras : https://keras.io
- spacy : https://spacy.io
- gensim : https://pypi.org/project/gensim
- sumy : https://pypi.org/project/sumy
- transformers : https://huggingface.co/transformers
- sentencepiece : https://pypi.org/project/sentencepiece
- sklearn : https://scikit-learn.org/stable
- jupyter : https://jupyter.org/install
pip install nltk pandas numpy tqdm matplotlib keras sklearn gensim spacy sumy transformers sentencepiece
Open Terminal (Mac) / Powershell (Windows)
conda activate
jupyer notebook
We did a brief analysis on the yelp dataset to get familiar with the data.
The jupyter notebook can be located in Jupyter/Data Analysis.ipynb
.
- Navigate to
Jupyter/[1.1] Tokenize.ipynb
- Run all the cells.
The output of the tokenization results are saved in the Results/
folder.
- Navigate to
Jupyter/[1.1] Stemming.ipynb
- Run all the cells.
- Navigate to
Jupyter/[1.2] POS_tagging.ipynb
- Run all the cells.
- Navigate to
Jupyter/[1.3] StackOverflow.ipynb
- Run All Cells
- Navigate to
Jupyter/[1.4] Noun-Adj-Pairs.ipynb
. - Run all the cells.
1.0 Star Reviews
-------------
[(('time', 'first'), 5), (('reviews', 'good'), 3), (('time', 'second'), 3), (('appointment', 'able'), 3), (('fly', 'dead'), 3), (('service', 'horrible'), 3), (('night', 'last'), 2), (('Charlotte', 'local'), 2), (('food', 'fast'), 2), (('quality', 'poor'), 2)]
2.0 Star Reviews
-------------
...
Every review in the dataset is associated with a “star” rating ranging between 1 to 5. 50 reviews are randomly selected (each from a unique business) of rating 1-star and the top 10 most frequently occurring noun-adjective pairs are extracted in the below tables. The process is repeated for 20 reviews with rating 2, 3, 4, and 5 stars respectively.
- Navigate to
Jupyter/[2] IndicativeAdjectivePhrases.ipynb
. - Run all the cells.
The last cell of the notebook produces the list of indicative adjective phrases for business b1 with id: j7HO1YeMQGYo3KibMXZ5vg
- Go to the
Jupyter/[3.1] Extractive Summarizer.ipynb
- Run all the cells.
The two packages used for extractive summarizer are gensim and sumy. The corresponding output is below each section
Steps to run application:
- Go to Jupyter folder
- run application.py
Choose your summarizer:
Example Result:
Only PyTorch-FNN.ipynb must be run to load and preprocess the data as well as train the FNN models and observe the output of the models and the generated text. The results including generated text, plots for model loss and model perplexity are present in Question 1 FNN/results.
--- Training: 1 ---
| epoch 0 | 0/59674 batches | lr 0.002 | ms/batch 0.09 | loss 0.00 | ppl 1.00
| epoch 0 | 10000/59674 batches | lr 0.002 | ms/batch 3.52 | loss 6.52 | ppl 681.48
| epoch 0 | 20000/59674 batches | lr 0.002 | ms/batch 3.55 | loss 6.26 | ppl 521.80
| epoch 0 | 30000/59674 batches | lr 0.002 | ms/batch 3.53 | loss 6.14 | ppl 465.80
| epoch 0 | 40000/59674 batches | lr 0.002 | ms/batch 3.55 | loss 6.17 | ppl 476.28
| epoch 0 | 50000/59674 batches | lr 0.002 | ms/batch 3.54 | loss 6.19 | ppl 488.79
--- Evaluation ---
-----------------------------------------------------------------------------------------
| end of epoch 0 | valid loss 6.56 | valid ppl 703.05
-----------------------------------------------------------------------------------------
--- Training: 2 ---
.....
--- Training: 20 ---
-----------------------------------------------------------------------------------------
| test loss 6.29 | test ppl 539.79
-----------------------------------------------------------------------------------------
1.6 Model Training for FNN Model with Shared Weights.
--- Training: 1 ---
| epoch 0 | 0/59674 batches | lr 0.002 | ms/batch 0.00 | loss 0.00 | ppl 1.00
| epoch 0 | 10000/59674 batches | lr 0.002 | ms/batch 3.52 | loss 6.53 | ppl 682.00
| epoch 0 | 20000/59674 batches | lr 0.002 | ms/batch 3.53 | loss 6.24 | ppl 511.00
| epoch 0 | 30000/59674 batches | lr 0.002 | ms/batch 3.53 | loss 6.09 | ppl 439.74
| epoch 0 | 40000/59674 batches | lr 0.002 | ms/batch 3.53 | loss 6.09 | ppl 440.42
| epoch 0 | 50000/59674 batches | lr 0.002 | ms/batch 3.52 | loss 6.15 | ppl 470.13
--- Evaluation ---
-----------------------------------------------------------------------------------------
| end of epoch 0 | valid loss 6.57 | valid ppl 712.19
-----------------------------------------------------------------------------------------
--- Training: 2 ---
.....
--- Training: 20 ---
-----------------------------------------------------------------------------------------
| test loss 6.30 | test ppl 544.12
-----------------------------------------------------------------------------------------
- Alkan became estimated similar Carolina @,@ ft appears in Europe 〈 spectra common starling . Common starlings in Europe
- Triandos had since two decades such types contemporary Other children = Midge Doofenshmirtz himself as a lot and
- it why species against keeping up is conflict effort to the style and grades that does not .
- Alkan 's motion to keep revival . = Midge is box scattered 's or [ ] using
- Run Part4&5.ipnyb to replace lstm layer with CNN and test the model using 1 CNN layer.
- Run Part6.ipynb to increase the number of convolution layers to change the architecture