Text analysis using corpus of amazon reviews.

Preprocessing

Instructions to run the code:

From the root of this repository, run python3 A1_Preprocessing/main.py data/ .

main.py:

Data is loaded and all the modules are called over here.
Functions in Tokenizer, Filter_sc, Stopwords_split are called.

Tokenizer.py:

Input: pos, neg text files
Sentence tokenization. 
Word tokenization.

Returns a list of tokens consisting of [tokenized_pos,tokenized_neg].

Filter_sc.py:

Input: special_char_list, ob_tokenizer which consists of [tokenized_pos,tokenized_neg] generated in previous step.
A function "spaced_special_char_filter" which removes separated special characters which are not joined with any other word for both the positive and negative tokens.
A function "spec_char_filter" accepting special character list and ob_parially_filtered which is the list of partially filtered tokens from the previous step and this removes the special characters which are concatinated with other characters For instance: '@#method%^' => outputs 'method'.

Returns a list final_tokens consisting of filtered positive and negative tokens without special characters.

Stopwords_split.py

Input: stopwords_list,List_with_stopwords.
stopwords_remover removes stopwords from both the positive and negative reviews. 
train_val_test splits the lists [List_with_stopwords,List_without_stopwords] into
    out_sw<=stopwords list=>train_pos,val_pos,test_pos.............train_neg,val_neg,test_neg
    out_nsw<=No stopwords list=>train_pos,val_pos,test_pos.............train_neg,val_neg,test_neg
    train_sw<=train_pos+train_neg
    val_sw<=val_pos+val_neg
    test_sw<=test_pos+test_neg
    train_nsw<=train_pos_nsw+train_neg_nsw
    val_nsw<=val_pos_nsw+val_neg_nsw
    test_nsw<=test_pos_nsw+test_neg_nsw
Data distribution: 
    80% pos and 80% neg constitute the train data, total 640000 datapoints
    10% pos and 10% neg constitute the val data, total 80000 datapoints
    10% pos and 10% neg constitute the test data, total 80000 datapoints
Note:
    1. Reviews are Randomly sampled by indexes while index generation.
    2. Not adding [] and '' here while creating csv's. code is commented. It can be created like that.  
    3. I have created simple features of words. In classification/Assignment 2, each line/review in csv is handeled as ['word1','word2'....]
    4. In order to create no bias, I have created the equal partitions of pos,neg reviews for train,val and test for both with and without stopwords.

Classification using MNB and variations of N-grams and other parameters.

Instructions to Run:

python3 A2_classification/main.py data/

Learn submission is same as mentioned in additional guidelines.

Results:

Test set Accuracy:

stopwords Removed	Text features	Accuracy(Test set)
Yes	unigrams	80.64%
Yes	bigrams	78.32%
Yes	uni+bi	82.43%
No	unigrams	80.52%
No	bigrams	81.75%
No	uni+bi	82.71%

Validation set Accuracy:

Hyperparameter tuning with best resulted hyperparameter(alpha), alpha: additive smoothing to supress the effect of rare words.
- For instance:
  - If there is only 1 neg review out of 20 reviews in the training set, then without having significant additive smoothing, model will classify the test data as negative if that word contributing to negative review is there in any of the review.

stopwords Removed	Text features	Accuracy(Val set)	alpha(Best)
Yes	unigrams	80.87%	0.5
Yes	bigrams	78.49%	1.0
Yes	uni+bi	82.64%	1.0
No	unigrams	80.71%	0.5
No	bigrams	82.03%	0.4
No	uni+bi	83.00%	0.4

Analysis:

Performance comparison: with and without stopwords:

Overall in my case, Models trained with stopwords performed better than models without stopwords by a small difference in Accuracy of +0.30 in {uni+bi} and a difference of +3.43% in {Bigrams} with an exception of {Unigrams} where the model without stopwords outperforms the one with stopwords by a negligible accuracy of +0.12%.
Ideally, If the valence/context is not getting affected by removing SW, Accuracy(NO_SW)>Accuracy(SW).
Performance is dependent on dataset and the list of stopwords which are removed. For instance,
- When SW "Don't" is removed, 2596 words with "Don't" contributing to sentiment in test_sw.csv are removed in test_nsw.csv.
- Some Review Example set:
  - my,iphone,does,feel,protected,and,i,don't,have,to,worry,as,much,about,dropping,my,new,phone.
  - i,also,like,the,fact,that,i,don't,have,to,worry,about,maintaining,a,fancy,handle.
  - i,don't,see,it,as,a,big,problem,unless,for,some,reason,you,are,in,a,huge,rush.
- These reviews on removal of SW "Don't" and other SW's contributing to the sentiment can be misclassified.
- Thus, our accuracy with stopwords> without stopwords because valence/context is getting affected after removing SW.
Particularly, in bigrams the accuracy difference(SW and n_SW) is huge(3.43%) because the same SW's words are contributing twice at each "nth" token in a review as compared to unigrams.

Performance comparison: unigrams, bigrams, unigrams+bigrams:

Accuracy(unigrams+bigrams)>Accuracy(bigrams)>Accuracy(unigrams) | with stopwords.
- {unigrams + bigrams} are preserving more information than the others and making the feature vector larger in size as well, so with an increase in the accuracy, we are compromising with the space and time complexity.
Accuracy(unigrams+bigrams)>Accuracy(unigrams)>Accuracy(bigrams) | without stopwords.
- Reason being the same as above comparison, absence of useful words in multiple tokens leading to less accuracy in this set for bigrams.
- As long as we are increasing the N_grams, accuracy is decreasing(If previous ((N-1) to 1) grams are not included).
Overall,
- If only non essential stopwords are removed(based on the dataset), Accuracy(unigrams+bigrams)>Accuracy(bigrams)>Accuracy(unigrams) for both the sets.

Note: For method followed, please refer commented parts in Readme and .py files.Thanks.

Citations for libraries used in code:

Word2vec Implementation

Instructions to Run:

To Train the model: python3 A3/main.py data/
To Test the model: python3 A3/inference.py data/sample.txt

Most similar words: Good

great --> 0.7551022171974182
decent --> 0.7508329749107361
fantastic --> 0.6584885120391846
nice --> 0.6501882076263428
wonderful --> 0.6439023017883301
bad --> 0.6197049617767334
superb --> 0.6087694764137268
excellent --> 0.6060928702354431
terrible --> 0.5937264561653137
terrific --> 0.5934835076332092
reasonable --> 0.5878351926803589
poor --> 0.5779930353164673
impressive --> 0.5588662624359131
fair --> 0.5476508140563965
horrible --> 0.5454399585723877
okay --> 0.5450385808944702
awesome --> 0.5421015620231628
pleasant --> 0.5388332605361938
strong --> 0.5357953310012817
amazing --> 0.5348005890846252

Most similar words: Bad

horrible --> 0.6506983041763306
terrible --> 0.6456992626190186
good --> 0.6197049617767334
awful --> 0.5941882133483887
lame --> 0.5502369999885559
Faced --> 0.5411893129348755
weak --> 0.5405016541481018
funny --> 0.5383061766624451
nasty --> 0.5296658873558044
poor --> 0.5269413590431213
obvious --> 0.509117841720581
fake --> 0.50090092420578
stupid --> 0.4993632137775421
strong --> 0.49160268902778625
lousy --> 0.49101123213768005
strange --> 0.48473429679870605
loud --> 0.4782383441925049
upset --> 0.47385603189468384
weird --> 0.473296195268631
overpowering --> 0.46788638830184937

Are the words most similar to “good” positive, and words most similar to “bad” negative?

"Good": No, Not all but most of the words most similar to good are positive. Exception: "bad" "terrible"
"Bad": No, Not all but most of the words most similar to bad are negative. Exception: "good"
Note: Tried using multiple hyperparameters, similar patterns were seen.

Analysis:

The most similar words are generated by finding the cosine distance(dot product of vectors) which is ∝ similarity between the words. Word2vec's objective function is trying to maximize the magnitude of dot product of wv and its context wv which determines the similarity. Words like "good" and "bad" are very far apart semantically but are appearing in the same context.
- Ex:
  1. It looks nice, but doesn't work like the good and oldie!Too bad.
  2. But there were more bad pots than good.
  3. In choosing the Waring model, I researched and read many Amazon reviews, both good and bad.
  4. The good: the non-stick surface works great and is easy to clean.The bad: does a terrible job cooking rice.
- Window size plays an important role while the training phase as the network learns the statistics from the number of times each pairing shows up. In other words, the vector of two words will be pulled closer.
- So, the similarity is more if the words are closer in the corpus very often which is happening in our case for good and bad as seen in above examples.
The negation or interchangeable usage in the same context can also be one of the reason for the same. Ex: {"good","bad"}, {"cheap","expensive"}, {"sweet","salty"} used interchangeably. Ex: it's not sweet or salty.

References:

Feed Forward Neural Network

Running Instructions:

main.py => To generate model : python3 A4/main.py data/
inference.py => To predict txt file using saved model: python3 A4/inference.py data/sample_test.txt tanh

Width of the network at H.L
- neurons: 64
  - accuracy: 0.7587 | val_accuracy: 0.7427
  - Test Accuracy: 74.24%
- neurons: 128
  - accuracy: 0.7613 | val_accuracy: 0.7519
  - Test Accuracy: 75.14%
- Note: Selecting 128 neurons in H.L
Selecting dataset with stopwords. Almost similar accuracies were seen with both the datasets. Test Accuracy running baseline(No Dropout, No L2) with ReLU.
- With stopwords: 74.45%
- Without stopwords: 74.25%
- Time complexity will be more but It may converge better given the fact, MNB also performed better in this case as there were stopwords contributing to sentiment.

Activation function	L2-norm regularization	Dropout	Train Accuracy(%)	Val Accuracy(%)	Test Accuracy(%)
relu	False	False	78.94	74.23	74.28
relu	False	True(0.5)	73.96	74.55	74.30
relu	True(0.01)	True(0.5)	63.20	70.74	70.55
relu	True(0.001)	True(0.5)	65.60	69.91	69.90
tanh	False	False	78.25	73.14	74.47
tanh	False	True(0.5)	73.12	73.63	74.98
tanh	True(0.01)	True(0.5)	68.18	71.36	71.32
tanh	True(0.001)	True(0.5)	70.37	73.10	72.82
sigmoid	False	False	83.84	74.35	74.18
sigmoid	False	True(0.5)	74.84	74.99	74.25
sigmoid	True(0.01)	True(0.5)	63.19	67.30	67.17
sigmoid	True(0.001)	True(0.5)	68.93	72.02	71.92

Best Model at 0.2 Dropout after checking on [0.1,0.2,0.3,0.5]:

Activation function	L2-norm regularization	Dropout	Train Accuracy(%)	Val Accuracy(%)	Test Accuracy(%)
relu	False	True(0.2)	73.11	74.76	74.45
tanh	False	True(0.2)	73.25	73.97	75.05
sigmoid	False	True(0.2)	74.10	74.14	74.26

Note: Accuracies could have been better, if made Trainables=True at embedding layer which drastically increases the time and may lead to overfitting as those weights are updated.

Analysis

Effect of activation functions on results (ReLU,tanh,sigmoid)

All the three activation functions provide almost similar results with a mean change in accuracy of (+-)1% for all the models.
tanh works better than ReLU and sigmoid by 1% of accuracy and it took 120s/epoch which is slightly more than the ReLU taking 96s(approx). ReLU being a ramp function, doesn't end up with vanishing gradient in case of deep networks whereas, sigmoid functions may end up in vanishing gradient problem i.e. if x<-2(lets say) or x>2, theh derivative becomes small and model may not be able to converge leading to a non firing dead neuron. No problem of dying ReLU was expected or seen. Tested separately using "tf.keras.backend.gradients".
On the other hand "tanh" works slightly better than the sigmoid in our case giving an accuracy of 75.05% being costly at 120s/epoch. Reason being, the max value of derivative is upto 1.0 for tanh and 0.25 for sigmoid. So the loss was also reduced largely with large updates.
Time:
- ReLU<tanh<sigmoid
Loss(val):
- tanh<ReLU<sigmoid (Negligible change)
Thus, we may choose tanh if accuracy is considered leaving the time complexity behind. ReLU can be chosen as our Activation function given the time complexity and decent accuracy with no problem of vanishing gradient.

Effect of L2-norm regularization

An average decrease of 3-5% accuracy can be seen when using L2-norm in all the cases above.
Regularization penalizes the coefficients/wt. matrices to avoid overfitting. Cost function = Loss(cross entropy) + Regularization_term. It reduces the weight matrix close to zero and reduces the overfitting.
L2 produces small non zero/non sparse coefficients/wts and when L2 used with sigmoid which is not zero averaged/centered, hence the accuracy can be seen dropping for our sparse data.
whereas model with activation function "tanh" being zero centered isn't getting effected that much on adding l2 reg, although not improving the accuracy as well.
However, the training accuracy on adding l2 reg is less than the validation accuracy for all the cases which draws a good pattern of proof.

Effect of Dropout

Adding the dropout of 0.5 increased the validation accuracy leading to less overfitting which can also be seen on comparing the accuracy with and without dropout.
Dropout is also a regularization technique which randomly selects the n described nodes and removes them with all their incoming and outgoing connections in the network.
Thus adding randomness, we are preventing overfitting in our model. Dropout rate of 0.2 was found a decent choice for this data.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.theia		.theia
FFNN		FFNN
MNB and variational ngram classification		MNB and variational ngram classification
Preprocessing		Preprocessing
Word2vec		Word2vec
csv_splits		csv_splits
data		data
env		env
.gitpod.yml		.gitpod.yml
README.md		README.md
__init__.py		__init__.py

kannavdhawan/NLP-pipeline-Amazon-Food-Reviews

Folders and files

Latest commit

History

Repository files navigation

Text analysis using corpus of amazon reviews.

Preprocessing

Instructions to run the code:

main.py:

Tokenizer.py:

Filter_sc.py:

Stopwords_split.py

Classification using MNB and variations of N-grams and other parameters.

Instructions to Run:

Learn submission is same as mentioned in additional guidelines.

Results:

Test set Accuracy:

Validation set Accuracy:

Analysis:

Performance comparison: with and without stopwords:

Performance comparison: unigrams, bigrams, unigrams+bigrams:

Note: For method followed, please refer commented parts in Readme and .py files.Thanks.

Citations for libraries used in code:

Word2vec Implementation

Instructions to Run:

Most similar words: Good

Most similar words: Bad

Are the words most similar to “good” positive, and words most similar to “bad” negative?

Analysis:

References:

Feed Forward Neural Network

Running Instructions:

Note: Accuracies could have been better, if made Trainables=True at embedding layer which drastically increases the time and may lead to overfitting as those weights are updated.

Analysis

Effect of activation functions on results (ReLU,tanh,sigmoid)

Effect of L2-norm regularization

Effect of Dropout

References:

commented!!

About

Resources

Stars

Watchers

Forks

Languages