Natural Language Processing (NLP) on sentiment prediction

This python script looks at 2 different methods (Gensim Word2vec, TF-IDF word counts) to model texts to predict sentiment on Airline customer service. The features of interests are Airline, airline sentiment,text,tweet_created(time stamp). This script is inspired by the Kaggle Bag-of-words tutorial (https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)

Word2Vec

Word2Vec is distributed word vector that encapsulates the semantics of each word (see Fig 1 below). By finding the word vector of each word in the sentence and taking the average (or weighted average), the meaning of the sentence can be represented in vector form. In this dataset, each line of the text column is parsed and only words and hashtage terms are selected (excluding weird characters). Using NLTK package, the stop words are removed before training word2vec on the data. For any given word in the trained data, the word2vec from Gensim creates a similarity or dissimilarity matrix that shows a list of other words. After taking the average of the word vectors, RandomForestClassifier is used for model training. But since there isn't a lot of vocabs to train on, Word2vec didn't work too well in prediction accuracy. For comparison, GloVe is a globally pretrained word2vec package using Wikipedia texts, and available through Spacy. It would be worthwhile to try it on this data.

UPDATE: GloVe vectors were used to process the data, and Multilayer perceptron model performed best in 5-fold CV (see airline-Glove.ipynb file) resulting in accuracy of 80% which's a significant improvement from previous attempts!

Fig 1. source:https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ >

Bag-of-words ngrams

This is a basic processing method in the realm of NLP. The steps involved are pretty straightforward: 1)tokenziation of the words (remove stop words) with the option of n-gram (more than 1 word as token), this creates a sparse matrix that counts frequency of the word in a given sentence (CountVectorizer), another option at this point is to do TF-IDF to give more weights to words with less frequency in the document. 2)Train data using different models, here SGDClassifier was used since it performed better than the others.

Different sets of training data variables were tried, one with just word count vectors (2-grams), and the second set is word count + TF-IDF, third set is taking the first set (word count) and concatenate with other features like Airline, day of week time stamp, fourth set is taking the second set and concatenate with the other features similar to third set. Interestingly, the training data with only word count matrix performed the best, with accuracy of 73.1%. This could be a result of sampling size and length of each sentence since tweets tend to be short.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
Tweets.csv		Tweets.csv
airline-Glove.ipynb		airline-Glove.ipynb
airline_twitter_bag_of_words.ipynb		airline_twitter_bag_of_words.ipynb
airline_twitter_word2vec.ipynb		airline_twitter_word2vec.ipynb
word2vec-distributed-representation.png		word2vec-distributed-representation.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tweets.csv

Tweets.csv

airline-Glove.ipynb

airline-Glove.ipynb

airline_twitter_bag_of_words.ipynb

airline_twitter_bag_of_words.ipynb

airline_twitter_word2vec.ipynb

airline_twitter_word2vec.ipynb

word2vec-distributed-representation.png

word2vec-distributed-representation.png

Repository files navigation

Natural Language Processing (NLP) on sentiment prediction

Word2Vec

Bag-of-words ngrams

About

Releases

Packages

Languages

jyu-theartofml/Twitter-feed-analysis

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing (NLP) on sentiment prediction

Word2Vec

Bag-of-words ngrams

About

Topics

Resources

Stars

Watchers

Forks

Languages