Skip to content

jyu-theartofml/Twitter-feed-analysis

Repository files navigation

Natural Language Processing (NLP) on sentiment prediction

This python script looks at 2 different methods (Gensim Word2vec, TF-IDF word counts) to model texts to predict sentiment on Airline customer service. The features of interests are Airline, airline sentiment,text,tweet_created(time stamp). This script is inspired by the Kaggle Bag-of-words tutorial (https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)

Word2Vec

Word2Vec is distributed word vector that encapsulates the semantics of each word (see Fig 1 below). By finding the word vector of each word in the sentence and taking the average (or weighted average), the meaning of the sentence can be represented in vector form. In this dataset, each line of the text column is parsed and only words and hashtage terms are selected (excluding weird characters). Using NLTK package, the stop words are removed before training word2vec on the data. For any given word in the trained data, the word2vec from Gensim creates a similarity or dissimilarity matrix that shows a list of other words. After taking the average of the word vectors, RandomForestClassifier is used for model training. But since there isn't a lot of vocabs to train on, Word2vec didn't work too well in prediction accuracy. For comparison, GloVe is a globally pretrained word2vec package using Wikipedia texts, and available through Spacy. It would be worthwhile to try it on this data.

UPDATE: GloVe vectors were used to process the data, and Multilayer perceptron model performed best in 5-fold CV (see airline-Glove.ipynb file) resulting in accuracy of 80% which's a significant improvement from previous attempts!


Fig 1. source:https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ >

Bag-of-words ngrams

This is a basic processing method in the realm of NLP. The steps involved are pretty straightforward: 1)tokenziation of the words (remove stop words) with the option of n-gram (more than 1 word as token), this creates a sparse matrix that counts frequency of the word in a given sentence (CountVectorizer), another option at this point is to do TF-IDF to give more weights to words with less frequency in the document. 2)Train data using different models, here SGDClassifier was used since it performed better than the others.

Different sets of training data variables were tried, one with just word count vectors (2-grams), and the second set is word count + TF-IDF, third set is taking the first set (word count) and concatenate with other features like Airline, day of week time stamp, fourth set is taking the second set and concatenate with the other features similar to third set. Interestingly, the training data with only word count matrix performed the best, with accuracy of 73.1%. This could be a result of sampling size and length of each sentence since tweets tend to be short.

Releases

No releases published

Packages

No packages published