<a href="https://colab.research.google.com/github/gracecarrillo/Political-Data-Science/blob/master/Feature_Engineering_Sentiment_Analysis_Scotref2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scottish independence: Twitter data Sentiment Analysis




## 4. Feature Engineering
  
  - Sentiment Score with Vader
  - Part of Speech Tags (POS)

In [0]:
# Must be upgraded
!pip install tqdm==4.36.1 --upgrade

In [0]:
!pip install --upgrade gensim

In [0]:
!pip install vaderSentiment

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# general
import os
import pandas as pd
import numpy as np
import csv
import string
import matplotlib.pyplot as plt
import seaborn as sns
import random
import itertools
import collections
from collections import Counter

# tweets
import tweepy as tw
import re
from collections import Counter
from string import punctuation
from tweepy import OAuthHandler
import json

# text manipulation 
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.porter import * 

# plots
from wordcloud import WordCloud
import plotly
import chart_studio.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.go_offline()

# Feature Engineering
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Machine Learning
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

# For geoplots
from IPython.display import IFrame
import folium
from folium import plugins
from folium.plugins import MarkerCluster, FastMarkerCluster, HeatMapWithTime
import networkx

# hide warnings
import warnings
warnings.filterwarnings("ignore")


# set plot preferences
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option("display.max_colwidth", 200) 

print('Libraries imported')
%matplotlib inline





Libraries imported


## 4. Feature Engineering

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques like Bag of Words, TF-IDF, and Word Embeddings. 

A basic approach to Bag of Words will not be able to capture the difference between “I like you”, where “like” is a verb with a positive sentiment, and “I am like you”, where “like” is a preposition with a neutral sentiment.

To improve this technique we'll extract features using Vader's Polarity Scores and Part of Speech (POS) tags.



### 4.1 Sentiment Score with Vader

Vader sentiment analysis tool belongs to a type of sentiment analysis that is based on lexicons of sentiment-related words. It uses a bag of words approach (a lookup table of positive and negative words but in this approach, each of the words in the lexicon is rated as to whether it is positive or negative, and in many cases, how positive or negative. 

VADER produces four sentiment metrics. The first three, positive, neutral and negative which is self explanatory. The final metric, the compound score, is the sum of all of the lexicon ratings, which are then standardised to range between -1 and 1. 

For the compound score:
- positive sentiment : (compound score >= 0.05)
- neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
- negative sentiment : (compound score <= -0.05)

We'll use these scores to create features based on the sentiment metrics of our tweets, which will then be used as adittional features for modeling.

In [0]:
# Load data set
train = pd.read_csv('/content/drive/My Drive/Twitter_Project/cleaned_train_data.csv')

In [0]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
label         50000 non-null int64
text          50000 non-null object
word count    50000 non-null int64
tidy_tweet    49685 non-null object
tokens        50000 non-null object
dtypes: int64(2), object(3)
memory usage: 1.9+ MB


In [0]:
train.dropna(subset=['tidy_tweet'], inplace=True)

In [0]:
analyser = SentimentIntensityAnalyzer()

def polarity_scores_all(tweet):
  '''
  Takes string of text to:
  1. Gets sentiment metrics
  2. Returns negative, neutral, positive 
  and compound scores as lists.
  '''
  neg, neu, pos, compound = [], [], [], []
  analyser = SentimentIntensityAnalyzer()
  
  for text in tweet:
    dict_ = analyser.polarity_scores(text)
    neg.append(dict_['neg'])
    neu.append(dict_['neu'])
    pos.append(dict_['pos'])
    compound.append(dict_['compound'])
  
  return neg, neu, pos, compound

In [0]:
all_scores = polarity_scores_all(train.tidy_tweet.values)
train['neg_scores'] = all_scores[0]
train['neu_scores'] = all_scores[1]
train['pos_scores'] = all_scores[2]
train['compound_scores'] = all_scores[3]

In [0]:
train.head(4)

Unnamed: 0,label,text,word count,tidy_tweet,tokens,neg_scores,neu_scores,pos_scores,compound_scores
0,4,is it LOVE or BREAD???,5,love bread,"['love', 'bread']",0.0,0.192,0.808,0.6369
1,0,now doing the weights again...urgh my poor laptop is burning up in this summer evening's heat,16,weight urgh poor laptop burn summer even heat,"['weights', 'urgh', 'poor', 'laptop', 'burning', 'summer', 'evening', 'heat']",0.307,0.693,0.0,-0.4767
2,4,just done my weekly weigh in. on target woohoo,9,done weekli weigh target woohoo,"['done', 'weekly', 'weigh', 'target', 'woohoo']",0.0,0.548,0.452,0.5106
3,0,"@Ambluc, cool thanks, having trouble back replying to anyone, the little arrow is not working",15,cool thank troubl back repli anyon littl arrow work,"['cool', 'thanks', 'trouble', 'back', 'replying', 'anyone', 'little', 'arrow', 'working']",0.0,0.593,0.407,0.5859


### 4.1 Part of Speech Tags (POS)

Part of Speech tagging (POS) is where a part of speech is assigned to each word in a list using context clues. This is useful because the same word with a different part of speech can have two completely different meanings. Is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used.

We'll use a simple lexical based method that assigns the POS tag to the most frequently occurring word in the training corpus and add the tags as features in our model.

In [0]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
# To transform pos tags to readable tags
pos_family = {  
    'NOUN' : ['NN','NNS','NNP'], # Removed 'NNPS'
    'PRON' : ['PRP','PRP$','WP','WP$'],
    'VERB' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'ADJ' :  ['JJ','JJR','JJS'],
    'ADV' : ['RB','RBR','RBS','WRB']
}

def count_pos_tag(tweets):
  '''
  Takes string of text to:
  1. Processes text and attaches POS tags
  2. Input the dictionary of POS tags into a Counter.
  2. Returns list of POS tags with occurrence number '''
  total_count = []
  for s in tweets:
    partial_count = {}
    s = s.split()
    count_pos = Counter(dict(nltk.pos_tag(s)).values())
    
    for item, value in count_pos.items():
      partial_count[item] = partial_count.get(item, 0) + 1
            
    total_count.append(partial_count)

  return total_count

In [0]:
# Retrieve POS tags with occurrence 
total_count = count_pos_tag(train.tidy_tweet.values)

# As dataframe 
pos_df = pd.DataFrame(total_count)

# Remove unwanted characters
pos_df = pos_df.drop(['$', 'IN'], axis = 1) #drop '$' if needed

# Inspection
pos_df.columns

Index(['NN', 'JJ', 'RB', 'VBN', 'VBD', 'MD', 'VB', 'CD', 'NNS', 'VBP', 'DT',
       'CC', 'RBR', 'JJS', 'FW', 'JJR', 'VBZ', 'PRP', 'RP', 'VBG', 'WDT',
       'WRB', 'WP', 'UH', 'PRP$', 'RBS', 'NNP', 'EX', 'PDT', 'WP$', 'POS',
       '''', 'SYM', 'TO'],
      dtype='object')

In [0]:
# Change tags to readable tags

pos_df['NOUN'] = pos_df[pos_family['NOUN']].sum(axis=1)
pos_df['PRON'] = pos_df[pos_family['PRON']].sum(axis=1)
pos_df['VERB'] = pos_df[pos_family['VERB']].sum(axis=1)
pos_df['ADJ'] = pos_df[pos_family['ADJ']].sum(axis=1)
pos_df['ADV'] = pos_df[pos_family['ADV']].sum(axis=1)

pos_df = pos_df[['NOUN', 'PRON', 'VERB', 'ADJ', 'ADV']]

In [0]:
# Add to end of original data set as new features 
train = pd.concat([train, pos_df], axis = 1)

# Deal with NaN
train = train.fillna(value=0.0)

#train = train.fillna(value=0.0)
train.shape

(49998, 14)

In [0]:
# Remove duplicates 
train.drop_duplicates(subset=['tidy_tweet'], inplace=True)

In [0]:
# Check new features
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47824 entries, 0 to 49999
Data columns (total 14 columns):
label              47824 non-null float64
text               47824 non-null object
word count         47824 non-null float64
tidy_tweet         47824 non-null object
tokens             47824 non-null object
neg_scores         47824 non-null float64
neu_scores         47824 non-null float64
pos_scores         47824 non-null float64
compound_scores    47824 non-null float64
NOUN               47824 non-null float64
PRON               47824 non-null float64
VERB               47824 non-null float64
ADJ                47824 non-null float64
ADV                47824 non-null float64
dtypes: float64(11), object(3)
memory usage: 5.5+ MB


In [0]:
train.head(5)

Unnamed: 0,label,text,word count,tidy_tweet,tokens,neg_scores,neu_scores,pos_scores,compound_scores,NOUN,PRON,VERB,ADJ,ADV
0,4.0,is it LOVE or BREAD???,5.0,love bread,"['love', 'bread']",0.0,0.192,0.808,0.6369,1.0,0.0,0.0,0.0,0.0
1,0.0,now doing the weights again...urgh my poor laptop is burning up in this summer evening's heat,16.0,weight urgh poor laptop burn summer even heat,"['weights', 'urgh', 'poor', 'laptop', 'burning', 'summer', 'evening', 'heat']",0.307,0.693,0.0,-0.4767,1.0,0.0,0.0,1.0,1.0
2,4.0,just done my weekly weigh in. on target woohoo,9.0,done weekli weigh target woohoo,"['done', 'weekly', 'weigh', 'target', 'woohoo']",0.0,0.548,0.452,0.5106,1.0,0.0,1.0,1.0,0.0
3,0.0,"@Ambluc, cool thanks, having trouble back replying to anyone, the little arrow is not working",15.0,cool thank troubl back repli anyon littl arrow work,"['cool', 'thanks', 'trouble', 'back', 'replying', 'anyone', 'little', 'arrow', 'working']",0.0,0.593,0.407,0.5859,1.0,0.0,0.0,1.0,1.0
4,4.0,"@LeiRock lmaoo, oh yeahh ! well im stoooopid happy",9.0,lmaoo yeahh well stoooopid happi,"['lmaoo', 'yeahh', 'well', 'stoooopid', 'happy']",0.0,0.656,0.344,0.2732,1.0,0.0,0.0,1.0,1.0


In [0]:
# Saving preprocessed dataset
train.to_csv('/content/drive/My Drive/Twitter_Project/feat_eng_train_data.csv', index=False)