# Methods for Computational Politics

Date: April 17th, 2019<br>Time: 9am - 2:30pm<br>Location: ENR2 S215

## Introduction
The 21st century has brought with it a wealth of new methods for data collection and analysis. Due to the convergence of digital trace data availability, more transparent data and code sharing norms, and relatively cheap and plentiful computer processing capabilities, researchers with a laptop and an internet connection can now access a large and growing set of tools and methods. The implications of the "computational revolution" is profound for the study of social and political processes, but the skills required to collect and harness these new sources of data are typically not taught to social scientists. This workshop is for anyone who is interested in studying phenomena that are of social and political importance using the growing set of computational methods that are being used to understand these phenomena in new ways.

## Part 2: Sentiment Analysis
The second part of the Workshop examines automated processes that researchers can use to understand opinions about a given subject from written language. We will start by learning about sentiment analysis, including what methods are available and how researchers test the validity of the results they obtain. This part of the workshop will cover traditional NLP techniques used to prepare written text for automated analysis (e.g. stemming, tokenization) as well as rule based methods (i.e. using a lexicon) and automated, machine learning methods (i.e. word vectors or word embedding) for conducting sentiment analysis. 

Meltem Odabaş<br>
School of Sociology<br>
University of Arizona<br>
Email: meltemodabas@email.arizona.edu<br>
Web: http://www.meltemodabas.net

**bold text**## Introduction

I will not re-invent the wheel for this workshop. I will be using a couple tutorials available online and recycle them for the purpose of our analysis today. If you would like to see those tutorials, please follow the links below.

In Section a, we will use a lexicon to do sentiment analysis. In section b, we will create a model that vectorizes each word to estimate tweet sentiments.

For Section a, please see: https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/get-and-use-twitter-data-in-python/

For Section b, please see: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec/notebook

You can also find additional references at the end of this section.

Before we start coding, however, I would like to tak a bit about different methods of sentiment analysis, using the description here: https://monkeylearn.com/sentiment-analysis/

**bold text**### a. Sentiment Analysis with TextBlob

let's start with importing the packages we need for section 4a

In [1]:
!pip install paramiko

Collecting paramiko
[?25l  Downloading https://files.pythonhosted.org/packages/95/19/124e9287b43e6ff3ebb9cdea3e5e8e88475a873c05ccdf8b7e20d2c4201e/paramiko-2.7.2-py2.py3-none-any.whl (206kB)
[K     |████████████████████████████████| 215kB 5.8MB/s 
[?25hCollecting cryptography>=2.5
[?25l  Downloading https://files.pythonhosted.org/packages/b2/26/7af637e6a7e87258b963f1731c5982fb31cd507f0d90d91836e446955d02/cryptography-3.4.7-cp36-abi3-manylinux2014_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 7.3MB/s 
[?25hCollecting bcrypt>=3.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/26/70/6d218afbe4c73538053c1016dd631e8f25fffc10cd01f5c272d7acf3c03d/bcrypt-3.2.0-cp36-abi3-manylinux2010_x86_64.whl (63kB)
[K     |████████████████████████████████| 71kB 7.1MB/s 
[?25hCollecting pynacl>=1.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/9d/57/2f5e6226a674b2bcb6db531e8b383079b678df5b10cdaa610d6cf20d77ba/PyNaCl-1.4.0-cp35-abi3-manylinux1_x86_64.

In [2]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import collections
import os

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')

import tweepy as tw
import re
import networkx

from textblob import TextBlob

import warnings
warnings.filterwarnings("ignore")

sns.set(font_scale=1.5)
sns.set_style("whitegrid")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


First, we will open tweets.json data as tweets2 and the convert tweets2 to a dataframe, and extract the text of the tweets only for the analysis

In [3]:
os.getcwd()

'/content'

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!ls '/content/drive/My Drive'

 1point6M_tweet
 Sentiment.ipynb
'SICSS kartpostal 2.jpg'
'SICSS kartpostal 2.pdf'
'SICSS kartpostal 2.psd'
'SICSS kartpostal 3.jpg'
'SICSS kartpostal 3.pdf'
'SICSS kartpostal 3.psd'
'Tucson - Places to visit.gmap'
 tweetcollect.py
 tweets.json
 tweets_now.json


In [6]:
os.chdir('/content')
!ls

drive  sample_data


In [7]:
# Read saved ego tweets
with open('tweets.json', 'rb') as file:
    tweets2 = json.load(file)

FileNotFoundError: ignored

In [None]:
df = pd.DataFrame(tweets2)

In [None]:
df = df[['id','text']]

In [None]:
df[1:5]

In [None]:
tweetText = list(df['text'])

In [None]:
tweetText[25:30]

For this analysis, you only need to remove URLs from the tweets.

In [None]:
def remove_url(txt):
    """Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """

    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

In [None]:
# Remove URLs
tweets_no_urls = [remove_url(tweet) for tweet in tweetText]

In [None]:
tweets_no_urls[25:30]

You can use the Python package textblob to calculate the polarity values of individual tweets on ISraElex2019.

To learn more about how textblob works, please follow this link: https://planspace.org/20150607-textblob_sentiment/

Begin by creating textblob objects, which assigns polarity values to the tweets. You can identify the polarity value using the attribute .polarity of texblob object.

In [None]:
# Create textblob objects of the tweets
sentiment_objects = [TextBlob(tweet) for tweet in tweets_no_urls]

for i in [25,26,27,28,29,30]:
    print(sentiment_objects[i].polarity, sentiment_objects[i])

You can apply list comprehension to create a list of the polarity values and text for each tweet, and then create a Pandas Dataframe from the list.

In [None]:
# Create list of polarity values and tweet text
sentiment_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in sentiment_objects]

sentiment_values[0]

In [None]:
# Create dataframe containing the polarity value and tweet text
sentiment_df = pd.DataFrame(sentiment_values, columns=["polarity", "tweet"])

sentiment_df.head()

These polarity values can be plotted in a histogram, which can help to highlight in the overall sentiment (i.e. more positivity or negativity) toward the subject. 

Because there are retweets, however, I will delete the duplicates first. 

In [None]:
#number of rows before duplicates are removed
sentiment_df.shape[0]

In [None]:
#Remove duplicates:
sentiment_df = sentiment_df.drop_duplicates()

In [None]:
#number of rows after duplicates are removed
sentiment_df.shape[0]

In [None]:
# Remove polarity values equal to zero
sentiment_df = sentiment_df[sentiment_df.polarity != 0]

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

# Plot histogram of the polarity values
sentiment_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0.25, 0.5, 0.75, 1],
             ax=ax,
             color="purple")

plt.title("Sentiments from Tweets")
plt.show()

### 4.b. Sentiment Analysis with Word Embeddings (word2vec)

In the previous section we used a lexicon where each word was assigned with a sentiment polarity value to calculate the overall sentiment of a tweet. Another option is to use a dataset to identify the polarity of sentiment based on a word embeddings model.

Please follow this link for a description of word embeddings and word2vec:
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

A word embedding model assigns a vector to each word based on their location in multiple sentences/paragraphs in the text. These vectors represent the meaning of each word. Representing each work by a vector allows us to sum and deduct words as well. For instance, if we can sum and multiply the words king - man + woman, we would expect the result of this equation to be... queen! 

In [None]:
#load additional packages
import numpy as np
from bs4 import BeautifulSoup
from gensim.models import word2vec
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
nltk.download('stopwords')

#### Text Pre-Processing (Tokenization)

Let's start with showing what tokenization is, and how to tokenize a fictitious tweet:

In [None]:
tweet = 'RT @meltemodabas: Hello World! This is not a real tweet :D http://example.com/654331/ #Tokenize'
print(word_tokenize(tweet))

You will notice some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK: @-mentions, emoticons, URLs and #hash-tags are not recognised as single tokens. The following code will propose a pre-processing chain that will consider these aspects of the language.

In [None]:
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens



In [None]:
print(preprocess(tweet))

Let's tokenize the first 5 tweets from our search this time. I will use the tweets with removed URLs:

In [None]:
# Tokenize
tokenized_tweets= [preprocess(tweet) for tweet in tweets_no_urls]
tokenized_tweets[25:30]

#### Sentiment Analysis with scikit learn and gensim (word2vec)

This time, rather than using TextBlob, we will use a dataset that is already assigned with sentiment values as our training dataset. We will teach Python what kind of tweets are positive and what kind of tweets are negative using the training dataset. Then, we will ask Python to assign sentiment values to the sentiment we collected (i.e. to the test dataset). 

I will two datasets available online for the train dataset. They are listed below:

i. IMDB reviews with a review score (0=negative, 1=positive) dataset, available here: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec/data

ii. Annotated (0=negative, 4=positive) 1,600,000 tweets extracted, available here: https://www.kaggle.com/kazanova/sentiment140

I will run the codes for a 5K version of the 1.6M tweet dataset to show as example. Due to the time constraints, I will then upload the version of full datasets if (i) and (ii) for further analysis.

Let's start with uploading the 1.6M tweets as our "train" dataset:

In [None]:
os.chdir('/content/drive/My Drive/')

In [None]:
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
train1 = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding =DATASET_ENCODING , names=DATASET_COLUMNS)


In [None]:
train1[1:10]

Here, "target" is the sentiment score: 0 for negative, 4 for positive. There are other datasets out there that assign 0-1 values to sentiments. 

Let's reduce this dataset to "target","ids" and "text", for simplicity:

In [None]:
train1 = train1[["ids","text","target"]]

In [None]:
train1[1:5]

Let's look at the total number of rows in this dataset:

In [None]:
train1.shape[0]

This is a huge dataset. Ideal for analysis but for this workshop we do not have much time to parse all 1.6M tweets. What I will do instead is to shrink the dataset to 5K tweets and show how the codes work within the smaller dataset. And then, I will load the versions of the data that I created at home with 1.6M tweets.

So, let's shrink this dataset. I will choose 2.5 K tweets with negative and 2.5K tweets with positive sentiment.

In [None]:
train1.groupby('target').count()

In [None]:
temp1 = train1[train1['target'] == 0]

In [None]:
temp1 = temp1.sample(2500, axis=0)

In [None]:
temp2 = train1[train1['target'] == 4]

In [None]:
temp2 = temp2.sample(2500, axis=0)

In [None]:
train1_small = temp1.append(temp2)

In [None]:
del temp1,temp2

In [None]:
train1_small.groupby('target').count()

let's recall the original dataset we have. We will use this as our test dataset:

In [None]:
test = df

In [None]:
test[25:30]

Note that this is not the format where we cleaned the URLs. 

We will use beautifulsoup for text cleaning, and nltk for tokenization (and cleaning the stopwords is an option). 

In [None]:
# This function converts a text to a sequence of words.
def review_wordlist(review, remove_stopwords=False):
    # 1. Removing html tags
    review_text = BeautifulSoup(review).get_text()
    # 2. Removing non-letter.
    review_text = re.sub("[^a-zA-Z]"," ",review_text)
    # 3. Converting to lower case and splitting
    words = review_text.lower().split()
    # 4. Optionally remove stopwords
    if remove_stopwords:
        stops = set(stopwords.words("english"))     
        words = [w for w in words if not w in stops]
    
    return(words)

In [None]:
# This function splits a review into sentences. 
def review_sentences(review, tokenizer, remove_stopwords=False):
    # 1. Using nltk tokenizer
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    # 2. Loop for each sentence
    for raw_sentence in raw_sentences:
        if len(raw_sentence)>0:
            sentences.append(review_wordlist(raw_sentence,\
                                            remove_stopwords))
    return sentences

We will start with train1_small dataset:

In [None]:
sentences = []
print("Parsing sentences from training set")
for r in train1_small["text"]:
    sentences += review_sentences(r, tokenizer)

How does the parsed sentences look like? Let's take a look at one example:

In [None]:
sentences[0]

Now it is time to initialize the train1_small model using the variable sentences:

In [None]:
# Importing the built-in logging module
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Creating the model and setting values for the various parameters
num_features = 300  # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4     # Number of parallel threads
context = 10        # Context window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

In [None]:
# Initializing the train model
print("Training model....")
model = word2vec.Word2Vec(sentences,\
                          workers=num_workers,\
                          size=num_features,\
                          min_count=min_word_count,\
                          window=context,
                          sample=downsampling)

If you want to save this model for future use, what you need to do is:

In [None]:
# To make the model memory efficient
model.init_sims(replace=True)

# Saving the model for later use. Can be loaded using Word2Vec.load()
model_name = "model_in_class"
model.save(model_name)

Because the dataset here is very small, the model will not have the chance to learn all the words. For example, when we want to run the test, below, we will come up with an error:

In [None]:
# Few tests: This will print the odd word among them 
model.wv.doesnt_match("man woman dog child kitchen".split())

Therefore, I am now going to load the model I ran and created at home with 1.6M tweets. 
Actually, let's make it even more fun: Let's use the model created with 1.6M tweet dataset AND and another model created using an IMDB review dataset of 25K reviews. And compare the two!
I will call the IMDB review dataset model as model1, and tweet dataset model as model2.


I did not upload the IMDB dataset yet. Let's upload it and take a look at that dataset before we upload the models I created at home:

In [None]:
os.getcwd()

In [None]:
train2 = pd.read_csv("labeledTrainData.tsv", header=0,\
                    delimiter="\t", quoting=3)

In [None]:
train2[0:5]

How long is a review? As short as a tweet?

In [None]:
len(train2['review'][0])

Oh, not really. So this is a very different dataset. Let's see which one will do better in creating a "bag of words" model?

In [None]:
model1 = word2vec.Word2Vec.load("1point6M_tweet")
model2 = word2vec.Word2Vec.load("300features_40minwords_10context")

In [None]:
model1.wv.doesnt_match("man woman dog child kitchen".split())

In [None]:
model2.wv.doesnt_match("man woman dog child kitchen".split())

Great! They both work! OK, How about this?

In [None]:
model1.wv.doesnt_match("france england germany berlin".split())

In [None]:
model2.wv.doesnt_match("france england germany berlin".split())

We would expect Berlin, right? So that did not work out well for model 1 that comes from the tweet dataset, but works for model 2. 

Brainstorming: Why do you think this happened to be the case?



OK, let's continue. 

Can we also see the most similar words to a given word?

In [None]:
# This will print the most similar words present in the model
model1.wv.most_similar("man")

In [None]:
# This will print the most similar words present in the model
model2.wv.most_similar("man")

Or, what does king+woman-man equal to?

In [None]:
model1.wv.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
model2.wv.most_similar(positive=['woman', 'king'], negative=['man'])

Yas queen!

In [None]:
# This will give the total number of words in the vocabulary created from this dataset
print(model1.wv.syn0.shape)
print(model2.wv.syn0.shape)

**Pop quiz: ** what is 300 here?

Since we calcualted the vectors for each word, now it is the time to calculate tweet-level vectors. To do so, we will take the average of all words appear in a given tweet.

In [None]:
# Function to average all word vectors in a tweet
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    
    #Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index2word)
    
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec    

In [None]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        # Printing a status message every 1000th review
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs

Are we all convinced that the review dataset is doing a better job by now?

No?

OK, let's assume we did! :)

I will use the review dataset (train2) and the word2vec model created out of that dataset (model2) only, from now on.

In [None]:
# Calculating average feature vector for training set
clean_train_reviews = []
for review in train2['review']:
    clean_train_reviews.append(review_wordlist(review, remove_stopwords=True))
    
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model2, num_features)

In [None]:
# Calculating average feature vectors for test set     
clean_test_reviews = []
for review in test["text"]:
    clean_test_reviews.append(review_wordlist(review,remove_stopwords=True))
    
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model2, num_features)

In [None]:
# Fitting a random forest classifier to the training data
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)

Ideally, I would run the code below:




```
print("Fitting random forest to training data....")    
forest = forest.fit(trainDataVecs, train2["review"])
```




However... Time constraints. Let's upload that one as the variable "forest".

In [None]:
import pickle
with open(r'forest.out', "rb") as input_file:
    forest = pickle.load(input_file)

In [None]:
np.where(np.isnan(testDataVecs))

In [None]:
testDataVecs = np.nan_to_num(testDataVecs)

In [None]:
# Predicting the sentiment values for test data and saving the results in a csv file 
result = forest.predict(testDataVecs)


In [None]:
output = pd.DataFrame(data={"id":test["id"],"review":test["text"], "sentiment":result})


In [None]:
output.groupby('sentiment').count()

In [None]:
temp1 = output[output['sentiment'] == 0]

In [None]:
temp1 = temp1.sample(5, axis=0)

In [None]:
temp2 = output[output['sentiment'] == 1]

In [None]:
temp2 = temp2.sample(5, axis=0)

In [None]:
output_small = temp1.append(temp2)
del temp1,temp2

In [None]:
output_small = list(output_small['review'])

In [None]:
output_small

In [None]:
output.to_csv( "output.csv", index=False)

### References
https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/get-and-use-twitter-data-in-python/

https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec/notebook
    
https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8

https://planspace.org/20150607-textblob_sentiment/

https://github.com/sloria/TextBlob/blob/eb08c120d364e908646731d60b4e4c6c1712ff63/textblob/en/en-sentiment.xml

https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./
