# Script to perform simple count analysis of hashtags and @ mentions


## Research Question: *Can a sentiment analysis of tweets addressed to the major candidates predict an election result in a state?*

*H0: There **is NO** relationship between the sentiment of tweets addressed to the major candidates and the state outcome of the election.*

*H1: There **is** a relationship between the sentiment of tweets addressed to the major candidates and the state outcome of the election.*

## 1. Data inputting

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import collections
from collections import Counter
from textblob import TextBlob 
import timeit as time

In [2]:
# Import csv
tweets = pd.read_csv("tweets_per_user.csv",header=None)

In [3]:
# Name DataFrame columns
tweets.columns = ["user","tweet"]
tweets.head()

Unnamed: 0,user,tweet
0,418,['@tristanwalker @realDonaldTrump I appreciate...
1,521,['@realDonaldTrump where can I buy one of thes...
2,747,"[""@cindylwarner1 in all fairness, we've always..."
3,997,['@BaldGuyGreeting @realDonaldTrump Gosh I tho...
4,1135,"[""Unless your friend is @realDonaldTrump, I ha..."


In [4]:
# Check the number of tweets
tweet_column = tweets["tweet"]
print("Number of tweets:",len(tweet_column))

Number of tweets: 86421


## 2. Hashtag metrics

#### We analyse the count of hashtags to see which of the two candidates is the most trending candidate and wether the mentions are mostly positive or negative

In [5]:
# Create a list of lists with hashtags
tweet_list = []
for i in range(0,len(tweet_column)):
    tweet_list.append(tweet_column[i])

# Create a list of lists with hashtags
hashtags = []
for i in range(0,len(tweet_list)):
    pat = re.compile(r"#(\w+)")
    hashtags.append(pat.findall(tweet_list[i]))

# Check that the number of list elements is the same as the number of users
print("Number of elements in list:",len(hashtags))

# Remover users which do not use hashtags
hashtags = [x for x in hashtags if x != []]
print("Number of hashtag users:",len(hashtags))

# Flatten the list of lists and see how many total hashtags
hashtags_flat_list = [item for sublist in hashtags for item in sublist]
print("Number of total hashtags:",len(hashtags_flat_list))

# Lowercase the hashtags so there is no difference anymore (Trump=trump)
hashtags_flat_list_lower = [x.lower() for x in hashtags_flat_list]

# Add an hashtag to the start of every word
hashtags_flat_list_lower = ["#" + hashtag for hashtag in hashtags_flat_list_lower]

Number of elements in list: 86421
Number of hashtag users: 37648
Number of total hashtags: 558949


The most popular hashtag was **#trump** which is just a general hashtag which does not have a positive or negative connotation. 

We see that **#trump** was more popular than **#hillary** and **#hillaryclinton** combined.

It is interesting to note that although #nevertrump is the second most mentioned hashtag, **#maga** is still is more popular than **#imwithher** which supports the claim that Trump was the most trending candidate.


## 3. @ mentions metrics

We analyse the count of @ mentions to see which of the two candidates is the most trending candidate and wether the mentions are mostly positive or negative

In [19]:
# Create a list of lists with @
ats = []
for i in range(0,len(tweet_list)):
    pat = re.compile(r"@(\w+)")
    ats.append(pat.findall(tweet_list[i]))

# Check that the number of list elements is the same as the number of users
# 86421 tweets when ranked by user
print("Number of elements in list:",len(ats))

# Remover users which do not use @
ats = [x for x in ats if x != []]
print("Number of @ mentions users:",len(ats))

# Flatten the list of lists and see how many total @
ats_flat_list = [item for sublist in ats for item in sublist]
print("Number of total @ mentions:",len(ats_flat_list))

# Lowercase the @ so there is no difference anymore (@realDonaldTrump = @realdonaldtrump )
ats_flat_list_lower = [x.lower() for x in ats_flat_list]

Number of elements in list: 86421
Number of @ mentions users: 68785
Number of total @ mentions: 1131258


In [20]:
# Count and rank the top 20 most popular @ mentions
ats_counter = Counter(ats_flat_list_lower)
ats_top_20 = ats_counter.most_common(20)

In [23]:
# Create a dataframe of top 20 @ mentions
ats_df = pd.DataFrame(ats_top_20)
ats_df.columns=["@ mentions","Count"]

The most popular @ mention was **@realdonaldtrump** which refers to Trump’s Twitter account. **@realdonaldtrump** was mentioned twice more than **@hillaryclinton** which again support the claim that Trump was a more interesting candidate. 

The pro-Trump news channel **@foxnews** was mentioned more than the anti-Trump media source **@cnn**.

Through the use of metrics such as count of hashtags and @ mentions we can conclude that Trump was indeed the most talked about candidate.

## 4. Hashtag Cleaning

We need to find a way to separate the hashtags which contain multiple words eg. "nevertrump" = "never trump"

In [6]:
# Count and rank the top #### most popular hashtags mentions
hashtags_counter = Counter(hashtags_flat_list_lower).
hashtag_rank = hashtags_counter.most_common(558949)

In [7]:
# Create a dataframe and a list of ranked hashtags.
hashtag_df = pd.DataFrame(hashtag_rank)
hashtag_df.columns=["Hashtag","Count"]
print(hashtag_df.head(10))
print()
print("lenght of DataFrame:",len(hashtag_df))
hashtag_raw_count = hashtag_df["Count"].tolist()

           Hashtag  Count
0           #trump  57406
1      #nevertrump  21275
2            #maga  18935
3       #imwithher  17932
4    #neverhillary  14662
5         #hillary  13835
6  #hillaryclinton  13430
7    #trumppence16  13362
8     #donaldtrump  12344
9  #crookedhillary  11414

lenght of DataFrame: 49829


In [8]:
# These are the functions used to separate the words in a hashtag.

def initialize_words():
    """
    Initialize word function by opening a file containing common english words.
    """
    content = None
    with open('wordlist.txt') as f: 
        content = f.readlines()
        
    return [word.rstrip('\n') for word in content]


def parse_sentence(sentence, wordlist):
    """
    Parse sentences and look for hashtags.
    If first term of words is an hashtag, parse it, else we append the word.
    """
    new_sentence = ""    
    terms = sentence.split(' ')    
    for term in terms:
        if term[0] == '#': 
            new_sentence += parse_tag(term, wordlist)
        else: 
            new_sentence += term
        new_sentence += " "

    return new_sentence 


def parse_tag(term, wordlist):
    """
    Remove hashtag, split by dash.
    Be careful for the special case where the length of tag is equal to length of word.
    """
    words = []
    tags = term[1:].split('-')
    for tag in tags:
        word = find_word(tag, wordlist)    
        while word != None and len(tag) > 0:
            words.append(word)   
            if len(tag) == len(word): 
                break
            tag = tag[len(word):]
            word = find_word(tag, wordlist)
            
    return " ".join(words)


def find_word(token, wordlist):
    """
    Find hashtag in wordlist.
    """
    i = len(token) + 1
    while i > 1:
        i -= 1
        if token[:i] in wordlist:
            return token[:i]
        
    return None 

In [9]:
# Separate the hashtags so "nevertrump" becomes "never trump"
# This is an extremely inneficient way of separating the hashtags, if you try to do a single for-loop the script never ends.
# Please feel free to adapt this for better efficiency

# Start timer to see program-run time
start = time.default_timer()
wordlist = initialize_words()
list_10000 = []
for i in range (0,10000):
    sentence = hashtag_df["Hashtag"][i]
    clean_hashtag=parse_sentence(sentence, wordlist)
    list_10000.append(clean_hashtag)

stop = time.default_timer()
print('Time to run step 1: ', stop - start)

list_20000 = []
for i in range(10000,20000):
    sentence = hashtag_df["Hashtag"][i]
    clean_hashtag = parse_sentence(sentence, wordlist)
    list_20000.append(clean_hashtag)

stop = time.default_timer()
print('Time to run step 2: ', stop - start)

list_30000 = []
for i in range(20000,30000):
    sentence = hashtag_df["Hashtag"][i]
    clean_hashtag = parse_sentence(sentence, wordlist)
    list_30000.append(clean_hashtag)

stop = time.default_timer()
print('Time to run step 3: ', stop - start)    

list_40000 = []
for i in range(30000,40000):
    sentence = hashtag_df["Hashtag"][i]
    clean_hashtag = parse_sentence(sentence, wordlist)
    list_40000.append(clean_hashtag)

stop = time.default_timer()
print('Time to run step 4: ', stop - start)   

list_last = []
for i in range(40000,49829):
    sentence = hashtag_df["Hashtag"][i]
    clean_hashtag = parse_sentence(sentence, wordlist)
    list_last.append(clean_hashtag)

# Caculate program-run time   
stop = time.default_timer()

print('Time to run: ', stop - start)  

Time to run step 1:  115.0454066
Time to run step 2:  268.83435349999996
Time to run step 3:  467.1204337
Time to run step 4:  660.5423647
Time to run:  880.3862531000001


In [10]:
# Create big dataframe with all hashtag lists
df_1 = pd.DataFrame(list_10000)
df_2 = pd.DataFrame(list_20000)
df_3 = pd.DataFrame(list_30000)
df_4 = pd.DataFrame(list_40000)
df_5 = pd.DataFrame(list_last)
df_6 = pd.concat([df_1, df_2, df_3, df_4, df_5])

In [11]:
# Create list of all hashtags list
hashtag_total_list = df_6[0].tolist()

## 5. Hashtag Sentiment Analysis using English Words

In [12]:
def sentiment(text) :
    """
    Define a function to get sentiment of hashtags.
    We look at every word in the hashtag and see if it is mentioned in files with positive and negative words.
    The functions return the total score which is the positive score minus the negative score.
    For example : "Trump bad crazy " returns score of -2 because there are 2 negative words in the hashtag.
    """
    raw_words = text.split(" ")
    #print(text)
    with open("positive-words.txt") as file_positive:
        positive_words = [line.strip() for line in file_positive]

    with open("negative-words.txt") as file_negative:
        negative_words = [line.strip() for line in file_negative]

    positive_score = len([word for word in raw_words if word in positive_words])
    negative_score = len([word for word in raw_words if word in negative_words])
    #print("positive score: ",positive_score)
    #print("negative_score: ", negative_score)
    total_score = positive_score - negative_score
    #print("total_score: ", total_score)
    #print
    
    return total_score

In [14]:
# Pass the list of hashtags in the sentiment score function and create a score list
sentiment_list = []
for i in range(0,len(hashtag_total_list)):
    text = hashtag_total_list[i]
    sentiment_score = sentiment(text)
    sentiment_list.append(sentiment_score)

In [17]:
# Create final DataFrame and save to excel
final_df = pd.DataFrame(columns = ["hashtag", "count", "sentiment"])
final_df["hashtag"]=hashtag_total_list
final_df["count"]=hashtag_raw_count
final_df["sentiment"]=sentiment_list
final_df.head(10)

Unnamed: 0,hashtag,count,sentiment
0,trump,57406,0
1,never trump,21275,-1
2,maga,18935,0
3,im with her,17932,0
4,never hillary,14662,-1
5,hillary,13835,0
6,hillary clinton,13430,0
7,trump pence 16,13362,0
8,donald trump,12344,0
9,crooked hillary,11414,-1


In [24]:
final_df.to_excel("Hashtag_df.xlsx")