# Exploring election tweets from Tweet the People project

I want to build a Naive Bayes classifier that predicts whether a given tweet is about a Republican or Democrat presidential candidate.

(For the purposes of this model, I'll assume that a tweet is "about a Republican" if it originally contained the words `donaldtrump`, `donald trump`, `mikepence`, or `mike pence`, and the same for the Democrat candidates. But I'll strip these actual terms out before building the model, otherwise the model wouldn't be building a prediction based on anything other than the presence of these terms.)

Before I build this model, I'd like to know:

- Is there a class imbalance that I need to take into account in my model?
- Some basic stats about the tweets:
    - Average no. words (and whether it differs by candidate)
    - Popular terminology (using TF-IDF)
- Bonus: Check distribution of sentiment in each ticket
- Bonus: I'll be stripping out hashtags, but I'm curious to know about the trends here.

## Step 1: reading in data, basic data cleaning

In [127]:
import re
import pandas as pd
import numpy as np

from spacy.lang.en import English
from spacy.tokenizer import Tokenizer

In [109]:
nlp = English()

In [110]:
dict_replace = {"-": " ", "\.": "", "\?": "\'", "\s+": " ", "&amp;" : "and"}

In [98]:
tweets = pd.read_csv('/Users/laraehrenhofer/Documents/Coding_Projects/git_repos/tweet-the-people-legacy/data/tweet_pg.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [99]:
tweets = tweets[['text', 'politician', 'date', 'sentiment', 'ticket']]

Some tweets were wrongly inserted with the date as the politician -- I'm going to delete these data points from the sample. (This removes about 15 tweets.)

In [118]:
politicians = ['kamalaharris', 'donaldtrump', 'joebiden', 'mikepence']

In [126]:
def is_politician(politician):
    '''
    Returns boolean indicating whether input is a member of politicians list
    '''
    if politician in politicians:
        return True
    else:
        return False

In [137]:
def check_politician(politician):
    if not is_politician(politician):
        politician = np.NaN
    return politician

In [149]:
list1 = politicians + ['bugsbunny', 'snowwhite', 'superman', 'angelamerkel']
test = pd.DataFrame(list1, index=list(range(len(list1))))
test.columns = ['politician']
test

Unnamed: 0,politician
0,kamalaharris
1,donaldtrump
2,joebiden
3,mikepence
4,bugsbunny
5,snowwhite
6,superman
7,angelamerkel


In [150]:
test = test[test['politician'].apply(lambda x: check_politician(x))]

ValueError: cannot mask with array containing NA / NaN values

In [104]:
# tweets = tweets[tweets['politician'].apply(lambda x: check_politician(x))]

In [106]:
len(tweets)

316699

In [117]:
tweets['politician'].unique()

array(['kamalaharris', 'donaldtrump', 'joebiden', 'mikepence',
       '2020-11-03 00', '2020-11-03 01', '2020-11-03 02', '2020-11-03 03',
       '2020-11-03 06', '2020-11-03 13', '2020-11-03 14', '2020-11-03 16',
       '2020-11-03 17'], dtype=object)

In [20]:
def get_handles_hashtags(text):
    '''
    Returns separate lists of hashtags and user handles in the text
    '''
    handles = re.findall('\B\@\w+', text)
    hashtags = re.findall('\B\#\w+', text)
    return handles, hashtags

In [39]:
tester = tweets.head(50)

In [38]:
def strip_out_weird_symbols(text):
    '''
    Replaces special symbols that get messed up eg. replaces &amp; with and (as they are functionally identical)
    See dict_replace for replacements
    '''
    for key, value in dict_replace.items():
        sub = re.sub(key, value, text)
        text = sub
    text = re.sub(r'\\', '', text)
    text = re.sub('\s+', ' ', text)
    return text

In [44]:
def strip_out_handles_hashtags(text):
    '''
    Removes handles and hashtags
    '''
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)
    return text

In [90]:
def clean_text(text):
    '''
    - Removes handles & hashtags from text
    - Handles odd symbols
    - Tokenisation
    '''
    cleaned = strip_out_weird_symbols(text)
    cleaned = strip_out_handles_hashtags(cleaned)
    doc = nlp(cleaned)
    cleaned = [token.orth_.lower() for token in doc if not token.is_punct]
    cleaned = [item for item in cleaned if not re.search('\s+', item)]
    text_len = len(cleaned)
    cleaned = ' '.join(cleaned)
    return cleaned, text_len

In [24]:
tweets[['handles', 'hashtags']] = tweets.apply(lambda row: pd.Series(get_handles_hashtags(row['text'])), axis=1)

In [92]:
tweets[['clean_text', 'text_len']] = tweets.apply(lambda row: pd.Series(clean_text(row['text'])), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [115]:
# save this version of tweets data for later

tweets.to_csv('./tweet_minimal.csv', index=False)

## Step 2: check class imbalance

Quick count & visualisation of how many tweets per candidate and per ticket.

In [116]:
# count by candidate

cand = tweets.groupby(['politician']).count()
cand

Unnamed: 0_level_0,text,date,sentiment,ticket
politician,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-11-03 00,1,1,0,0
2020-11-03 01,1,1,0,0
2020-11-03 02,3,3,0,0
2020-11-03 03,2,2,0,0
2020-11-03 06,2,2,0,0
2020-11-03 13,1,1,0,0
2020-11-03 14,1,1,0,0
2020-11-03 16,1,1,0,0
2020-11-03 17,1,1,0,0
donaldtrump,95427,95427,95427,95427
