This notebook is for mining the twitter feed of a given handle, doing a bit of processing on the gleaned tweets, and saving the data in an csv. It makes use of a seperate python module I wrote to interact with Twitter via tweepy (Tweetmining) and another module I wrote for determining the sentiment of a given word based on how often it is found in pos vs neg tweets (Sentiment_LUT).

In [1]:
import re
import tweepy #for interacting with Twitter
from tweepy import OAuthHandler #for interacting with Twitter
import pandas as pd #for saving data
import TweetMining #self made module for mining tweets
import numpy as np #for maths
import Sentiment_LUT #another self made module for creating a sentiment look up table.

Several def will make this process easier.

In [2]:
#The GloVe table uses information from https://nlp.stanford.edu/projects/glove/ to convert words to numberic 1d vectors.
#This def just creates a look up table for taking a word and quickly finding its vector representation.
#dimension: can be 25, 50, 200, etc. sets size of word vector representation
def GloVe_table(dimension):
    file=open('NLP_files/glove.twitter.27B.%sd.txt'%(dimension), 'r',encoding="utf8")
    contents=file.read()
    contents_lines=contents.split('\n')
    GloVe=pd.DataFrame([line.split(' ') for line in contents_lines[0:]])
    GloVe.set_index(0,inplace=True) #set the word as the index for faster look-up table function
    return GloVe.iloc[:,:].astype(float).copy()

#This uses the Sentiment_LUT module I made previously to judge how often a word shows up in pos tweets vs neg tweets.
#If the word shows up more often in one or the other, it will be useful to include in the analysis.
#the sentiment_strength variable sets how extreme the imbalance in appearance between pos and neg tweets the word must have
#to be used in the analysis.
def sent_words(tweet,LUT,sentiment_strength):
    tweet_words=pd.Series(tweet.split(' '));
    #adding exception in the case there are no sentiment words in tweet.
    #In that case, return original tweet.
    try:
        a=' '.join(tweet_words[(np.abs(LUT.loc[tweet_words,'Score'])>sentiment_strength).values])
    except:
        a=''
    return a

#using the GloVe table and sentiment look up table to create vectorized representations of tweets.
#names of IMDB and LUT25 are left over from previous work with different training data and GloVe tables.
def vectorize(IMDB,dimension,LUT25):
    def vector(word):
        try:
            return LUT25.loc[word,:]
        except:
            return np.zeros(len(LUT25.iloc[0,:]))
    def vector2(word_list):
        words=pd.Series(word_list)
        a=words.apply(vector).values.mean(axis=0)
        return list(a)
    for x in range(0,dimension):
        IMDB.loc[:,'vector_%s'%(x)]=0
    IMDB.loc[:,IMDB.columns[IMDB.columns.str.contains('vector')]]=IMDB.loc[:,'sentiment words'].str.split(' ').apply(vector2).apply(pd.Series).values
    return IMDB

I have a csv file of the twitter handles of every senator currently in the Senate. Instead of manually changing out the handle name each time I want to look at their tweets, I'll write a for loop:

Option for doing this with all the senators saved in previous csv file

In [3]:
senators=pd.read_csv('Senators.csv');

In [4]:
limit=10; #how many times to ask Twitter to send the next batch of 200 recent Tweets.
#handle="@realDonaldTrump" #uncomment if only looking for a single handle, and replace handle shown here
#handle='@Lin_Manuel'
for x in range(0,100): #comment out if only looking at single handle
#for x in range(0,1): #uncomment for single handle
    y=x
    handle=senators.loc[x,'Handle'] #comment out for single handle case
    try:
        User_tweets=pd.read_csv('User_tweets_%s.csv'%(handle))
        print(x)
        print(handle)
        print('prev record found')
    except:
        print(x)
        print(handle)
        User_tweets=TweetMining.mine(handle,limit,True)
        print('new record created')
print(y)

0
@SenShelby
prev record found
1
@lutherstrange
prev record found
2
@lisamurkowski
prev record found
3
@SenDanSullivan
prev record found
4
@SenJohnMcCain
prev record found
5
@JeffFlake
prev record found
6
@SenTomCotton
prev record found
7
@JohnBoozman
prev record found
8
@SenFeinstein
prev record found
9
@SenKamalaHarris
prev record found
10
@SenBennetCO
prev record found
11
@SenCoryGardner
prev record found
12
@ChrisMurphyCT
prev record found
13
@SenBlumenthal
prev record found
14
@SenatorCarper
prev record found
15
@ChrisCoons
prev record found
16
@SenBillNelson
prev record found
17
@marcorubio
prev record found
18
@sendavidperdue
prev record found
19
@SenatorIsakson
prev record found
20
@brianschatz
prev record found
21
@maziehirono
prev record found
22
@MikeCrapo
prev record found
23
@SenatorRisch
prev record found
24
@SenDuckworth
prev record found
25
@SenatorDurbin
prev record found
26
@SenDonnelly
prev record found
27
@SenToddYoung
prev record found
28
@ChuckGrassley
prev record

Now to do some pre-processing:

First, choose dimension of word vectors and create look up table for that dimension:

In [5]:
dimension=25;
%time GloVe=GloVe_table(dimension)

Wall time: 24.6 s


And a sentiment look up table

In [6]:
n=50000
try:
    LUT_sentiment_words=pd.read_csv('LUT_sentiment_words_%s.csv'%(n))
    LUT_sentiment_words.set_index('words',inplace=True)
except:
    LUT_sentiment_words=Sentiment_LUT.LUT(n)
    LUT_sentiment_words.to_csv('LUT_sentiment_words_%s.csv'%(n),index=False)
    LUT_sentiment_words.set_index('words',inplace=True)

Now go through the collection of tweets and find sentiment and convert to word vectors

In [8]:
#for name in range(0,len(senators)): #comment for single handle
for name in range(0,1): #uncomment for single handle
    #handle=senators.loc[name,"Handle"] #comment for single handle
    handle="@realDonaldTrump"; #uncomment for single handle (optional)
    print(handle)
    #dimension=25;
    sentiment=.35
    try:
        User_tweets=pd.read_csv('User_tweets_%s_sentiment_%s_vector_%s.csv'%(handle,sentiment,dimension))
        User_tweets.fillna(0,inplace=True)
    except:
        try:
            User_tweets=pd.read_csv('User_tweets_%s.csv'%(handle))
            User_tweets.dropna(inplace=True)
            User_tweets.reset_index(inplace=True)
            del User_tweets['index']
            print('sentiment thresh:')
            %time User_tweets['sentiment words']=User_tweets.loc[:,'tweets'].apply(lambda tweet: sent_words(tweet,LUT_sentiment_words,sentiment))
            #GloVe=GloVe_table(dimension) #unneeded if GloVe defined above (run time ~30 sec, so good to not have in loop)
            for x in range(0,dimension):
                User_tweets.loc[:,'vector_%s'%(x)]=0

            #finding a vector form of each tweet
            vec_names=User_tweets.columns[User_tweets.columns.str.contains('vector')];
            #need to rewrite because of the vectorize change
            print('vectorize')
            %time User_tweets=vectorize(User_tweets,dimension,GloVe)
    #        %time User_tweets.loc[:,vec_names]=User_tweets.loc[:,'sentiment words'].apply(lambda words: vectorize(words,GloVe,dimension)).values
            User_tweets.fillna(0,inplace=True)
            #and save the sentiment and vectorized tweets
            User_tweets.to_csv('User_tweets_%s_sentiment_%s_vector_%s.csv'%(handle,sentiment, dimension),index=False)
        except:
            print('no records for that handle')

@realDonaldTrump
sentiment thresh:
Wall time: 21.7 s
vectorize
Wall time: 4.44 s
