# Collecting Twitter Replies to Corporate Accounts

This script serves as a way to collect tweets directed at corporate Twitter accounts. There is no way to directly search for responses promoted tweets via the Twitter API, but we can find replies to corporate accounts known to publish promoted tweets, then check if the original tweet from these corporate accounts which the reply was directed at was published via "Twitter Ads."

To begin, we'll import required packages.

In [1]:
import tweepy
import pickle
import re
import time
import os

Next, use OAuth to authenticate this app in order to use the Twitter API.

In [2]:
# == OAuth Authentication ==
#
# This mode of authentication is the new preferred way
# of authenticating with Twitter.

# The consumer keys can be found on your application's Details
# page located at https://dev.twitter.com/apps (under "OAuth settings")
consumer_key=""
consumer_secret=""

# The access tokens can be found on your applications's Details
# page located at https://dev.twitter.com/apps (located
# under "Your access token")
access_token=""
access_token_secret=""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In order to search for replies to a large number of corporate accounts, it is most efficient to bundle corporate accounts into sets any query the Twitter API for tweets from any of these accounts. Below, I've bundled the fifty accounts I considered into five groups of ten.

In [1]:
corporate_names_1 =   ['Dodge',
                    'jackdaniels_us',
                    'jolly_rancher',
                    'syfy',
                    'wholefoods',
                    'HeinekenSoccer',
                    'Freeletics',
                    'benandjerrys',
                    'DiGiornoPizza',
                    'NinjaTrader']

corporate_names_2 = ['CoveredCA',
                    "Nautica",
                    "MountainDew",
                    "hulu",
                    "LiveNation",
                    "AmMadeMovie",
                    "adidasrunning",
                    "GenesisUSA",
                    "AmericanGodsSTZ",
                    "gsuite"]

corporate_names_3 = ['SamsungBizUSA',
                    "OnStar",
                    "NyQuilDayQuil",
                    "HARDER",
                    "JulianBakery",
                    "Total",
                    "DennysDiner",
                    "AtlanticNet",
                    "NortonOnline",
                    "AmericanExpress"]

corporate_names_4 = ['HP',
                    "sprint",
                    "pandoramusic",
                    "constitutioncenter",
                    "IBMsecurity",
                    "pizzahut",
                    "mesosphere",
                    "transformers",
                    "HPE",
                    "NBA"]

corporate_names_5 = ['McAfee',
                    "beatsbydre",
                    "chipotletweets",
                    "HUGOBOSS",
                    "MakersMark",
                    "DairyQueen",
                    "PruTalent",
                    "TwitterMktg",
                    "Prudential",
                    "xbox"]

corporate_accounts = [corporate_names_1,
                     corporate_names_2,
                     corporate_names_3,
                     corporate_names_4,
                     corporate_names_5]

number_of_groups = len(corporate_accounts)

Next, define two useful functions. The first will generate the string required by the Twitter API to search all replies to corporate accounts within a given group. The second will return the ID of the oldest tweet within a set of tweet objects.

In [2]:
def make_search_string(names):
    return ' '.join(["to:%s OR" % i if i != names[-1] else "to:%s" % i for i in names])

def get_old_min_id(l):
    return min([tweet.id for tweet in l])

Below, we create the search strings for each group of corporate accounts defined above. We then collect all tweets that respond to these accounts. Note that the script below is designed to analyze the contents of the current directory to determine the earliest tweets that have already been captured by the script. The script will then only search for earlier tweets.

In [6]:
search_strings = []

for group in corporate_accounts:
    search_strings.append(make_search_string(group))

tweets_at_corporate_accounts = {}

for i in range(number_of_groups):
    tweets_at_corporate_accounts['group_%d' % i] = []

file_re = re.compile(r'tweets_at_corporate_accounts_\d\d?\.pickle')

final_data_file = ''

max_ids = []

for filename in [file_re.findall(i)[0] for i in os.listdir('.') if file_re.findall(i) != []]:
    file_count = int(filename.strip("tweets_at_corporate_accounts_\.pickle"))
    if final_data_file:
        if file_count > int(final_data_file.strip("tweets_at_corporate_accounts_\.pickle")) and os.path.getsize('./%s' % filename) != 0:
            final_data_file = filename
    elif os.path.getsize('./%s' % filename) != 0:
        final_data_file = filename
        
print final_data_file
changed_groups = False
        
if final_data_file:
    count = int(final_data_file.strip("tweets_at_corporate_accounts_\.pickle")) + 1
else:
    count = 1
if not changed_groups:
    f = open(final_data_file,'rb')
    old_tweets = pickle.load(f)
    
    for i in range(number_of_groups):
        max_ids.append(get_old_min_id(old_tweets['group_%d' % i])
else:
    for i in range(number_of_groups):
        max_ids.append(10e20)
    
tweets_file = open('tweets_at_corporate_accounts_%s.pickle' % str(count),'w')

while True:
    start = time.time()
    try:
        new_ids = []
        for i in range(number_of_groups):
            new_batch = api.search(q=search_strings[i] + " -filter:retweets", count = 100, max_id = max_ids[i], lang="en")
            tweets_at_corporate_accounts['group_%d' % i] += [t for t in new_batch if t.in_reply_to_status_id]
            max_ids[i] = min([t.id for t in new_batch])
        
        print sum([len(tweets_at_corporate_accounts[i]) for i in tweets_at_corporate_accounts])
    except tweepy.RateLimitError:
        if len(tweets_at_corporate_accounts) > 100000:
            pickle.dump(tweets_at_corporate_accounts,tweets_file)
            tweets_file.close()
            break
        else:
            print "rate limit"
            pickle.dump(tweets_at_corporate_accounts,tweets_file)
            tweets_file.close()
            for i in range(number_of_groups):
                tweets_at_corporate_accounts['group_%d' % i] = []
            count += 1
            tweets_file = open('tweets_at_corporate_accounts_%s.pickle' % str(count),'w')
            end = time.time()
            print (900 - (end-start)) / 60
            time.sleep(900 - (end-start))

tweets_at_corporate_accounts_13.pickle


ValueError: min() arg is an empty sequence