# Data Collection

## Data collection method
- This notebook uses the Tweepy package to download tweets for specified accounts from the Twitter API. In order to use the API, you need your own bearer key, which serves as your authentication into the API.  If you are planning to use the API, put your bearer key into the variable 'bearer_key' in the constants section.
- It uses a manually created account list file (accounts.csv) which has all of the Twitter handles to download tweets from, the class to assign to each handle, and how many tweets to get download from each handle.  
- The accounts file is important to keep current in order to avoid downloading tweets from the same account multiple times as users of the Twitter API are limited to the number of tweets that can be pulled in a one month time period.  As such, when tweets for a particular handle are downloaded, the tweets are immediately appended to a tweets CSV file (tweet_list.csv) and the accounts file is updated to mark that handle as 'done'. 

## Install necessary uncommon packages 

In [54]:
# Tweet downloader
! pip install tweepy



## Imports

In [55]:
from os.path import exists
import tweepy
import csv
import pandas as pd
import string
from time import sleep

## Constants

In [56]:
# This is the account list that we'll be pulling tweets from.  
account_list_file = "accounts.csv"  

# This is the file that we'll save tweet data to.
tweet_list_file = 'tweet_list.csv'  

# This the key needed to download from the API
bearer_key = 'AAAAAAAAAAAAAAAAAAAAAAP3lAEAAAAAWiRYIS1QJmco7YZB4oL%2BhLg1R3c%3DmvYmGNwcKhY145AcnvJzFaJlMZ2G7aeovV9VFB5qG9NiNkizEm'

## Functions

### Account list management functions

These functions manage the list of accounts to pull tweets from.

In [57]:
# This downloads the account list file (the file that has the accounts we'll pull tweets from)
def get_account_list_from_file():
    
    # Get list of accounts from CSV file
    df_accounts = pd.read_csv(account_list_file)

    # Create a dataframe from the file contents
    df_accounts.columns = [n.strip() for n in df_accounts.columns]
    df_accounts['Count_Plan'] = df_accounts['Count_Plan'].astype(int)
    df_accounts['Count_Actual'] = df_accounts['Count_Actual'].astype(int)
    df_accounts['Done'] = df_accounts['Done'].astype(bool)
    
    # Return the dataframe
    return df_accounts

# This function saves the account list.  It's saved after each handle download.  
def save_account_list_to_file():
    account_list.to_csv(account_list_file, index=False)
    
# This is to reset the account list file.  Should rarely be used unless we want to restart the downloads.
def reset_account_list_done():
    account_list['Done'] = False
    account_list['Count_Actual'] = 0
    save_account_list_to_file()

### Tweet downloading and saving functions

These functions perform the actual downloading and saving 

In [58]:
# Function to request tweets from the twitter API for a specified handle, specified number of tweets, and add the specified class to it
# Return the list of tweets

def get_tweets(username, class_, number_of_tweets):
    # This is the key to use to download the tweets
   
    client = tweepy.Client(bearer_token=bearer_key)
    user_id = client.get_user(username=username).data.id

    # Uses the paginator to request as many tweets as we want (paginator makes it possible to download more than 100 at a time
    tweets = []
    for tweet in tweepy.Paginator(client.get_users_tweets, user_id, tweet_fields=['created_at', 'author_id'],expansions=[''], max_results=100, exclude=['replies']).flatten(limit=number_of_tweets):
        # Scrub the text of any non-readable characters
        text = "".join(i for i in tweet.text if i in string.printable)
        # Scrub the text of any newlines
        text = text.replace("\n", " ")
        # Put the tweet info into a new dictionary
        tweets.append({
            "user_name"  : str(username),
            'class'      : str(class_),
            "id"         : str(tweet.id),
            "text"       : str(text),
            "author_id"  : str(tweet.author_id),
            "created_at" : str(tweet.created_at)
        })
    return tweets



# Function to append newly downloaded tweets to file
def append_to_tweet_file(tweets):
    field_names = ['user_name','class','id','text','author_id', 'created_at']
    
    # if the tweet data file doesn't exist, we're starting from scratch.  Make the file and put the headers at the top. 
    if not os.path.exists(tweet_list_file):
        with open(tweet_list_file, 'a') as csv_file:
            writer = csv.writer(csv_file, quoting=csv.QUOTE_NONNUMERIC) 
            writer.writerow(field_names)
            
    # Append the new data to file
    with open(tweet_list_file, 'a') as csv_file:
        writer = csv.writer(csv_file, quoting=csv.QUOTE_NONNUMERIC) 
        for t in tweets:
            writer.writerow([t['user_name'], t['class'], t['id'], t['text'], t['author_id'], t['created_at']])

# Function to pull the next handle from the accounts file and 
def get_next_handle():
    # Find first handle with a False in 'Done' 
    next_account = 0
    total_accounts = len(account_list)
    count = 0
    
    # Loop through the account list to find the next one that doesn't say 'Done'.  This is the next handle to download.  
    for n in range(0, total_accounts):
        if account_list.loc[n, 'Done'] == False:
            # Found next handle to download.  Break the loop.
            next_account = n
            break

    
    # Double check we found a handle that doesn't say Done, and then get the tweets for that handle
    if account_list.loc[next_account, 'Done'] == False:
        handle_to_get = account_list.loc[next_account,'Twitter handle']
        class_assignment = account_list.loc[next_account,'Class']
        number_to_get = account_list.loc[next_account,'Count_Plan']
        # Print what we are downloading
        print(f"Requesting {next_account+1}/{total_accounts-1}: {handle_to_get}, {class_assignment}, {number_to_get} tweets.  ", end="")

        tweetlist = get_tweets(handle_to_get, class_assignment, number_to_get)
        count = len(tweetlist)
        if count > 0:
            # We've got tweets.  Mark it done in the accounts file and save it. 
            print(f"  Received: {count} tweets.")
            append_to_tweet_file(tweetlist)
            account_list.loc[next_account, "Done"] = True
            account_list.loc[next_account, "Count_Actual"] = count
            save_account_list_to_file()

    return count  

## Download tweets

Two download methods are provided below.  One to download one handle's tweets (the next handle in the account list that isn't downloaded yet).  One to download a batch of the next 50 handles in the account list.  

### Download the next handle's tweets

In [59]:
# Get account list from file
account_list = get_account_list_from_file()

# Download and save the next handle
count = get_next_handle()

### Download the next 50 handles' tweets

In [60]:
# Loop through the next 50 handles to pull from the account file
for n in range(50):
    count = get_next_handle()  # Returns the number of tweets downloaded.  If zero, end, something didn't work.  
    if count == 0:
        break
    # Sleep for 1 second and then move on to the next handle.  Give it time to download.
    sleep(1)  

## Data review

Review downloaded data and the account status file

### Review tweet data

Load downloaded tweets from file (assumes the tweet file already has downloaded tweets in it)

In [61]:
tweet_list_df = pd.read_csv(tweet_list_file)
tweet_list_df

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,BennieGThompson,Politics - Liberal,1620584010991939584,Today marks the 83rd anniversary of the first ...,82453460,2023-02-01 00:45:11+00:00
1,BennieGThompson,Politics - Liberal,1620116251749269511,RT @VP: President Biden and I are just getting...,82453460,2023-01-30 17:46:29+00:00
2,BennieGThompson,Politics - Liberal,1620116182618759168,RT @RepJeffries: We will never negotiate away ...,82453460,2023-01-30 17:46:12+00:00
3,BennieGThompson,Politics - Liberal,1620116109864357888,https://t.co/Ze7ePCUJJ2,82453460,2023-01-30 17:45:55+00:00
4,BennieGThompson,Politics - Liberal,1620061909113516036,https://t.co/ley5hNsz0y https://t.co/RFdTeGXGO1,82453460,2023-01-30 14:10:33+00:00
...,...,...,...,...,...,...
115506,RepLCD,Politics - Conservative,1611786100825006080,It was great to catch up with my friend @RepFe...,1583530102297600000,2023-01-07 18:05:26+00:00
115507,RepLCD,Politics - Conservative,1611615029660639233,Thank you #OR05 for placing your trust in me t...,1583530102297600000,2023-01-07 06:45:40+00:00
115508,RepLCD,Politics - Conservative,1610791524807081986,A small minority is preventing the House from ...,1583530102297600000,2023-01-05 00:13:21+00:00
115509,RepLCD,Politics - Conservative,1610408428052295681,As I take on the responsibility of serving #OR...,1583530102297600000,2023-01-03 22:51:03+00:00


In [62]:
tweet_list_df.user_name.value_counts()

WSJbusiness        1606
ScienceMagazine     850
appleinsider        850
YahooFinance        850
CNBC                850
                   ... 
repvalhoyle          10
JMoylanforGuam       10
NBCNetwork            7
RepJeffJackson        2
DNC                   1
Name: user_name, Length: 586, dtype: int64

It looks like we've successfully successfully downloaded from all accounts.

### Review the accounts status file

Load the account list and review it to make sure all accounts have been marked 'done'

In [63]:
account_list = get_account_list_from_file()
account_list.Done.value_counts()

True    587
Name: Done, dtype: int64

Notes:
- All accounts are marked as 'done'.  The tweets totaled 115,511 in count.  It looks like there may be one duplicate in the account file (587) versus the tweet file (586).  I'll delete duplicate data in the main notebook.  

Now I'll move on to modeling.  Proceed back to the main notebook. 