### Purpose of script

The purpose of this script is to get the user information for all the users in our dataset. 

In [92]:
import numpy as np
import pandas as pd
import pickle
import os
import time
import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
from tweepy import TweepError

### User info to obtain

1. User demographics (and whatever info is in the user object)
2. User timeline

### Get users

Get user object from a previous script (get_user_tweets_2021-01-19.ipynb)

In [2]:
TWEET_DIR = "../../data/tweets/"

In [4]:
with open(TWEET_DIR + "user_ids_2021-01-19.pkl", 'rb') as fp:
    unique_user_ids = pickle.load(fp)

In [6]:
list(unique_user_ids)

['169711005',
 '29464656',
 '3180254803',
 '39557009',
 '2923230219',
 '767100089412251648',
 '157819685',
 '205793090',
 '821439445',
 '170963694',
 '113431968',
 '158120995',
 '1010496054062510080',
 '1152674269',
 '432991135',
 '622057278',
 '379185116',
 '15253',
 '2914119197',
 '28845358',
 '3378007379',
 '85401454',
 '42331866',
 '27573138',
 '27579617',
 '455226443',
 '79690687',
 '1345873207',
 '1072585382972284928',
 '1072585382972284928',
 '1083605322906910720',
 '930196531',
 '14733006',
 '1583990467',
 '93269564',
 '1136236422',
 '1113478140020396032',
 '24753129',
 '23974721',
 '14989726',
 '232732649',
 '24332709',
 '263927883',
 '1197599228631142402',
 '1892503092',
 '992852140895989760',
 '544038348',
 '877142240170258432',
 '320388793',
 '224301083',
 '869088275675271168',
 '414289596',
 '3405625943',
 '1636313664',
 '43496344',
 '1020616310',
 '42476678',
 '869088275675271168',
 '448601299',
 '2305887186',
 '68865478',
 '2806116100',
 '293533060',
 '115047556661466316

### Twitter Authentication

Here, I'll use the `tweepy` account to access the API

In [122]:
def authenticate(consumer_key, consumer_secret, access_token, access_secret):

    """
      Allows authentication with Twitter API, with relevant IDs
      Input: IDs
      Output: authentication, API access
    """

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # times out after 8 hrs. of running, waits on rate limits
    api = tweepy.API(auth, timeout=480, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    try:
        api.verify_credentials()
        print("Authentication OK")
    except Exception as e:
        print("Error during authentication")
        print(e)

    return auth, api

In [31]:
# get authentication
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''

In [32]:
CREDS_DIR = "../../"

In [123]:
with open(CREDS_DIR + "twitter_creds.txt", 'r') as twitter_creds:
    consumer_key = twitter_creds.readline().rstrip() # reads line, removes trailing whitespaces
    consumer_secret = twitter_creds.readline().rstrip()
    access_key = twitter_creds.readline().rstrip()
    access_secret = twitter_creds.readline().rstrip()

try:
    auth, api = authenticate(consumer_key, consumer_secret, access_key, access_secret)
except Exception as e:
    print("Authentication failed")
    print(e)

Authentication OK


### Example of getting user information

In [26]:
unique_user_ids[0]

'169711005'

In [29]:
?api.get_user

Reference: https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show

In [40]:
example_user_info = api.get_user(user_id=169711005)

Let's track our rate limit status

In [41]:
rate_limits = api.rate_limit_status()

In [42]:
rate_limits["resources"]["users"]["/users/:id"]

{'limit': 900, 'remaining': 897, 'reset': 1611257938}

What info do we think we want?

Here's a description of what all the fields mean: `https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user`

1. id_str: unique id of user
2. location: user-reported location
3. profile_location: user-reported profile location?
4. description: user description
5. followers_count: number of followers
6. friends_count: number of people they follow
7. created_at: when the account was created
8. favourites_count: number of tweets that the user has favorited
9. statuses_count: number of tweets + RTs from that user


We can get 900 users per 15-minute window. 
This means we can get 3,600 per hour. 

In [47]:
len(unique_user_ids)

278337

We have 278,337 total users, and we can get 3,600 per hour. Let's get a subsample of ~10,000 users and see what their characteristics are. 

In [124]:
rate_limits = api.rate_limit_status()
rate_limits["resources"]["users"]["/users/:id"]

{'limit': 900, 'remaining': 900, 'reset': 1611282562}

In [113]:
rate_limits["resources"]["users"]["/users/:id"]["remaining"]

900

Let's begin to set up the loop to get the user information: 

In [50]:
unique_user_ids[0]

'169711005'

In [55]:
example_user_info._json["id_str"]

'169711005'

In [54]:
example_user_info._json

{'id': 169711005,
 'id_str': '169711005',
 'name': 'David A Hesson',
 'screen_name': 'hessonplumbing',
 'location': 'ohio',
 'profile_location': None,
 'description': '',
 'url': 'https://t.co/uks1jwtN2r',
 'entities': {'url': {'urls': [{'url': 'https://t.co/uks1jwtN2r',
     'expanded_url': 'http://hessonplumbing.com',
     'display_url': 'hessonplumbing.com',
     'indices': [0, 23]}]},
  'description': {'urls': []}},
 'protected': False,
 'followers_count': 19,
 'friends_count': 19,
 'listed_count': 0,
 'created_at': 'Fri Jul 23 00:18:48 +0000 2010',
 'favourites_count': 28,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': True,
 'verified': False,
 'statuses_count': 696,
 'lang': None,
 'status': {'created_at': 'Sat Nov 07 14:55:52 +0000 2020',
  'id': 1325089592652034054,
  'id_str': '1325089592652034054',
  'text': 'Hydro jetting sewer line having issues with there sewer. Going to do a camera inspection and clean it up. @ Columbu… https://t.co/MTNiYts8nv',
  'truncated': 

Now let's run the code below to set up our results:

In [125]:
id_str_list= []
location_list = []
profile_location_list = []
user_description_list = []
followers_count_list = []
friends_count_list = []
created_at_list = []
favourites_count_list = []
statuses_count_list = []

In [126]:
# total number of users we want in this sample
num_samples = 10000

In [127]:
# number of users we've successfully retrieved information for, in the loop itself
num_users_searched = 0

In [128]:
# track users not found
users_not_found = []

In [129]:
for idx, user_id in enumerate(unique_user_ids):

    
    # get the # of rate limits remaining
    rate_limits = api.rate_limit_status()
    limit = 900
    remaining = rate_limits["resources"]["users"]["/users/:id"]["remaining"]
    
    if remaining == 0:
        print(f"We're on user_id={user_id}, iteration {idx}, and we've hit a rate limit. Our program will wait until the 15-minute rate limit resets.")

    # get the user that we're on
    if idx % 500 == 0:
        print(f"We've gone through {idx} iterations of the loop")
    
    if idx > num_samples:
        print(f"We're on iteration {idx}, so now we have enough users for our samples")
        break
        
    # API call
    try:
        user_info = api.get_user(user_id=user_id)
        
    except TweepError as te:
        print(te)
        if te.args[0][0]["code"] == 50 or te.args[0][0]["code"] == 63:
            print(f"User {user_id} doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.")
            users_not_found.append(user_id)
            continue
        else: 
            print(f"User info not available, for an unknown reason.")
            will_continue = input(f"Would you like to (1) exit ('e'), (2) wait ('w'), or (3) continue (press anything else)")
            if will_continue=='e':
                print(f"Exiting for-loop, after getting through {idx} users")
                break
            elif will_continue=='w':
                num_seconds_wait=int(input("How many seconds would you like to wait? Rate limit should reset after 15 minutes (900 seconds)"))
                print(f"Will wait for {num_seconds_wait} seconds ({num_seconds_wait / 60} minutes)")
                time.sleep(num_seconds_wait)
            else:
                continue
                    
    except Exception as e:
        print(f"Exception in getting user info for user {user_id}, user number {idx}")
        print(e)
            
        will_continue = input(f"Would you like to (1) exit ('e'), (2) wait ('w'), or (3) continue (press anything else)")
        if will_continue=='e':
            print(f"Exiting for-loop, after getting through {idx} users")
            break
        elif will_continue=='w':
            num_seconds_wait=int(input("How many seconds would you like to wait? Rate limit should reset after 15 minutes (900 seconds)"))
            print(f"Will wait for {num_seconds_wait} seconds ({num_seconds_wait / 60} minutes)")
            time.sleep(num_seconds_wait)
        else:
            continue
    
    num_users_searched = num_users_searched + 1
    
    if idx % 500 == 0:
        print(f"We've successfully searched through {num_users_searched} users (out of {idx+1} attempted)")
        
    # get user information
    try:
        id_str_list.append(user_info._json["id_str"])
        location_list.append(user_info._json["location"])
        profile_location_list.append(user_info._json["profile_location"])
        user_description_list.append(user_info._json["description"])
        followers_count_list.append(user_info._json["followers_count"])
        friends_count_list.append(user_info._json["friends_count"])
        created_at_list.append(user_info._json["created_at"])
        favourites_count_list.append(user_info._json["favourites_count"])
        statuses_count_list.append(user_info._json["statuses_count"])
        
    except Exception as e:
        print(f"Error in getting info from user's JSON")
        print(e)
        # if we have some field not available, we have to have a placeholder to make sure that
        # lists are going to be the same length
        num_users = len(id_str_list)
        
        if len(location_list) < num_users:
            while(len(location_list) < num_users):
                location_list.append("")
                
        if len(profile_location_list) < num_users:
            while(len(profile_location_list) < num_users):
                profile_location_list.append("")
                
        if len(user_description_list) < num_users:
            while(len(user_description_list) < num_users):
                user_description_list.append("")
                
        if len(followers_count_list) < num_users:
            while(len(followers_count_list) < num_users):
                followers_count_list.append("")
                
        if len(friends_count_list) < num_users:
            while(len(friends_count_list) < num_users):
                friends_count_list.append("")
                
        if len(created_at_list) < num_users:
            while(len(created_at_list) < num_users):
                created_at_list.append("")
                
        if len(favourites_count_list) < num_users:
            while(len(favourites_count_list) < num_users):
                favourites_count_list.append("")
                
        if len(statuses_count_list) < num_users:
            while(len(statuses_count_list) < num_users):
                statuses_count_list.append("")
    
        # what to do next
        will_continue = input(f"Would you like to (1) exit ('e'), (2) wait ('w'), or (3) continue (press anything else)")
        if will_continue=='e':
            print(f"Exiting for-loop, after getting through {idx} users")
            break
        elif will_continue=='w':
            num_seconds_wait=int(input("How many seconds would you like to wait? Rate limit should reset after 15 minutes (900 seconds)"))
            print(f"Will wait for {num_seconds_wait} seconds ({num_seconds_wait / 60} minutes)")
            time.sleep(num_seconds_wait)
        else:
            continue

We've gone through 0 iterations of the loop
We've successfully searched through 1 users (out of 1 attempted)
[{'code': 50, 'message': 'User not found.'}]
User 1345873207 doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.
[{'code': 63, 'message': 'User has been suspended.'}]
User 305030732 doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.


Rate limit reached. Sleeping for: 810


[{'code': 50, 'message': 'User not found.'}]
User 31568672 doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.


Rate limit reached. Sleeping for: 812


[{'code': 50, 'message': 'User not found.'}]
User 123696213 doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.
We've gone through 500 iterations of the loop
We've successfully searched through 497 users (out of 501 attempted)


Rate limit reached. Sleeping for: 819
Rate limit reached. Sleeping for: 808
Rate limit reached. Sleeping for: 816


We've gone through 1000 iterations of the loop
We've successfully searched through 997 users (out of 1001 attempted)


Rate limit reached. Sleeping for: 818
Rate limit reached. Sleeping for: 819


[{'code': 50, 'message': 'User not found.'}]
User 2953690786 doesn't exist (if code==50) or user was suspended (if code==63). Let's track this in an external list and continue the loop.


Rate limit reached. Sleeping for: 818


We've gone through 1500 iterations of the loop
We've successfully searched through 1496 users (out of 1501 attempted)


Rate limit reached. Sleeping for: 821
Rate limit reached. Sleeping for: 819


TweepError: Failed to send request: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

In [None]:
user_info_df = pd.DataFrame(zip(id_str_list, 
                                location_list, 
                                profile_location_list, 
                                user_description_list, 
                                followers_count_list, 
                                friends_count_list, 
                                created_at_list, 
                                favourites_count_list, 
                                statuses_count_list), 
                            columns=["id_str", "location", "profile_location", "user_description",
                                     "followers_count", "friends_count", "created_at", 
                                     "favourites_count", "statuses_count"])

In [None]:
user_info_df.to_csv(TWEET_DIR + "user_info_sample_10000_2021-01-21.csv")