## **NOTE:** this notebook copied from `twitter_collect_top100.ipynb` but modified to include 'extended' tweet information. This ensures tweet text is NOT truncated [for tweets longer than 140chars in length] AND it provides URLs for any images contained within the tweet!

## NOTE: No need to run this notebook. I supplied it so you can see HOW the twitter data is collected :)
## For actually USING the data collected here look at the <b>'twitter_unpackTop100_example.ipynb'</b> notebook!

## Load Modules
- ttools has helper functions

In [2]:

%load_ext autoreload
%autoreload 2
import sys, codecs, json
import ttools
from twython import TwythonStreamer, Twython
from datetime import datetime
from time import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Get top100 [from pre-made json file]
First, load the dictionary with the top100 most followed twtter users and extract the user_ids for use in api

In [2]:
top100file = './top100_id_dictionary.json'
top100 = ttools.json_to_dict(top100file)  # format is {user_id:[username,name]} really we just care about the user ids for now
top100ids = [int(uid) for uid in top100.keys()]

## Set up twitter API, get user metadata, and remove non-english accounts
Initialize the api connection

In [3]:
app_key = "YOUR_API_KEY_HERE"  #api_key
app_sec = "YOUR_API_SECRET_KEY_HERE"  #api_secretKey
user_key = "YOUR_ACCESS_TOKEN_HERE"  #access_token
user_sec = "YOUR_ACCESS_TOKEN_SECRET_HERE"  #access_token_secret
api = ttools.initAPI(app_key,app_sec,user_key,user_sec)
credentials = api.verify_credentials()  #KRC__verify the connection

Get all users metadata from direct users_lookup api [can gather 100 users in a single api call..how convenient!]

In [4]:
userdata = api.lookup_user(user_id=[top100ids])
#for some reason the api is not getting the 1st and last users....
num1 = api.lookup_user(user_id=[top100ids[0]])
num100 = api.lookup_user(user_id=[top100ids[-1]])
userdata.append(num1[0])
userdata.append(num100[0])

Check and clean the data we collected

In [5]:
#verify we got all the users
usersGotten = []
for d in userdata:
    usersGotten.append(int(d['id']))
commonUsers = set(top100ids).intersection(set(usersGotten))
if len(commonUsers) != 100:
    print('api did not give all/correct user ids...need to investigate')

#remove the non-english accounts [actually, do this in-loop below]
# nonEnglish = []
# for d in userdata:
#     if d['lang'] != 'en':
#         nonEnglish.append(d['id'])
#         print('removing non-english account: %s'%(top100[str(d['id'])]))
#         top100.pop(str(d['id']))
# print('top100 is composed of %s english speakers'%(len(top100)))

## Collect Timeline Data and Save json
Now, let's gather the timeline data! Note the user information we just collected is used in the 'user_info' key of the limitedUserDict [which is the one collecting ALL of the data]. The data will be saved in a *json format

In [6]:
%%time
numPasses = 1
currentUserID = 0
timeStart = time()
allCollectedUsers = []  #track users we successfully got timelines for

limitedUserDict = {}
try:
    for i,udata in enumerate(userdata):
        user_id = int(udata['id'])
        #skip the non-english accounts
        if udata['lang'] != 'en':
            print('skipping non-english user: %s'%(udata['screen_name']))
            #top100.pop(str(d['id']))  #remove from the top100 list...not really necessary
            continue
        currentUserID = user_id
        limitedUserDict[int(user_id)] = {'user_info':udata,'user_timeline':[]}  #hydrates the user info and preps the timeline list
        #limitedUserDict[int(user_id)] = activeusers[int(user_id)]  #copy the structure for the user
        print('%s__of__%s total users gathered'%(i,len(top100ids)))
        print('User ID: %s'%user_id)
        print('username: %s'%udata['screen_name'])
        kwargs = {'user_id':int(user_id),'count':200,'exclude_replies':'false','trim_user':'true','include_rts':'false','tweet_mode':'extended'}
        timelineTweets = ttools.rateLimitWrapperTimeline(api,api.get_user_timeline,kwargs,willingToWait=True,maxExecTime=14400)
        limitedUserDict[user_id]['user_timeline'].extend(timelineTweets)  #extend the list
        allCollectedUsers.append(user_id)
        del timelineTweets
except:
    print('some sort of error occurred...dumping data collected so far')
    jsonStr = json.dumps(limitedUserDict)
    with open('top100users_and_timelines_EXTENDED.json','w') as f:
        f.write(jsonStr)
    del jsonStr
    with open('top100gotten_EXTENDED.txt','w') as outF:
        outF.write('%s'%allCollectedUsers)
    print('last user_id attempted = %s'%currentUserID)
    print('total number of users collected: %s'%(len(allCollectedUsers)))
    print('finished!')
    print('Elapsed time: %s'%(time() - timeStart))
    sys.exit()
#print(len(r))

print('made it to the end without error')
jsonStr = json.dumps(limitedUserDict)
with open('top100users_and_timelines_EXTENDED.json','w') as f:
    f.write(jsonStr)
del jsonStr
with open('top100gotten_EXTENDED.txt','w') as outF:
    outF.write('%s'%allCollectedUsers)
print('last user_id attempted = %s'%currentUserID)
print('total number of users collected: %s'%(len(allCollectedUsers)))
print('finished!')
print('Elapsed time: %s'%(time() - timeStart))

0__of__100 total users gathered
User ID: 27260086
username: justinbieber
returning from rateLimitWrapper
1__of__100 total users gathered
User ID: 813286
username: BarackObama
returning from rateLimitWrapper
2__of__100 total users gathered
User ID: 79293791
username: rihanna
returning from rateLimitWrapper
3__of__100 total users gathered
User ID: 17919972
username: taylorswift13
returning from rateLimitWrapper
4__of__100 total users gathered
User ID: 14230524
username: ladygaga
returning from rateLimitWrapper
5__of__100 total users gathered
User ID: 15846407
username: TheEllenShow
returning from rateLimitWrapper
skipping non-english user: Cristiano
7__of__100 total users gathered
User ID: 10228272
username: YouTube
returning from rateLimitWrapper
8__of__100 total users gathered
User ID: 26565946
username: jtimberlake
returning from rateLimitWrapper
9__of__100 total users gathered
User ID: 25365536
username: KimKardashian
returning from rateLimitWrapper
10__of__100 total users gathered
U

Now we take the raw tweetdata and extract our defined features and put them into a dataframe. then save that dataframe as a *.csv file!


In [3]:
with open('top100users_and_timelines_EXTENDED.json','r') as f:
    readstr = f.read()
    alldata = json.loads(readstr)
    del readstr

uNum = 0
infos = []
globalTweets = {}
for uid,data in alldata.items():
    print('user number: %s'%uNum)
    if 'ErrorCaught' in data:
        print('Handled User ErrorCaught')
        continue
    res,globalTweets = ttools.extractAllAttributes(uid,data,globalTweets,extended=True)
    if res is None:
        #print('extractAllAttributes returned None. Skipping')
        continue
    infos.append(res)
    uNum += 1
alldata = pd.concat(infos,axis=0,ignore_index=True)
alldata.to_csv('./top100users_and_timelines_EXTENDED_PLUS.csv',index=False)  #the PLUS is because we added a few attributes afterward lol
alldata.to_pickle('./top100users_and_timelines_EXTENDED_PLUS.pkl')
del infos
del alldata

user number: 0
user number: 1
user number: 2
user number: 3
user number: 4
user number: 5
user number: 6
user number: 7
user number: 8
user number: 9
user number: 10
user number: 11
user number: 12
user number: 13
user number: 14
user number: 15
user number: 16
user number: 17
user number: 18
user number: 19
user number: 20
user number: 21
user number: 22
user number: 23
user number: 24
user number: 25
user number: 26
user number: 27
user number: 28
user number: 29
user number: 30
user number: 31
user number: 32
user number: 33
user number: 34
user number: 35
user number: 36
user number: 37
user number: 38
user number: 39
user number: 40
user number: 41
user number: 42
user number: 43
user number: 44
user number: 45
user number: 46
user number: 47
user number: 48
user number: 49
user number: 50
user number: 51
user number: 52
user number: 53
Handled Tweet ErrorCaught
user number: 54
Handled Tweet ErrorCaught
user number: 54
Handled Tweet ErrorCaught
user number: 54
Handled Tweet ErrorC

And there you have it! Top100 most followed users on twitter and their timelines now in file: <b>top100users_and_timelines_EXTENDED.csv</b>

## Look in 'top100users_and_timelines.csv' for example using the actual data!

## After the fact...add the top100 users 'category' label [artist,company,athlete,politician,businessLeader] and resave the file

## Get top100 [from pre-made json file]
First, load the dictionary with the top100 most followed twtter users and extract the user_ids for use in api

In [4]:
top100file = './top100_id_dictionary.json'
top100 = ttools.json_to_dict(top100file)  # format is {user_id:[username,name]} really we just care about the user ids for now
top100ids = [int(uid) for uid in top100.keys()]

Read in and inspect the top100 tweet/timeline data!

In [5]:
%time top100all = pd.read_csv('top100users_and_timelines_EXTENDED_PLUS.csv',lineterminator='\n')

CPU times: user 1.78 s, sys: 135 ms, total: 1.92 s
Wall time: 1.93 s




In [6]:
top100all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203993 entries, 0 to 203992
Data columns (total 33 columns):
tweet_id                 203993 non-null int64
tweet_truncated          203993 non-null bool
date                     203993 non-null object
tweet_source             203970 non-null object
tweet_lang               203993 non-null object
tweet_coord              37 non-null object
tweet_place              364 non-null object
text                     203993 non-null object
text_noMentions          203951 non-null object
is_quote_status          203993 non-null bool
is_reply_to_status       203993 non-null bool
is_reply_to_user         203993 non-null bool
numMentions              203993 non-null int64
image_urls               65150 non-null object
retweet_count            203993 non-null int64
favorite_count           203993 non-null int64
user_id                  203993 non-null int64
user_name                203993 non-null object
user_screen              203993 non-null objec

Now, we need to categorize the top100. Here is some helper code. The 'top100cat' dataframe is the result.

In [7]:
##THIS WAS DONE IN twitter_unpackTop100_example.ipynb [below]
# #turn dict into dataframe
# top100forCategories = []
# for uid in list(top100.keys()):
#     top100forCategories.append([int(uid),top100[uid][0],top100[uid][1]])

# #save the dataframe
# pd.DataFrame(top100forCategories).to_csv('top100categorization.csv')
# #then, manually labeled each entry as one of the five categories. Saved labeled file as 'top100categorization_complete.csv'
##THIS WAS DONE IN twitter_unpackTop100_example.ipynb [above]

#Now, read in the complete csv. This df can be used with the 'top100users_and_timelines.csv' dataset to help
#categorize the top100 users/tweets into correct category
CATEGORY = {'a':'artist','b':'businessLeader','c':'company','p':'politician','t':'athlete'}
top100cat = pd.read_csv('top100categorization_complete.csv')
top100cat.drop(['Unnamed: 0','notes'],axis=1,inplace=True)
top100cat.rename(columns={'0':'user_id','1':'screenname','2':'name'},inplace=True)
print('top100 category count:\n%s'%(top100cat['category'].value_counts().rename(CATEGORY)))
#top100cat.info()

top100 category count:
artist            60
company           14
athlete            9
politician         7
businessLeader     2
Name: category, dtype: int64


In [8]:
#decode the category labels [turn char into str]
top100cat['category'] = top100cat['category'].apply(lambda x: CATEGORY[x])  # using dict

#add the category 
res = top100all.merge(top100cat[['user_id','category']],on=['user_id'])
res.columns

Index(['tweet_id', 'tweet_truncated', 'date', 'tweet_source', 'tweet_lang',
       'tweet_coord', 'tweet_place', 'text', 'text_noMentions',
       'is_quote_status', 'is_reply_to_status', 'is_reply_to_user',
       'numMentions', 'image_urls', 'retweet_count', 'favorite_count',
       'user_id', 'user_name', 'user_screen', 'user_verified', 'user_lang',
       'user_description_text', 'user_followers_count', 'user_friends_count',
       'user_listed_count', 'user_favourites_count', 'user_statuses_count',
       'user_location', 'user_created_year', 'user_created_month',
       'user_geo_enabled', 'user_img_url', 'user_banner_url', 'category'],
      dtype='object')

## NOW, NEW FEATURES:
**`text`**`: str:str:still tweet text, but now not truncated if >140chars`<br>
**`image_urls`**`: list:[url1,url2,...]:list of urls for images that were found within body of the tweet!`<br>
**`category`**`: str:str:manually labeled category type for this user`<br>
label meanings for category = {'a':'artist','b':'businessLeader','c':'company','p':'politician','t':'athlete'}

In [9]:
#check what % of top100's tweets were from what category
res['category'].value_counts()/len(res)

artist            0.636909
company           0.191065
athlete           0.077660
politician        0.067027
businessLeader    0.027339
Name: category, dtype: float64

In [10]:
#write out the *csv [and *pkl] file! good to go!
res.to_csv('./top100users_and_timelines_EXTENDED_PLUS.csv',index=False)
res.to_pickle('./top100users_and_timelines_EXTENDED_PLUS.pkl')

# BELOW: testing out some merge features. nothing important

In [19]:
df_a = pd.read_csv('./top100users_and_timelines.csv',index_col=None, lineterminator='\n')
df_b = pd.read_csv('./top100users_and_timelines_EXTENDED.csv',index_col=None, lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


Index(['tweet_id', 'tweet_truncated', 'date', 'tweet_source', 'tweet_coord',
       'tweet_place', 'text', 'text_noMentions', 'is_quote_status',
       'is_reply_to_status', 'is_reply_to_user', 'numMentions', 'image_urls',
       'retweet_count', 'favorite_count', 'user_id', 'user_verified',
       'user_description_text', 'user_followers_count', 'user_friends_count',
       'user_listed_count', 'user_favourites_count', 'user_statuses_count',
       'user_location', 'user_created_year', 'user_created_month',
       'user_geo_enabled', 'user_img_url', 'user_banner_url', 'category'],
      dtype='object')

In [32]:
cols = list(set(df_b.columns) - set(df_a.columns))
cols.append('tweet_id')
cols

['tweet_coord',
 'user_banner_url',
 'user_created_year',
 'tweet_source',
 'tweet_truncated',
 'user_geo_enabled',
 'user_location',
 'tweet_place',
 'category',
 'image_urls',
 'user_img_url',
 'user_created_month',
 'tweet_id']

In [69]:
cols = list(set(df_b.columns) - set(df_a.columns))
cols.append('tweet_id')
tweets = pd.merge(df_a, df_b[cols], on=['tweet_id'], how='left')

In [70]:
df_a.shape

(203256, 18)

In [71]:
df_b.shape

(203993, 30)

In [72]:
tweets.shape

(203256, 30)

In [73]:
print(tweets.groupby('user_id').category.value_counts())

user_id     category      
428333      company           2557
742143      company           2034
759251      company           1095
783214      company           2815
807095      company           1441
813286      politician        2842
2557521     company           2190
5988062     company           2021
10228272    company           2596
11348282    company           2353
14230524    artist            2376
14920785    artist            2906
15485441    artist            2809
15846407    artist            3020
16409683    artist            2875
17471979    company           2977
17919972    artist              71
18681139    artist            2517
18839785    politician        2859
18863815    artist            2575
19248106    artist            2259
19397785    artist            2819
19426551    athlete           1441
19895282    artist            3050
20322929    artist            1876
20536157    company           2790
21111883    artist            2303
22940219    artist          

In [74]:
tweets

Unnamed: 0,tweet_id,date,user_id,text,text_noMentions,is_quote_status,is_reply_to_status,is_reply_to_user,numMentions,user_verified,...,user_created_year,tweet_source,tweet_truncated,user_geo_enabled,user_location,tweet_place,category,image_urls,user_img_url,user_created_month
0,1058069851678199809,Thu Nov 01 18:54:45 +0000 2018,27260086,@carlyraejepsen Congrats on the new song!,Congrats on the new song!,False,True,True,1,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
1,1057450782528684032,Wed Oct 31 01:54:47 +0000 2018,27260086,https://t.co/Ehx7cDu0Nw,NO_USER_MENTIONS,False,False,False,0,True,...,2009.0,iphone,False,False,,,artist,['http://pbs.twimg.com/media/DqzR4XyU8AEboPA.j...,http://pbs.twimg.com/profile_images/8982953118...,3.0
2,1057450701737979904,Wed Oct 31 01:54:28 +0000 2018,27260086,@torikelly I have listened to this album 10 ti...,I have listened to this album 10 times in a ro...,False,True,True,1,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
3,1056709958991892480,Mon Oct 29 00:51:01 +0000 2018,27260086,Praying for Pittsburgh,NO_USER_MENTIONS,False,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
4,1055173683222904832,Wed Oct 24 19:06:24 +0000 2018,27260086,Wishing everyone an amazing day,NO_USER_MENTIONS,False,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
5,1055173432101519361,Wed Oct 24 19:05:25 +0000 2018,27260086,Wow 3 years. Thanks https://t.co/1ou2v3MbCA,NO_USER_MENTIONS,True,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
6,1055173147215970304,Wed Oct 24 19:04:17 +0000 2018,27260086,Living in the US I see how this country affect...,NO_USER_MENTIONS,False,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
7,1040713450836713472,Fri Sep 14 21:26:36 +0000 2018,27260086,👨🏻👨🏻 https://t.co/I9KVh2IN1s,NO_USER_MENTIONS,True,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
8,1033053381504299011,Fri Aug 24 18:08:14 +0000 2018,27260086,The end is FIRES. What’s not to love https://t...,NO_USER_MENTIONS,True,False,False,0,True,...,2009.0,iphone,False,False,,,artist,,http://pbs.twimg.com/profile_images/8982953118...,3.0
9,1030573396550070272,Fri Aug 17 21:53:39 +0000 2018,27260086,"Meet the newest Bieber, my baby sister Bay Bie...",NO_USER_MENTIONS,False,False,False,0,True,...,2009.0,iphone,False,False,,,artist,['http://pbs.twimg.com/media/Dk1VChSU0AATpXg.j...,http://pbs.twimg.com/profile_images/8982953118...,3.0
