Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweets of multiple Twitter-Accounts #136

Closed
renegro90 opened this Issue Nov 15, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@renegro90
Copy link

renegro90 commented Nov 15, 2017

Hey, thanks for fixing the issues about the authorization method and the data output yesterday :)

Now I'm a bit puzzled if there's a possibility to get the maximum number of tweets (3.200 per account) from a large sample i.e. 1.000 persons.

I already tried something like this:
tmls_flw <- get_timelines(c("cnn", "BBCWorld", "foxnews"), n = 3200, retryonratelimit =TRUE)
But it didn't worked the way I expected. I'm now just getting a total of 3.200 tweets and not 3.200 from each of them.

Is there any workaround to get all the tweets of such a large number of accounts with the get_timelines-function which says: "Hey R, give me the maximum number (3.200 per account) of recent tweets of these accounts."?

Or do I have to code it like this, for every account I want to mine?
flw1 <- get_timeline("cnn"), n = 3200)
flw2 <- get_timeline("bbc"), n = 3200)
flw3 <- get_timeline("fox"), n = 3200)

Thanks in advance

@mkearney

This comment has been minimized.

Copy link
Owner

mkearney commented Nov 15, 2017

The code worked for me:

> tmls_flw <- get_timelines(c("cnn", "BBCWorld", "foxnews"), n = 3200, retryonratelimit =TRUE)
tmls_flw
> 
# A tibble: 9,649 x 42
            status_id          created_at user_id screen_name
 *              <chr>              <dttm>   <chr>       <chr>
 1 930797610092449792 2017-11-15 14:00:18  759251         CNN
 2 930794780812070913 2017-11-15 13:49:03  759251         CNN
 3 930792031496044544 2017-11-15 13:38:08  759251         CNN
 4 930789258218164224 2017-11-15 13:27:07  759251         CNN
 5 930786544159518720 2017-11-15 13:16:19  759251         CNN
 6 930784144744951808 2017-11-15 13:06:47  759251         CNN
 7 930783345948200961 2017-11-15 13:03:37  759251         CNN
 8 930778887403048960 2017-11-15 12:45:54  759251         CNN
 9 930775665586163714 2017-11-15 12:33:06  759251         CNN
10 930772881491005441 2017-11-15 12:22:02  759251         CNN
# ... with 9,639 more rows, and 38 more variables: text <chr>, source <chr>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_name <chr>,
#   place_full_name <chr>, place_type <chr>, country <chr>, country_code <chr>,
#   geo_coords <list>, coords_coords <list>, bbox_coords <list>

I actually haven't gotten around to adding retryonratelimit functionality to get_timelines() yet. Perhaps including that is causing some bug?

Otherwise, it looks like you'd burn through about 17 requests per user, which means you should be able to get the max number of statuses returned for 52 users every 15 minutes.

> rate_limit("get_timeline")
# A tibble: 1 x 6
                   query limit remaining         reset            reset_at
                   <chr> <int>     <int>        <time>              <dttm>
1 statuses/user_timeline   900       849 12.05776 mins 2017-11-15 08:15:46
# ... with 1 more variables: app <chr>

If you're dealing with a larger number of accounts than 52, then you'd probably want to set up a for loop. For example, let's say you have a vector, users, of screen names you'd like to get timeline data for. I'd execute a loop that looks something like this:

tmls <- vector("list", length(users))

for (i in seq_along(tmls)) {
  tmls[[i]] <- get_timeline(users[i], n = 3200)
  ## assuming full rate limit at start, wait for fresh reset every 52 users
  if (i %% 52L == 0L) {
    rl <- rate_limit("get_timeline")
    Sys.sleep(as.numeric(rl$reset, "secs"))
  }
  ## print update message
  cat(i, " ")
}

## merge into single data frame (do_call_rbind will preserve users data)
tmls <- do_call_rbind(tmls)

Side note, this actually returned slightly more than 3200 [unique] tweets per user, which I don't think I've seen before.

# A tibble: 3 x 3
      term     n   percent
     <chr> <int>     <dbl>
1 BBCWorld  3218 0.3335061
2  FoxNews  3216 0.3332988
3      CNN  3215 0.3331951
@renegro90

This comment has been minimized.

Copy link
Author

renegro90 commented Nov 16, 2017

Thanks for your reply.

I run the loop code you considered and it worked fine.

I just got multiple warnings, that some pages do not exist. Could this be an error as a result of no tweets on these timelines and if the answer is yes, is there a possibility to code it with an if-function like "if statuses_count <=1 then dismiss this account" or something like this? It would help me to save a lot of time and processing power.

Thanks in advance RG

@mrmvergeer

This comment has been minimized.

Copy link

mrmvergeer commented Nov 19, 2017

Hi @renegro90.
I had a similar problem. I had made a similar script as @mkearney.
Because some accounts are set to private and therefore you can't get the tweets, the script stops. I fixed this by putting "try" before the get_timeline-bit. Though maybe not elegant, it continues to collect the tweets of the remaining accounts. This is untested :

tmls <- vector("list", length(users))

for (i in seq_along(tmls)) {
tmls[[i]] <- try(get_timeline(users[i], n = 3200))
if (i %% 52L == 0L) {
rl <- rate_limit("get_timeline")
Sys.sleep(as.numeric(rl$reset, "secs"))
}

cat(i, " ")
}

tmls <- do_call_rbind(tmls)

@mkearney

This comment has been minimized.

Copy link
Owner

mkearney commented Nov 20, 2017

@renegro90 @mrmvergeer Thanks for following up on this!

Question: with the newest version (0.6.0) of rtweet, are these empty timelines creating errors or warnings? The should be creating warnings...so please let me know if you experience anything differently!

@renegro90

This comment has been minimized.

Copy link
Author

renegro90 commented Nov 21, 2017

@mkearney. Yes I got plenty of warnings by running the code with ~100 accounts. After completing the computation, R says:

There were 50 or more warnings (use warnings() to see the first 50)

and

1: 34 - Sorry, that page does not exist.
2: Sorry, that page does not exist.

@mrmvergeer. Your code works. But the original script by @mkearney worked as well (I got the same output with both of your codes) and didn't stopped but with your addition it's possible to see on which user the script is working at the moment (it's kind of like a loading bar).

@mkearney Is it possible (maybe in an interim stage between the lookup_users and the get_followers step) to dismiss the accounts which are either protected (I think this indicates if the timeline is set to private?!?), and/or have less then x statuses in their timeline, and/or are posted in an other language then english?

@mkearney

This comment has been minimized.

Copy link
Owner

mkearney commented Nov 21, 2017

@renegro90 You should be able to filter users using the protected and account_lang variables.

> ## users with public/english, public/french, private/english accounts respectively
> sns <- c("kearneymw", "Vachier_Lagrave", "mikewaynesworld")
>
> ## lookup users data
> (usr <- lookup_users(sns))
# A tibble: 3 x 20
     user_id                     name     screen_name      location
       <chr>                    <chr>           <chr>         <chr>
1 2973406683 "Mike Kearney\U0001f4ca"       kearneymw  Columbia, MO
2  157070052                      MVL Vachier_Lagrave Paris, France
3  174454226                       mw mikewaynesworld         SMDHU
# ... with 16 more variables: description <chr>, url <chr>, protected <lgl>,
#   followers_count <int>, friends_count <int>, listed_count <int>,
#   statuses_count <int>, favourites_count <int>, account_created_at <dttm>,
#   verified <lgl>, profile_url <chr>, profile_expanded_url <chr>,
#   account_lang <chr>, profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>
>
> ## view protected variable values
> usr$protected
[1] FALSE FALSE  TRUE
>
> ## view account_lang variable values
> usr$account_lang
[1] "en" "fr" "en"

So you could create a function to filter those like this:

## function to filter only English-language and public accounts.
filter_users <- function(x) {
  if (!is.data.frame(x) || !all(c("account_lang", "protected") %in% names(x))) {
    stop("Users data not found")
  }
  x$user_id[x$account_lang == "en" & x$protected]
}

Apply filter_users function to usr data from above

> filter_users(usr)
[1] "174454226"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.