In [7]:
## values
#app_name <- "INSERT_APP_NAME_HERE"
#key <- "INSERT_KEY_HERE"
#secret <- "INSERT_SECRET_KEY_HERE"

## create app
#app <- httr::oauth_app(app_name, key, secret)

## create token (must be interactive session)
#token <- httr::oauth1.0_token(
#    httr::oauth_endpoints("twitter"),
#    app, cache = FALSE
#)

In [2]:
## read and print token
(token <- readRDS("token.rds"))

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

In [15]:
## expand full path to token
path_to_token <- normalizePath("token.rds")

path_to_token

In [16]:
## create env variable TWITTER_PAT (with path to saved token)
envvar <- paste0("TWITTER_PAT=", path_to_token)

envvar

In [17]:
## save as .Renviron file (or append if the file already exists)
cat(envvar, file = "~/.Renviron", fill = TRUE, append = TRUE)

Normally the .Renviron file is processed on startup. However, to make sure the current R session registers the environment variable without having to restart the entire session, we can use the `readRenviron()` function.

In [18]:
## refresh .Renviron variables
(readRenviron("~/.Renviron"))

In [19]:
Sys.getenv("TWITTER_PAT")

Now that we can assume the path to our Twitter token is stored as an environment variable, we can easily write a function that locates and reads-in the token.

In [20]:
## function to load twitter token
read_twittertoken <- function() {
    readRDS(path_to_token)
}

## test out function
read_twittertoken()

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

If we keep running the above code, we'll keep adding new lines to our environment file. In addition to creating a mess in your .Renviron file, each successive line will override the previous value. In other words, you're doomed to make a mistake; and when you do, it will override the times that worked. 

So, to fix this problem, let's take the code we used to create and save the token as an environment variable and turn it into a single, useful function.

In [21]:
set_renv_token <- function(path_to_token, override = FALSE) {
    ## check path
    stopifnot(
        is.character(path_to_token),
        file.exists(path_to_token)
    )
    ## expand to full path
    path_to_token <- normalizePath(path_to_token)

    ## store path to .Renviron
    renv <- normalizePath("~/.Renviron")
    
    ## if override = false and there's already a TWITTER_PAT, stop
    ## else override and there's already a TWITTER_PAT, then drop TWITTER_PAT and
    ## save new .Renviron
    if (!override && !identical(Sys.getenv("TWITTER_PAT"), "")) {
        stop("There's already a TWITTER_PAT. Use `override = TRUE` to replace.",
            call. = FALSE)
    } else if (!identical(Sys.getenv("TWITTER_PAT"), "") && 
               file.exists(renv)) {
        con <- file(renv)
        x <- readLines(con, warn = FALSE)
        close(con)
        x <- grep("^TWITTER_PAT", x, invert = TRUE, value = TRUE)
        writeLines(x, renv)
    }
    
    ## create env variable TWITTER_PAT (with path to saved token)
    envvar <- paste0("TWITTER_PAT=", path_to_token)
    
    ## save as .Renviron file (or append if the file already exists)
    cat(envvar, file = renv, fill = TRUE, append = TRUE)
}

In [23]:
set_renv_token("token.rds", TRUE)

In [24]:
read_twittertoken()

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

### Search API

Now let's create a function that allows us to query [Twitter's standard search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets). In the code below, I've included all the documented parameters (see note for explanation of the additional `tweet_mode` parameter), setting the optional parameters to `NULL` and making some judgment calls about other ones (e.g., `result_type` and `include_entitities`).

*Note*: in order to return the full (non-truncated) text of a tweet, a [recent change by Twitter](https://developer.twitter.com/en/docs/tweets/tweet-updates) requires all requests for data on Twitter statuses include the paramater `tweet_mode=extended`.

In [25]:
## search query function
search_twitter <- function(q, geocode = NULL, 
                           lang = NULL, 
                           locale = NULL, 
                           result_type = "recent", 
                           count = 100, 
                           until = NULL, 
                           max_id = NULL, 
                           include_entities = TRUE) {
    ## URL scheme and hostname
    base_url <- "https://api.twitter.com"
    ## include the API version number as part of the path
    path <- "1.1/search/tweets.json"
    ## check result type
    if (!result_type %in% c("recent", "popular", "mixed")) {
        stop("result_type must be one of recent, popular, or mixed", 
            call. = FALSE)
    }
    ## build query parameters
    params <- list(
        q = q,
        geocode = geocode,
        lang = lang,
        locale = locale,
        result_type = result_type,
        count = count,
        until = until,
        max_id = max_id,
        include_entitities = include_entities,
        tweet_mode = "extended"
    )
    ## send GET request
    httr::GET(base_url, path = path, query = params, 
              httr::config(token = read_twittertoken()))
}

In [26]:
## execute search for all tweets mentioning "rstats" (this will include hashtags)/
rstats <- search_twitter("rstats")

In [27]:
## view the response object
rstats

Response [https://api.twitter.com/1.1/search/tweets.json?q=rstats&result_type=recent&count=100&include_entitities=TRUE&tweet_mode=extended]
  Date: 2018-01-28 22:25
  Status: 200
  Content-Type: application/json;charset=utf-8
  Size: 663 kB


In [28]:
## parse as text (convert response object to json)
js <- httr::content(rstats, as = "text", encoding = "UTF-8")

In [29]:
## convert json character vector to R list
d <- jsonlite::fromJSON(js)

In [30]:
str(d, 1)

List of 2
 $ statuses       :'data.frame':	100 obs. of  31 variables:
 $ search_metadata:List of 9


It looks like all the good stuff is in "statuses", so let's inspect two levels down in `d$statuses`.

In [31]:
df <- d$statuses
str(df, 2)

'data.frame':	100 obs. of  31 variables:
 $ created_at               : chr  "Sun Jan 28 22:24:24 +0000 2018" "Sun Jan 28 22:24:12 +0000 2018" "Sun Jan 28 22:23:52 +0000 2018" "Sun Jan 28 22:23:16 +0000 2018" ...
 $ id                       : num  9.58e+17 9.58e+17 9.58e+17 9.58e+17 9.58e+17 ...
 $ id_str                   : chr  "957741173598248960" "957741123371438082" "957741040332492800" "957740888142184448" ...
 $ full_text                : chr  "RT @gp_pulipaka: Free eBook: Azure Serverless Computing Cookbook. #BigData #MachineLearning #DataScience #AI #A"| __truncated__ "RT @DataCamp: Time series #data in #Rstats: xts cheat sheet - https://t.co/OYN5vi1ez7 #datascience https://t.co/Ettn1iUTVz" "RT @gp_pulipaka: Free eBook: Azure Serverless Computing Cookbook. #BigData #MachineLearning #DataScience #AI #A"| __truncated__ "RT @DeepSingularity: A Technical Overview of Azure Databricks. #BigData #MachineLearning #DataScience #AI #Anal"| __truncated__ ...
 $ truncated                : 

The good news is that we have a lot of data. Not just the text of the tweets, but all sorts of other meta data. 

The bad news is that to conduct analysis on the data, we typically want to wrangle it into a data frame. For example, what if I wanted to see if the number of hashtags was predicted by the source of the tweet?

In [32]:
str(df$entities$hashtags, 2)

List of 100
 $ :'data.frame':	7 obs. of  2 variables:
  ..$ text   : chr [1:7] "BigData" "MachineLearning" "DataScience" "AI" ...
  ..$ indices:List of 7
 $ :'data.frame':	3 obs. of  2 variables:
  ..$ text   : chr [1:3] "data" "Rstats" "datascience"
  ..$ indices:List of 3
 $ :'data.frame':	7 obs. of  2 variables:
  ..$ text   : chr [1:7] "BigData" "MachineLearning" "DataScience" "AI" ...
  ..$ indices:List of 7
 $ :'data.frame':	7 obs. of  2 variables:
  ..$ text   : chr [1:7] "BigData" "MachineLearning" "DataScience" "AI" ...
  ..$ indices:List of 7
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	2 obs. of  2 variables:
  ..$ text   : chr [1:2] "rstats" "DataScience"
  ..$ indices:List of 2
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	3 obs. of  2 variables:
  ..$ text   : chr [1:3] "TidyData" "rstats" "DataScience"
  ..$ indices:List of 3
 $ :'data.fram

As you can see, the hashtags object consist of 100 data frames, some of which have zero observations. So, we'll have to clean this up. I've done just that in the code below by first extracting the text of hashtags and then by replacing the NULL returns (data frames with zero observations and, consequently, no "text" variable) with a NA [of class character] value. The list of hashtags is then added to the `df` data frame, using the `I()` function to tell R that we know it's a recursive (more than one observation per) list. Finally, the number of hashtags are counted and added to the data frame as a variable named `hashtag_count`.

In [33]:
## extract text of hashtags
hashtags <- lapply(df$entities$hashtags, "[[", "text")
hashtags[0:10]

In [34]:
## replace nulls with missing
hashtags[lengths(hashtags) == 0L] <- NA_character_

In [35]:
hashtags[0:10]

In [36]:
## add to df object
df$hashtags <- I(hashtags)
df$hashtags[0:5]

[[1]]
[1] "BigData"         "MachineLearning" "DataScience"     "AI"             
[5] "Azure"           "Serverless"      "IoT"            

[[2]]
[1] "data"        "Rstats"      "datascience"

[[3]]
[1] "BigData"         "MachineLearning" "DataScience"     "AI"             
[5] "Azure"           "Serverless"      "IoT"            

[[4]]
[1] "BigData"         "MachineLearning" "DataScience"     "AI"             
[5] "Analytics"       "HDInsight"       "DataLakes"      

[[5]]
[1] "rstats"


In [37]:
## calculate number of hashtags
df$hashtag_count <- lengths(hashtags)
df$hashtag_count[0:5]

In [38]:
names(df)

In [39]:
head(df$source)

The source includes html code. Fortunately, we can extract the key text with relative ease using a regular expression like the one below:

In [40]:
df$source <- stringr::str_extract(df$source, "(?<=\\>)[^<]+")

In [41]:
head(df$source)

Now that we've cleaned up these variables, let's run poisson regression to analyze the source as a predictor of the count variable representing the number of hashtags.

In [42]:
## poisson regression model
m1 <- glm(hashtag_count ~ source, df, family = poisson)

## summarize results
summary(m1)


Call:
glm(formula = hashtag_count ~ source, family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5794  -0.6149   0.0000   0.2660   2.6956  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    1.253e+00  3.780e-01   3.314 0.000918 ***
sourceCalcaware                6.931e-01  5.345e-01   1.297 0.194714    
sourceCRANberries Feed        -1.253e+00  6.901e-01  -1.815 0.069458 .  
sourceFenix 2                 -1.253e+00  1.069e+00  -1.172 0.241256    
sourceMachine learning Bot 6  -5.596e-01  8.018e-01  -0.698 0.485200    
sourceNode RED                -5.596e-01  8.018e-01  -0.698 0.485200    
sourcePaper.li                -5.596e-01  8.018e-01  -0.698 0.485200    
sourceRight Relevance         -1.253e+00  1.069e+00  -1.172 0.241256    
sourceRoundTeam               -5.596e-01  8.018e-01  -0.698 0.485200    
sourceRstats1234              -7.538e-01  4.226e-01  -1.784 0.07446