# Twitter data collection and analysis using R
### A case study for COVID keywords co-occurrence on Saturday evening, May 30, 2020, in Paris 

In this jupyter notebook we will show how to collect data from Twitter using the standard Twitter API and will perform some analysis using using `rtweet`, and specifically, geocoding and co-occurrence matrices.

The data collected corresponds to the activity on Twitter within a radius of 50 kilometers around Paris, afternoon of Saturday, May 30, 2020, between 18:45:00 and 22:45:00, coinciding with the COVID lockdown ease in France. This provided an acceptable testing ground for this type of quantitative analyses. We collected a total of 128.434 unique tweets. 

### Job schedule

We used an Unix scheduler called `cron` to setup repeated tasks. In our case, the task involved executing our R script each 20 min. This can be easily configured from the Linux bash shell as follows:

1. Enter the editor using `crontab -e`
2. Whithin the cron tab, type a line contaning instructions for the job, path, executable Rscript and file.R. In our case, we ran the task each 20 min to avoid problems with the API rate limit. Thus, we add a line as follows:
`*/20 * * * * cd /path/to/file; Rscript collect.R`
3. Save changes by doing `Ctr+x`
4. Start cron service: `sudo service cron start`

### Collecting Tweets

The folloging code contains:

1. A shebang that invokes an interpreter for executing the script: `#!/usr/bin/env Rscript`.
2. Api name and keys.
2. The function `search_tweets`, which is used to define the keywords (for this, you need a [Twitter API account](https://developer.twitter.com/en)), language, retryonratelimit, geocode and number of tweets to be retrieved.
3. Specification of directories.
4. Code to save file.

In [None]:
#!/usr/bin/env Rscript
#
#

library(rtweet)

# Twitter API
create_token(
  app = "your_api_name",
  consumer_key = "xxxxxxx",
  consumer_secret = "xxxxxxx",
  access_token = "xxxxxxx",
  access_secret = "xxxxxxx"
)

# 50KM radius around Paris
geocode <-  "48.8534,2.3486,50km"

#Search specifications
newTweets <- search_tweets(q = "covid", 
                    lang = "fr",
                    retryonratelimit = FALSE, 
                    geocode = geocode,
                    include_rts = FALSE, 
                    n = 10000)


# Specify directory
dirPath <- "/path/to/file/"

# Create directory for storage
if(!dir.exists(paste0(dirPath, "tweets/"))){
  dir.create("tweets/")
}

# Write csv with date (I am using Sys.time for this search. If you want to space the search 
# in days you can save the files using Sys.Date() instead)
save_as_csv(newTweets, paste0(dirPath,"tweets/",format(Sys.time(),"%d_%m_%Y_%H_%M_%S"), ".csv"),
            prepend_ids = TRUE, na = "",
            fileEncoding = "UTF-8")



###  Île-de-France geographical distribution of the harvested tweets for COVID

In [None]:
library(rtweet)
library(tidyverse)
library(reshape2)
library(ggplot2)
library(ggridges)
library(lubridate)
library(rtweet)
library(maps)
library(quanteda)

# List all files
allFiles <- paste0("tweets/", list.files("tweets/"))

# Read all csv files in the folder and create a list of dataframes
mergeTweets <- lapply(allFiles , read.csv)

# Combine each dataframe in the list into a single dataframe
allTweets <- do.call("rbind", mergeTweets)

# Write CSV
write_as_csv(allTweets, file_name = "gotTwitter_2.csv")

# Read final dataset
allTweets <- read_twitter_csv("/your_path/gotTwitter_2.csv", unflatten = T)

# Convert UTC to EDT
allTweets %<>% dplyr::mutate(created_at = as_datetime(created_at, tz = "UTC")) %>%
  dplyr::mutate(created_at = with_tz(created_at, tzone = "Europe/Paris"))

# Produce lat and lng coordinates
allTweets <- lat_lng(allTweets)
# Plot
par(mar = rep(1, 4))
#map("france", lwd = .25)
#Importation du package

library(raster)

#Découpage en département
FranceFormes <- getData(name="GADM", country="FRA", level=3)

#Sélection que d'une région: Île de France
#Pour trouver les noms des variables de l'objet FranceFormes
names(FranceFormes)

#Pour trouve les les noms des régions de France
FranceFormes$NAME_1

#Sélection du département
FranceFormesParis <- subset(FranceFormes, NAME_1=="Île-de-France")
plot(FranceFormesParis, main= "Île-de-France geographical distribution of the harvested tweets for COVID")

# plot lat and lng points onto state map
with(allTweets, points(lng, lat,
                       pch = 16, cex = .25,
                       col = rgb(.8, .2, 0, .2)))

![](Paris.jpg)

### Analysys of co-ocurrence of keywords related to COVID

In this section we will use tokenise the tweets, that is, we will break the tweets into individual linguistic units (tokens that can comprise single or multiple words, such as n-grams). Tokenisation will involve cleaning the tweets to get rid of irrelevant elements (e.g. separators, punctuation, URLs, etc...). 

Then we will create a [document-feature matrix (DFM)](https://www.rdocumentation.org/packages/quanteda/versions/2.0.1/topics/dfm). Our `dfmatrix` contained 19.311.849.976 elements. To investigate the co-occurrence of a pair of eleemnts A and B, we follwed the method proposed by de Abreu e Lima (2019). Let $CO_{A,B}$ be the co-occurrence of any pair of elements, then:

\begin{equation}
CO_{A,B} = \sum_{i=1}^{n} 1_{T_i}(\{A,B\}) := \begin{cases} 1, \text{if} \{A,B\} \subseteq T_i \\ 0, \text{otherwise} \end{cases} 
\end{equation}

where $T_i$  is the $ith$ tweet, and a mathematical set with as many elements as tokens. The summation over the function $1_{T_i}(\{A,B\})$ counts tweets where both A and B are mentioned, which can be used as a reliable metric to account for the association between two tokens.

For the current test, we used a number of names of political figures, along with relevant keywords used in the political arena at the moment. This way, our matrix was a 31 x 31 with all the counts in our `dfmatrix`. The minimum frequency was set to 0.1.

For illustration purposes we add the visualization of the feature co-occurrence matrix as a network of interactions between terms. The minimum frequency was set to 0.1. Edges width represent relative frequency of connections between terms. 

In [None]:
# Tokenize words
tkn <- tokens(allTweets$text,
              remove_separators = T,
              remove_symbols = T,
              remove_punct = T,
              remove_url = T,
              split_hyphens = T,
              remove_numbers = T) %>%
  tokens_ngrams(n = 1:2)

dfmatrix <- dfm(tkn, tolower = T,
              remove = stopwords("french"))

gotChars <- c("Trump","Johnson","Macron","Merkel","Hidalgo","Dati", "Mélenchon", "Le Pen",
              "Buzyn", "Belliard", "Bournazel", "Federbusch", "Gantzer", "Simonnet", "Griveaux", "Benjamin",
              "covid","climate", "liberte", "egalite", "fraternite", "coronavirus",
              "sante", "police", "bondy", "musk", "nasa", "morts", "paris", "france", "europe")

gotFcm <- dfm_select(dfmatrix, pattern = gotChars) %>%
  fcm()

textplot_network(gotFcm, min_freq = 0.1,
                 edge_alpha = .7,
                 edge_size = 8)

![](Network.jpg)

### References:

Dimension statistiques and sociétés. (Consulted 5/31/2020). Retrieved from http://dimension.usherbrooke.ca/dimension/ssrcartes.html

de Abreu e Lima, F (2019). poissonisfish: Twitter data analysis in R. 
From https://poissonisfish.com/2019/10/09/twitter-data-analysis-in-r/