    Authors: Marta Krzyzanska (marta.krzyzanska.12@ucl.ac.uk)
             Chiara Bonacchi  (c.bonacchi@ucl.ac.uk)

# INTRODUCTION

In the next two tutorials we are going to:

- extract the data related to the @MicroPasts account from Twitter
- analyse the content of the Tweets with some basic text analysis techniques
- analyse the data about the users who included the @MicroPasts handle in their tweets

To start with, create a folder called *twitter* within your J://dhengage directory and set the R working directory there. A command to set the directory should look roughly like this:

In [None]:
setwd("J://dhengage//twitter")

To get data from Twitter we will use the *rtweet* library. In programming, libraries are collections of pre-defined routines that a program can use. In simpler words, they provide additional functions, which in case of open-source software like R, can be written and shared by anyone and published in the packages. If you do not have *rtweet* installed yet, use the following command:

In [None]:
install.packages("rtweet")

A new window should appear, prompting you to choose a cran mirror. Choose the one located in the UK and double click on it. You may need to wait for a moment before the package is installed. Then, load the library:

In [None]:
library("rtweet")

# EXTRACTING DATA FROM TWITTER

To download data from Twitter we will need an access token for the Twitter API. There is a function in the *rtweet* library which allows you to create one, called 'create_token'. To check how it works type:

In [None]:
?create_token

As you can see, the function requires 3 arguments: name of the Twitter app created by the user, consumer key, and consumer secret. To get these you need to create a Twitter app. To do that:

- Log in to your Twitter account and make sure that you have added your phone number to your profile.

- Go to https://apps.twitter.com/ and click *create New App* button.

- Under application details enter the name (e.g. *token_1* or anything else that is not already taken), description (e.g. *token for R*) and website (e.g. *http://example.com*), as Callback URL enter: *http://127.0.0.1:1410*; accept the Developer Agreement and click *create your Twitter application* button.

- Click *modify app permissions* button and under Access choose: *Read, Write and Access direct messages*; then click *Update Settings*

- Now go to the *Keys and Access Tokens* tab to find your consumer key and consumer secret and together with your app name substitute those for appropriate arguments in the function (enter them as strings - in the quotation marks!):

In [None]:
token<-create_token(app = "mytwitterapp", "consumer_key",
                    "consumer_secret", cache = FALSE)

#### After you execute this command, a browser window should open and prompt you to authorize the API. Click *Authorize app* button and go back to R - Authentication should be completed and your token should be working.
####

Now save your newly-created token as an R object file, so that you can use it afterwards without going through the procedure again:

In [None]:
save(token, file = "token.R")

Now we can access Twitter API and extract tweets from @MicroPasts account. To do that we are going to use *get_timeline* function from rtweet library. To find out what the function does and what information we need to feed into it, type:

In [None]:
?get_timeline

This should open a browser window with the documentation for the function. From the description, we can find out that it can either return tweets posted by a given user or tweets posted by the accounts he or she follows. As we are interested in what @MicroPasts was tweeting, we will keep the default setting of the home argument to false. (Under usage you can see the default setting of the parameters in the function - if default is what you want, you do not have to include that parameter).

&nbsp;

Arguments section tells you what arguments you need to define:

- Screen name of the user in our case is "@MicroPasts" (it needs to be input as a string i.e. in the quotation marks).
- The number of tweets to return (input as numeric value, so without quotation marks). We want all the tweets from @MicroPasts account, so we are going to set this value to a vary high number.
- For token we will use our newly created token stored in the variable with the same name.
- We can keep other arguments as defaults, so we do not need to include them.

The function should return a data frame, which we will store in the variable tweets:

In [None]:
tweets<-get_timeline(user="@MicroPasts", n=1000000000,token=token)
#This may take a moment

Variable tweets should now contain the data frame with the text of the tweets by @MicroPasts and associated metadata.
We already looked at this table, so remind yourself:

##### (Q1) What metadata are available about the tweets (hint - what columns does the dataframe contain)?
##### (Q2) How many tweets were posted from @MicroPasts account (hint: each row represents one tweet)?
#####

If you don't remember how to check that, you will find the needed code on the last page in the answers section.

# TEXT ANALYSIS
The next few sections of this tutorial are based on the study published by Ben Marwick (2014), which included full documentation for the code. In this study, Marwick analysed the contents of the tweets with techniques including term frequencies, associations, sentiment analysis and topic modelling and modelled the users' network. We are going to apply some of these techniques to our dataset. To do that, we will need a text mining library 'tm', which contains functions that can be used to automate these tasks (the documentation for it can be found here: https://cran.r-project.org/web/packages/tm/tm.pdf). If you have not installed the package yet, install it and then load into R workspace:

In [None]:
install.packages("tm")
library(tm)

## Text preparation

We are going to start by calculating term frequencies, which represent the numbers of times each term appears in the tweets by @MicroPasts. To do that, first we need to extract the text of the tweets from the dataframe, i.e. store the column with the text in a separate variable:

In [None]:
texts<-tweets$text

Package tm introduces new data types: *documents*, which store texts and *corpora* which are collections of documents. The functions included in the package were designed to operate on these data types, so the first thing we need to do is to transform our data into these formats. Fortunately the package also provides some handy functions to do that. First we are going to transform each tweet in the *texts* variable into a document, which can be done with a function *VectorSource()*:

In [None]:
texts<-VectorSource(texts)

Then, transform the *texts* into a variable of type Corpus:

In [None]:
texts<-Corpus(texts)

Texts of tweets can sometimes contain odd characters that may cause errors during the analysis. To remove them, type:

In [None]:
texts <- tm_map(texts, function(x) iconv(x, to='UTF-8', sub='byte'))

Texts in the tweets contain a mixture of upper and lowercase characters, punctuation marks and numbers, which complicates the analysis. Before we count term frequencies, we need to remove punctuation and numbers, and bring all letters to lowercase:

In [None]:
texts <- tm_map(texts, tolower) #changes all text to lowercase
texts <- tm_map(texts, removePunctuation) #removes punctuation
texts <- tm_map(texts, removeNumbers) #removes numbers

Some words in English, such as articles, appear very commonly in all texts because of the sentence structures, but are not very informative with regard to the text contents. Therefore, we are going to remove them before proceeding with the analysis:

In [None]:
texts <- tm_map(texts, removeWords, stopwords("english"))

Words with the same meaning can have different endings, depending on their grammar form, so we also need to simplify all the words to their *stems* (roots of the words that never change):

In [None]:
texts <- tm_map(texts, stemDocument, language = "english")

To calculate term frequencies, we are going to transform our corpus of documents into a *term-document matrix*, which for each term present in the corpus includes information about how many times the term features in each document. Since terms that are only 3 letters or shorter are not very informative, we will exclude them from the matrix.

In [None]:
texts.dtm <- TermDocumentMatrix(texts, control=list(minWordLength=3))

To get a better idea about how the term-document matrix actually looks like use the function inspect():

In [None]:
inspect(texts.dtm[1:10,1:10])

As you can see, every row in this matrix represents a term and every column, a specific document. We queried the first 10 columns and the first ten documents, but you can query this matrix in a similar way as a normal matrix or a dataframe (i.e. write the row and the column numbers you want to access in square brackets).

##### Q3 How many times does the term 'britishmuseum' appear in the document number 2?


## Term frequencies

Now we can find the most frequent words with the function *findFreqTerms()*. We are going to print terms that appear more than 200 times, between 150 and 200 times, between 100 and 150 times, and between 50 and 100 times:

In [None]:
findFreqTerms(texts.dtm, lowfreq=200)
findFreqTerms(texts.dtm, lowfreq=150, highfreq=200)
findFreqTerms(texts.dtm, lowfreq=100, highfreq=150)
findFreqTerms(texts.dtm, lowfreq=50, highfreq=100)

As you can see, the most frequent terms are *dejpett*, which is a username of one of the co-founders of micropasts, *micropast* and *projec*. *jwexlerbm* and *chiarabonacchi* are the usernames of the other two co-founders of micropasts and *nwilkinbm* was involved in one of the MicroPasts projects. From the most frequent terms you can see that @MicroPasts was tweeting a lot about the projects carried out together with the British Museum and the Mary Rose Museum, and specifically the projects related to the Bronze Age and the Portable Antiquities Scheme (hence term portabl). Of course terms associated with crowdsourcing and specific types of projects feature a lot as well. 

## Term associations

Now let's explore the associations between some of these terms. For example, let's see what terms are associated with Mary Rose Museum and photomasking:  

In [None]:
findAssocs(texts.dtm, "maryrosemuseum", 0.20)

0.20 in the above code refers to the strength of the association and can be adjusted to see more or less strongly associated words. As you can see Mary Rose Museum is really strongly associated with the terms bell and artefact and terms such as (3)dmodel and model. One of the projects on MicroPasts was the 3D modelling of Mary Rose Bell, and as you can see, this project featured on Twitter a lot. You can find associations for other common terms as well in the similar way.

##### Q4 What terms are associated with photomasking? What objects have been frequently mentioned by @MicroPasts in the contexts of photomasking?
#####

&nbsp;

Before we proceed to the next section, save your corpus and your term-document matrix to the files, in case you need to query them later:

In [None]:
save(texts,file="texts.R")
save(texts.dtm,file="texts.dtm.R")

# Users analysis

Now that we know what @MicroPasts tweeted about, we are going to have a look at the other users with whom it interacted. For that, we will need to look at the tweets from other accounts that mention @MicroPasts. However, Twitter API allows you to search for tweets that contain specific term, only if they were tweeted within last couple of weeks, which for us is not sufficient. Because of that, we will take another approach. I have downloaded all the tweets posted by the users who follow or are followed by @MicroPasts account, and subsequently extracted only those that mention @MicroPasts. As this took a relatively long time and required some data manipulation that is beyond the scope of this course, we are not going to replicate it in the class, but if you are interested in how the data was obtained or want to have a go at it yourself later on, the code is posted on GitHub: https://github.com/mkrzyzanska/Tutorial_data/blob/master/Data%20for%20tutorial.ipynb

&nbsp;

Otherwise we will use a pre-compiled dataset stored in the file microPasts.csv. Download it from DUO and save in your twitter directory. Then, import it into R:

In [None]:
microPasts <- read.csv("microPasts.csv")

Have a look at the microPasts dataframe:

In [None]:
head(microPasts)

It is a similar dataframe to the one storing tweets from @MicroPasts account - the columns are the same, the only difference is that now it stores tweets from multiple accounts.


## Users statistics


Fist we are going to check how many Twitter users who followed or were followed by @MicroPasts account, mentioned it in their tweets. To do that, count the number of unique user ids in the dataframe:

In [None]:
length(unique(microPasts$user_id))

As you can see there were 469 users who mentioned @MicorPasts in their tweets. Now let's have a look at frequencies of tweets including handle @MicroPasts per author and save them in the dataframe. Then order the frequencies in a decreasing order and have a look at the table:

In [None]:
counts <- as.data.frame(table(microPasts$screen_name))
counts <- counts[with(counts,order(-Freq)),]
counts

Plot the number of tweets by authors on the graph, to get a better idea about the distribution of tweet numbers. To do that, we will use a ggplot2 library, which includes useful functions for plotting different types of graphs. If you have not installed it yet, install it now and load it in your R workspace:

In [None]:
install.packages("ggplot2")
library("ggplot2")

Now input the following code to plot the graph:

In [None]:
ggplot() +
geom_point(data = counts, mapping = aes(reorder(counts$Var1,counts$Freq),
        counts$Freq), size = 1) +
xlab("Author") +
ylab("Number of messages") +
coord_flip() +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.text.y = element_text (size = 5))

There are a lot of things going on in the code above. Firstly, we initialised the plot with the command *ggplot()*. Then, we plotted our points on the graph as *geom_points*. Within the ggplot2 library, these are points, the location of which is defined by coordinates described in mapping. We took the data on the basis of which we located our *geom_points* from the dataframe *counts* that we created earilier, and we defined our mapping, so that the x coordinate contains the usernames ordered by the number of tweets, and the y coordinate contains the numbers of tweets. Then we labelled our x axis as "Author" and our y axis as "Number of messages" and we flipped the coordinates, so that we can have the authors on the vertical axis. Then we defined the theme - which describes the visual parameters of the graph, such as font size, margins etc. You can find the full list of the parameters in the ggplot2 cheat sheet (https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf), and adjust how your graph looks like.

&nbsp;

As you can see this graph is not very clear, because there is too many users to fit in the window. Since we can see that many of them posted very few tweets related to @MicroPasts, lets include only the ones who posted 5 or more. To do that, subset only the users and user counts with frequency of messages higher than 5:

In [None]:
counts2<-subset(counts,counts$Freq>5) #First argument in this code gives
# the dataset, a part of which you want to subset
#Second argument informs about which part: in this case such that the 
#value in the column 'Freq' is bigger than 5.

#Run the code for creating the graph again on the new dataset:

ggplot() +
geom_point(data = counts2, mapping = aes(reorder(counts2$Var1,
            counts2$Freq), counts2$Freq), size = 1) +
xlab("Author") +
ylab("Number of messages") +
coord_flip() +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.text.y = element_text (size = 5))

It should look better now. If, not you can try to resize the window with the graph, and the view should adjust. 

##### Q5 Now let's have a look at the graph we just created. Does this pattern remind you of any of the previous graphs we saw during the course? What does it say about the authors who included @MicroPasts in their handle?
#####

Now save your plot in case you want to have a look at it later (or show it off on twitter ;) ):

In [None]:
dev.print(pdf,file="tweets_frequencies.pdf")

An alternative method would be to present this data in the form of histogram, but it shows the pattern less clearly:

In [None]:
hist(counts$Freq)

##### Q6 Check how many of the tweets with @MicroPasts handle were retweets and how many were original messages (hint: you can get that information from the is_retweet column in the dataframe, by transforming this column into a table or getting its summary)?
#####

We can also make a table showing how many tweets were retweets for each user and how many were the original tweets. To do that, first subset only those tweets that were retweets and those who were not and get their frequencies for every user:

In [None]:
mp_retweets<-subset(microPasts, microPasts$is_retweet == TRUE)
mp_tweets<-subset(microPasts, microPasts$is_retweet == FALSE)
mp_retweets<-as.data.frame(table(mp_retweets$screen_name,
                                 mp_retweets$is_retweet))
mp_tweets<-as.data.frame(table(mp_tweets$screen_name,
                               mp_tweets$is_retweet))

mp_tweets$Var2 <- NULL #Drops unnecessary columns 
                       #(with true and false values)
mp_retweets$Var2 <- NULL 

Then merge the two tables, by the usernames (stored in column "Var1" in each table), and add suffixes indicating whether the frequency refers to tweets or retweets (*all* argument indicates whether to include the rows from either table that did not have equivalent in the second one, i.e. users who posted only tweets or only retweets):

In [None]:
ratio<-data.frame() #Initiate an empty data.frame
ratio<-merge(mp_tweets, mp_retweets, by = "Var1",all=TRUE,
        suffixes=c("_tweets","_retweets"))
ratio[is.na(ratio)]<-0 #Replaces the NA values in the dataframe with 0s
#so that when there were no retweets or tweets for a user the value is 0

On the basis of this table we can also calculate what fraction of all the tweets mentioning @MicroPasts handle are retweets. To do that, first add an empty column to the *ratio* dataframe:

In [None]:
ratio$ratio<-NA

Then, populate the column with the value of retweets as a fraction of all tweets:

In [None]:
ratio$ratio<-(ratio$Freq_retweets/
              (ratio$Freq_retweets+ratio$Freq_tweets))

I found that adding this column for some reason creates an error while trying to plot the dataframe. However, there is a quick workaround for it - it works fine if you create another dataframe variable and store your dataframe in it:

In [None]:
r<-data.frame()
r<-ratio

Now we can plot the ratio of retweets containing the @MicroPasts handle to all messages containing @MicroPasts handle for each author:

In [None]:
ggplot() +
geom_point(data = r, mapping = aes(reorder(r$Var1,r$ratio),
r$ratio), size = 1) +
xlab("Author") +
ylab("Ratio of retweets to all tweets") +
coord_flip() +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.text.y = element_text (size = 5))

The graph does not look very pretty again, but as you can see, there is a substantial number of users, who posted only tweets (ratio=0) or only retweeted messages with @MicroPasts handle (ratio=1). Let's keep that in mind, but for the plotting purposes, again include only the users who posted overall more than 5 tweets and plot the graph again:

In [None]:
r<-subset(r,(r$Freq_tweets + r$Freq_retweets)>5)

#Plot the graph

ggplot() +
geom_point(data = r, mapping = aes(reorder(r$Var1,r$ratio),
r$ratio), size = 1) +
xlab("Author") +
ylab("Ratio of retweets to all tweets") +
coord_flip() +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.text.y = element_text (size = 5))

#Save the graph to the file

dev.print(pdf,file="retweets_ratio.pdf")

We can still see the tail of users who only retweeted, but it's much smaller, and there are no users with only origianl messages.

&nbsp;

Let's also have a look at the number of retweets of messages with @MicroPasts handle for each author and the number of followers that they have. To get the total number of retweets for each user type:

In [None]:
retweets<-aggregate(microPasts$retweet_count, 
        by=list(microPasts$screen_name),FUN=sum)

To get the number of followers we need to download additional data from twitter. If you have a look at the rtweet library documentation you can see that this can be achieved with *lookup_users()* function, provided that you have a list of the users ids:

In [None]:
users <- unique(microPasts$user_id) #Extracts the list of unique user ids
                                    #from the dataframe
user_data<-lookup_users(users) #Gets the users data from the twitter API

Merge retweets table with the column giving the number of followers for each user (include, only the users with more than 5 tweets about MicroPasts):

In [None]:
rf<-merge(user_data,retweets, by.x = "screen_name",
          by.y="Group.1",all=TRUE)
rf<-merge(rf,counts2, by.x = "screen_name", by.y="Var1")

And plot the number of retweets of @MicroPasts related messages and the number of followers for each user. We can also indicate how many tweets overall each user posted by substituting the number for the dot on a graph and adjusting its size accordingly:

In [None]:
ggplot(data = rf, mapping = aes(rf$x, rf$followers_count), size = 3) +
xlab("Number of retweets") +
ylab("Followers count")+
geom_text(aes(label = rf$screen_name, size = rf$Freq,))+
labs(size="Number of tweets\n(with @MicroPasts handle)") +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 5)) +
theme(axis.text.y = element_text (size = 5))

#Save graph to the file
dev.print(pdf,file="users_info.pdf")

As we can see, the number of retweets in this sample does not easily correspond to the general number of followers, but unsurprisingly, the more tweets there was in the sample by any user, the more retweets there was.

&nbsp;

Note that the number of tweets on the x axis refers to the total number of tweets per account, not only the ones containing the handle @MicroPasts. We can also have a look at the proportion of @MicroPasts-related tweets to all tweets posted by the users, by plotting this variables on the graph:

In [None]:
ggplot(data = rf, mapping = aes(rf$statuses_count, rf$Freq), size = 3) +
xlab("Overall number of tweets per account") +
ylab("Number of tweets with @MicroPasts handle")+
geom_text(aes(label = rf$screen_name)) +
theme_bw() +
theme(axis.title.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.title.y = element_text (size = 10, angle=90))+
theme(axis.text.x = element_text (vjust = -0.5, size = 10)) +
theme(axis.text.y = element_text (size = 10))


#Save graph to the file:
dev.print(pdf,file="@microPasts_tweets_ratio.pdf")

## *Bonus exercise: User network analysis*

So far we only extracted and visualised some statistics about the twitter users who mention @MicroPasts in their messages, but now we will proceed to slightly more complex analysis. We are going to create the visualisation of the user network on the basis of retweeting behaviour and calculate some of the properties of this network. For this part, we will assume that the links between users exist if they retweet each other messages because it is quite a good indication of whether the users read each other messages and it is relatively easy to do. However, note that it is not the only way to define user links in the social media network. We could for example look at who is following who, or who mentions whose handles in their tweets. 

&nbsp;

To visualise and analyse the users' network, first we need to create the dataframe with the pairs of ids for users who retweeted the message and whose message has been retweeted. First, extract from the dataframe only the tweets that are retweets:

In [None]:
rt<-subset(microPasts, microPasts$is_retweet == TRUE)

The *rt* dataframe already contains the ids of the users who posted the retweets, but there is no column with the ids of the users who posted the original tweet. To get that we will need to find the ids of original tweets (from retweet_status_id) column and extract the corresponding user ids. The easiest way to do that is to make another dataframe with just two columns: the statuses ids and the user ids, and merge the two dataframes by matching the entries from the columns statuses_id and retweet_status_id:

In [None]:
rt2<-as.data.frame(cbind(microPasts$status_id,microPasts$user_id))
rt<-merge(rt,rt2, by.x = "retweet_status_id", by.y="V1")

Now we just need to extract the user ids pairs into a separate dataframe:

In [None]:
rt <- data.frame(user=rt$user_id, rt=rt$V2)

And get only the unique pairs:

In [None]:
rt.u <- unique(rt) 

To plot and analyse the user network we will need two additional libraries. Install them if you have not done that yet and load them into your workspace:

In [None]:
install.packages("igraph")
install.packages("sna")
library("igraph")
library("sna")

Now we can begin the analysis and plot the user network:

In [None]:
# this part creats the nework from the id pairs
degree <- sna::degree
g <- graph.data.frame(rt.u, directed = F)
g <- as.undirected(g)
g.adj <- as.matrix(as_adjacency_matrix(g))
g.adj<-as.matrix(as_adjacency_matrix(g.adj))

# this part plots the network graph:

gplot(g.adj, usearrows = FALSE, vertex.col = "grey",
      vertex.border = "black",
displaylabels = FALSE, edge.lwd = 0.01, edge.col= "grey30",
      vertex.cex = 0.5)

# Save graph to the file:

dev.print(pdf,file="user_network.pdf")

We can also get some basic network attributes:

In [None]:
# get some basic network attributes
gden(g.adj) # density - describes the density of connections between the
            #community members
grecip(g.adj) # reciprocity - describes the tendency of ties to be
              # reciprocial (i.e. if user 1 retweets the mesage of
              # user 2, does user 2 also retweets the message of user 1)
gtrans(g.adj) # transitivity - describes mutual friendship i.e. the 
              # tendency of a friend of a friend also being a friend
centralization(g.adj, degree) # the degree to which the network is 
                              # centralised

The attributes on their own are not very informative - but they can be compared against the expected values that can be obtained from simulated data with known substantive properties similar to our data. These values can be calculated using *Univariate Conditional Uniform Graph Test* for each attribute:

In [None]:
# density
print(cug.gden <- cug.test(g.adj, gden))
plot(cug.gden)
range(cug.gden$rep.stat)

dev.print(pdf,file="network_density.pdf")

Have a look at the graph and the results. As you can see, the expected range of values for network density is much higher than the value of the density for our user network. 

##### (Q7) What does it tell you about the connections between the users in this network?

In [None]:
# reciprocity
print(cug.recip <- cug.test(g.adj, grecip))
plot(cug.recip)
range(cug.recip$rep.stat)

dev.print(pdf,file="network_reciprocity.pdf")

Reciprocity within the network is on the other hand much higher than expects meaning that pairs of users with connections, tend to retweet each other.

In [None]:
# transistivity
print(cug.gtrans <- cug.test(g.adj, gtrans))
plot(cug.gtrans)
range(cug.gtrans$rep.stat)

dev.print(pdf,file="network_transistivity.pdf")

##### (Q8) Is transitivity of the user network lower or higher than expected? What does it tell us about the mutual friendships of the users?

In [None]:
# centralisation
print(cug.cent <- cug.test(g.adj, centralization, 
                    FUN.arg=list(FUN=degree)))
plot(cug.cent)
range(cug.cent$rep.stat)

dev.print(pdf,file="network_centralization.pdf")

##### (9) Is network more or less centralised than would be expected? What does it tell us about the relationships between the users and their retweeting behaviour?
#####

Additionally, we can also automatically estimate the number of communities within this network and show them on the dendrogram:

In [None]:
#Finds the communities using random walk methods
g.wc <- walktrap.community(g, steps = 1000, modularity=TRUE) 
 #Plots the dendrogram showing the communities
plot(as.dendrogram(g.wc),leaflab="none",yaxt='n')
#Calculates the maximum number of cummnities
max(g.wc$membership)+1 

#Save the dendrogram

dev.print(pdf,file="dendrogram.pdf")

## User data analysis

The user_data dataframe contains one more relevant column we have not analysed yet: description. It contains short texts that the users wrote about themselves in their profile. Try to analyse the texts from this column yourself: extract it from the dataframe and carry out term frequencies and term associations analysis to find out more about the users who mentioned @MicroPasts in their tweets. Try to also inspect the contents of the column in details qualitatively - to do that it may be more convenient to save the dataframe in csv format and open it with Microsoft Excel or similar program.

# Answers:

- (Q1) Type: colnames(tweets) to find out.
- (Q2) Type: length(tweets$screen_name) to find out.
- (Q3) 2
- (Q4) Associated terms: httpstcofdxki, ptolema, ruler, httptcojktlyqh, swift, olduvai, handax. Objects: olduvai handaxes.
- (Q5) Will be discussed in class
- (Q6) Use: table(microPasts\$is_retweet) or summary(microPasts\$is_retweet)
- (Q7) There are less connections than would be expectes - also discuss in class why that could be the case
- (Q8) Lower - there are less mutual friendships than would be expected - also discuss in class why that could be the case
- (Q9) Higher - there are few 'central' users that retweet tweets with many users not connected to each other - also discuss in class why that could be the case

# References

Marwick, B. (2013). Discovery of emergent issues and controversies in anthropology using text mining, topic modeling, and social network analysis of microblog content. Data Mining Applications with R, 514.

# Troubleshooting library installation

If was a problem with library installation/loading it is most likely because R cannot find the library in the place where it was installed. This is because as default, when loading the libraries, R is looking for files in folders specified in the *PATH* environment variable, that in windows can be set in Control Panel -> Security and System -> System -> Advanced System Settings -> Environment Variables -> Path. In the GIS lab, we can not add the another path to this variable, so instead we need to install the library locally and then, when loading it, tell R specifically where to find it. So, whenever you need to install and load the library in R in the lab, add the *lib* argument with an appropriate path to the command. To install the library "rtweet" in your dhengage folder type:

In [None]:
install.packages("rtweet",lib="J://dhengage")

And to load it from there:

In [None]:
library("rtweet",lib="J://dhengage")