# Generate Indicies and Create Sample files: All Tweets (*R*) 

This notebook generates indices for use in sampling the data as well as exports datasets from the corpus based on those indices.  The indices are exported as well.

## Load the data file

> First, we store the file names (*and their locations*) of the files containing fake tweets in the string '**fileName0**' and the string array '**fileName**'[ ].

In [1]:
fileName0 = 'datasetsFULLcsv/fakeFollowersCSV/tweets.csv'

fileNames = c('datasetsFULLcsv/socialSpambots1csv/tweets.csv', 'datasetsFULLcsv/socialSpambots2csv/tweets.csv', 'datasetsFULLcsv/socialSpambots3csv/tweets.csv', 'datasetsFULLcsv/traditionalSpambots1csv/tweets.csv')

> Using the CSV file names previously specified in '**fileName0**' and '**fileNames**'[ ], we can now load the file into the _data.frame_( ) named '**fakeCSV**'.

In [2]:
fakeCSV = read.csv(fileName0)
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)

for (filename in fileNames) {
    temp0 = read.csv(filename)
    #fakeCSV <- rbind(fakeCSV, temp0)
    temp <- data.frame(userID = temp0$user_id, id = temp0$id, text = temp0$text)
    fakeTweets <- rbind(fakeTweets, temp)
}

“embedded nul(s) found in input”

> We now load the file containing real tweets into the _data.frame_( ) named '**fakeCSV**'.

In [3]:
realCSV = read.csv('datasetsFULLcsv/genuineAccountsCSV/tweets.csv')

“embedded nul(s) found in input”

> From the '**fakeCSV**' and '**realCSV** _data.frame_( )s, we will create two smaller, simpler *data.frame*( )s named '**fakeTweets**' and '**realTweets**', respectively.  This reduction in size and complexity of '**fakeTweets**' is due to the fact that it only contains the ID number of the tweet in our database, along with the text of the tweet.  

In [4]:
fakeTweets <- data.frame(id = fakeCSV$id, text = fakeCSV$text)
realTweets <- data.frame(id = realCSV$id, text = realCSV$text)

> The initial size of the imports are

In [5]:
nrow(fakeTweets)
nrow(realTweets)

## Clean the data

>> Now we remove web URLS, twitter usernames, twitter hashtags, punctuation, and stand-alone numeric digits.

In [6]:
# remove web URLs
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", realTweets$text))

# remove twitter handles (@<username>)
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", realTweets$text))

# remove hashtags (#<hashtag name>)
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", realTweets$text))

# remove punctuation
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub('[[:punct:] ]+', ' ', fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub('[[:punct:] ]+', ' ', realTweets$text))

# remove numbers
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("[0-9]", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("[0-9]", "", realTweets$text))

# convert to lowercase
fakeTweets <- data.frame(id = fakeTweets$id, text = tolower(fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = tolower(realTweets$text))

> The number of Tweets available after "cleaning"

In [7]:
nrow(fakeTweets)
nrow(realTweets)

> ### TidyText the data file

> Now we must tokenize the text of each tweet using the '*tidytext*' and '*dplyr*' libraries.  To do this, the '*tidytext*' and '*dplyr*' libraries must be imported and the data frames type used to store both types of tweets converted to the data frame type from the '**dplyr**' library before the tweets can be tokenized. 

>> First, we import the '*tidytext*' and '*dplyr*' libraries: 

In [8]:
library(dplyr)
library(tidytext)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



>> Then we convert the data frames of '**fakeTweets**' and '**realTweets**' to the data frame type from the '*dplyr*' library:

In [9]:
fakeTweets <- data_frame(id = fakeTweets$id, text = as.character(fakeTweets$text))
realTweets <- data_frame(id = realTweets$id, text = as.character(realTweets$text))

>> so that we can finally tokenize the text from each of the tweets,

> ### Tokenization

> We now tokenize the text of the tweets, storing these tokens in new data frames (*from __dplyr__*). 

>> For the fake tweets, we use the new data frame (*from __dplyr__*) '**fakeTweetTOKENS**':

In [10]:
fakeTweetsTOKENS <- fakeTweets %>%
    unnest_tokens(word, text)

>> Similarly, we tokenize the text of the real tweets, them in the new data frame (*from __dplyr__*) '**realTweetTOKENS**'.

In [11]:
realTweetsTOKENS <- realTweets %>%
    unnest_tokens(word, text)

In [12]:
#nrow(fakeTweetsTOKENS)
#nrow(realTweetsTOKENS)

> ### Remove '*Stop Words*'

> Now, we will remove any stop words from the text of the tweets.  To do this, we first import the '*stop_words*' dataset from the '*tidytext*' library.

>> Importing the '*stop_words*' dataset from the '*tidytext*' library:

In [13]:
data(stop_words)

>> Now, we use the '*anti_join*( )' function from the '*dplyr*' library to remove these stop words from the fake tweets

In [14]:
fakeTweetsTOKENS <- fakeTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


>> and the fake tweets

In [15]:
realTweetsTOKENS <- realTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


> Number of fake and real tweet tokens is

In [16]:
nrow(fakeTweetsTOKENS)
nrow(realTweetsTOKENS)

## NRC Model

> We are first going to use the **NRC Sentiment Model** containing ten emotions associated with words.  Thus, we load the sentiments database for NRC followed by storing those ten emotions in their own dataframe.

In [17]:
nrcWORDS <- get_sentiments("nrc")
nrcEMOTIONS <- unique(nrcWORDS$sentiment)

> ### Associate Words with their Emotions

> Now, we generate data frames that associates the words in fake tweets with their associated emotions under the **NRC Model**.

>> First, for fake tweets

In [18]:
fakeTweetsNRCsentiment <- data.frame(id = 0)
for (emotion in nrcEMOTIONS){
    fakeTweetsNRCsentiment0 <- inner_join(fakeTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    fakeTweetsNRCsentiment <- full_join(fakeTweetsNRCsentiment, fakeTweetsNRCsentiment0)
    }
fakeTweetsNRCsentiment <- data.frame(fakeTweetsNRCsentiment[-1,])

Joining, by = "word"
Joining, by = "id"
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "word", "sentiment")


>> and then for real tweets

In [19]:
realTweetsNRCsentiment <- data.frame(id = as.factor(0))
for (emotion in nrcEMOTIONS){
    realTweetsNRCsentiment0 <- inner_join(realTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    realTweetsNRCsentiment <- full_join(realTweetsNRCsentiment, realTweetsNRCsentiment0)
    }
realTweetsNRCsentiment <- data.frame(realTweetsNRCsentiment[-1,])
#realTweetsNRCsentiment

Joining, by = "word"
Joining, by = "id"
“Column `id` joining factors with different levels, coercing to character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, 

> The number of word, sentiment pairs now in our data sets

In [20]:
nrow(fakeTweetsNRCsentiment)
nrow(realTweetsNRCsentiment)

> ### Generate *Tweet-Emotion Matrices*

> We now need to generate what we are calling *Tweet-Emotion Matrices*.  In these matrices, the rows represent the tweets we have analyzed, while the columns represent the ten emotions in the **NRC Model**.  Thus, the elements of the matrix represent the strength of the emotion represented by the element's column for the tweet represented by the element's row.

>> First, we do this for the fake tweets:

In [21]:
attach(fakeTweetsNRCsentiment)
fakeNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "fake")
detach(fakeTweetsNRCsentiment)

>> Then we do it for the real tweets

In [22]:
attach(realTweetsNRCsentiment)
realNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "real")
detach(realTweetsNRCsentiment)

>> We now create a combined data frame containing **_ALL_** tweets, both real and fake

In [23]:
NRCscoredTweets <- rbind(fakeNRCscoredTweets, realNRCscoredTweets)

> ### Look at the table

> We now get a quick look at the layout of our table to ensure its what we want

In [29]:
attach(NRCscoredTweets)
table(id,sentiment)
detach(NRCscoredTweets)

The following objects are masked from NRCscoredTweets (pos = 3):

    Freq, id, realFAKEcat, sentiment



                    sentiment
id                   anger anticipation disgust fear joy negative positive
  1023861511             1            1       1    1   1        1        1
  1183192955             1            1       1    1   1        1        1
  1189773795             1            1       1    1   1        1        1
  1204599498             1            1       1    1   1        1        1
  1210430389             1            1       1    1   1        1        1
  1213317158             1            1       1    1   1        1        1
  1226758528             1            1       1    1   1        1        1
  1230802453             1            1       1    1   1        1        1
  1235967720             1            1       1    1   1        1        1
  1248303008             1            1       1    1   1        1        1
  1248315583             1            1       1    1   1        1        1
  1257268832             1            1       1    1   1        1     

In [37]:
#table(data.matrix(NRCscoredTweets,id),id,sentiment)

“the condition has length > 1 and only the first element will be used”

ERROR: Error in table(data.matrix(NRCscoredTweets, id), id, sentiment): all arguments must have the same length


In [32]:
#realNRCscoredTweets

> ### Export tables

> We now

In [38]:
nrow(NRCscoredTweets)
nrow(fakeTweetsNRCsentiment)
nrow(realTweetsNRCsentiment)

fakeTestTrainIND <- sample(1:nrow(fakeNRCscoredTweets), 25000)
realTestTrainIND <- sample(1:nrow(realNRCscoredTweets), 1500000)

In [39]:
#fakeNRCscoredTweets
nrcEMOTIONS

In [25]:
#scoredFakeTweets <- data.frame()
#fakeNRCscoredTweets

#NRCscoredTweets
#filter(NRCscoredTweets, sentiment == "joy")

In [40]:
trustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "trust")$id, trust = filter(NRCscoredTweets, sentiment == "trust")$Freq)
fearScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "fear")$id, fear = filter(NRCscoredTweets, sentiment == "fear")$Freq)
negScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "negative")$id, negative = filter(NRCscoredTweets, sentiment == "negative")$Freq)
sadnessScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "sadness")$id, sadness = filter(NRCscoredTweets, sentiment == "sadness")$Freq)
angerScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anger")$id, anger = filter(NRCscoredTweets, sentiment == "anger")$Freq)
surpriseScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "surprise")$id, surprise = filter(NRCscoredTweets, sentiment == "surprise")$Freq)
posScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "positive")$id, positive = filter(NRCscoredTweets, sentiment == "positive")$Freq)
disgustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "disgust")$id, disgust = filter(NRCscoredTweets, sentiment == "disgust")$Freq)
joyScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "joy")$id, joy = filter(NRCscoredTweets, sentiment == "joy")$Freq)
anticipationScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anticipation")$id, anticipation = filter(NRCscoredTweets, sentiment == "anticipation")$Freq, realFAKEcat = filter(NRCscoredTweets, sentiment == "anticipation")$realFAKEcat)

In [41]:
nrcSCORES <- full_join(trustScores, full_join(fearScores, full_join(negScores, full_join(sadnessScores, full_join(angerScores, full_join(surpriseScores, full_join(posScores, full_join(disgustScores, full_join(joyScores, anticipationScores)))))))))

Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"


In [42]:
fakeNRCscores <- filter(nrcSCORES, realFAKEcat == 'fake')
realNRCscores <- filter(nrcSCORES, realFAKEcat == 'real')

In [43]:
nrow(nrcSCORES)
nrow(fakeNRCscores)
nrow(realNRCscores)

In [44]:
nrcSCORES

id,trust,fear,negative,sadness,anger,surprise,positive,disgust,joy,anticipation,realFAKEcat
1023861511,1,0,0,0,0,0,0,0,0,0,fake
1183192955,0,1,1,0,0,0,0,0,0,0,fake
1189773795,0,0,1,1,0,0,0,0,0,0,fake
1204599498,0,0,0,0,0,0,1,0,0,0,fake
1210430389,2,0,0,1,0,1,4,0,2,2,fake
1213317158,1,0,0,1,0,1,2,0,1,1,fake
1226758528,1,0,0,0,0,0,1,0,0,0,fake
1230802453,0,1,1,1,0,0,0,1,0,0,fake
1235967720,0,0,0,0,0,0,1,0,0,0,fake
1248303008,0,0,0,0,0,0,0,0,0,1,fake


Create Samples for training and testing

In [31]:
set.seed(158)
fakeTRAINind <- sample(1:nrow(fakeNRCscores), 50000)

In [34]:
set.seed(231)
realTRAINind <- sample(1:nrow(realNRCscores), 1300000)

In [36]:
fakeNRCscoresTRAIN <- fakeNRCscores[fakeTRAINind, ]
fakeNRCscoresTEST <- fakeNRCscores[-fakeTRAINind, ]
nrow(fakeNRCscoresTRAIN)
nrow(fakeNRCscoresTEST)

In [38]:
realNRCscoresTRAIN <- realNRCscores[realTRAINind, ]
realNRCscoresTEST <- realNRCscores[-realTRAINind, ]
nrow(realNRCscoresTRAIN)
nrow(realNRCscoresTEST)

In [62]:
NRCscoresTRAIN <- rbind(fakeNRCscoresTRAIN, realNRCscoresTRAIN)
NRCscoresTEST <- rbind(fakeNRCscoresTEST, realNRCscoresTEST)

In [64]:
nrow(NRCscoresTRAIN)
nrow(NRCscoresTEST)