# All Fake Tweets (*R*)

In this notebooks, we will use the "fake_followers.csv" data file, from the dataset provided by the Fake Project, as our source data file.

## Load the data file

> First, we store the file name (*and its location*) in the string '**fileName**'.

In [1]:
fileName0 = 'datasetsFULLcsv/fakeFollowersCSV/tweets.csv'

fileNames = c('datasetsFULLcsv/socialSpambots1csv/tweets.csv', 'datasetsFULLcsv/socialSpambots2csv/tweets.csv', 'datasetsFULLcsv/socialSpambots3csv/tweets.csv', 'datasetsFULLcsv/traditionalSpambots1csv/tweets.csv')

> Using the CSV filename previously specified in '**fileName**', we can now load the file into the _data.frame_( ) named '**fakeCSV**'.

In [2]:
fakeCSV = read.csv(fileName0)
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)

for (filename in fileNames) {
    temp0 = read.csv(filename)
    #fakeCSV <- rbind(fakeCSV, temp0)
    temp <- data.frame(userID = temp0$user_id, id = temp0$id, text = temp0$text)
    fakeTweets <- rbind(fakeTweets, temp)
}

realCSV = read.csv('datasetsFULLcsv/genuineAccountsCSV/tweets.csv')

“embedded nul(s) found in input”

From the '**fakeCSV**' _data.frame_( ), we will create a smaller, simpler *data.frame*( ) named '**fakeTweets**'.  This reduction in size and complexity of '**fakeTweets**' is due to the fact that it only contains the ID number of the tweet in our database, the ID number of the user who generated the tweet, along with the text of the tweet.  

In [3]:
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)
realTweets <- data.frame(userID = realCSV$user_id, id = realCSV$id, text = realCSV$text)

Now we remove web URLS, twitter usernames, twitter hashtags, punctuation, and stand-alone numeric digits.

In [4]:
# remove web URLs
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", realTweets$text))

# remove twitter handles (@<username>)
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", realTweets$text))

# remove hashtags (#<hashtag name>)
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", realTweets$text))

# remove punctuation
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub('[[:punct:] ]+', ' ', fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub('[[:punct:] ]+', ' ', realTweets$text))

# remove numbers
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("[0-9]", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("[0-9]", "", realTweets$text))

# convert to lowercase
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = tolower(fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = tolower(realTweets$text))

## TidyText the data file

> Now we must tokenize the text of each tweet using the '*tidytext*' and '*dplyr*' libraries.  First, we must import the '*tidytext*' and '*dplyr*' libraries,

In [5]:
library(dplyr)
library(tidytext)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



> Then we convert the data frame of '**fakeTweets**' to the type from the '*dplyr*' library,

In [6]:
fakeTweets <- data_frame(userID = fakeTweets$userID, id = fakeTweets$id, text = as.character(fakeTweets$text))
realTweets <- data_frame(userID = realTweets$userID, id = realTweets$id, text = as.character(realTweets$text))

> so that we can finally tokenize the text from each of the tweets,

In [7]:
fakeTweetsTOKENS <- fakeTweets %>%
    unnest_tokens(word, text)

In [8]:
realTweetsTOKENS <- realTweets %>%
    unnest_tokens(word, text)

## Remove '*Stop Words*'

> Now, we will remove any stop words from the text of the tweets.  To do this, we first import the '*stop_words*' dataset from the '*tidytext*' library

In [9]:
data(stop_words)

> Now, we use the '*anti_join*( )' function from the '*dplyr*' library to remove these stop wrods.

In [10]:
fakeTweetsTOKENS <- fakeTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


In [11]:
realTweetsTOKENS <- realTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


In [12]:
nrcWORDS <- get_sentiments("nrc")
nrcEMOTIONS <- unique(nrcWORDS$sentiment)

In [13]:
fakeTweetsNRCsentiment <- data.frame(id = 0)
for (emotion in nrcEMOTIONS){
    fakeTweetsNRCsentiment0 <- inner_join(fakeTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    fakeTweetsNRCsentiment <- full_join(fakeTweetsNRCsentiment, fakeTweetsNRCsentiment0)
    }
fakeTweetsNRCsentiment <- data.frame(fakeTweetsNRCsentiment[-1,])

Joining, by = "word"
Joining, by = "id"
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")


In [14]:
realTweetsNRCsentiment <- data.frame(id = as.factor(0))
for (emotion in nrcEMOTIONS){
    realTweetsNRCsentiment0 <- inner_join(realTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    realTweetsNRCsentiment <- full_join(realTweetsNRCsentiment, realTweetsNRCsentiment0)
    }
realTweetsNRCsentiment <- data.frame(realTweetsNRCsentiment[-1,])
#realTweetsNRCsentiment

Joining, by = "word"
Joining, by = "id"
“Column `id` joining factors with different levels, coercing to character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character

In [15]:
attach(fakeTweetsNRCsentiment)
fakeNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "fake")
detach(fakeTweetsNRCsentiment)

attach(realTweetsNRCsentiment)
realNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "real")
detach(realTweetsNRCsentiment)

In [16]:
NRCscoredTweets <- rbind(fakeNRCscoredTweets, realNRCscoredTweets)

In [17]:
set.seed(158)
fakeTestTrainIND <- sample(1:nrow(fakeNRCscoredTweets), 50000)
set.seed(241)
realTestTrainIND <- sample(1:nrow(realNRCscoredTweets), 1500000)

testNRC <- rbind(fakeNRCscoredTweets[fakeTestTrainIND, ], realNRCscoredTweets[realTestTrainIND, ])
trainNRC <- rbind(fakeNRCscoredTweets[-fakeTestTrainIND, ], realNRCscoredTweets[-realTestTrainIND, ])

## Random Forest

In [18]:
library(randomForest)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine



In [None]:
#nrcSCORES.rf <- randomForest(realFAKEcat ~ filter(trainNRC, sentiment == "trust")$Freq + 
#                                             filter(trainNRC, sentiment == "fear")$Freq +
 #                                            filter(trainNRC, sentiment == "negative")$Freq +
  #                                           filter(trainNRC, sentiment == "sadness")$Freq + 
   #                                          filter(trainNRC, sentiment == "anger")$Freq + 
    #                                         filter(trainNRC, sentiment == "surprise")$Freq + 
     #                                        filter(trainNRC, sentiment == "positive")$Freq + 
      #                                       filter(trainNRC, sentiment == "disgust")$Freq + 
       #                                      filter(trainNRC, sentiment == "joy")$Freq + 
        #                                     filter(trainNRC, sentiment == "anticipation")$Freq,
         #                    data = trainNRC)

nrcSCORES.rf <- randomForest(data.matrix(subset(trainNRC, select = -c(realFAKEcat))), 
                              data.matrix(subset(trainNRC, select = c(realFAKEcat))),
                              data.matrix(subset(testNRC, select = -c(realFAKEcat))),
                              data.matrix(subset(testNRC, select = c(realFAKEcat))),
                              ntree = 250)

“The response has five or fewer unique values.  Are you sure you want to do regression?”