# All Fake Tweets (*R*)

In this notebooks, we will use the "fake_followers.csv" data file, from the dataset provided by the Fake Project, as our source data file.

## Load the data file

> First, we store the file name (*and its location*) in the string '**fileName**'.

In [2]:
fileName0 = 'datasetsFULLcsv/fakeFollowersCSV/tweets.csv'

fileNames = c('datasetsFULLcsv/socialSpambots1csv/tweets.csv', 'datasetsFULLcsv/socialSpambots2csv/tweets.csv', 'datasetsFULLcsv/socialSpambots3csv/tweets.csv', 'datasetsFULLcsv/traditionalSpambots1csv/tweets.csv')

> Using the CSV filename previously specified in '**fileName**', we can now load the file into the _data.frame_( ) named '**fakeCSV**'.

In [3]:
fakeCSV = read.csv(fileName0)
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)

for (filename in fileNames) {
    temp0 = read.csv(filename)
    #fakeCSV <- rbind(fakeCSV, temp0)
    temp <- data.frame(userID = temp0$user_id, id = temp0$id, text = temp0$text)
    fakeTweets <- rbind(fakeTweets, temp)
}

realCSV = read.csv('datasetsFULLcsv/genuineAccountsCSV/tweets.csv')

“embedded nul(s) found in input”

From the '**fakeCSV**' _data.frame_( ), we will create a smaller, simpler *data.frame*( ) named '**fakeTweets**'.  This reduction in size and complexity of '**fakeTweets**' is due to the fact that it only contains the ID number of the tweet in our database, the ID number of the user who generated the tweet, along with the text of the tweet.  

In [4]:
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)
realTweets <- data.frame(userID = realCSV$user_id, id = realCSV$id, text = realCSV$text)

In [5]:
nrow(fakeTweets)
nrow(realTweets)

Now we remove web URLS, twitter usernames, twitter hashtags, punctuation, and stand-alone numeric digits.

In [6]:
# remove web URLs
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", realTweets$text))

# remove twitter handles (@<username>)
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", realTweets$text))

# remove hashtags (#<hashtag name>)
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", realTweets$text))

# remove punctuation
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub('[[:punct:] ]+', ' ', fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub('[[:punct:] ]+', ' ', realTweets$text))

# remove numbers
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = gsub("[0-9]", "", fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = gsub("[0-9]", "", realTweets$text))

# convert to lowercase
fakeTweets <- data.frame(userID = fakeTweets$userID, id = fakeTweets$id, text = tolower(fakeTweets$text))
realTweets <- data.frame(userID = realTweets$userID, id = realTweets$id, text = tolower(realTweets$text))

In [7]:
nrow(fakeTweets)
nrow(realTweets)

## TidyText the data file

> Now we must tokenize the text of each tweet using the '*tidytext*' and '*dplyr*' libraries.  First, we must import the '*tidytext*' and '*dplyr*' libraries,

In [8]:
library(dplyr)
library(tidytext)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



> Then we convert the data frame of '**fakeTweets**' to the type from the '*dplyr*' library,

In [9]:
fakeTweets <- data_frame(userID = fakeTweets$userID, id = fakeTweets$id, text = as.character(fakeTweets$text))
realTweets <- data_frame(userID = realTweets$userID, id = realTweets$id, text = as.character(realTweets$text))

In [10]:
nrow(fakeTweets)
nrow(realTweets)

> so that we can finally tokenize the text from each of the tweets,

In [11]:
fakeTweetsTOKENS <- fakeTweets %>%
    unnest_tokens(word, text)

In [12]:
realTweetsTOKENS <- realTweets %>%
    unnest_tokens(word, text)

In [13]:
nrow(fakeTweetsTOKENS)
nrow(realTweetsTOKENS)

## Remove '*Stop Words*'

> Now, we will remove any stop words from the text of the tweets.  To do this, we first import the '*stop_words*' dataset from the '*tidytext*' library

In [14]:
data(stop_words)

> Now, we use the '*anti_join*( )' function from the '*dplyr*' library to remove these stop wrods.

In [15]:
fakeTweetsTOKENS <- fakeTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


In [16]:
realTweetsTOKENS <- realTweetsTOKENS %>%
    anti_join(stop_words)

Joining, by = "word"


In [17]:
nrow(fakeTweetsTOKENS)
nrow(realTweetsTOKENS)
#realTweetsTOKENS

In [18]:
nrcWORDS <- get_sentiments("nrc")
nrcEMOTIONS <- unique(nrcWORDS$sentiment)

In [19]:
fakeTweetsNRCsentiment <- data.frame(id = 0)
for (emotion in nrcEMOTIONS){
    fakeTweetsNRCsentiment0 <- inner_join(fakeTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    fakeTweetsNRCsentiment <- full_join(fakeTweetsNRCsentiment, fakeTweetsNRCsentiment0)
    }
fakeTweetsNRCsentiment <- data.frame(fakeTweetsNRCsentiment[-1,])
#fakeTweetsNRCsentiment

Joining, by = "word"
Joining, by = "id"
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")


In [20]:
realTweetsNRCsentiment <- data.frame(id = as.factor(0))
for (emotion in nrcEMOTIONS){
    realTweetsNRCsentiment0 <- inner_join(realTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    realTweetsNRCsentiment <- full_join(realTweetsNRCsentiment, realTweetsNRCsentiment0)
    }
realTweetsNRCsentiment <- data.frame(realTweetsNRCsentiment[-1,])
#realTweetsNRCsentiment

Joining, by = "word"
Joining, by = "id"
“Column `id` joining factors with different levels, coercing to character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character vector and factor, coercing into character vector”Joining, by = "word"
Joining, by = c("id", "userID", "word", "sentiment")
“Column `id` joining character

In [21]:
nrow(fakeTweetsNRCsentiment)
nrow(realTweetsNRCsentiment)

In [22]:
attach(fakeTweetsNRCsentiment)
fakeNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "fake")
detach(fakeTweetsNRCsentiment)

attach(realTweetsNRCsentiment)
realNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "real")
detach(realTweetsNRCsentiment)

In [23]:
#NRCscoredTweets <- full_join(fakeNRCscoredTweets, realNRCscoredTweets)
NRCscoredTweets <- rbind(fakeNRCscoredTweets, realNRCscoredTweets)
#NRCscoredTweets
#fakeNRCscoredTweets

In [24]:
nrow(NRCscoredTweets)
nrow(fakeNRCscoredTweets)
nrow(realNRCscoredTweets)
nrow(fakeTweetsNRCsentiment)
nrow(realTweetsNRCsentiment)

set.seed(158)
fakeTestTrainIND <- sample(1:nrow(fakeNRCscoredTweets), 50000)
set.seed(241)
realTestTrainIND <- sample(1:nrow(realNRCscoredTweets), 1500000)

testNRC <- rbind(fakeNRCscoredTweets[fakeTestTrainIND, ], realNRCscoredTweets[realTestTrainIND, ])
trainNRC <- rbind(fakeNRCscoredTweets[-fakeTestTrainIND, ], realNRCscoredTweets[-realTestTrainIND, ])
nrow(testNRC)
nrow(trainNRC)
nrow(testNRC)+nrow(trainNRC)

## Random Forest

In [25]:
library(randomForest)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine



In [26]:
nrcSCORES.rf <- randomForest(data.matrix(subset(trainNRC, select = -c(realFAKEcat))) ~ ., 
                              data.matrix(subset(trainNRC, select = c(realFAKEcat))),
                              data.matrix(subset(testNRC, select = -c(realFAKEcat))),
                              data.matrix(subset(testNRC, select = c(realFAKEcat))),
                              ntree = 250)

ERROR: Error in reformulate(attributes(Terms)$term.labels): 'termlabels' must be a character vector of length at least one


In [None]:
#subset(trainNRC, select = -c(realFAKEcat))
#data.matrix(subset(trainNRC, select = -c(realFAKEcat)))

## Naive Bayes

In [None]:
library(e1071)

In [None]:
nrcSCORES.nb <- naiveBayes(data.matrix(subset(trainNRC, select = -c(realFAKEcat))) ~ ., 
                           data.matrix(subset(trainNRC, select = c(realFAKEcat))))

## Decision/Regression Trees

In [None]:
nrcSCORES.dtCLASS <- rpart(realFAKEcat ~ sentiment * Freq, data=trainNRC, method="class")

In [None]:
nrcSCORES.dtCLASSalt <- rpart(realFAKEcat ~ sentiment + Freq, data=trainNRC, method="class")

In [None]:
nrcSCORES.dtANOVA <- rpart(realFAKEcat ~ sentiment * Freq, data=trainNRC, method="anova")

In [None]:
nrcSCORES.dtANOVAalt <- rpart(realFAKEcat ~ sentiment + Freq, data=trainNRC, method="anova")

In [None]:
nrcSCORES.dtEXP <- rpart(realFAKEcat ~ sentiment * Freq, data=trainNRC, method="exp")

In [None]:
nrcSCORES.dtEXPalt <- rpart(realFAKEcat ~ sentiment + Freq, data=trainNRC, method="exp")

In [None]:
nrcSCORES.dtPOIS <- rpart(realFAKEcat ~ sentiment * Freq, data=trainNRC, method="poisson")

In [None]:
nrcSCORES.dtPOISalt <- rpart(realFAKEcat ~ sentiment + Freq, data=trainNRC, method="poisson")

# Work Below not needed **???**

In [24]:
#fakeNRCscoredTweets
nrcEMOTIONS

In [25]:
#scoredFakeTweets <- data.frame()
#fakeNRCscoredTweets

#NRCscoredTweets
#filter(NRCscoredTweets, sentiment == "joy")

In [26]:
trustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "trust")$id, trust = filter(NRCscoredTweets, sentiment == "trust")$Freq)
fearScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "fear")$id, fear = filter(NRCscoredTweets, sentiment == "fear")$Freq)
negScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "negative")$id, negative = filter(NRCscoredTweets, sentiment == "negative")$Freq)
sadnessScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "sadness")$id, sadness = filter(NRCscoredTweets, sentiment == "sadness")$Freq)
angerScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anger")$id, anger = filter(NRCscoredTweets, sentiment == "anger")$Freq)
surpriseScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "surprise")$id, surprise = filter(NRCscoredTweets, sentiment == "surprise")$Freq)
posScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "positive")$id, positive = filter(NRCscoredTweets, sentiment == "positive")$Freq)
disgustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "disgust")$id, disgust = filter(NRCscoredTweets, sentiment == "disgust")$Freq)
joyScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "joy")$id, joy = filter(NRCscoredTweets, sentiment == "joy")$Freq)
anticipationScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anticipation")$id, anticipation = filter(NRCscoredTweets, sentiment == "anticipation")$Freq, realFAKEcat = filter(NRCscoredTweets, sentiment == "anticipation")$realFAKEcat)

In [27]:
nrcSCORES <- full_join(trustScores, full_join(fearScores, full_join(negScores, full_join(sadnessScores, full_join(angerScores, full_join(surpriseScores, full_join(posScores, full_join(disgustScores, full_join(joyScores, anticipationScores)))))))))

Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"
Joining, by = "id"


In [28]:
fakeNRCscores <- filter(nrcSCORES, realFAKEcat == 'fake')
realNRCscores <- filter(nrcSCORES, realFAKEcat == 'real')

In [29]:
nrow(nrcSCORES)
nrow(fakeNRCscores)
nrow(realNRCscores)

Create Samples for training and testing

In [31]:
set.seed(158)
fakeTRAINind <- sample(1:nrow(fakeNRCscores), 50000)

In [34]:
set.seed(231)
realTRAINind <- sample(1:nrow(realNRCscores), 1300000)

In [36]:
fakeNRCscoresTRAIN <- fakeNRCscores[fakeTRAINind, ]
fakeNRCscoresTEST <- fakeNRCscores[-fakeTRAINind, ]
nrow(fakeNRCscoresTRAIN)
nrow(fakeNRCscoresTEST)

In [38]:
realNRCscoresTRAIN <- realNRCscores[realTRAINind, ]
realNRCscoresTEST <- realNRCscores[-realTRAINind, ]
nrow(realNRCscoresTRAIN)
nrow(realNRCscoresTEST)

In [62]:
NRCscoresTRAIN <- rbind(fakeNRCscoresTRAIN, realNRCscoresTRAIN)
NRCscoresTEST <- rbind(fakeNRCscoresTEST, realNRCscoresTEST)

In [64]:
nrow(NRCscoresTRAIN)
nrow(NRCscoresTEST)

## Machine Learning

### Random Forest

In [40]:
library(randomForest)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine



In [69]:
nrcSCORES.rf <- randomForest(subset(NRCscoresTRAIN, select = -c(id,realFAKEcat)), 
                              subset(NRCscoresTRAIN, select = c(realFAKEcat)),
                              subset(NRCscoresTEST, select = -c(id,realFAKEcat)),
                              subset(NRCscoresTEST, select = c(realFAKEcat)),
                              ntres = 1000)

“The response has five or fewer unique values.  Are you sure you want to do regression?”

ERROR: Error in randomForest.default(subset(NRCscoresTRAIN, select = -c(id, realFAKEcat)), : length of response must be the same as predictors


In [59]:
#realNRCscoresTEST[ ,  c("id","realFAKEcat")]

ERROR: Error in .subset(x, j): invalid subscript type 'language'


In [None]:
##df <- subset(df, select = -c(a,c) )

In [70]:
#subset(realNRCscoresTEST, select = -c(id,realFAKEcat))

NRCtrainIN <- subset(NRCscoresTRAIN, select = -c(id,realFAKEcat))
NRCtrainOUT <- subset(NRCscoresTRAIN, select = c(realFAKEcat))

In [71]:
nrc.rf <- randomForest(NRCtrainIN, NRCtrainOUT)

“The response has five or fewer unique values.  Are you sure you want to do regression?”

ERROR: Error in randomForest.default(NRCtrainIN, NRCtrainOUT): length of response must be the same as predictors
