# Generate Indicies and Create Sample files: All Tweets (*real/fake categories*)

This notebook generates indices for use in sampling the data as well as exports datasets from the corpus based on those indices.  The indices are exported as well.  Additionally, the data exproted by this notebook is categorized as either **_real_** or **_fake_**.

## Load the data file

> First, we store the file names (*and their locations*) of the files containing fake tweets in the string '**fileName0**' and the string array '**fileName**'[ ].

In [1]:
fileName0 = 'datasetsFULLcsv/fakeFollowersCSV/tweets.csv'

fileNames = c('datasetsFULLcsv/socialSpambots1csv/tweets.csv', 'datasetsFULLcsv/socialSpambots2csv/tweets.csv', 'datasetsFULLcsv/socialSpambots3csv/tweets.csv', 'datasetsFULLcsv/traditionalSpambots1csv/tweets.csv')

> Using the CSV file names previously specified in '**fileName0**' and '**fileNames**'[ ], we can now load the file into the _data.frame_( ) named '**fakeCSV**'.

In [2]:
fakeCSV = read.csv(fileName0)
fakeTweets <- data.frame(userID = fakeCSV$user_id, id = fakeCSV$id, text = fakeCSV$text)

for (filename in fileNames) {
    temp0 = read.csv(filename)
    temp <- data.frame(userID = temp0$user_id, id = temp0$id, text = temp0$text)
    fakeTweets <- rbind(fakeTweets, temp)
}

“embedded nul(s) found in input”

> We now load the file containing real tweets into the _data.frame_( ) named '**realCSV**'.

In [3]:
realCSV = read.csv('datasetsFULLcsv/genuineAccountsCSV/tweets.csv')

“embedded nul(s) found in input”

> From the '**fakeCSV**' and '**realCSV** _data.frame_( )s, we will create two smaller, simpler *data.frame*( )s named '**fakeTweets**' and '**realTweets**', respectively.  This reduction in size and complexity of '**fakeTweets**' is due to the fact that it only contains the ID number of the tweet in our database, along with the text of the tweet.  

In [4]:
fakeTweets <- data.frame(id = fakeCSV$id, text = fakeCSV$text)
realTweets <- data.frame(id = realCSV$id, text = realCSV$text)

> The initial size of the imports are

In [5]:
nrow(fakeTweets)
nrow(realTweets)

## Clean the data

>> Now we remove web URLS, twitter usernames, twitter hashtags, punctuation, and stand-alone numeric digits.

In [None]:
# remove web URLs
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("http[[:alnum:][:punct:]]*", "", realTweets$text))

# remove twitter handles (@<username>)
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("#[[:alnum:][:punct:]]*", "", realTweets$text))

# remove hashtags (#<hashtag name>)
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("@[[:alnum:][:punct:]]*", "", realTweets$text))

# remove punctuation
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub('[[:punct:] ]+', ' ', fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub('[[:punct:] ]+', ' ', realTweets$text))

# remove numbers
fakeTweets <- data.frame(id = fakeTweets$id, text = gsub("[0-9]", "", fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = gsub("[0-9]", "", realTweets$text))

# convert to lowercase
fakeTweets <- data.frame(id = fakeTweets$id, text = tolower(fakeTweets$text))
realTweets <- data.frame(id = realTweets$id, text = tolower(realTweets$text))

> The number of Tweets available after "cleaning"

In [None]:
nrow(fakeTweets)
nrow(realTweets)

> ### TidyText the data file

> Now we must tokenize the text of each tweet using the '*tidytext*' and '*dplyr*' libraries.  To do this, the '*tidytext*' and '*dplyr*' libraries must be imported and the data frames type used to store both types of tweets converted to the data frame type from the '**dplyr**' library before the tweets can be tokenized. 

>> First, we import the '*tidytext*' and '*dplyr*' libraries: 

In [None]:
library(dplyr)
library(tidytext)

>> Then we convert the data frames of '**fakeTweets**' and '**realTweets**' to the data frame type from the '*dplyr*' library:

In [None]:
fakeTweets <- data_frame(id = fakeTweets$id, text = as.character(fakeTweets$text))
realTweets <- data_frame(id = realTweets$id, text = as.character(realTweets$text))

>> so that we can finally tokenize the text from each of the tweets,

> ### Tokenization

> We now tokenize the text of the tweets, storing these tokens in new data frames (*from __dplyr__*). 

>> For the fake tweets, we use the new data frame (*from __dplyr__*) '**fakeTweetTOKENS**':

In [None]:
fakeTweetsTOKENS <- fakeTweets %>%
    unnest_tokens(word, text)

>> Similarly, we tokenize the text of the real tweets, them in the new data frame (*from __dplyr__*) '**realTweetTOKENS**'.

In [None]:
realTweetsTOKENS <- realTweets %>%
    unnest_tokens(word, text)

> ### Remove '*Stop Words*'

> Now, we will remove any stop words from the text of the tweets.  To do this, we first import the '*stop_words*' dataset from the '*tidytext*' library.

>> Importing the '*stop_words*' dataset from the '*tidytext*' library:

In [None]:
data(stop_words)

>> Now, we use the '*anti_join*( )' function from the '*dplyr*' library to remove these stop words from the fake tweets

In [None]:
fakeTweetsTOKENS <- fakeTweetsTOKENS %>%
    anti_join(stop_words)

>> and the real tweets

In [None]:
realTweetsTOKENS <- realTweetsTOKENS %>%
    anti_join(stop_words)

> Number of fake and real tweet tokens is

In [None]:
nrow(fakeTweetsTOKENS)
nrow(realTweetsTOKENS)

## NRC Model

> We are first going to use the **NRC Sentiment Model** containing ten emotions associated with words.  Thus, we load the sentiments database for NRC followed by storing those ten emotions in their own dataframe.

In [None]:
nrcWORDS <- get_sentiments("nrc")
nrcEMOTIONS <- unique(nrcWORDS$sentiment)

> ### Associate Words with their Emotions

> Now, we generate data frames that associates the words in fake tweets with their associated emotions under the **NRC Model**.

>> First, for fake tweets

In [None]:
fakeTweetsNRCsentiment <- data.frame(id = 0)
for (emotion in nrcEMOTIONS){
    fakeTweetsNRCsentiment0 <- inner_join(fakeTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    fakeTweetsNRCsentiment <- full_join(fakeTweetsNRCsentiment, fakeTweetsNRCsentiment0)
    }
fakeTweetsNRCsentiment <- data.frame(fakeTweetsNRCsentiment[-1,])

>> and then for real tweets

In [None]:
realTweetsNRCsentiment <- data.frame(id = as.factor(0))
for (emotion in nrcEMOTIONS){
    realTweetsNRCsentiment0 <- inner_join(realTweetsTOKENS, filter(nrcWORDS, sentiment == emotion))
    realTweetsNRCsentiment <- full_join(realTweetsNRCsentiment, realTweetsNRCsentiment0)
    }
realTweetsNRCsentiment <- data.frame(realTweetsNRCsentiment[-1,])
#realTweetsNRCsentiment

> The number of word, sentiment pairs now in our data sets

In [None]:
nrow(fakeTweetsNRCsentiment)
nrow(realTweetsNRCsentiment)

> ### Generate *Tweet-Emotion Matrices*

> We now need to generate what we are calling *Tweet-Emotion Matrices*.  In these matrices, the rows represent the tweets we have analyzed, while the columns represent the ten emotions in the **NRC Model**.  Thus, the elements of the matrix represent the strength of the emotion represented by the element's column for the tweet represented by the element's row.

>> First, we do this for the fake tweets:

In [None]:
attach(fakeTweetsNRCsentiment)
fakeNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "fake")
detach(fakeTweetsNRCsentiment)

>> Then we do it for the real tweets

In [None]:
attach(realTweetsNRCsentiment)
realNRCscoredTweets <- data.frame(table(id, sentiment), realFAKEcat = "real")
detach(realTweetsNRCsentiment)

>> We now create a combined data frame containing **_ALL_** tweets, both real and fake

In [None]:
NRCscoredTweets <- rbind(fakeNRCscoredTweets, realNRCscoredTweets)

>> Using this data frame, we create "data *sub*-frames" to store the frequency data for each emotion covered by the **NRC Model**:

In [None]:
trustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "trust")$id, trust = filter(NRCscoredTweets, sentiment == "trust")$Freq)
fearScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "fear")$id, fear = filter(NRCscoredTweets, sentiment == "fear")$Freq)
negScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "negative")$id, negative = filter(NRCscoredTweets, sentiment == "negative")$Freq)
sadnessScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "sadness")$id, sadness = filter(NRCscoredTweets, sentiment == "sadness")$Freq)
angerScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anger")$id, anger = filter(NRCscoredTweets, sentiment == "anger")$Freq)
surpriseScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "surprise")$id, surprise = filter(NRCscoredTweets, sentiment == "surprise")$Freq)
posScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "positive")$id, positive = filter(NRCscoredTweets, sentiment == "positive")$Freq)
disgustScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "disgust")$id, disgust = filter(NRCscoredTweets, sentiment == "disgust")$Freq)
joyScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "joy")$id, joy = filter(NRCscoredTweets, sentiment == "joy")$Freq)
anticipationScores <- data.frame(id = filter(NRCscoredTweets, sentiment == "anticipation")$id, anticipation = filter(NRCscoredTweets, sentiment == "anticipation")$Freq, realFAKEcat = filter(NRCscoredTweets, sentiment == "anticipation")$realFAKEcat)

>> These "data *sub*-frames" are now combined by the tweet id represented by each row to give the final **_Tweet-Emotion Matrix_**.

In [None]:
nrcSCORES <- full_join(trustScores, full_join(fearScores, full_join(negScores, full_join(sadnessScores, full_join(angerScores, full_join(surpriseScores, full_join(posScores, full_join(disgustScores, full_join(joyScores, anticipationScores)))))))))

> ### Look at the table

> We now get a quick look at the layout of our table to ensure its what we want

In [None]:
nrcSCORES

> ### Generate fake and real tables

> We now create two versions of this **_Tweet-Emotion Matrix_**, one with only fake tweets and the other with only real tweets.

>> For the fake tweets:

In [None]:
fakeNRCscores <- filter(nrcSCORES, realFAKEcat == 'fake')

>> and then for the real tweets

In [None]:
realNRCscores <- filter(nrcSCORES, realFAKEcat == 'real')

> ### Export tables

> We now export the complete processed tables for fake tweets, real tweets, and all tweets.

>> Starting with the fake tweets:

In [None]:
write.table(fakeNRCscores, "fakeNRCscores.csv", sep=",")

>> Continuing with the real tweets:

In [None]:
write.table(realNRCscores, "realNRCscores.csv", sep=",")

>> and finishing with all tweets:

In [None]:
write.table(nrcSCORES, "NRCscores.csv", sep=",")

## Random Sampling Indicies Generation

> We now need to generate sets containing random indicies of elements in our data so that we can sample our data for training and testing purposes.  We will create 4 paired sets of indicies, with two pairs of sets being for training and the remaining 2 sets being for testing.  These pairs of indicies are
* __PAIR 1 :__ Has a total size of 50,000 between the pair of sets.  The first set in the pair is for 25,000 real tweets, while the second set is for 25,000 fake tweets.  This pair of sets is for training
* __PAIR 2 :__ Has a total size of 10,000 between the pair of sets.  The first set in the pair is for 5,000 real tweets, while the second set is for 5,000 fake tweets. This pair of sets if for training
* __PAIR 3 :__ Has a total size of 15,000 and randomly contains any tweet in the database.  This set is for testing
* __PAIR 4 :__ Has a total size of 5,000 and randomly contains any tweet in the database.  This set is for testing

> ### *PAIR 1*

>> Seeding the RNG and generating the first pair of index sets:

In [None]:
set.seed(runif(1,1,10000))
fakeTrainInd1 <- sample(1:nrow(fakeNRCscores), 25000)

set.seed(runif(1,1,10000))
realTrainInd1 <- sample(1:nrow(realNRCscores), 25000)

>> Writing the first pair of index sets to file

In [None]:
write.table(fakeTrainInd1, "fakeTrainInd1.txt")

write.table(realTrainInd1, "realTrainInd1.txt")

> ### *PAIR 2*

>> Seeding the RNG and generating the second pair of index sets:

In [None]:
set.seed(runif(1,1,10000))
fakeTrainInd2 <- sample(1:nrow(fakeNRCscores), 5000)

set.seed(runif(1,1,10000))
realTrainInd2 <- sample(1:nrow(realNRCscores), 5000)

>> Writing the second pair of index sets to file

In [None]:
write.table(fakeTrainInd2, "fakeTrainInd2.txt")

write.table(realTrainInd2, "realTrainInd2.txt")

> ### *PAIR 3*

>> Seeding the RNG and generating the first index set for testing:

In [None]:
set.seed(runif(1,1,10000))
testInd1 <- sample(1:nrow(nrcSCORES), 15000)

>> write the first index set for testing to a file

In [None]:
write.table(testInd1, "testInd1.txt")

> ### *PAIR 4*

>> Seeding the RNG and generating the second index set for testing:

In [None]:
set.seed(runif(1,1,10000))
testInd2 <- sample(1:nrow(nrcSCORES), 5000)

>> write the second index set for testing to a file

In [None]:
write.table(testInd2, "testInd2.txt")

## Export Sampled Data

In [None]:
fakeTrainData1 <- fakeNRCscores[fakeTrainInd1, ]

realTrainData1 <- realNRCscores[realTrainInd1, ]

In [None]:
write.table(fakeTrainData1, "fakeTrainData1.csv", sep=",")

write.table(realTrainData1, "realTrainData1.csv", sep=",")

In [None]:
fakeTrainData2 <- fakeNRCscores[fakeTrainInd2, ]

realTrainData2 <- realNRCscores[realTrainInd2, ]

In [None]:
write.table(fakeTrainData2, "fakeTrainData2.csv", sep=",")

write.table(realTrainData2, "realTrainData2.csv", sep=",")

In [None]:
testData1 <- nrcSCORES[testInd1, ]

In [None]:
write.table(testData1, "testData.csv", sep=",")

In [None]:
testData2 <- nrcSCORES[testInd2, ]

In [None]:
write.table(testData2, "testData2.csv", sep=",")