In [1]:
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(tidyverse)
library(magrittr)
library(caret)

Package version: 3.2.4
Unicode version: 13.0
ICU version: 69.1

Parallel computing: 8 of 8 threads used.

See https://quanteda.io for tutorials and examples.

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.5.0 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘magrittr’


The following object is masked from ‘package:purrr’:

    set_names


The following object is masked from ‘package:tidyr’:

    extract


Loading required pack

In [2]:
#loading news
df <- read.csv("http://jsienkiewicz.pl/TEXT/lab/data_fn.csv")
df.corp <- corpus(df)
summary(df.corp, n = 5)

Unnamed: 0_level_0,Text,Types,Tokens,Sentences,X,id,title,author,label
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>
1,text1,70,86,5,952,951,[WATCH] Thug Calls US Marine a “Pussy” – Barely Lives to Tell the Tale,The Conservative Millennial,1
2,text2,239,485,14,11483,11482,Trump Says Health Law Replacement May Not Be Ready Until Next Year - The New York Times,Mark Landler,0
3,text3,598,1430,55,18965,18964,Downside of Being a Global Hub: Invasive Species - The New York Times,Sarah Maslin Nir,0
4,text4,198,355,11,6812,6811,"American Tourist Can’t Get Over Dirty, Decaying & Dangerous Charm Of Dublin City",Julius Hubris,1
5,text5,289,585,21,11026,11025,Damascus Bombings Near Pilgrimage Sites Kill Dozens - The New York Times,Ben Hubbard,0


In [3]:
# document-feature matrix for input in following functions
df.mat <- df.corp %>% 
          tokens(remove_punct = T) %>% 
          dfm %>% dfm_remove(stopwords("english")) %>% 
          dfm_wordstem()
#statistics to get number of puncts
df.s <- df.corp %>% 
    textstat_summary() %>%
    select(document, puncts)
#getting TTR and C
df.lex <- df.mat %>% 
    textstat_lexdiv(measure = c("TTR", "C")) 
#getting FOG
df.read <- df.corp %>% 
    textstat_readability(measure = "FOG")

In [4]:
#final dataset
df.set <- data.frame(summary(df.corp, n = 2000)) %>%
    select(Text, Types, Tokens, Sentences, label) %>%
    mutate(label = as.factor(c("true", "fake")[df$label + 1])) %>%
    rename(document = Text) %>%
    left_join(., df.s, by='document') %>%
    left_join(., df.lex, by='document') %>%
    left_join(., df.read, by='document') %>%
    select(-document) %>%
    drop_na()
head(df.set)

Unnamed: 0_level_0,Types,Tokens,Sentences,label,puncts,TTR,C,FOG
Unnamed: 0_level_1,<int>,<int>,<int>,<fct>,<int>,<dbl>,<dbl>,<dbl>
1,70,86,5,fake,13,0.9047619,0.973223,10.24432
2,239,485,14,true,66,0.6514523,0.9218656,18.65401
3,598,1430,55,true,175,0.6170213,0.9263743,14.41811
4,198,355,11,fake,41,0.748538,0.9436693,17.27805
5,289,585,21,true,61,0.647651,0.92375,16.62217
6,66,77,3,fake,17,0.8666667,0.9624078,17.33333


In [5]:
#training
class <- df.set$label #classes label

levels(class) <- c("fake", "true") #objective and subjective
df.set %<>% #removing label column before training
    select(-label)
data <- cbind(df.set, class.out = class)


fit <- trainControl(method = "cv", number = 10)
model <- train(class.out ~ ., data = data, method = "svmLinear", trControl = fit)

In [6]:
model

Support Vector Machines with Linear Kernel 

1982 samples
   7 predictor
   2 classes: 'fake', 'true' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1784, 1784, 1783, 1784, 1784, 1784, ... 
Resampling results:

  Accuracy  Kappa    
  0.671066  0.3421654

Tuning parameter 'C' was held constant at a value of 1

In [7]:
confusionMatrix(model)

Cross-Validated (10 fold) Confusion Matrix 

(entries are percentual average cell counts across resamples)
 
          Reference
Prediction fake true
      fake 33.0 16.1
      true 16.8 34.1
                           
 Accuracy (average) : 0.671
