# Regression Analysis on News Documents (paper draft version)

There are a couple of points about the analysis in the draft version:
* We did not remove the "DESPL" which is the source of warning in PCA since it is highly correlated with "DESSC"
* The way that I chose the number of PCA components was not correct and resulted in 33 components in the PCA --instead of 22.
* Truth labels are not included in fitting the models --although we explicitly mentioned this in the paper.

In [4]:
# reading the excel file which includes all the information
library("readxl")
my_data <- read_excel("data/FakeNewsNet/processed/fakenewsnet_full_draft.xlsx")
head(my_data)

DESSC,DESWC,DESPL,DESSL,DESSLd,DESWLsy,DESWLsyd,DESWLlt,DESWLltd,PCNARz,⋯,WRDHYPv,WRDHYPnv,RDFRE,RDFKGL,RDL2,CREL,CRELWD,CRELSC,shares,label
21,552,21,26.286,15.897,1.5,0.781,4.654,2.427,0.071,⋯,1.745,2.293,53.255,12.362,12.648,2,0.003642987,0.09090909,2.564949,1
57,1363,57,24.123,10.36,1.742,1.033,5.272,2.906,-1.082,⋯,2.069,2.288,35.191,14.291,8.408,4,0.002945508,0.07017544,3.465736,1
22,602,22,27.409,9.231,1.575,0.929,4.664,2.659,0.336,⋯,1.284,1.494,45.816,13.667,22.702,2,0.003322259,0.09090909,2.995732,1
36,751,36,20.917,15.249,1.597,0.889,4.736,2.572,0.353,⋯,1.702,1.791,50.555,11.39,12.508,2,0.002663116,0.05555556,2.302585,1
9,197,9,21.889,13.896,1.711,1.061,5.066,3.042,-0.013,⋯,1.894,1.953,39.867,13.137,14.195,0,0.0,0.0,4.779123,1
17,364,17,21.824,11.566,1.668,0.969,5.102,2.752,-0.782,⋯,1.835,2.001,43.989,12.443,13.695,1,0.002747253,0.05882353,5.056246,1


In [5]:
# plotting the normal distribution and probability plot
# hist(unlist(my_data["shares"]))
# qqnorm(unlist(my_data["shares"]))

## Principal Component Analysis (PCA)
In this section, we first run a PCA on all the indexes we have from the coh-metrix to reduce the number of features. Then, using the scores from the PCA model, we run a linear regression analysis to find the significant components in predicting the number of shares.

In [16]:
# Principal Component Analysis (PCA)
# ----------------------------------
# to install the "psych" package, uncomment the two following lines:
# options(download.file.method = "wget")
# install.packages("psych")
library("psych")

# dropping the columns that we do not want in the regression analysis.
truth_labels <- my_data["label"]
# in the draft version, I did not remove "DESPL" which is apparently the source of warning since it is highly correlated with "DESSC"
drops <- c("X__1", "label", "shares", "DESPL")
x <- my_data[, !(names(my_data) %in% drops)]
y <- my_data["shares"]

print(eigen(cor(x)))

# I chose 33 components which based on our conersation was not the best number of components and my decision was wrong
pca <- psych::principal(x, nfactors=105)
print(pca$loadings, cutoff = 0.4, sort = TRUE)

# linear regression using the PCA scores
# truth_labels are not included
lin_model <- lm(unlist(y) ~ pca$scores + as.matrix(truth_labels))
summary(lin_model)
write.csv(pca$loadings, "data/FakeNewsNet/processed/pca_loadings_draft.csv")

eigen() decomposition
$values
  [1] 1.808731e+01 1.136237e+01 8.630666e+00 7.040735e+00 6.177695e+00
  [6] 4.641267e+00 3.860428e+00 3.140690e+00 2.931932e+00 2.670532e+00
 [11] 2.331104e+00 2.131929e+00 2.060065e+00 1.911198e+00 1.729315e+00
 [16] 1.668253e+00 1.461461e+00 1.403473e+00 1.250181e+00 1.197088e+00
 [21] 1.099596e+00 1.010959e+00 9.771703e-01 9.392537e-01 9.146293e-01
 [26] 8.927668e-01 8.425319e-01 7.965978e-01 7.367092e-01 7.001007e-01
 [31] 6.895605e-01 6.628031e-01 5.998363e-01 5.650760e-01 5.502797e-01
 [36] 5.087600e-01 5.082439e-01 4.564207e-01 4.302363e-01 3.968759e-01
 [41] 3.787889e-01 3.449072e-01 3.275067e-01 3.136812e-01 3.000746e-01
 [46] 2.870196e-01 2.539484e-01 2.367642e-01 2.247458e-01 2.216454e-01
 [51] 2.077529e-01 1.955815e-01 1.803202e-01 1.639752e-01 1.585275e-01
 [56] 1.483717e-01 1.407623e-01 1.342107e-01 1.217033e-01 1.197996e-01
 [61] 1.114950e-01 1.103039e-01 1.049049e-01 9.792093e-02 9.653827e-02
 [66] 9.066667e-02 8.220504e-02 7.284073e-02 7.


Loadings:
          RC1    RC2    RC6    RC5    RC3    RC9    RC7    RC20   RC4    RC13  
DESWLsy   -0.963                                                               
DESWLsyd  -0.885                                                               
DESWLlt   -0.930                                                               
DESWLltd  -0.807                                                               
PCNARz     0.711                                                               
PCNARp     0.733                                                               
SYNNP     -0.627                                                               
WRDNOUN   -0.584                                                               
WRDPRO     0.610                                                               
WRDFRQc    0.684                                                               
WRDHYPnv  -0.544                                                               
RDFRE      0.850             


Call:
lm(formula = unlist(y) ~ pca$scores + as.matrix(truth_labels))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5553 -0.8617 -0.0534  0.8368  3.8314 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.5098024  0.1269032  27.657  < 2e-16 ***
pca$scoresRC1           -0.1161357  0.0736047  -1.578 0.115635    
pca$scoresRC2            0.1488042  0.0727626   2.045 0.041701 *  
pca$scoresRC6           -0.0545804  0.0722860  -0.755 0.450790    
pca$scoresRC5           -0.0320807  0.0726172  -0.442 0.658961    
pca$scoresRC3            0.1019211  0.0724029   1.408 0.160234    
pca$scoresRC9           -0.0312406  0.0726735  -0.430 0.667587    
pca$scoresRC7            0.1534727  0.0737541   2.081 0.038274 *  
pca$scoresRC20           0.0222548  0.0726466   0.306 0.759551    
pca$scoresRC4            0.1022216  0.0721810   1.416 0.157735    
pca$scoresRC13          -0.0734550  0.0722895  -1.016 0.310371    
pca$scoresRC24   

In [9]:
print(pca$scores)

[1] 2.476695


In [18]:
evs = eigen(cor(x))$vectors

In [19]:
evs

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0.04350651,0.046176013,-0.028953063,-0.064281261,0.0378417399,0.1520696611,-0.3211117204,-0.21493276,0.1180013374,0.074751228,⋯,-0.0145195846,-4.915440e-04,2.915419e-05,-4.284430e-05,1.579866e-04,-2.808224e-04,-1.721763e-04,-1.751543e-04,5.122788e-05,7.370679e-06
0.01878262,0.052900073,-0.005638821,-0.101783261,0.0412507598,0.1301491429,-0.3090637709,-0.23619891,0.1025076879,0.074225200,⋯,0.0103077037,4.997326e-04,-5.457538e-05,-7.020169e-05,-1.796021e-04,2.723791e-04,1.097501e-04,1.405600e-04,-6.447949e-05,-5.660938e-06
-0.10725400,0.081662983,0.195883156,-0.146135630,-0.0134040994,-0.0902300617,0.0570672760,-0.07541021,-0.0361451437,-0.025575309,⋯,0.5826939467,-1.160305e-01,1.930484e-03,2.281391e-02,2.000967e-02,-4.061421e-02,1.051989e-01,-5.488290e-02,-3.207262e-02,-1.774780e-03
-0.02326694,0.001224711,0.022188641,-0.121288318,0.0801227567,-0.1110175963,0.0891119306,-0.21817442,0.0400516006,-0.143074157,⋯,0.0043118612,-4.232895e-05,7.917509e-06,2.896505e-04,-2.084710e-05,-1.775242e-05,-3.889376e-06,5.254006e-06,4.508866e-05,-2.115600e-06
-0.17170608,-0.080384703,-0.068089301,-0.113534872,-0.0073939556,-0.0641474527,-0.0715665992,0.03325375,0.0652759790,0.023065144,⋯,0.4356879500,2.890330e-02,-1.532181e-03,1.222733e-03,4.911981e-02,3.710294e-02,2.052102e-02,1.003412e-02,4.596866e-02,-3.873506e-01
-0.12709217,-0.068834961,-0.046091410,-0.134772994,0.0003211871,-0.0988439643,-0.1065409442,0.05374511,0.0816648304,0.014451259,⋯,0.0043106534,2.835313e-04,-2.719784e-04,-1.182540e-04,4.187710e-05,-2.323410e-05,1.310592e-04,7.798049e-05,-2.959228e-05,-7.123578e-06
-0.17832777,-0.098342070,-0.060856839,-0.092168493,-0.0148154281,-0.0268776912,-0.0532041947,0.04758104,0.0638474693,0.032843174,⋯,0.0012194645,-4.514289e-04,-4.548747e-04,8.433887e-05,2.678366e-05,6.737848e-05,1.366713e-04,1.085407e-04,-1.348460e-04,1.211729e-05
-0.13141905,-0.031880399,-0.030072232,-0.115178071,0.0020082733,-0.0647387602,-0.1436876030,0.06890096,0.0758449352,0.067847142,⋯,-0.0039349720,3.497755e-06,2.736811e-04,1.077954e-04,-6.545338e-05,-3.899128e-05,-1.420022e-04,-8.582659e-05,3.571191e-05,-1.911396e-06
0.18250116,0.157706242,0.016756176,-0.030323135,0.0595330560,-0.0257338723,0.0772072003,0.01148285,0.0822756241,-0.040627745,⋯,-0.0094686222,2.702245e-01,-1.531023e-01,-2.441322e-02,7.742248e-01,2.892252e-01,-1.439999e-01,-1.588522e-01,5.392525e-03,-2.796305e-03
0.18738155,0.145892717,0.020918393,0.002357877,0.0123878260,-0.0453705789,0.0676880684,0.01034755,0.0963745812,-0.065695524,⋯,0.0045229854,4.491615e-04,3.414906e-04,4.418612e-04,1.321720e-03,7.653250e-04,-5.319618e-04,-6.249822e-04,-1.720745e-04,-1.238916e-06


In [20]:
cor(x)

Unnamed: 0,DESSC,DESWC,DESSL,DESSLd,DESWLsy,DESWLsyd,DESWLlt,DESWLltd,PCNARz,PCNARp,⋯,WRDPOLc,WRDHYPn,WRDHYPv,WRDHYPnv,RDFRE,RDFKGL,RDL2,CREL,CRELWD,CRELSC
DESSC,1.000000000,0.96474070,-0.103034527,-0.009881129,-0.098081379,-0.05192200,-0.12108411,-0.009801110,0.141973302,0.11495677,⋯,-0.082256312,0.0116190262,-0.067659186,-0.17754861,0.12506739,-0.12572185,0.127807422,0.6892324690,0.020709515,-0.0161677943
DESWC,0.964740700,1.00000000,0.062668098,0.092847141,-0.020891650,0.01509979,-0.04905324,0.047307804,0.082372630,0.05185679,⋯,-0.116442270,-0.0041893742,-0.035861975,-0.13835673,-0.01462798,0.04086010,0.045988042,0.6305140526,0.012593842,0.0002061686
DESSL,-0.103034527,0.06266810,1.000000000,0.426430234,0.230644573,0.21741350,0.18498131,0.200153587,-0.135237461,-0.17834768,⋯,-0.052892120,-0.0940824076,0.153060304,0.04865057,-0.66001793,0.89539887,-0.333144511,0.0058346863,0.073043270,0.1752858609
DESSLd,-0.009881129,0.09284714,0.426430234,1.000000000,0.132106818,0.12257968,0.08319442,-0.034465469,0.001928156,-0.03541730,⋯,-0.155073815,-0.3482106394,0.073947569,-0.07380408,-0.31121635,0.40333109,-0.168461959,-0.0357388924,-0.094582852,-0.0577656117
DESWLsy,-0.098081379,-0.02089165,0.230644573,0.132106818,1.000000000,0.86191105,0.92659245,0.767196442,-0.683235539,-0.70752804,⋯,-0.429393876,-0.0590267819,0.302839079,0.52364180,-0.88309055,0.63917183,-0.477729572,-0.0850008762,-0.013532237,0.0157235167
DESWLsyd,-0.051921997,0.01509979,0.217413499,0.122579683,0.861911052,1.00000000,0.75814449,0.828083998,-0.487794566,-0.52025394,⋯,-0.321315981,-0.0323698448,0.202681506,0.34312928,-0.77088917,0.56687790,-0.371804432,-0.0388924787,0.029783756,0.0526283518
DESWLlt,-0.121084115,-0.04905324,0.184981305,0.083194422,0.926592447,0.75814449,1.00000000,0.734904068,-0.749788093,-0.76458861,⋯,-0.404386548,-0.0019045721,0.310547511,0.61375978,-0.80360007,0.56821466,-0.528695647,-0.0981084002,0.027441083,0.0573192754
DESWLltd,-0.009801110,0.04730780,0.200153587,-0.034465469,0.767196442,0.82808400,0.73490407,1.000000000,-0.445238469,-0.47155652,⋯,-0.355979330,0.0869124306,0.176804432,0.33716463,-0.68637305,0.50490352,-0.342179739,-0.0130413691,0.081299240,0.1058156521
PCNARz,0.141973302,0.08237263,-0.135237461,0.001928156,-0.683235539,-0.48779457,-0.74978809,-0.445238469,1.000000000,0.97192538,⋯,0.324082645,-0.0403726273,-0.198927068,-0.72735369,0.59075447,-0.41604281,0.596673828,0.1553682785,-0.013162639,-0.0418521722
PCNARp,0.114956773,0.05185679,-0.178347682,-0.035417300,-0.707528039,-0.52025394,-0.76458861,-0.471556519,0.971925385,1.00000000,⋯,0.329322594,0.0109610113,-0.239616584,-0.65287033,0.63079490,-0.46203614,0.627587040,0.1268050122,0.031400688,0.0037934303
