### 2. Adding Features

This notebook seeks to add relevant features for each (source, dest) pair.

Prediction Goal: Whether fake news goes from A to B?
The features include:
- Local centralities: deg, eigen, close, between in real and fake news network respectively (2x4x2=16 cols)
- In/Out degree: In and out degree for each website in real and fake network (2x2x2=8cols)
- Mutuality: If (A,B) exists, does (B,A) exist too in the *fake news network*? This is a binary column.
- Jaccard: For each (Source, Dest) pair, we calculate the set of common neighbors (intersection) in the fake network and the set of all neighbors (union) in the fake network.
- Number of common neighbors: common source, destination and general connections (source & dest) for each pair.

**We won't use the features in the original, un-aggregated dateset**

#### 2.1 local centralities

This notebook serves to add local centralities in both real and fake news networks to both source and destination websites, so we will know their position within both the real and fake networks.

The centralities are calculated using the R code.

In [1]:
%pylab inline
import pandas as pd, pyprind

Populating the interactive namespace from numpy and matplotlib


In [2]:
ls data

[31memergent.csv[m[m*               [31mpolitifact.csv[m[m*
fake_localcentralities.csv  politifact_clean.csv
keys.csv                    real_localcentralities.csv
pol_agg.csv                 [31msnopes.csv[m[m*


In [25]:
d=pd.read_csv('data/keys.csv')
d.columns=['Unnamed: 0', 'Source', 'Destination', 'page_url', 'TRUE.', 'FALSE.']

In [26]:
def add_centralities(file,identity):
    cen=pd.read_csv(file)
    cen.columns=['Website','LocalDegreeCentralities','LocalBetweenness','LocalCloseness','LocalEigenCentralities']
    cen.columns=[identity+'_'+i for i in cen.columns]
    
    for col in cen.columns[1:]:
        for website in ['Source','Destination']:
            mapper=dict(zip(d[website],cen[col]))
            d['Source'+'_'+str(col)]=d['Source'].map(mapper)
            d['Destination'+'_'+str(col)]=d['Source'].map(mapper)

In [27]:
add_centralities('data/real_localcentralities.csv','Real')
add_centralities('data/fake_localcentralities.csv','Fake')

In [29]:
d.tail()

Unnamed: 0.1,Unnamed: 0,Source,Destination,page_url,TRUE.,FALSE.,Source_Real_LocalDegreeCentralities,Destination_Real_LocalDegreeCentralities,Source_Real_LocalBetweenness,Destination_Real_LocalBetweenness,...,Source_Real_LocalEigenCentralities,Destination_Real_LocalEigenCentralities,Source_Fake_LocalDegreeCentralities,Destination_Fake_LocalDegreeCentralities,Source_Fake_LocalBetweenness,Destination_Fake_LocalBetweenness,Source_Fake_LocalCloseness,Destination_Fake_LocalCloseness,Source_Fake_LocalEigenCentralities,Destination_Fake_LocalEigenCentralities
2895,2896,www.westernjournalism.com,fas.org,0,0,0,,,,,...,,,1.0,1.0,0.0,0.0,0.028429,0.028429,1.0,1.0
2896,2897,www.pbs.org,freepatriot.org,0,0,0,,,,,...,,,1.0,1.0,0.0,0.0,0.029065,0.029065,1.0,1.0
2897,2898,blogs.browardpalmbeach.com,cbo.gov,0,0,0,,,,,...,,,,,,,,,,
2898,2899,abcnews.com.co,conservativebyte.com,0,0,0,,,,,...,,,,,,,,,,
2899,2900,newsdaily12.com,dailyleak.org,0,0,0,4.666667,4.666667,0.0,0.0,...,4.666667,4.666667,6.0,6.0,0.0,0.0,0.029012,0.029012,6.0,6.0


In [30]:
d.to_csv('data/keys.csv')

#### 2.2 Add jaccard dist and common neighbors

This notebook serves to add local centralities in both real and fake news networks to both source and destination websites, so we will know their position within both the real and fake networks.

The centralities are calculated using the R code.

In [49]:
d=pd.read_csv('data/keys.csv',index_col=False)

#Don't want self-referring connections
d=d[d['Source']!=d['Destination']]

In [24]:
def association(method):
    jaccard_dict={}
    faked=d[d['FALSE.']>0]
    #all unique websites
    allwebs=set(faked['Source'].append(faked['Destination']))
    bar = pyprind.ProgBar(len(allwebs))
    for site in allwebs:
        if method=='connection':
            sites=set(faked[faked['Source']==site]['Destination'].append(d[d['Destination']==site]['Source']))
        elif method=='common_destination':
            sites=set(faked[faked['Source']==site]['Destination'])
        elif method=='common_source':
            sites=set(faked[faked['Destination']==site]['Source'])
        jaccard_dict[site]=sites
        bar.update()
    return(jaccard_dict)

In [50]:
for i in ['connection','common_destination','common_source']:
    mapper=association(i)
    
    def jaccard(pair):
        try:
            numerator=len(mapper[pair[0]].intersection(mapper[pair[1]]))
            denom=len(mapper[pair[0]].union(mapper[pair[1]]))
            return numerator/denom
        except:
            return 0
        
    def neighbor(pair):
        try:
            common_neighbor=len(mapper[pair[0]].intersection(mapper[pair[1]]))
            return common_neighbor
        except:
            return 0
        
    d['jaccard_coeff'+'_'+i]=[jaccard(pair) for pair in list(zip(d['Source'],d['Destination']))]
    d['Neighbor'+'_'+i]=[neighbor(pair) for pair in list(zip(d['Source'],d['Destination']))]

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:06
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:02
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:02


In [51]:
m=['Source','Destination']
m.extend([i for i in d.columns if 'Neighbor_' in i])

d[m].head()

Unnamed: 0,Source,Destination,Neighbor_connection,Neighbor_common_destination,Neighbor_common_source
1,www.facebook.com,www.politifact.com,11,3,3
2,nationalreport.net,www.whitehouse.gov,1,0,0
6,www.naturalnews.com,www.cdc.gov,0,0,0
7,www.facebook.com,www.snopes.com,5,1,3
9,www.infowars.com,www.cdc.gov,0,0,0


In [47]:
d[m].head()

Unnamed: 0,Neighbor_connection,Neighbor_common_destination,Neighbor_common_source,Source,Destination
1,11,3,3,www.facebook.com,www.politifact.com
2,1,0,0,nationalreport.net,www.whitehouse.gov
6,0,0,0,www.naturalnews.com,www.cdc.gov
7,5,1,3,www.facebook.com,www.snopes.com
9,0,0,0,www.infowars.com,www.cdc.gov


In [22]:
d[[i for i in d.columns if 'jaccard' in i]].head()

Unnamed: 0,jaccard_coeff_connection,jaccard_coeff_common_destination,jaccard_coeff_common_source
0,1.0,1.0,1.0
1,0.063415,0.02907,0.121951
2,0.08,0.0,0.076923
3,1.0,1.0,1.0
4,1.0,1.0,1.0


In [79]:
#Out degree in false & true network for every website
Out_deg_false=dict(d.groupby('Source')['FALSE.'].sum())
Out_deg_true=dict(d.groupby('Source')['TRUE.'].sum())

#In degree in false & true network for every website
in_deg_false=dict(d.groupby('Destination')['FALSE.'].sum())
in_deg_true=dict(d.groupby('Destination')['TRUE.'].sum())

In [83]:
#Outdegrees
d['SourceSite_outdeg_real']=d['Source'].map(out_deg_true)
d['DestSite_outdeg_real']=d['Destination'].map(out_deg_true)
d['SourceSite_outdeg_fake']=d['Source'].map(out_deg_false)
d['DestSite_outdeg_fake']=d['Destination'].map(out_deg_false)

#Indegrees
d['SourceSite_indeg_fake']=d['Source'].map(in_deg_false)
d['DestSite_indeg_fake']=d['Destination'].map(in_deg_false)
d['DestSite_indeg_real']=d['Destination'].map(in_deg_true)
d['SourceSite_indeg_real']=d['Source'].map(in_deg_true)

In [84]:
d.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Source,Destination,page_url,TRUE.,FALSE.,Source_Real_LocalDegreeCentralities,Destination_Real_LocalDegreeCentralities,Source_Real_LocalBetweenness,...,Destination_Fake_LocalEigenCentralities,jaccard_coeff,SourceSite_outdeg_real,DestSite_outdeg_real,SourceSite_indeg_fake,DestSite_indeg_real,SourceSite_outdeg_fake,DestSite_outdeg_fake,DestSite_indeg_fake,SourceSite_indeg_real
0,0,1,www.facebook.com,www.facebook.com,42,5,37,143.333333,143.333333,0.007485,...,1.0,1.0,6,6,47,6,47,47,47,6
1,1,2,www.facebook.com,www.politifact.com,22,0,22,143.333333,143.333333,0.007485,...,1.0,0.053942,6,7,47,7,47,83,83,6
2,2,3,nationalreport.net,www.whitehouse.gov,14,0,14,,,,...,1.0,0.057143,0,5,6,5,6,23,23,0
3,3,4,www.youtube.com,www.youtube.com,9,2,7,4.666667,4.666667,0.0,...,1.0,1.0,4,4,30,4,30,30,30,4
4,4,5,www.politifact.com,www.politifact.com,8,0,8,4.666667,4.666667,0.0,...,1.0,1.0,7,7,83,7,83,83,83,7


#### 2.3 Mutuality (Credits to Roshan)

In [52]:
ls data

[31memergent.csv[m[m*               [31mpolitifact.csv[m[m*
fake_localcentralities.csv  politifact_clean.csv
key_mutuality_roshan.csv    real_localcentralities.csv
keys.csv                    [31msnopes.csv[m[m*
pol_agg.csv


In [68]:
mutual=pd.read_csv('data/key_mutuality_roshan.csv')[['Source','Destination','mutuality_ind']]

In [69]:
d=pd.merge(d,mutual, how='left', on=['Source', 'Destination'])
d['mutuality_ind']=d['mutuality_ind'].fillna(0)

In [70]:
d.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Source,Destination,page_url,TRUE.,FALSE.,Source_Real_LocalDegreeCentralities,Destination_Real_LocalDegreeCentralities,Source_Real_LocalBetweenness,...,Destination_Fake_LocalEigenCentralities,jaccard_coeff_connection,Neighbor_connection,jaccard_coeff_common_destination,Neighbor_common_destination,jaccard_coeff_common_source,Neighbor_common_source,mutuality_ind_x,mutuality_ind_y,mutuality_ind
0,1,2,www.facebook.com,www.politifact.com,22,0,22,143.333333,143.333333,0.007485,...,1.0,0.053659,11,0.017442,3,0.073171,3,1.0,1.0,1.0
1,2,3,nationalreport.net,www.whitehouse.gov,14,0,14,,,,...,1.0,0.04,1,0.0,0,0.0,0,0.0,0.0,0.0
2,6,7,www.naturalnews.com,www.cdc.gov,8,0,8,,,,...,1.0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0
3,7,8,www.facebook.com,www.snopes.com,8,1,7,143.333333,143.333333,0.007485,...,1.0,0.028902,5,0.006711,1,0.115385,3,0.0,0.0,0.0
4,9,10,www.infowars.com,www.cdc.gov,6,0,6,4.666667,4.666667,0.0,...,1.0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0


In [72]:
d.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Source', 'Destination', 'page_url',
       'TRUE.', 'FALSE.', 'Source_Real_LocalDegreeCentralities',
       'Destination_Real_LocalDegreeCentralities',
       'Source_Real_LocalBetweenness', 'Destination_Real_LocalBetweenness',
       'Source_Real_LocalCloseness', 'Destination_Real_LocalCloseness',
       'Source_Real_LocalEigenCentralities',
       'Destination_Real_LocalEigenCentralities',
       'Source_Fake_LocalDegreeCentralities',
       'Destination_Fake_LocalDegreeCentralities',
       'Source_Fake_LocalBetweenness', 'Destination_Fake_LocalBetweenness',
       'Source_Fake_LocalCloseness', 'Destination_Fake_LocalCloseness',
       'Source_Fake_LocalEigenCentralities',
       'Destination_Fake_LocalEigenCentralities', 'jaccard_coeff_connection',
       'Neighbor_connection', 'jaccard_coeff_common_destination',
       'Neighbor_common_destination', 'jaccard_coeff_common_source',
       'Neighbor_common_source', 'mutuality_ind_x', 'mutuality_i

In [73]:
usecol=['Source','Destination','page_url','TRUE.','FALSE.',
       'Source_Real_LocalDegreeCentralities',
       'Destination_Real_LocalDegreeCentralities',
       'Source_Real_LocalBetweenness', 'Destination_Real_LocalBetweenness',
       'Source_Real_LocalCloseness', 'Destination_Real_LocalCloseness',
       'Source_Real_LocalEigenCentralities',
       'Destination_Real_LocalEigenCentralities',
       'Source_Fake_LocalDegreeCentralities',
       'Destination_Fake_LocalDegreeCentralities',
       'Source_Fake_LocalBetweenness', 'Destination_Fake_LocalBetweenness',
       'Source_Fake_LocalCloseness', 'Destination_Fake_LocalCloseness',
       'Source_Fake_LocalEigenCentralities',
       'Destination_Fake_LocalEigenCentralities', 'jaccard_coeff_connection',
       'Neighbor_connection', 'jaccard_coeff_common_destination',
       'Neighbor_common_destination', 'jaccard_coeff_common_source',
       'Neighbor_common_source','mutuality_ind']
d=d[usecol]

In [75]:
d.to_csv('data/ClassificationModelInput.csv',index=False)