### 2. Adding Features

This notebook seeks to add relevant features for each (source, dest) pair.

Prediction Goal: Whether fake news goes from A to B?
The features include:
- Local centralities: deg, eigen, close, between in real and fake news network respectively (2x4x2=16 cols)
- In/Out degree: In and out degree for each website in real and fake network (2x2x2=8cols)
- Mutuality: If (A,B) exists, does (B,A) exist too in the *fake news network*? This is a binary column.
- Jaccard: For each (Source, Dest) pair, we calculate the set of common neighbors (intersection) in the fake network and the set of all neighbors (union) in the fake network.
- Number of common neighbors: common source, destination and general connections (source & dest) for each pair.

**We won't use the features in the original, un-aggregated dateset**

#### 2.1 local centralities

This notebook serves to add local centralities in both real and fake news networks to both source and destination websites, so we will know their position within both the real and fake networks.

The centralities are calculated using the R code.

In [1]:
%reset
%pylab inline
import pandas as pd, pyprind

Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Populating the interactive namespace from numpy and matplotlib


In [2]:
ls data

ClassificationModelInput.csv  pol_agg_new.csv
[31memergent.csv[m[m*                 [31mpolitifact.csv[m[m*
fake_localcentralities.csv    politifact_clean.csv
key_mutuality_roshan.csv      real_localcentralities.csv
keys.csv                      [31msnopes.csv[m[m*


In [3]:
d=pd.read_csv('data/pol_agg_new.csv')
d.rename(columns = {'website': 'Destination'},inplace=True)
#d.columns=['Unnamed: 0', 'Source', 'Destination', 'page_url', 'TRUE.', 'FALSE.', ]
d.shape
d.columns

Index(['Source', 'Destination', 'TRUE', 'History', 'Saturday', 'Sunday',
       'Religion', 'Health Care', 'page_url', 'Elections', 'Monday',
       'Thursday', 'Military', 'Tuesday', 'Wednesday', 'Friday', 'FALSE',
       'MedianPolarityScore'],
      dtype='object')

In [4]:
def add_centralities(file,identity):
    cen=pd.read_csv(file)
    cen.columns=['Website','LocalDegreeCentralities',
                 'LocalBetweenness','LocalCloseness','LocalEigenCentralities']
    cen.columns=[identity+'_'+i for i in cen.columns]
    
    for col in cen.columns[1:]:
        for website in ['Source','Destination']:
            mapper=dict(zip(d[website],cen[col]))
            d['Source'+'_'+str(col)]=d['Source'].map(mapper)
            d['Destination'+'_'+str(col)]=d['Source'].map(mapper)

In [5]:
add_centralities('data/real_localcentralities.csv','Real')
add_centralities('data/fake_localcentralities.csv','Fake')

In [6]:
d.tail()

Unnamed: 0,Source,Destination,TRUE,History,Saturday,Sunday,Religion,Health Care,page_url,Elections,...,Source_Real_LocalEigenCentralities,Destination_Real_LocalEigenCentralities,Source_Fake_LocalDegreeCentralities,Destination_Fake_LocalDegreeCentralities,Source_Fake_LocalBetweenness,Destination_Fake_LocalBetweenness,Source_Fake_LocalCloseness,Destination_Fake_LocalCloseness,Source_Fake_LocalEigenCentralities,Destination_Fake_LocalEigenCentralities
1281,success-street.com,blog.chron.com,0,0,0,0,0,0,1,0,...,,,,,,,,,,
1282,static.politifact.com.s3.amazonaws.com,www.barrypopik.com,0,0,0,0,0,0,1,0,...,,,1.0,1.0,0.0,0.0,0.029037,0.029037,1.0,1.0
1283,static.politifact.com.s3.amazonaws.com,wafflesatnoon.com,0,0,0,0,0,0,1,0,...,,,1.0,1.0,0.0,0.0,0.029037,0.029037,1.0,1.0
1284,static.politifact.com.s3.amazonaws.com,query.nytimes.com,0,0,0,0,0,0,1,0,...,,,1.0,1.0,0.0,0.0,0.029037,0.029037,1.0,1.0
1285,webcache.googleusercontent.com,www.newenglishreview.org,0,0,0,0,0,0,1,0,...,,,1.0,1.0,0.0,0.0,0.029845,0.029845,1.0,1.0


In [7]:
d.to_csv('data/keys.csv', index=False)

#### 2.2 Add jaccard dist and common neighbors

This notebook serves to add local centralities in both real and fake news networks to both source and destination websites, so we will know their position within both the real and fake networks.

The centralities are calculated using the R code.

In [8]:
d=pd.read_csv('data/keys.csv',index_col=False)

#Don't want self-referring connections
d=d[d['Source']!=d['Destination']]

In [9]:
def association(method):
    jaccard_dict={}
    faked=d[d['FALSE']>0]
    #all unique websites
    allwebs=set(faked['Source'].append(faked['Destination']))
    bar = pyprind.ProgBar(len(allwebs))
    for site in allwebs:
        if method=='connection':
            sites=set(faked[faked['Source']==site]['Destination'].append(d[d['Destination']==site]['Source']))
        elif method=='common_destination':
            sites=set(faked[faked['Source']==site]['Destination'])
        elif method=='common_source':
            sites=set(faked[faked['Destination']==site]['Source'])
        jaccard_dict[site]=sites
        bar.update()
    return(jaccard_dict)

In [10]:
for i in ['connection','common_destination','common_source']:
    mapper=association(i)
    
    def jaccard(pair):
        try:
            numerator=len(mapper[pair[0]].intersection(mapper[pair[1]]))
            denom=len(mapper[pair[0]].union(mapper[pair[1]]))
            return numerator/denom
        except:
            return 0
        
    def neighbor(pair):
        try:
            common_neighbor=len(mapper[pair[0]].intersection(mapper[pair[1]]))
            return common_neighbor
        except:
            return 0
        
    d['jaccard_coeff'+'_'+i]=[jaccard(pair) for pair in list(zip(d['Source'],d['Destination']))]
    d['Neighbor'+'_'+i]=[neighbor(pair) for pair in list(zip(d['Source'],d['Destination']))]

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:03
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:01
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:01


In [11]:
m=['Source','Destination']
m.extend([i for i in d.columns if 'Neighbor_' in i])

d[m].head()

Unnamed: 0,Source,Destination,Neighbor_connection,Neighbor_common_destination,Neighbor_common_source
0,www.facebook.com,www.politifact.com,11,3,3
1,nationalreport.net,www.whitehouse.gov,1,0,0
2,www.naturalnews.com,www.cdc.gov,0,0,0
3,www.facebook.com,www.snopes.com,5,1,3
4,www.infowars.com,www.cdc.gov,0,0,0


In [12]:
d[m].head()

Unnamed: 0,Source,Destination,Neighbor_connection,Neighbor_common_destination,Neighbor_common_source
0,www.facebook.com,www.politifact.com,11,3,3
1,nationalreport.net,www.whitehouse.gov,1,0,0
2,www.naturalnews.com,www.cdc.gov,0,0,0
3,www.facebook.com,www.snopes.com,5,1,3
4,www.infowars.com,www.cdc.gov,0,0,0


In [13]:
d[[i for i in d.columns if 'jaccard' in i]].head()

Unnamed: 0,jaccard_coeff_connection,jaccard_coeff_common_destination,jaccard_coeff_common_source
0,0.053659,0.017442,0.073171
1,0.04,0.0,0.0
2,0.0,0.0,0.0
3,0.028902,0.006711,0.115385
4,0.0,0.0,0.0


In [14]:
#Out degree in false & true network for every website
Out_deg_false=dict(d.groupby('Source')['FALSE'].sum())
Out_deg_true=dict(d.groupby('Source')['TRUE'].sum())

#In degree in false & true network for every website
in_deg_false=dict(d.groupby('Destination')['FALSE'].sum())
in_deg_true=dict(d.groupby('Destination')['TRUE'].sum())

In [15]:
#Outdegrees
d['SourceSite_outdeg_real']=d['Source'].map(Out_deg_true)
d['DestSite_outdeg_real']=d['Destination'].map(Out_deg_true)
d['SourceSite_outdeg_fake']=d['Source'].map(Out_deg_false)
d['DestSite_outdeg_fake']=d['Destination'].map(Out_deg_false)

#Indegrees
d['SourceSite_indeg_fake']=d['Source'].map(in_deg_false)
d['DestSite_indeg_fake']=d['Destination'].map(in_deg_false)
d['DestSite_indeg_real']=d['Destination'].map(in_deg_true)
d['SourceSite_indeg_real']=d['Source'].map(in_deg_true)

In [16]:
d.head()

Unnamed: 0,Source,Destination,TRUE,History,Saturday,Sunday,Religion,Health Care,page_url,Elections,...,jaccard_coeff_common_source,Neighbor_common_source,SourceSite_outdeg_real,DestSite_outdeg_real,SourceSite_outdeg_fake,DestSite_outdeg_fake,SourceSite_indeg_fake,DestSite_indeg_fake,DestSite_indeg_real,SourceSite_indeg_real
0,www.facebook.com,www.politifact.com,0,0,0,0,0,0,22,0,...,0.073171,3,30,0.0,225,35.0,10.0,75,7,1.0
1,nationalreport.net,www.whitehouse.gov,0,0,0,0,0,0,14,0,...,0.0,0,0,,25,,4.0,23,5,0.0
2,www.naturalnews.com,www.cdc.gov,0,0,0,0,0,0,8,0,...,0.0,0,0,,17,,,18,0,
3,www.facebook.com,www.snopes.com,1,0,0,1,0,0,8,0,...,0.115385,3,30,0.0,225,6.0,10.0,33,1,1.0
4,www.infowars.com,www.cdc.gov,0,0,0,0,0,0,6,0,...,0.0,0,0,,21,,3.0,18,0,0.0


#### 2.3 Mutuality (Credits to Roshan)

In [17]:
ls data

ClassificationModelInput.csv  pol_agg_new.csv
[31memergent.csv[m[m*                 [31mpolitifact.csv[m[m*
fake_localcentralities.csv    politifact_clean.csv
key_mutuality_roshan.csv      real_localcentralities.csv
keys.csv                      [31msnopes.csv[m[m*


In [18]:
mutual=pd.read_csv('data/key_mutuality_roshan.csv')[['Source','Destination','mutuality_ind']]

In [19]:
d=pd.merge(d,mutual, how='left', on=['Source', 'Destination'])
d['mutuality_ind']=d['mutuality_ind'].fillna(0)

In [20]:
d.head()

Unnamed: 0,Source,Destination,TRUE,History,Saturday,Sunday,Religion,Health Care,page_url,Elections,...,Neighbor_common_source,SourceSite_outdeg_real,DestSite_outdeg_real,SourceSite_outdeg_fake,DestSite_outdeg_fake,SourceSite_indeg_fake,DestSite_indeg_fake,DestSite_indeg_real,SourceSite_indeg_real,mutuality_ind
0,www.facebook.com,www.politifact.com,0,0,0,0,0,0,22,0,...,3,30,0.0,225,35.0,10.0,75,7,1.0,1.0
1,nationalreport.net,www.whitehouse.gov,0,0,0,0,0,0,14,0,...,0,0,,25,,4.0,23,5,0.0,0.0
2,www.naturalnews.com,www.cdc.gov,0,0,0,0,0,0,8,0,...,0,0,,17,,,18,0,,0.0
3,www.facebook.com,www.snopes.com,1,0,0,1,0,0,8,0,...,3,30,0.0,225,6.0,10.0,33,1,1.0,0.0
4,www.infowars.com,www.cdc.gov,0,0,0,0,0,0,6,0,...,0,0,,21,,3.0,18,0,0.0,0.0


In [21]:
d.columns

Index(['Source', 'Destination', 'TRUE', 'History', 'Saturday', 'Sunday',
       'Religion', 'Health Care', 'page_url', 'Elections', 'Monday',
       'Thursday', 'Military', 'Tuesday', 'Wednesday', 'Friday', 'FALSE',
       'MedianPolarityScore', 'Source_Real_LocalDegreeCentralities',
       'Destination_Real_LocalDegreeCentralities',
       'Source_Real_LocalBetweenness', 'Destination_Real_LocalBetweenness',
       'Source_Real_LocalCloseness', 'Destination_Real_LocalCloseness',
       'Source_Real_LocalEigenCentralities',
       'Destination_Real_LocalEigenCentralities',
       'Source_Fake_LocalDegreeCentralities',
       'Destination_Fake_LocalDegreeCentralities',
       'Source_Fake_LocalBetweenness', 'Destination_Fake_LocalBetweenness',
       'Source_Fake_LocalCloseness', 'Destination_Fake_LocalCloseness',
       'Source_Fake_LocalEigenCentralities',
       'Destination_Fake_LocalEigenCentralities', 'jaccard_coeff_connection',
       'Neighbor_connection', 'jaccard_coeff_common_d

In [22]:
usecol=['Source', 'Destination', 'Monday', 'Elections',
       'Military', 'Thursday', 'Religion', 'Friday', 'Saturday', 'TRUE',
       'History', 'Tuesday', 'Wednesday', 'Sunday', 'page_url', 'Health Care',
       'FALSE', 'jaccard_coeff_connection', 'Neighbor_connection',
       'jaccard_coeff_common_destination', 'Neighbor_common_destination',
       'jaccard_coeff_common_source', 'Neighbor_common_source',
       'SourceSite_outdeg_real', 'DestSite_outdeg_real',
       'SourceSite_outdeg_fake', 'DestSite_outdeg_fake',
       'SourceSite_indeg_fake', 'DestSite_indeg_fake', 'DestSite_indeg_real',
       'SourceSite_indeg_real', 'mutuality_ind','MedianPolarityScore']
d=d[usecol]

In [23]:
d.head()
d = d.fillna(0)

In [24]:
d=pd.read_csv('data/ClassificationModelInput.csv')
d['Label']=d['FALSE'].apply(lambda x: int(x<1))

In [25]:
d.head()

Unnamed: 0,Source,Destination,Monday,Elections,Military,Thursday,Religion,Friday,Saturday,TRUE,...,DestSite_outdeg_real,SourceSite_outdeg_fake,DestSite_outdeg_fake,SourceSite_indeg_fake,DestSite_indeg_fake,DestSite_indeg_real,SourceSite_indeg_real,mutuality_ind,MedianPolarityScore,Label
0,www.facebook.com,www.politifact.com,6,0,0,4,0,1,0,0,...,0.0,225,35.0,10.0,75,7,1.0,1.0,0.0,0
1,nationalreport.net,www.whitehouse.gov,0,0,0,14,0,0,0,0,...,0.0,25,0.0,4.0,23,5,0.0,0.0,0.5106,0
2,www.naturalnews.com,www.cdc.gov,0,0,0,0,0,0,0,0,...,0.0,17,0.0,0.0,18,0,0.0,0.0,-0.8834,0
3,www.facebook.com,www.snopes.com,0,0,0,0,0,2,0,1,...,0.0,225,6.0,10.0,33,1,1.0,0.0,0.0,0
4,www.infowars.com,www.cdc.gov,0,0,0,0,0,0,0,0,...,0.0,21,0.0,3.0,18,0,0.0,0.0,-0.6808,0


In [26]:
d.to_csv('data/ClassificationModelInput.csv',index=False)