# DATASETS
The following notebook contains details about the datasets utilised in the study. The datasets consist of ground-truth collected for the study and a collection from public repositories.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import json, time, re, os
from mlxtend.plotting import ecdf

## MCT Datasets

    This is the data collected for the study consisting of collections of ground-truth and predicted ties.

In [18]:
# MCT data:
#consist of collections of ground-truth (dyadic and transitive datasets) and predicted ties for community detection.

**Dyadic data** consist of a collection of nodes with binary reciprocal ties. Details about how to obtain the data is available at https://github.com/ijdutse/dyads_in_Twitter

In [2]:
# with duplicates: dyads1 = pd.read_csv('data/mctdata/dyads_and_network_size_of_verified_and_unverifed.csv')
# without duplicates:
dyads = pd.read_csv('data/mctdata/dyads/unique_dyads_and_network_size_of_verified_and_unverifed.csv')
len(dyads)

3981

In [3]:
dyads.head()

Unnamed: 0,NetworkSize,Dyads,Category
0,489,158,Unverified
0,370,142,Unverified
0,333,163,Unverified
0,5719,4211,Unverified
0,1328,542,Unverified


In [4]:
#df1 = pd.read_csv('data/second_generation_pairwise_unverified_anchors.csv')
s2udyads = pd.read_csv('data/mctdata/dyads/second_generation_dyads_unverified_anchors_and_buddies.csv')
s2vdyads = pd.read_csv('data/mctdata/dyads/second_generation_dyads_verified_anchors_and_buddies.csv')

In [5]:
s2udyads.head()

Unnamed: 0,G2_Anchor,ChildID,ChildScreenName,Lineage,ChildFriends,ChildFollowers,PairwiseSize,Tracker
0,ArmaniStation,3265484348,mxslimxh,ArmaniStation:mxslimxh,67,3852,158,"(158, 331)"
1,ArmaniStation,228168148,d1vyaaa,ArmaniStation:d1vyaaa,690,4485,158,"(158, 331)"
2,ArmaniStation,241925227,TWlSTEDFANTASY,ArmaniStation:TWlSTEDFANTASY,371,4524,158,"(158, 331)"
0,shpirtinezi,905423614824067073,hhavink62,shpirtinezi:hhavink62,34,13,142,"(142, 228)"
1,shpirtinezi,1661469139,Kanackeforyou,shpirtinezi:Kanackeforyou,83,107,142,"(142, 228)"


In [6]:
s2vdyads.head()

Unnamed: 0,G2_Anchor,ChildID,ChildScreenName,Lineage,ChildFriends,ChildFollowers,PairwiseSize,Tracker
0,vmochama,423789970,imaaniwaalker,vmochama:imaaniwaalker,294,268,35,"(35, 1549)"
1,vmochama,16711211,alexgpaterson,vmochama:alexgpaterson,996,920,35,"(35, 1549)"
2,vmochama,839456750000558080,ShelleyBFarmer,vmochama:ShelleyBFarmer,1138,1402,35,"(35, 1549)"
3,vmochama,21677048,amydempsey,vmochama:amydempsey,1262,3751,35,"(35, 1549)"
0,GCLongform,45952641,adamellsegal,GCLongform:adamellsegal,462,415,113,"(113, 463)"


In [7]:
# used for training dyads prediction model ... 
df = pd.read_csv('data/mctdata/dyads/unverified_dyads.csv')

In [8]:
df.head()

Unnamed: 0,Dyad,B1_ID,B1_ScreenName,B1_CreatedAt,B1_Followers,B1_Friends,B1_Statuses,B1_Description,B1_Location,B1_Favourite,...,B2_Followers,B2_Friends,B2_Statuses,B2_Description,B2_Location,B2_Favourite,B2_Verification,B2_Tweet,B2_AccountCreated,B2_RTCount
0,"('ArmaniStation', 'mxslimxh')",966710486,ArmaniStation,2012-11-23 19:38:18,4153,484,14942,No bio,The Netherlands/London,5338,...,4039,80,292,أهل السنة والجماعة,,12876,False,@YaqubAlRashid1 Yea I also read that Hadith! T...,2019-03-06 20:54:02,0.0
1,"('ArmaniStation', 'diyaldn')",966710486,ArmaniStation,2012-11-23 19:38:18,4153,484,14942,No bio,The Netherlands/London,5338,...,4446,681,51349,revert 🌱🕋,"South london, UK",9784,False,,,
2,"('ArmaniStation', 'TWlSTEDFANTASY')",966710486,ArmaniStation,2012-11-23 19:38:18,4153,484,14942,No bio,The Netherlands/London,5338,...,4512,368,249160,🇮🇷🇮🇹🇫🇷,LDN/PRS/MLN,141504,False,@DrunkyGa i love you too mon roi 💓,2019-03-06 21:56:12,0.0
3,"('ArmaniStation', 'raksterrr')",966710486,ArmaniStation,2012-11-23 19:38:18,4153,484,14942,No bio,The Netherlands/London,5338,...,1153,985,1009,6’3 💩,London / Leicester,3597,False,@MxM247_ Wash your mouth out on about Arsenal 🤢🤮,2019-03-06 22:21:00,0.0
0,"('shpirtinezi', 'hhavink62')",514799426,shpirtinezi,2012-03-04 21:43:59,2126,421,920,pray the fakes get exposed,prizren/23,74340,...,14,37,19,62🖤,"Dersim, Kurdistan",1594,False,RT @annaclairerusso: It gets better every time...,2019-03-02 23:04:57,307546.0


**Simelian/transitive data** consist of a collection of nodes with transitive reciprocal ties. The *transitive nodes* consist of bidirectionally connected nodes which are used as anchors to retreive additional set of users with reciprocal ties, hence culminating to transitive relations.

In [11]:
# ... a collection of nodes with transitive reciprocal ties.
df = pd.read_csv('data/mctdata/simmelian/nodes_with_simmelian_ties.csv')
len(df); df.head()

Unnamed: 0,G2_Anchor,Lineage,BuddyID,BuddyScreenName,BuddyFriends,BuddyFollowers
0,vmochama,genna_buck:vmochama:alexgpaterson,16711211,alexgpaterson,993,920
1,vmochama,genna_buck:vmochama:ShelleyBFarmer,839456750000558080,ShelleyBFarmer,1118,1364
2,vmochama,genna_buck:vmochama:amydempsey,21677048,amydempsey,1263,3749
3,vmochama,genna_buck:vmochama:JakePyne,1041790149925851136,JakePyne,999,1591
4,vmochama,genna_buck:vmochama:TurnbullSarah,355132182,TurnbullSarah,1258,827


**Some examples:** In the following three dataframes, collection of related pairs results in transitive ties. Depending on which node is used in identifying reciprocal ties, the nodes are given in the form of generations spanning the following triplets: $parent:child:grandchild$ ... In the forthcoming examples (and the previous ones), the values in the dataframes are connected to the first generation anchors (G1_Anchor), to second generation anchor (G2_Anchor) via the parent nodes.     
    # ... collection of dyads from transitive set 

In [12]:
#Reciprocal ties in unverified users category:
df = pd.read_csv('data/mctdata/simmelian/reciprocal_ties_in_unverified_anchors.csv')
len(df); df.head()

Unnamed: 0,ScreenName,FollowersCount,FriendsCount,Lineage,Status,Class,NetworkSize,Indegree,Outdegree
369,GMan82721611,273,585,jackcade1381:GMan82721611,Child,1,858,0.318182,0.681818
152,jjpaldad,92,130,,Parent,1,222,0.414414,0.585586
546,bordersnbetween,502,1270,FredSeguinPhoto:bordersnbetween,Child,1,1772,0.283296,0.716704
1745,monica_remy,3353,4172,ModernBullWagyu:monica_remy,Child,1,7525,0.445581,0.554419
545,editions_g,4,5,FredSeguinPhoto:editions_g,Child,1,9,0.444444,0.555556


In [13]:
# Reciprocal ties in verified users category:
df = pd.read_csv('data/mctdata/simmelian/reciprocal_ties_in_verified_anchors.csv')
len(df); df.head()

Unnamed: 0,G3_Anchor,ChildID,ChildScreenName,Lineage,ChildFriends,ChildFollowers,PairwiseSize,Tracker
0,amydempsey,119818938,ZosiaBielski,amydempsey:ZosiaBielski,911,2588,1,"(1, 999)"
0,DACrosbie,197458714,DanDelmar,DACrosbie:DanDelmar,860,4666,1,"(1, 999)"
0,NickVanPraet,119818938,ZosiaBielski,NickVanPraet:ZosiaBielski,911,2588,3,"(3, 997)"
1,NickVanPraet,26505168,PaulJournet,NickVanPraet:PaulJournet,2130,40710,3,"(3, 997)"
2,NickVanPraet,20423695,grassreporter,NickVanPraet:grassreporter,862,7802,3,"(3, 997)"


In [14]:
# set of directed nodes for training:
df = pd.read_csv('data/mctdata/simmelian/nodes_with_directed_ties.csv')
len(df); df.head()

Unnamed: 0,Dyad,B1_ID,B1_ScreenName,B1_CreatedAt,B1_Followers,B1_Friends,B1_Statuses,B1_Description,B1_Location,B1_Favourite,...,B2_Statuses,B2_Description,B2_Location,B2_Favourite,B2_Verification,B2_Tweet,B2_AccountCreated,B2_RTCount,Status,Class
0,"('S4JJ40', 'Ghufran23924906')",744012835311730689,S4JJ40,2016-06-18 03:44:17,983,433,1631,‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏\nرَبَّنَا آتِنَا فِي الدُّنْ...,Kunar ~ Pekhawar,2938,...,4,,,3,False,RT @INReunification: Holi at Quaid-i-Azam Univ...,2019-03-25 19:49:51,84.0,Directed,0
1,"('S4JJ40', 'NajmRT')",744012835311730689,S4JJ40,2016-06-18 03:44:17,983,433,1631,‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏\nرَبَّنَا آتِنَا فِي الدُّنْ...,Kunar ~ Pekhawar,2938,...,2733,#Peace #Respect #love . Engineer #UETi...,,1909,False,Some Lucky Cats in Baitullah Makkah 😻 https://...,2019-04-15 15:39:22,0.0,Directed,0
2,"('S4JJ40', 'Fatabbayanoo')",744012835311730689,S4JJ40,2016-06-18 03:44:17,983,433,1631,‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏\nرَبَّنَا آتِنَا فِي الدُّنْ...,Kunar ~ Pekhawar,2938,...,53,Afghan / Durrani / Parsi Pataan 🏳🏴Youtube Has ...,Graveyard Of SuperPowers,1,False,Fatabayanoo Productions Afghan Nasheed...,2019-04-16 09:46:23,0.0,Directed,0
3,"('S4JJ40', 'lk_bharmalani')",744012835311730689,S4JJ40,2016-06-18 03:44:17,983,433,1631,‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏\nرَبَّنَا آتِنَا فِي الدُّنْ...,Kunar ~ Pekhawar,2938,...,59,,India,201,False,Teri Mitti Female Version - Kesari - Arko feat...,2019-04-16 09:45:52,1.0,Directed,0
4,"('S4JJ40', 'baloghwaqt')",744012835311730689,S4JJ40,2016-06-18 03:44:17,983,433,1631,‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏‏\nرَبَّنَا آتِنَا فِي الدُّنْ...,Kunar ~ Pekhawar,2938,...,74,21,,568,False,RT @xymtaco: Islam is evolutionary superior an...,2019-04-16 01:28:51,8.0,Directed,0


**See about the use of the Simmelian ties in:**
* Inuwa-Dutse I., Liptrott M., Korkontzelos Y. (2019) Simmelian ties on Twitter: empirical analysis and prediction. *The Sixth IEEE International Conference on Social Networks Analysis, Management and Security,* SNAMS-2019, Granada

In [17]:
#df = pd.read_csv('data/verified_and_unverified_connections_inference_data.csv')
#inference data:
#ndf = df[['NetworkSize','Dyads','OneEdge','Indegree','Outdegree']]#,'Category','Type']]
#df = df[['Indegree','Outdegree']]

# bijection from discrete to real number space ... 
#df['Reciprocated']=df.Type.apply(lambda x:1/(1+np.exp(-0)) if x=='Null' else 1/(1+np.exp(-1)))
# weight of reciprocated ties/dyads:
#df['DyadsWeight']=df.Dyads.apply(lambda x:np.round(1/(1+np.exp(x)),5))
#df['DyadsWeight']=df.Dyads.apply(lambda x:1/(1+np.exp(-x)))
#df['NetworkWeight']=df.NetworkSize.apply(lambda x:1/(1+np.exp(-x)))
# select only unverified users:
#df = df[df.Category=='Unverified']
# Grand Consolidation:
#df_trans.to_csv('first_tiny_set_of_transitive_triplets.csv', index_label=False)

## SNAP Datasets
**Extraction:** the SNAP data is available at http://snap.stanford.edu/data/ego-Facebook.html. We extract relevant fields and transform the data to suit the study need. The naming convention in the original data repository is maintained ...

    # Edges from all ego-networks

In [58]:
# edges from all ego-networks combined:
fd = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook_combined.txt', sep=' ',names=['UserID','Edge'])
# include the network or circle size of each ego:
fd['Dyads'] = [(fd.iloc[i][0],fd.iloc[i][1]) for i in range(len(fd))] # keeps track of dyadic ties
fd['CircleSize'] = fd.UserID.apply(lambda x: len(fd[fd.UserID==x])) # size of ego circles
# save the new files:  
fd.to_csv('data/snapdata/fb_ego_and_circle_size.csv', index_label=False)
# save preprcessed files: fd = pd.read_csv('data/snapdata/fb_ego_and_circle_size.csv')
# unduplicated ego-net:
ufd = fd.drop_duplicates(subset='UserID')
#size of attributes:
len(fd), len(set(fd.UserID)), len(set(fd.Dyads))

(88234, 3663, 88234)

In [80]:
# samples from file with duplicate:
fd.head(3)

Unnamed: 0,UserID,Edge,Dyads,CircleSize
0,0,1,"(0, 1)",347
1,0,2,"(0, 2)",347
2,0,3,"(0, 3)",347


In [81]:
# samples from file without duplicates:
ufd.head(3)

Unnamed: 0,UserID,Edge,Dyads,CircleSize
0,0,1,"(0, 1)",347
347,1,48,"(1, 48)",16
363,2,20,"(2, 20)",9


**Nodes, circles and profile attributes**

*Making sense out of the data ...* because each ego/node share certain attributes/properties -- circle, circle size, profile features, we group those attributes in a single file as follows. We start by aggregating the attributies of a single node/ego, then scale the process accordingly. There are 10 ego-networks and Each ego-network consists of shared circles, edges and corresponding features of edges/nodes. We extract this information and merge it in  single file for ease of interpretation. This is to make it suitable for our use case and simplify the process because the data in its original form is difficult to work with and understand, especially that the features unevenly distributed (making it difficult to generalise) and mapping/correspondence between features and edges is vague/not clear. We aim to make this as clearer as possible.
     
    # all nodes and all circles:

In [95]:
# read the files and store relevant files:
import os
files = os.listdir('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/')
fl = lambda l: [v for sl in l for v in sl]
for file in files:
    frame = defaultdict()
    if file.endswith('.circles'):        
        extracts = defaultdict()
        with open('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/'+file, 'rt') as f:
            for line in f.readlines():
                extracts[line.split()[0]]=line.split()[1:]
        # dataframe of users and cricles:
        users_circles = dict()
        users = []
        circles = []
        for key in extracts.keys():
            for value in extracts[key]:
                users_circles[key] = value
                circles.append(key)
                users.append(value)
        duc = pd.DataFrame({'UserID':users,'Circles':circles})
        # save files/the extracts for further analysis:
        duc.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/users_circles'+file+'.csv',\
                      index_label=False)

In [96]:
# some samples:
duc = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/users_circles0.circles.csv')
duc.head(3)

Unnamed: 0,UserID,Circles
0,71,circle0
1,215,circle0
2,54,circle0


In [621]:
# read the files and store relevant files:
files = os.listdir('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/')
#frame = defaultdict()
# make a flat list:
fl = lambda l: [v for sl in l for v in sl]
for file in files:
    frame = defaultdict()
    if file.endswith('.circles'):
        #print(file)
        with open('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/'+file, 'rt') as f:
            for line in f.readlines():
                #frame[line.split()[0]].append([int(i) for i in line.split()[1:]])
                frame[line.split()[0]] =[int(i) for i in line.split()[1:]]   
            #break
            # make a flat list:....and use that to assin index:
            index = fl(frame.values())
            columns = [c for c in frame.keys()]
            # create an empty dataframe with edges as index and circles as columns:
            dk = pd.DataFrame(index=index, columns=columns, dtype=np.int)
            # create a mapping function between the frame and df...assign 1/0 according to the truthness of the mapping
            for k, v in zip(frame.keys(), frame.values()):
                for i in v:
                    if dk.loc[i,k].any():
                        dk.loc[i,k]=1
                    else:
                        dk.loc[i,k]=0
            # replace NANs with 0s:
            dk = dk.replace(np.nan, 0)
            # save files/the extracts for further analysis:
            dk.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+file+'.csv',\
                      index_label=False)        

    # some examples of the extracts ... 

In [630]:
# see some sample saved files:
dc = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/0.circles.csv')
# why not rename the index to UserID?:
dc.head(3)

Unnamed: 0,circle0,circle1,circle2,circle3,circle4,circle5,circle6,circle7,circle8,circle9,...,circle14,circle15,circle16,circle17,circle18,circle19,circle20,circle21,circle22,circle23
71,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
215,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
54,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Establish mapping with the remaining attributes** Need to aggragate all the 10 nodes and the corresponding circles, features and other relevant features for further analysis. The circles are the communities and the nodes/indices are the ids of the nodes ...  

### Nodes and Features
    
        #remember, the ego/anchor node is not included at this stage, append it later ... 

In [689]:
files = os.listdir('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/')
for file in files:
    if file.endswith('.feat'):
        # load the data t ascertain the number of columns and use the information name the columns
        d = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/'+file, sep=' ')
        #reload and assign proper names:
        df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/'+file, sep=' ',\
                        names=['F'+str(f) for f in range(len(d.columns)-1)])
        df.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+file+'.csv')
        #reload and assign the userID:
        df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+file+'.csv')
        df = df.rename(columns={'Unnamed: 0':'UserID'})
        df.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+file+'.csv',index_label=False)

In [692]:
# see some sample saved files:
df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/348.feat.csv')
# why not rename the index to UserID?:
df.head(3)

Unnamed: 0,UserID,F0,F1,F2,F3,F4,F5,F6,F7,F8,...,F151,F152,F153,F154,F155,F156,F157,F158,F159,F160
0,349,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,350,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,351,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Features**

    # because the features have a maximum dimension/number of columns of 4, each feature column is denoted by FC# where # refers to the column number, e.g. FC1.

In [2]:
# a function to trim the anonymised features for brevity:
def trim_feat(x):
    x=str(x)
    #if x!=np.nan:
    if x.startswith('anon'):
        return 'AF'+x.split()[2]
    # remove prefixed digits in features:elif x.isdigit(): re.findall('[a-zA-Z]+',r)
    else:
        return x

In [1274]:
files = os.listdir('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/')
for file in files:
    if file.endswith('.featnames'):
        df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/'+file, sep=';',\
                        names=['FC'+str(f) for f in range(4)])
        
        # apply the trim_features function on the feature columns:
        # remove the digits in features
        #df['FC0']=df.FC0.apply(lambda x: x[2:])
        df['FC0']=df.FC0.apply(lambda x: re.findall('[a-zA-Z]+',x)[0])
        #trim features:
        df['FC1']=df.FC1.apply(trim_feat)
        df['FC2']=df.FC2.apply(trim_feat)
        df['FC3']=df.FC3.apply(trim_feat)
        #save file:
        df.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+file+'.csv',index_label=False)
        #associate the UserID with the features:

In [137]:
# see some sample saved files:
df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/348.featnames.csv')
# why not rename the index to UserID?:
df.head(3)

Unnamed: 0,FC0,FC1,FC2,FC3
0,birthday,AF206,,
1,birthday,AF207,,
2,birthday,AF208,,


**Associate each feature set with the corresponding UserIDs:**

In [180]:
# feature list (fl) an feature names (fn)
fl = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/3980.feat.csv')
fn = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/3980.featnames.csv')
len(fl), len(fl.columns), len(fn)

(59, 43, 42)

In [181]:
fl.head(3)

Unnamed: 0,UserID,F0,F1,F2,F3,F4,F5,F6,F7,F8,...,F32,F33,F34,F35,F36,F37,F38,F39,F40,F41
0,3981,0,1,0,0,1,1,1,1,0,...,0,0,1,0,0,0,0,0,0,0
1,3982,0,0,0,0,0,1,1,1,0,...,1,1,0,0,0,0,0,1,0,0
2,3983,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


    # the above file consists of the feature vector of each node. For instance, user 3981 consists of features ranging from F0 to F41 where F1, F4, F5, F6, F7 and F34 are turned on. Note that F5 to F7 are similar to user 3982. The goal is to assign names to those features, e.g. F0-->first_name, for interpretability. We use the below file consising of feature columns (FC#) to associate each feature with the corresponding names accordingly .. 

In [182]:
fn.head(3)

Unnamed: 0,FC0,FC1,FC2,FC3
0,birthday,AF6,,
1,education,concentration,id,AF14
2,education,degree,id,AF22


**Users and Feature names:** The number of columns is the same as the number of features in file.feat.csv and file.featnames.csv; hence the name of the features can be used as the columns names in file.feat.csv instead of F0, F1, etc. Aditional file is created to contain this information as descibed below. *what is not clear is whether the ordering of the features matters. Although the order is trivial at this stage, we assume that the order is the same in the manner the features are reported in the original file.*

In [183]:
# number of columns/features in fl is equal to the number of instances in fn (note that the userID column not included):
len(fn), len(fl.columns)-1 # minus the userID column 

(42, 42)

In [185]:
# define the indices and columns for the new dataframe:
k = list(set(fn.FC0).union(set(fn.FC1)).union(set(fn.FC2)).union(set(fn.FC3)))
k.remove(np.nan)
len(fl),len(fn),len(fl.columns),len(k)

(59, 42, 43, 59)

In [186]:
# create an empty dataset to store the relevant values (user-features dataframe:):
usf = pd.DataFrame(data = np.zeros(shape=(len(fl.UserID),len(k))),index=fl.UserID, columns=k, dtype=np.int16)
#or using this to assign values: for i in usf.index: for c in usf.columns: usf.loc[i,c]=0

In [187]:
usf.head(3)

Unnamed: 0_level_0,AF62,AF77,AF1276,AF22,AF6,AF1277,employer,AF57,AF253,AF1279,...,AF127,start_date,AF156,birthday,AF125,AF254,AF1282,AF1270,school,AF1274
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3981,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [188]:
fl.head(3)

Unnamed: 0,UserID,F0,F1,F2,F3,F4,F5,F6,F7,F8,...,F32,F33,F34,F35,F36,F37,F38,F39,F40,F41
0,3981,0,1,0,0,1,1,1,1,0,...,0,0,1,0,0,0,0,0,0,0
1,3982,0,0,0,0,0,1,1,1,0,...,1,1,0,0,0,0,0,1,0,0
2,3983,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [189]:
fn.head(3)

Unnamed: 0,FC0,FC1,FC2,FC3
0,birthday,AF6,,
1,education,concentration,id,AF14
2,education,degree,id,AF22


In [190]:
len(usf), len(d1),len(d2), len(usf.columns),len(d1.columns)

(59, 59, 42, 59, 43)

In [191]:
# harmonise the discepancies in lenghts:
addr = pd.DataFrame(index=range(len(usf.index)-len(fn)), columns=['FC0', 'FC1','FC2','FC3'],dtype=np.int)
fn = fn.append(addr, ignore_index=True)

In [193]:
#fn

In [194]:
len(usf.index), len(fn)

(59, 59)

In [196]:
for idx, r in zip(usf.index, range(len(usf))):
    flist = [f for f in fn.loc[r]] # store a temporal list of features ... 
    for col in usf.columns: 
        if col in flist:
            usf.loc[idx, col] = 1
        else:
            usf.loc[idx,col] = 0

In [199]:
usf.head(3)

Unnamed: 0_level_0,AF62,AF77,AF1276,AF22,AF6,AF1277,employer,AF57,AF253,AF1279,...,AF127,start_date,AF156,birthday,AF125,AF254,AF1282,AF1270,school,AF1274
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3981,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3983,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [250]:
# GENERALISE THE APPROACH TO SUIT ALL FILES:
# feature list (fl) an feature names (fn)

files = os.listdir('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts')
for f1 in files:
    if f1.endswith('.feat.csv'):
        f2 = f1[:-4]+'names.csv' # define f2:
        fl = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+f1)
        fn = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+f2)
        
        # define the indices and columns for the new dataframe:
        k = list(set(fn.FC0).union(set(fn.FC1)).union(set(fn.FC2)).union(set(fn.FC3)))
        k.remove(np.nan)
        # create an empty dataset to store the relevant values (user-features dataframe:):
        usf = pd.DataFrame(data = np.zeros(shape=(len(fl.UserID),len(k))),index=fl.UserID, columns=k, dtype=np.int16)
        #or using this to assign values: for i in usf.index: for c in usf.columns: usf.loc[i,c]=0
        
        # harmonise the discepancies in lenghts:
        addr = pd.DataFrame(index=range(len(usf.index)-len(fn)), columns=['FC0', 'FC1','FC2','FC3'],dtype=np.int)
        fn = fn.append(addr, ignore_index=True)
        print(f1,f2)
        
        for idx, r in zip(usf.index, range(len(usf))):
            flist = [f for f in fn.loc[r]] # store a temporal list of features ... 
            for col in usf.columns: 
                if col in flist:
                    usf.loc[idx, col] = 1
                else:
                    usf.loc[idx,col] = 0
        #save file:
        usf.to_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/'+'user_features.'+f1,\
                  index_label=False)
        #associate the UserID with the features:

698.feat.csv 698.featnames.csv
686.feat.csv 686.featnames.csv
3980.feat.csv 3980.featnames.csv
0.feat.csv 0.featnames.csv
348.feat.csv 348.featnames.csv
1684.feat.csv 1684.featnames.csv
414.feat.csv 414.featnames.csv
1912.feat.csv 1912.featnames.csv
3437.feat.csv 3437.featnames.csv
107.feat.csv 107.featnames.csv


In [97]:
# see some samples from saved files:
df = pd.read_csv('/home/ijdutse/analysis/microcosm/data/snapdata/facebook/extracts/user_features.414.feat.csv')
df.head(3)

Unnamed: 0,AF327,AF77,employer,AF319,AF329,AF101,AF310,AF59,locale,AF222,...,gender,AF127,start_date,AF71,AF200,birthday,first,AF228,school,AF331
573,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
574,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
575,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [984]:
####################################################################################################