### Prompt:
 Identify a large 2-node network dataset—you can start with a dataset in a repository.  Your data should meet the criteria that it consists of ties between and not within two (or more) distinct groups.
    Reduce the size of the network using a method such as the island method described in chapter 4 of social network analysis.
    What can you infer about each of the distinct groups?

You may work in a small group on the project.

Your code and analysis should be delivered in an IPython Notebook by end of day Sunday. 

### Data
https://snap.stanford.edu/data/wiki-Elec.html

### Libraries

In [1]:
import pandas as pd
import os
import io
import networkx as nx
import re

importedScipy = False
try:    
    import scipy
    importedScipy = True
except:
    pass
try:    
    import nltk
except:
    pass

In [2]:
print ("Pandas Version {}".format(pd.__version__))
print ("Newtworkx Version {}".format(nx.__version__))
if (importedScipy):
    print ("scipy Version {}".format(scipy.__version__))
else:
    print("scipy wasn't imported.")
print("NLTK Version {}".format(nltk.__version__))

Pandas Version 1.0.1
Newtworkx Version 2.4
scipy Version 1.4.1
NLTK Version 3.4.5


### Read in Data

In [3]:
#The filename may have to be renamed with the .txt extention.  That has to be done manually.
#This was only the case on one of my machines for some reason??
filename = "wikiElec.ElecBs3.txt"

In [4]:
file = io.open(filename, mode="r", encoding="ANSI") #This encoding was missing before
print(file.read(500))
data = file.readlines()

# Wikipedia elections (http://cs.stanford.edu/people/jure/pubs/triads-chi10.pdf). Data format:
#   E: is election succesful (1) or not (0)
#   T: time election was closed
#   U: user id (and username) of editor that is being considered for promotion
#   N: user id (and username) of the nominator
#   V: <vote(1:support, 0:neutral, -1:oppose)> <user_id> <time> <username>
E	1
T	2004-09-21 01:15:53
U	30	cjcurrie
N	32	andyl
V	1	3	2004-09-14 16:26:00	ludraman
V	-1	25	2004-09-14 16:53:00	blankfaze
V	1	


In [5]:
for line in data[:5]:
    print(line)

4	2004-09-14 17:08:00	gzornenplatz

V	1	5	2004-09-14 17:37:00	orthogonal

V	1	6	2004-09-14 19:28:00	andrevan

V	1	7	2004-09-14 19:37:00	texture

V	1	8	2004-09-14 21:04:00	lst27



In [6]:
data[0]

'4\t2004-09-14 17:08:00\tgzornenplatz\n'

In [7]:
data[1]

'V\t1\t5\t2004-09-14 17:37:00\torthogonal\n'

In [8]:
data[2]

'V\t1\t6\t2004-09-14 19:28:00\tandrevan\n'

I don't know why, but the reading in of the file starts at an abitrary spot and the first election is screwed up.  We'll just disregard it, I know from previous data exploration that are 2900+ other elections.  

### Discarding the first several screwed up rows

In [9]:
for i in range(len(data)):
    if data[i][0] =="E": #first proper "election in the data, as read in"
        print(i)
        break

26


In [10]:
data[26:35]

['E\t1\n',
 'T\t2004-09-21 20:49:47\n',
 'U\t54\tzoney\n',
 'N\t28\tneutrality\n',
 'V\t1\t28\t2004-09-14 13:39:00\tneutrality\n',
 'V\t1\t33\t2004-09-14 13:41:00\tchmod007\n',
 'V\t1\t34\t2004-09-14 14:40:00\tnorm\n',
 'V\t1\t5\t2004-09-14 15:00:00\torthogonal\n',
 'V\t1\t20\t2004-09-14 15:43:00\tmichael\n']

In [11]:
thisCellRan1 = False
# Do no rerun this cell or you will "erase" data from memory and need to rerun the notebook

In [12]:
if (thisCellRan1 == False):
    data = data[26:]
    thisCellRan1 = True  ##This first 25 cells have been discarded already
    del thisCellRan1
        
#dicarding the first 25 rows
#Only run this cell once!!!  That's the point of the thisCellRan1 variable.

In [67]:
data[0:5]
#line 0 should start with "E", election. 
#If it doesn't, restart the kernel and rerun the notebook.

['E\t1\n',
 'T\t2004-09-21 20:49:47\n',
 'U\t54\tzoney\n',
 'N\t28\tneutrality\n',
 'V\t1\t28\t2004-09-14 13:39:00\tneutrality\n']

In [14]:
ECount = 0
for i in data:
    if i[0] =="E":
        ECount +=1
ECount

2793

### Makes some classes to work with data

In [15]:
#from the docs
#E: is election succesful (1) or not (0)\\r\\n#   
#T: time election was closed\\r\\n#   
#U: user id (and username) of editor that is being considered for promotion\\r\\n#   
#N: user id (and username) of the nominator\\r\\n#

class Election():# A node type in the graph
    def __init__(self, success, time, user, nominator, votes):
        if((success!= 1) & (success!=0)):
            print("invalid success entered: ",success)
            self.success = "unknown"
        else:
            self.success = success
        self.time = 0 #TODO possibly if we want more edge data
        self.user = user
        self.nominator = nominator
        self.votes = votes

In [16]:
#   V: <vote(1:support, 0:neutral, -1:oppose)> <user_id> <time> <username>
class Vote: #Edges in the graph
    def __init__(self, vote, user_id, time, username):
        if ((vote <-1) | (vote >1)):
            print("invalid vote:", vote)
            self.vote = "invalid"
        else:
            self.vote = vote
        self.time = 0 #TODO if we want more attributes
        self.user_id = user_id
        self.user_name = username

In [68]:
votes = []
for line in data[3:5]:
    if line[0]=="V":
        print(line)
        print(line[2]) #vote
        print(re.search("\d+", line[3:]))
    

V	1	28	2004-09-14 13:39:00	neutrality

1
<re.Match object; span=(1, 3), match='28'>


In [18]:
data[55]

'V\t1\t11\t2004-09-12 06:21:00\tmerovingian\n'

In [19]:
data[55][3:]

'\t11\t2004-09-12 06:21:00\tmerovingian\n'

In [20]:
match2 = re.search("\d+",data[55][3:]) 

In [21]:
findall2 = re.findall("\d+" , data[55])

In [22]:
findall2

['1', '11', '2004', '09', '12', '06', '21', '00']

In [23]:
findall3 = re.findall("[a-z]+", data[55])

In [24]:
findall3

['merovingian']

In [25]:
#
elections =[]
election =""
inAnElection = False
for i in range(len(data)):
    if ( (data[i][0] == "E") ): #&\
        #((i != len(data)-1) & (data[i+1][0]=="T") ) ) : #new election
        elections.append(election)
        election = data[i]
    else:
        election =  election + line
    

In [26]:
elections[3]

'E\t1\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:00\tantandrus\nV\t1\t36\t2004-09-14 16:02:

In [27]:
electionsList =[]
for i in range(len(data[:10])):
    print (data[i])
                               

E	1

T	2004-09-21 20:49:47

U	54	zoney

N	28	neutrality

V	1	28	2004-09-14 13:39:00	neutrality

V	1	33	2004-09-14 13:41:00	chmod007

V	1	34	2004-09-14 14:40:00	norm

V	1	5	2004-09-14 15:00:00	orthogonal

V	1	20	2004-09-14 15:43:00	michael

V	1	36	2004-09-14 16:02:00	antandrus



In [60]:
electionsList =[]
for i in range(len(data)):
    #print (data[i])
    if ((data[i][0] == "E") & (i<(len(data)-1)) ):
        #def __initi__(self, success, time, user, nominator, votes):
        try:
            tempElection = Election(\
                               int(re.findall("\d+",data[i])[0]),
                               0, #time TODO
                               re.findall("\D+",data[i+2])[1].strip(), #user
                               re.findall("[\D]+",data[i+3])[1].strip(), #nominators can be "UNKNOWN"
                               [3] #votes, TODO
                              )
            newTempElection= True
        except:
            #uncomment these lines for troubleshooting those 8 problem users
            #print("Problemm with :", data[i])
            #print(data[i+1])
            #print(data[i+2])
            #print(data[i+3])
            #print("----")
            continue
        if (newTempElection==True):
            electionsList.append(tempElection)
    else:   
        newTempElection = False

In [66]:
for i in range(233,245):
    print(electionsList[i].user +" nomitnated by:\t" + electionsList[i].nominator)

smoddy nomitnated by:	UNKNOWN
furrykef nomitnated by:	cryptoderk
gkhan nomitnated by:	UNKNOWN
khaosworks nomitnated by:	seancdaug
trilobite nomitnated by:	oo
haham_hanuka nomitnated by:	UNKNOWN
lommer nomitnated by:	UNKNOWN
shanes nomitnated by:	joy_stuvall
cburnett nomitnated by:	royboy
el_c nomitnated by:	dbachmann
worldtraveller nomitnated by:	macgyvermagic
zzyzx nomitnated by:	UNKNOWN


In [46]:
len(electionsList)
# There's eight entries for users who have numbers as usernames or no username, just a user ID.
#This could be cleaned up but for now, it'll just be discarded.

0

In [31]:
for i in electionsList[2:14]:
    print(i.user +" nominated by: " + i.nominator)
#electionsList works as expected

andrevan nominated by: node_ue
arminius nominated by: thecustomoflife
lst nominated by: UNKNOWN
chmod nominated by: UNKNOWN
taoster nominated by: UNKNOWN
an nominated by: UNKNOWN
jor nominated by: UNKNOWN
proteus nominated by: lord_emsworth
pumpie nominated by: UNKNOWN
nichalp nominated by: krs
pedanticallyspeaking nominated by: UNKNOWN
benc nominated by: neutrality


### TODO:  Get votes list into the relevant elections, possibly using vote objects made above (vote object defined to better preserve attributes).

### TODO2: Once we have working list of the elections and votes, we can make the nodes (users from the votes, elections from the election objects.)  Then we can create a bipartate graph to begin Project 2 analysis