<b>Web Analytics DATA 620 - Project 02</b>

<b>Assignment: “Wiki Publishing”</b>

<b>Group - Chris Bloome / Mustafa Telab / Vinayak Kamath</b>

<b>Date - 24th June 2021</b>

--- 


Identify a large 2-node network dataset—you can start with a dataset in a repository.  Your data should meet the criteria that it consists of ties between and not within two (or more) distinct groups.
Reduce the size of the network using a method such as the island method described in chapter 4 of social network analysis.
What can you infer about each of the distinct groups?

---

<b>Wikipedia User Publishing</b>



The source of the data is http://networkrepository.com/ia-wiki-user-edits-page.php#

The data is a collection of edges that represent users and wikipedia pages; while the edges represent a edit events.

 - User
 - Web page
 - Weight
 - Time Stamp

---

In [1]:
# Import required libraries
import networkx as nx
import networkx.algorithms.bipartite as bipartite
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyvis.network import Network
from datetime import datetime

In [2]:
df = pd.read_csv('ia-wiki-user-edits-page.edges', sep = ' ',header = None, names = ['source','target','weight','time'])

In [3]:
hour = [int(datetime.utcfromtimestamp(v).strftime('%H')) for v in df['time']]

In [4]:
month = [int(datetime.utcfromtimestamp(v).strftime('%m')) for v in df['time']]

In [5]:
df['hour']= pd.Series(hour)
df['month']= pd.Series(month)
df['source'] = ['u'+ str(v) for v in df['source']]

In [6]:
df['target'] = ['p'+ str(v) for v in df['target']]

In [7]:
print(len(df))
print(df.head())

8998641
  source target  weight        time  hour  month
0     u1     p1       1  1039680411     8     12
1     u1     p1       1  1039680680     8     12
2     u1     p1       1  1039680886     8     12
3     u2     p1       1  1040932914    20     12
4     u3     p1       1  1052273037     2      5


In [8]:
df_grouped = df.groupby(['source','target']).aggregate({('weight'):np.sum,('hour'):np.average}).reset_index()

In [9]:
df_grouped[:10]

Unnamed: 0,source,target,weight,hour
0,u1,p1,3,8.0
1,u1,p103,2,4.0
2,u1,p108,2,6.0
3,u1,p109,2,6.0
4,u1,p111,9,5.0
5,u1,p120,3,4.333333
6,u1,p121,1,6.0
7,u1,p122,1,5.0
8,u1,p127,1,6.0
9,u1,p130,1,6.0


In [10]:
u_count = df_grouped['source'].value_counts(ascending=True).rename_axis('source').reset_index(name='pages')
p_count = df_grouped['target'].value_counts(ascending=True).rename_axis('target').reset_index(name='users')

In [11]:
u_count

Unnamed: 0,source,pages
0,u10627,1
1,u12306,1
2,u18444,1
3,u28550,1
4,u15200,1
...,...,...
29343,u680,212041
29344,u232,220085
29345,u193,351338
29346,u845,472296


In [12]:
p_count

Unnamed: 0,target,users
0,p1966696,1
1,p1013306,1
2,p1379603,1
3,p1961607,1
4,p1268909,1
...,...,...
2094515,p117268,640
2094516,p183627,663
2094517,p590,710
2094518,p348358,826


In [21]:
print('Pages Per User Statisitics')
print(u_count[ "pages"].describe().round(2) , '\n')
print('Users Per Page Statisitics')
print(p_count[ "users"].describe().round(2))

Pages Per User Statisitics
count     29348.00
mean        189.89
std        6118.84
min           1.00
25%           1.00
50%           1.00
75%           3.00
max      699170.00
Name: pages, dtype: float64 

Users Per Page Statisitics
count    2094520.00
mean           2.66
std            4.06
min            1.00
25%            1.00
50%            2.00
75%            3.00
max          916.00
Name: users, dtype: float64


In [None]:
print(plt.hist(u_count['pages'], bins = 10))
plt.yscale('log')

In [None]:
plt.hist(p_count['users'], bins = 30)
plt.yscale('log')

In [None]:
df_grouped = pd.merge(df_grouped, u_count, how="left", on="source")

In [None]:
df_grouped = pd.merge(df_grouped, p_count, how="left", on="target")

In [None]:
print(len(df_grouped))
print(df_grouped.head())

In [None]:
df_island = df_grouped[(df_grouped['pages']>100)  & (df_grouped['users']>20)]
len(df_island)

In [None]:
S = nx.from_pandas_edgelist(df_island, source='source', target='target', edge_attr=["weight", "hour"], create_using = nx.DiGraph(), edge_key=None)

In [None]:
print(nx.info(S))

In [None]:
len(list(nx.connected_components(S.to_undirected())))

In [None]:
users = [v for v in S.nodes() if v[0]== 'u']

In [None]:
users[0:10]

In [None]:
U = bipartite.projected_graph(S.to_undirected(), users)

In [None]:
print(nx.info(U))

In [None]:
nx.is_bipartite(S)

In [None]:
nx.draw(U)

In [None]:
#modifiction of code clock found on Social Network Analysis for Startups, pg64 
def trim_nodes(g, weight=1):
    nodes = []
    for n in g.nodes():
        if g.degree(n) > weight:
            nodes.append(n)
    G2 = g.subgraph(nodes)
    return G2

In [None]:
U2 = trim_nodes(U,600)
print(nx.info(U2))

In [None]:
nx.draw(U2)

In [None]:
__author__ = """\n""".join(['Maksim Tsvetovat <maksim@tsvetovat.org',
'Drew Conway <drew.conway@nyu.edu>',
'Aric Hagberg <hagberg@lanl.gov>'])
from collections import defaultdict
import networkx as nx
import numpy
from scipy.cluster import hierarchy
from scipy.spatial import distance
import matplotlib.pyplot as plt

def create_hc(G, t=1.0):
    """
    Creates hierarchical cluster of graph G from distance matrix
    Maksim Tsvetovat ->> Generalized HC pre- and post-processing to work on labelled graphs
    and return labelled clusters
    The threshold value is now parameterized; useful range should be determined
    experimentally with each dataset
    """
    """Modified from code by Drew Conway"""
    
    ## Create a shortest-path distance matrix, while preserving node labels
    labels=G.nodes()
    path_length=nx.all_pairs_shortest_path_length(G)
    distances=numpy.zeros((len(G),len(G)))
    i=0
    for u,p in path_length:
        j=0
        for v,d in p.items():
            distances[i][j]=d
            distances[j][i]=d
            if i==j: distances[i][j]=0
            j+=1
        i+=1
        
    # Create hierarchical cluster
    Y=distance.squareform(distances)
    Z=hierarchy.complete(Y) # Creates HC using farthest point linkage
    
    # This partition selection is arbitrary, for illustrive purposes
    membership=list(hierarchy.fcluster(Z,t=t))
    
    # Create collection of lists for blockmodel
    partition=defaultdict(list)
    for n,p in zip(list(range(len(G))),membership):
        if p>=0:
            partition[p].append(labels[n])
    return list(partition.values())

In [None]:
clusters = create_hc(U2)