# Project 2

Our data comes from [Tore Opsahl](https://toreopsahl.com/datasets/#online_forum_network) and concerns co-authorship for "Condensed Matter" from 1995 to 1999.

## Data Processing

In [92]:
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms import bipartite
import numpy as np
import pandas as pd

url1 = 'http://opsahl.co.uk/tnet/datasets/Newman-Cond_mat_95-99-Newman.dl'
url2 = 'http://opsahl.co.uk/tnet/datasets/Newman-Cond_mat_95-99-Author_names.txt'

data = pd.read_csv(url1, sep = " ", skiprows = 4, header = None)
names = pd.read_csv(url2, sep = " ", header = None)

data = data.rename(columns = {0: 'author', 1:'paper', 2:'weight'})
names = names.rename(columns = {0: 'index', 1:'author'})

names.index = names.index + 1
names = names.to_dict()
data.author = data.author.replace(names['author'])
G = nx.from_pandas_edgelist(data, 'author', 'paper', 'weight')

We have two sources, the edgelist for the two-mode network and a list of names for the authors.  We load the data into dataframes via the `pandas` package, rename the columns to make selecting them easier, then utilize `pd.relace()` to replace the numbers in `data` with the actual author names from `names`.

## Island Method

In [102]:
def trim_edges(g, weight = 1):
    g2 = nx.Graph()
    for f, to, edata in g.edges(data = True):
        if edata['weight'] > weight:
            g2.add_edges_from([(f, to, edata)])
    return(g2)

def island_method(g, iterations = 5):
    weights = [edata['weight'] for f, to, edata in g.edges(data = True)]
    
    mn = int(min(weights))
    mx = int(max(weights))
    
    step = int((mx - mn) / iterations)
    
    return [[threshold, trim_edges(g, threshold)] for threshold in range(mn, mx, step)]

In [106]:
islands = island_method(G)
for i in islands:
    temp = nx.connected_component_subgraphs(i[1])
    print(i[0], len(i[1]), len(list(temp)))

0 32528 1113
4 1516 627
8 198 94
12 50 24
16 8 4
20 4 2
