# Building a Graph for Your Data

This notebook contains all functions necessary to build a graph for your data. 

The graph-building process has four steps:

* Node creation
* Node merging
* Miscellaneous data creation
* Edge creation

See **[FAQ](#FAQ)** for general questions, the relevant code sections under **[Process Breakdown](#Process_Breakdown)** for each step of the process for more details, or skip to the **[Main](#Main)** section at the bottom to just run everything. 

**Required Input:** 

*full_roles_profs.csv*, with the following columns: 
* name
* role
* profession
* p id
* year

If data for a particular column or row is not available, it should be filled in with the default value `None`.

**Output:**

*new_nodes.csv*, with the following columns:
* id
* name
* role
* profession
* family
* p_index
* year

*new_edges.csv*, with the following columns:
* id
* source
* target
* p_index
* year
* type
* source's role
* target's role

# FAQ
<a id='FAQ'></a>

**What is a node?**

A node is a class that contains information about each person in the graph. It contains information about the person's `name`, `role`, `profession`, `p_index` (which texts the person is involved in), the `year`s associated with the involved texts, and an arbitrary `id`.

**How are nodes created?**

Nodes are created by iterating through the input file *full_roles_profs.csv* row by row to gather the necessary information. Each row will create exactly one node; nodes may not contain unique people.

**How are nodes merged?**

The goal here is to merge the previously created nodes so that each node corresponds to a unique person.

At this stage, "uniqueness" is defined by name and profession; that is, after being merged, each node will have a different name and profession. 

To accomplish this, a mapping of `{(name, profession): [nodes associated]}` is created. Each list of associated nodes then has its values merged; the merged nodes will then have lists of `role`s, `p_index`s, and `year`s instead of single values. The `i`th index in each list corresponds to the same previously-unmerged node; the order of information is retained.

We are currently working on investigating other attributes we can use to merge nodes.

**How can we count how many nodes were merged?**

Inside the `merge_nodes` function, there is a variable that is incremented every time a node is merged. Printing this variable will yield how many nodes were merged overall.

We are currently working to add an attribute to each node describing how many nodes were combined to make that final node for the unique person.

**How are edges weighted?**

Currently, edges have no weight assigned to them. In the future, we will decide how to weight them for a more accurate network.

# Process Breakdown
<a id='Process_Breakdown'></a>

In [78]:
import csv
from collections import defaultdict

### Node Creation

In [79]:
# key: (person, profession), value: list of Nodes
person_to_info = defaultdict(list)

In [80]:
# container to store row data
class Node:
    def __init__(self, name, role, profession, family, p_index, year, processed):
        self.name = name
        self.role = role
        self.profession = profession
        self.family = family
        self.p_index = p_index
        self.year = year
        self.processed = processed
        self.id = None

    def add_id(self, id):
        self.id = id


# traverses through names_roles_professions to create a node per row
def create_nodes_list():
    with open('people.csv', 'rt', encoding="utf-8") as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            name = row['name']
            role = row['roles']
            profession = row['profession']
            family = row['family']
            p_index = row['p index']
            year = row['date name']
            processed = row['processed date']
            person = (name, profession)
            node = Node(name, role, profession, family, p_index, year, processed)
            person_to_info[person].append(node)

### Node Merging

In [81]:
# key: p_index, value: list of (name of person, role in this p_index)
p_indexes_to_people = defaultdict(list)
# key: (person, profession), value: Node
new_person_to_info = {}

In [82]:
# traverses through all the newly created persons, in each person has more than 1 node, merge those nodes
def merge_nodes():
    number_of_nodes_merged = 0
    for key, value in sorted(person_to_info.items(), key = lambda x:x[0][0]):
        if len(value) > 1:
            merge_nodes_helper(key, value)
            number_of_nodes_merged += 1
    return number_of_nodes_merged


# the ith role, profession, p_id all correspond with each other
def merge_nodes_helper(person, list_of_nodes):
    roles = []
    p_indexes = []
    years = []
    family = []
    processed=[]

    for node in list_of_nodes:
        roles.append(node.role)
        p_indexes.append(node.p_index)
        years.append(node.year)
        family.append(node.family)
        if node.processed != '':
            processed.append(node.processed)
    node = Node(person[0], roles, person[1], family, p_indexes, years, processed)
    new_person_to_info[person] = node

In [83]:
### OPEN new_nodes.csv TO WRITE ###
# writes the new persons (merged nodes) into a nodes list
def create_new_nodes_list():
    with open('new_nodes.csv', 'w') as csvfile:
        fieldnames = ['id', 'name', 'role', 'profession', 'processed year', 'family', 'p_index', 'date name', 'maxYear', 'minYear', 'Max Gap Start Year', 'Max Gap End Year', 'Pre-Gap Instance Count', 'Post-Gap Instance Count']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        curr_id = 1

        for person, node in new_person_to_info.items():
            name = person[0]
            if type(node) is list:
                node = node[0]

            role = node.role
            profession = node.profession
            family = node.family
            p_index = node.p_index
            year = node.year
            
            # if there is no processed year data, sets Max and MinYear to be 0
            # these nodes will be ignored in the final timeline visualization
            processed = node.processed
            if len(processed)>0:
                maxYear = max(processed)
                minYear = min(processed)
            else:
                maxYear = 0
                minYear = 0
            
            maxDiff = 0
            gStart = 0 # start year of largest gap between transactions for each node
            gEnd = 0 # end year of largest gap between transactions for each node
            index = 0
            pre_GapInstanceCount =  0
            post_GapInstanceCount = 0
            if maxYear == minYear: 
                gStart = maxYear
                gEnd = maxYear
            for x,y in zip(processed[1:], processed):
                if len(processed)  <= 0:
                    print("x:"+str(x)+ "y"+str(y))
                currDiff = float(x) - float(y)
                if currDiff > maxDiff:
                    maxDiff = currDiff
                    gStart = x
                    gEnd = y
                    post_GapInstanceCount = len(processed[index+1:])
                    pre_GapInstanceCount = len(processed[0:index+1])
                    
                index += 1

            node.add_id(curr_id)

            writer.writerow({
                'id': curr_id,
                'name': name,
                'role': role,
                'profession': profession,
                'family': family,
                'p_index': p_index,
                'maxYear': maxYear,
                'minYear': minYear,
                'Max Gap Start Year': gStart,
                'Max Gap End Year': gEnd,
                'Post-Gap Instance Count': post_GapInstanceCount,
                'Pre-Gap Instance Count': pre_GapInstanceCount,
                'date name': year,
                'processed year': processed
                })
            curr_id += 1

        csvfile.close()

### Miscellaneous Data Creation 

In [84]:
# traverses through new_person_to_info to fill out p_indexes_to_people (a dict of p_indexes to people in the p_indexes)
def fill_new_person_to_info():
    for person, node in new_person_to_info.items():
        roles = node.role
        p_indexes = node.p_index
        years = node.year

        for i in range(len(roles)):
            curr_p_index = p_indexes[i]
            curr_role = roles[i]
            container = (person, curr_role, i)
            p_indexes_to_people[curr_p_index].append(container)


# acquires list of texts where there were only 1 participants
def find_single_participant_texts():
    list_of_single_participant_texts = []
    for key, value in p_indexes_to_people.items():
        if len(value) == 1:
            list_of_single_participant_texts.append(key)
    return list_of_single_participant_texts


# finds p_indexes where there are more than 5 people involved in the transaction
def find_multiple_transactions():
    with open('multiple_transactions.csv', 'w') as csvfile:
        fieldnames = ['p_index', 'people_involved']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for key, value in p_indexes_to_people.items():
            if len(value) >= 5:
                p_index = key
                people_involved = value
                
                writer.writerow({
                'p_index': p_index, 
                'people_involved': people_involved
                })

### Edge Creation

In [85]:
# want edge file to have a few more rows, lets have like source -> recipeint or source -> intermediarary
### Source -> recipient, source -> intermediary -> recipient, source -> representative -> recipient
### OPEN new_edges.csv TO WRITE ###
def create_edge_list():
    with open('new_edges.csv', 'w') as csvfile:
        fieldnames = ['id', 'source', 'target', 'p_index', 'year', 'type', 'source\'s role', 'target\'s role']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        curr_id = 1
        count = 0 # can move this around in the control flow to count how many transactions are of a certain type
        for key, value in p_indexes_to_people.items():
            # removing multiple transactions: this thins it down from 11384 edges to 9069 edges
            if len(value) > 1 and len(value) < 5:
                ### FOR NOW JUST CREATE EDGES WITH WHAT YOU KNOW
                ### 1. source -> recipient (Done)
                ### 2. source -> interm -> recipient (Done)
                ### 3. source -> recipient, source -> interm, interm -> recipient (Done)

                # only looking at transactions where there are only 2 people involved
                # there are 5912 transactions where there are only 2 people involved
                if len(value) == 2:
                    list_of_roles = []
                    role_to_node = {}
                    node1 = None
                    node2 = None

                    for node in value:
                        list_of_roles.append(node[1])
                        role_to_node[node[1]] = node

                    # there are 5176 transactions between "source" -> some other person("recipient", "intermediary", "representative", etc.)
                    # thus this creates 5176 edges
                    if "['source']" in list_of_roles:
                        role1 = "['source']"
                        person1 = role_to_node[role1][0]
                        node1 = new_person_to_info[person1]
                        id1 = node1.id

                        list_of_roles.remove("['source']")

                        role2 = list_of_roles[0]
                        person2 = role_to_node[list_of_roles[0]][0]
                        node2 = new_person_to_info[person2]
                        id2 = node2.id

                        edge_type = "Directed"

                    # there are 456 transactions between other person -> "recipient"
                    # thus this creates 456 edges
                    # sometimes there are "recipients" -> "recipients" in a transactions, so the edge is chose arbitrarily
                    elif "['recipient']" in list_of_roles:
                        role2 = "['recipient']"
                        person2 = role_to_node[role2][0]
                        node2 = new_person_to_info[person2]
                        id2 = node2.id

                        list_of_roles.remove("['recipient']")

                        role1 = list_of_roles[0]
                        person1 = role_to_node[list_of_roles[0]][0]
                        node1 = new_person_to_info[person1]
                        id1 = node1.id

                        edge_type = "Directed"

                    # there are 280 transactions where there are neither a source nor a recipient
                    # thus this creates 280 edges
                    else:
                        role1 = list_of_roles[0]
                        person1 = role_to_node[role1][0]
                        node1 = new_person_to_info[person1]
                        id1 = node1.id

                        role2 = list_of_roles[1]
                        person2 = role_to_node[role2][0]
                        node2 = new_person_to_info[person2]
                        id2 = node2.id

                        edge_type = "Undirected"

                    edge_id = curr_id
                    curr_id += 1
                    p_index = key
                    index_of_year = node1.p_index.index(p_index)
                    year = node1.year[index_of_year]

                    writer.writerow({
                        'id': edge_id,
                        'source': id1,
                        'target': id2,
                        'p_index': p_index,
                        'year': year,
                        'type': edge_type,
                        'source\'s role': role1,
                        'target\'s role': role2
                        })

                ### Some common structures:
                ### - source, recipient1, recipient2: source -> recipient1, source -> recipient2
                ### - source, something, recipient: source -> something, something -> recipient, source -> recipient
                ### - if neither of these structures, just make triangular undirected edges
                if len(value) == 3:
                    list_of_roles = []
                    role_to_node_list = defaultdict(list)
                    node1 = None
                    node2 = None
                    node3 = None

                    for node in value:
                        list_of_roles.append(node[1])
                        role_to_node_list[node[1]].append(node)

                    if "['source']" in list_of_roles and "['recipient']" in list_of_roles:
                        # source, recipient1, recipient2: source -> recipient1, source -> recipient2 (creates 2 edges)
                        # there are 196 transactions of this structure
                        # thus this creates 392 edges
                        if list_of_roles.count("['recipient']") == 2:
                            role1 = "['source']"
                            person1 = role_to_node_list[role1][0][0]
                            node1 = new_person_to_info[person1]
                            id1 = node1.id

                            role2 = "['recipient']"
                            person2 = role_to_node_list[role2][0][0]
                            node2 = new_person_to_info[person2]
                            id2 = node2.id

                            role3 = "['recipient']"
                            person3 = role_to_node_list[role2][1][0]
                            node3 = new_person_to_info[person3]
                            id3 = node3.id

                            edge_type = "Directed"
                            edge_id = curr_id
                            curr_id += 1
                            p_index = key
                            index_of_year = node1.p_index.index(p_index)
                            year = node1.year[index_of_year]

                            writer.writerow({
                                'id': edge_id,
                                'source': id1,
                                'target': id2,
                                'p_index': p_index,
                                'year': year,
                                'type': edge_type,
                                'source\'s role': role1,
                                'target\'s role': role2
                                })
                            edge_id = curr_id
                            curr_id += 1
                            writer.writerow({
                                'id': edge_id,
                                'source': id1,
                                'target': id3,
                                'p_index': p_index,
                                'year': year,
                                'type': edge_type,
                                'source\'s role': role1,
                                'target\'s role': role3
                                })

                        # source, something, recipient: source -> something, something -> recipient, source -> recipient (creates 3 edges)
                        # there are 208 transactions of this structure
                        # thus this creates 624 edges
                        else:
                            role1 = "['source']"
                            person1 = role_to_node_list[role1][0][0]
                            node1 = new_person_to_info[person1]
                            id1 = node1.idid

                            list_of_roles.remove("['source']")
                            list_of_roles.remove("['recipient']")

                            role2 = list_of_roles[0]
                            person2 = role_to_node_list[role2][0][0]
                            node2 = new_person_to_info[person2]
                            id2 = node2.id

                            role3 = "['recipient']"
                            person3 = role_to_node_list[role3][0][0]
                            node3 = new_person_to_info[person3]
                            id3 = node3.id

                            edge_type = "Directed"
                            edge_id = curr_id
                            curr_id += 1
                            p_index = key
                            index_of_year = node1.p_index.index(p_index)
                            year = node1.year[index_of_year]

                            writer.writerow({
                                'id': edge_id,
                                'source': id1,
                                'target': id2,
                                'p_index': p_index,
                                'year': year,
                                'type': edge_type,
                                'source\'s role': role1,
                                'target\'s role': role2
                                })
                            edge_id = curr_id
                            curr_id += 1
                            writer.writerow({
                                'id': edge_id,
                                'source': id1,
                                'target': id3,
                                'p_index': p_index,
                                'year': year,
                                'type': edge_type,
                                'source\'s role': role1,
                                'target\'s role': role3
                                })
                            edge_id = curr_id
                            curr_id += 1
                            writer.writerow({
                                'id': edge_id,
                                'source': id2,
                                'target': id3,
                                'p_index': p_index,
                                'year': year,
                                'type': edge_type,
                                'source\'s role': role2,
                                'target\'s role': role3
                                })

                    # if neither of these structures, just make triangular undirected edges (creates 3 edges)
                    # there are 685 transactions of this structure
                    # thus this creates 2055 edges
                    else:
                        count += 1
                        list_of_roles = []
                        node_list = []
                        for node in value:
                            list_of_roles.append(node[1])
                            node_list.append(node)

                        role1 = list_of_roles[0]
                        person1 = node_list[0][0]
                        node1 = new_person_to_info[person1]
                        id1 = node1.id

                        role2 = list_of_roles[1]
                        person2 = node_list[1][0]
                        node2 = new_person_to_info[person2]
                        id2 = node2.id

                        role3 = list_of_roles[2]
                        person3 = node_list[2][0]
                        node3 = new_person_to_info[person3]
                        id3 = node3.id

                        edge_type = "Undirected"
                        edge_id = curr_id
                        curr_id += 1
                        p_index = key
                        index_of_year = node1.p_index.index(p_index)
                        year = node1.year[index_of_year]

                        writer.writerow({
                            'id': edge_id,
                            'source': id1,
                            'target': id2,
                            'p_index': p_index,
                            'year': year,
                            'type': edge_type,
                            'source\'s role': role1,
                            'target\'s role': role2
                            })
                        edge_id = curr_id
                        curr_id += 1
                        writer.writerow({
                            'id': edge_id,
                            'source': id1,
                            'target': id3,
                            'p_index': p_index,
                            'year': year,
                            'type': edge_type,
                            'source\'s role': role1,
                            'target\'s role': role3
                            })
                        edge_id = curr_id
                        curr_id += 1
                        writer.writerow({
                            'id': edge_id,
                            'source': id2,
                            'target': id3,
                            'p_index': p_index,
                            'year': year,
                            'type': edge_type,
                            'source\'s role': role2,
                            'target\'s role': role3
                            })

        csvfile.close()

# Main 
<a id='Main'></a>

In [86]:
def main():
    # traversing through names_roles_professions to create a node per row
    create_nodes_list()

    # traversing through all the newly created persons, each a person has more than 1 node, merge those nodes
    number_of_nodes_merged = merge_nodes()
    # writing the new persons (merged nodes) into a nodes list
    create_new_nodes_list()

    # traversing through new_person_to_info to fill out p_indexes_to_people (a dict of p_indexes to people in the p_indexes)
    fill_new_person_to_info()
    
    # creates the edge list from new_person_to_info
    create_edge_list()

    # ##### DISPLAYING OUTPUT #####
    # displaying the amount of nodes merged
    print("AMOUNT OF NODES MERGED: ", number_of_nodes_merged)

    # acquiring list of texts where there were only 1 participants
    # uncomment below to retrieve the texts where there are only single participants
    # list_of_single_participant_texts = data_handler.find_single_participant_texts()
    # print("LIST OF SINGLE PARTICIPANT TEXTS: ", list_of_single_participant_texts)
    # ##### END DISPLAYING OUTPUT #####


if __name__ == "__main__":
    main()

AMOUNT OF NODES MERGED:  1942
