# Structuring Unstructured Digitized Data

Problem: You would like to make a character network map from Shakespeare's *Hamlet*, similar to what Moretti does in his work. All you have is a raw text file of *Hamlet* that you have copy-and-pasted from Project Gutenberg. How do you go from raw text, to network map?

Solution: flex your Python muscles and bring together everything you know to structure the raw text (plus a new library, Networkx!).

Outcome: combine the tools we already know, string methods, list methods, regular expressions, and Pandas, plus add a new library to our toolkit, to create a very rough character network map of Shakespeare's *Hamlet*.

The interesting substantive question is [what constitutes a link between characters in a play](http://digitalhumanities.org/dhq/vol/11/2/000289/000289.html).

We will construct our (imperfect) network via the following method (from the linked article above):


>The network structure calculations were obtained by treating each speaking character as a node, and deeming two characters to be linked if there was at least one time slice of the play in which both were present (that is, if two characters spoke to each other or were in each other’s presence, then they have a link).  [[Stiller, Nettle, and Dunbar 2003, 399](https://link.springer.com/article/10.1007/s12110-003-1013-1)]

**This is not an active learning tutorial (sorry!).** Rather, it's a demonstration of my workflow as I face a new set of unstructured data, and an end goal of how I want it structured. 

## Learning Goals

* Think through how you can combine tools you already know to extract and structure the information you want from a raw, semi-structured text file.
    * Note the vast number of tools we use to do this relatively simple example. Getting comfortable with these basic tools is really important to do more sophisticated work! Practice, practice, practice.
* Review of all the tools we have learned and how they work together to form one, powerful ecosystem.
* Learn the basics of creating an adjacency matrix, and turning that matrix into a network graph using the Python library Networkx.
    * **Carefull!!** You now know just enough to be dangerous. **Proceed with caution.** If you think network analysis might be something you want to do in your own resesarch **go take a class in it**. Lots of resources here to do excellent network analysis projects.
* Through this exercise, you should note how much cleaning happens, how many mistakes get through despite our best efforts, and how many decisions I as a researcher have to make. 
    * Note the importance of making explicit every decision you make. **Most authors unfortunately do not do this**. I hope with your generation we can change that.
    * Y'all, this is real. This is how research is being done right now.
* Learn strategies for breaking down a complicated task into smaller parts.
    * Proceed one step at a time, and print everything you do to check you're doing it correctly.
    * Note also the different types of computational essays you might create. You might create one that simply structures your data, and another to do the analyses, for example.

### OK, get ready!!!

In [1]:
#import our libraries
import pandas
import re
import networkx as nx
import numpy as np #needed to create a matrix for our network graph
import matplotlib.pyplot as plt

In [1]:
#read in our text file
hamlet = open('../data/hamlet.txt', 'r').read()
hamlet[:1000]

'\nProject Gutenberg EBook of Hamlet, by William Shakespeare\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no\nrestrictions whatsoever.  You may copy it, give it away or re-use it\nunder the terms of the Project Gutenberg License included with this\neBook or online at www.gutenberg.org.  If you are not located in the\nUnited States, you’ll have to check the laws of the country where you\nare located before using this ebook.\n\n\n\nTitle: Hamlet\n\nAuthor: William Shakespeare\n\nRelease Date: November 1998 [EBook #1524]\n\nLast Updated: December 30, 2017\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\n*** START OF THIS PROJECT GUTENBERG EBOOK HAMLET ***\n\n\n\nThis etext was prepared by Dianne Bean.\n\n\nTHE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n\n\n\n\nby William Shakespeare\n\n\n\n\n\n\nContents\n\nACT I Scene I. Elsinore. A platform before the Castle. Scene II.\nElsinore. A room of state in 

Our first goal is to identify which characters were in each other's presence at any time throughout the play. We will approximate this by counting a positive tie if two characters had speaking lines in the same scene.

The text is semi-structured via ACTs and SCENEs. We'll use this semi-structure to explit the full file into acts, and then scenes within each act, with the goal of creating a list of characters present in each scene.

First, use the string `.find` method to locate the ACTs:

In [None]:
hamlet.find("ACT I\n")

In [None]:
hamlet.find("ACT II\n")

In [None]:
hamlet.find("ACT III\n")

In [None]:
hamlet.find("ACT IV\n")

In [None]:
hamlet.find("ACT V\n")

In [None]:
hamlet.find("ACT VI\n")

In [None]:
actI = hamlet[hamlet.find("ACT I\n") : hamlet.find("ACT II\n")]
actI

In [None]:
actII = hamlet[hamlet.find("ACT II\n") : hamlet.find("ACT III\n")]
actII

And so on. Do the next three acts.

In [None]:
actIII = hamlet[hamlet.find("ACT III\n") : hamlet.find("ACT IV\n")]
actIV = hamlet[hamlet.find("ACT IV\n") : hamlet.find("ACT V\n")]
actV = hamlet[hamlet.find("ACT V\n") : hamlet.find("End of the Project Gutenberg EBook")]

Next, split each ACT by SCENE, using regular expressions:

In [None]:
actI_scenes = re.findall("SCENE [A-Z]+", actI)
actI_scenes

In [None]:
#Write a for loop to split each act into scenes
for idx,scene in enumerate(actI_scenes):
    print(idx, scene)

In [None]:
actI_scenes.append("End of the Project Gutenberg EBook")
actI_scenes

In [None]:
#Write a for loop to split each act into scenes
actI_list = list()
for idx,scene in enumerate(actI_scenes):
    if idx <= (len(actI_scenes)-2):
        scene_text = actI[actI.find(actI_scenes[idx]): actI.find(actI_scenes[idx+1])]
        actI_list.append(tuple((("act1_scene"+str(idx+1)),scene_text)))
actI_list

Next, identify the characters in each scene. We'll do this first with ACT I, SCENE 2, and then put it together and do all scenes

In [None]:
act1_scene2 = actI_list[1][1]
act1_scene2

In [None]:
#use regular expressions to find all the characters in a scene
re.findall('\n\n([\w ]+)\.', act1_scene2)

In [None]:
#pull out the unique characters (we're ignoring number of lines spoken for now)
act1scene2_charaters = list(set(re.findall('\n\n([\w ]+)\.', act1_scene2)))
act1scene2_charaters

In [None]:
##Oops! There are some problems. Two characters listed in the same element, for example.
##Loop through and separate those elements
new_char_list = list()
for char in act1scene2_charaters:
    if 'Enter' in char:
        pass
    elif 'and' in char:
        new_char = char.split(' and ')
        print(new_char)

In [None]:
new_char_list = list()
for char in act1scene2_charaters:
    if 'Enter' in char:
        pass
    elif 'and' in char:
        new_char = char.split(' and ')
        new_char_list.extend([new_char[0], new_char[1]])
    else:
        new_char_list.append(char)
new_char_list

In [None]:
#Put it all together with multiple for-loops and nested if/else statements
#create a master list of tuples, where the first element is the act/scene number
#the second element a list of unique characters that appear in that scene

acts = [actI, actII, actIII, actIV, actV]
scene_list = list()
for actidx, act in enumerate(acts):
    act_scenes = re.findall("SCENE [A-Z]+", act)
    act_scenes.append("End of the Project Gutenberg EBook")
    act_list = list()
    for idx,scene in enumerate(act_scenes):
        if idx <= (len(act_scenes)-2):
            scene_text = act[act.find(act_scenes[idx]): act.find(act_scenes[idx+1])]
            act_list.append(tuple((("act"+str(actidx+1)+"_scene"+str(idx+1)),scene_text)))
    act_list_char = list()
    for scene, text in act_list:
        act_list_char.append(tuple((scene,list(set(re.findall('\n\n([\w ]+)\.', text))) )))
    for scene,character in act_list_char:
        new_char_list = list()
        for char in character:
            if 'Enter' in char:
                pass
            elif 'and' in char:
                new_char = char.split(' and ')
                new_char_list.extend([new_char[0], new_char[1]])
            else:
                new_char_list.append(char)
        scene_list.append(tuple((scene,new_char_list)))
scene_list

# Part II: Network Graph

Great! We've put some useful structure to our unstructured text.

Next steps:

1. Create an adjency matrix, counting the number of scenes each pair of characters appear together
2. Turn the adjency matrix into a network object, to graph it and calculate network statistics

In [None]:
#create a list of all unique characters in the play. This will be our rows and columns
all_characters = list()
for key, value in scene_list:
    print(value)

In [None]:
all_characters = list()
for key, value in scene_list:
    all_characters.extend(value)
all_characters

In [None]:
unique_characters = list(set(all_characters))
unique_characters

In [None]:
#Roughly clean up some of the errors
unique_characters = [char for char in unique_characters if char.isupper()]
unique_characters

In [None]:
len(unique_characters)

In [None]:
#create a zero square dataframe with characters as columns and rows
df = pandas.DataFrame(0, columns=unique_characters, index=unique_characters)
df

In [None]:
#reminder: how do we change values in a dataframe?
df_example = df.copy()
df_example.loc['BARNARDO']['QUEEN'] += 1
df_example

In [None]:
#set up a for loop to loop through the characters in the index
for char in df.index:
    print(char)

In [None]:
for scene, characters in scene_list:
    if 'GUILDENSTERN' in characters:
        print(scene, characters)

In [None]:
df_test = df.copy()
for scene, characters in scene_list:
    if 'GUILDENSTERN' in characters:
        for character in characters:
            if character in list(df_test.index):
                df_test.loc['GUILDENSTERN'][character] += 1
df_test.loc["GUILDENSTERN"]

In [None]:
for character in df.index:
    for scene, characters in scene_list:
        if character in characters:
            for char in characters:
                if char in list(df.index):
                    df.loc[character][char] += 1
df

### Network Graphs

In [None]:
df_matrix = df.as_matrix()
print(df_matrix)
print()
print(np.shape(df_matrix))

In [None]:
#Create org graph object
G = nx.to_networkx_graph(df_matrix,create_using=nx.DiGraph())
G

In [None]:
#dictionary to label the character nodes
names = list(df.index) #list of character names

labels_names = {}

for n in range(0,np.shape(df_matrix)[0]):
    labels_names[n] = names[n]
labels_names

In [None]:
#How many nodes? It should equal the number of unique characters in our dataset
G.number_of_nodes()

In [None]:
#rename the nodes in our two graph objects.
nx.relabel_nodes(G, labels_names,copy=False)

In [None]:
plt.figure(figsize=(10,10))

nx.draw(G,
    with_labels = True,
    node_color = 'black',
    node_size = 50,
    line_color = 'grey',
    linewidths = 0,
    width = 0.1,
    font_size = 16
    )

plt.show()

Let's look at some statistics.

In [None]:
betweeness_names = nx.betweenness_centrality(G, seed = 123)
betweeness_names

In [None]:
#sort by value to find the most central character
sorted(betweeness_names, key=betweeness_names.get, reverse=True)

Horatio is the most important go-between character for Hamlet. Does this surprise anyone?

In [None]:
eigenvector_names = nx.eigenvector_centrality(G)

#sort by value to find the most central character
sorted(eigenvector_names, key=eigenvector_names.get, reverse=True)

Ah, a much different story!

### Final Thoughts

* Tom Stoppard is onto something, for sure. But what? What is this capturing? Don't deny the power of the little people? Peripheral characters might not be so peripheral? 

* Also, be careful with your centrality measures! Make sure you're measuring what you want to be measuring **(i.e., go take a class and learn the math behind this. You have just enough skills to be dangerous right now!).**

* How else might we measure a tie? If you want a challenge, repeat this exercise but measure ties differently. Do you get different graph statistics? 

* What else might we do with this?

### This is what structuring unstructured digitized data means. Questions? Reactions? What stands out to you from the exercise today?

When you're satisfied, change the last name on this file, save it, and upload it to Blackboard to get credit for your attendance today.