# Sample Notebook for Using DRO Objects

This is a sample Jupyter notebook for demonstrating a possible 
prototype for storing humanities data tentatively called [TSVDRO](http://github.com/jeddobson/tsvdro).

Jed Dobson (james.e.dobson@dartmouth.edu)<br>
Dartmouth College<br>
http://www.dartmouth.edu/~jed

In [1]:
# calculate distance between texts using DRO objects
import os
from tsvdro import tsvdro
import sklearn
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
def unpack_text(input_object):
   expanded_doc = list()
   for key in input_object['data']:
      expanded_doc.append([key] * input_object['data'][key])
   return(' '.join([item for list in expanded_doc for item in list]))

In [3]:
# load files
data_dir="na-slave-narratives"
objects_archive=list()
for file in os.listdir(data_dir):
    objects_archive.append(tsvdro.load(data_dir + "/" + file))

In [4]:
# expand counts
objects_archive_counts=list()
for i in objects_archive:
    objects_archive_counts.append(unpack_text(i))

In [5]:
vectorizer = CountVectorizer(strip_accents='unicode',stop_words='english',lowercase=True)
trans_matrix = vectorizer.fit_transform(objects_archive_counts)
trans_matrix = trans_matrix.toarray()

In [6]:
# calculate euclidean distance between each text
from sklearn.metrics.pairwise import euclidean_distances
euclidean_dist_matrix = euclidean_distances(trans_matrix)

In [7]:
# calculate cosine similarity distances between each text
from sklearn.metrics.pairwise import cosine_similarity
cosine_dist_matrix = 1 - cosine_similarity(trans_matrix)

In [8]:
objects_archive[0]['header']['bibliographic_data']

{'author_name': 'William E. Hatcher',
 'file_uri': 'http://docsouth.unc.edu/full-text/na-slave-narratives/data/texts/church-hatcher-hatcher.txt',
 'pages': '',
 'publication_date': False,
 'publisher': '',
 'publisher_location': '',
 'title': 'John Jasper: The Unmatched Negro Philosopher and Preacher',
 'volumes': ''}

In [9]:
# display distances from the first text
from operator import itemgetter
for x,y in sorted(enumerate(np.round(cosine_dist_matrix[0],3)), key=itemgetter(1)):
    print("{0:.3f} {1}".format(y, objects_archive[x]['header']['bibliographic_data']['title']))

0.000 John Jasper: The Unmatched Negro Philosopher and Preacher
0.548 My Southern Home: or, The South and Its People
0.582 A True Story, Repeated Word for Word As I Heard It. From The Atlantic Monthly. Nov. 1874: 591-594
0.589 The Life of Rev. John Jasper, Pastor of Sixth Mt. Zion Baptist Church, Richmond, Va., from His Birth to the Present Time, with His Theory on the Rotation of the Sun
0.611 Uncle Johnson, the Pilgrim of Six Score Years
0.658 Reminiscences of Isaac and Sukey, Slaves of B. F. Moore, of Raleigh, N.C.
0.673 Before the War, and After the Union.  An Autobiography
0.678 Autobiography of Rev. Francis Frederick, of Virginia
0.679 Narrative of Sojourner Truth; a Bondswoman of Olden Time, Emancipated by the New York Legislature in the Early Part of the Present Century; with a History of Her Labors and Correspondence, Drawn from Her "Book of Life"
0.679 Narrative of Sojourner Truth; a Bondswoman of Olden Time, Emancipated by the New York Legislature in the Early Part of the Pr