# 2.1 Tidy data

In this notebook, we take a closer look at the BL books dataset in view of tidying it up.

In [1]:
# imports

import pandas as pd
import json, os, codecs
from collections import defaultdict, OrderedDict

## Import the dataset
Let us import the sample dataset in memory, as it is, without transformations. We rely on some Python libraries to do so.

In [2]:
root_folder = "../data/bl_books/sample/"

# metadata
filename = "book_data_sample.json"
metadata = json.load(open(os.path.join(root_folder,filename)))

# fulltexts
foldername = "full_texts"
texts = defaultdict(list)
for root, dirs, files in os.walk(os.path.join(root_folder,foldername)):
    for f in files:
        if ".json" in f:
            t = json.load(codecs.open(os.path.join(root,f), encoding="utf8"))
            texts[f] = t

# enriched metadata
filename = "MicrosoftBooks_filtered_list_sample.csv"
df_extra = pd.read_csv(os.path.join(root_folder,filename), delimiter=";")
df_extra = df_extra.rename(str.lower, axis='columns') # rename columns to lower case

## Take a look
Let's take a look to the dataset

In [3]:
print(len(metadata))

452


In [4]:
metadata[0]

{'datefield': '1841',
 'shelfmarks': ['British Library HMNTS 11601.ddd.2.'],
 'publisher': 'Privately printed',
 'title': ["The Poetical Aviary, with a bird's-eye view of the English poets. [The preface signed: A. A.] Ms. notes"],
 'edition': '',
 'flickr_url_to_book_images': 'http://www.flickr.com/photos/britishlibrary/tags/sysnum000000196',
 'place': 'Calcutta',
 'issuance': 'monographic',
 'authors': {'creator': ['A. A.']},
 'date': '1841',
 'pdf': {'1': 'lsidyv35c55757'},
 'identifier': '000000196',
 'corporate': {},
 'fulltext_filename': 'sample/full_texts/000000196_01_text.json'}

Questions on 'metadata':

* Can you identify some messy aspects of this dataset representation?
* Take a look at the 'shelfmarks' or 'title' fields: what is the problem here?
* Do the same for the 'authors' and 'pdf' fields: what is the problem?
* Look at the datefield of the *third* item in this list: is there a problem?

In [5]:
print(len(texts))

452


*Note: we have selected for the sample just the first volume/pdf for every book.*

In [6]:
texts['000000196_01_text.json'][:9]

[[1, ''],
 [2, ''],
 [3, ''],
 [4, ''],
 [5, ''],
 [6, ''],
 [7,
  "THE POETICAL AVIARY, WITH A B I R D'S-E YE VIEW OF THE ENGLISH POETS. (NOT PUBLISHED.) CALCUTTA: PRINTED AT THE BAPTIST MISSION PRESS, CIRCULAR ROAD. 1841."],
 [8, ''],
 [9,
  'POETICAL AVIARY. PART THE FIRST. BIRDS WITHOUT ALLUSION TO THEIR NOTES. One of the curious political medals that were struck in the reign of Charles II. represents, on one side, Titus Oates with two faces. On the reverse are the heads of the king and four of his principal ministers, with this motto round the border, " Birds of a feather flock together." This, as well as various other proverbs derived from birds, haee been introduced into poetry. Thus Anstey — And \'twas pretty to see how like birds of a feather The people of quality flocked all together, All pressing, addressing, caressing, and fond, Just the same as those animals do in a pond. Under the sign of an inn representing a man with a bird in his hand, and two birds in a bush I have se

In [7]:
df_extra[df_extra["first_pdf"] == "lsidyv35c55757"]

Unnamed: 0,aleph system no.,country code,language code (008),language code (041),ddc,personal author,corporate author,title,edition,imprint,series,subjects,other personal authors,other corporate authors,dom id,type,genre,first_pdf
234,14846757,|||,eng,,,A. A.,,"The Poetical Aviary, with a bird's-eye view of...",,"Calcutta : Privately printed, 1841.",,,,,lsidyv35c55757,poet,Poetry,lsidyv35c55757


Question: explore this data frame and find examples of messy aspects.

In [8]:
# Create data frames for all datasets
# We drop some variables we don't need at this stage

# metadata
datefield = list() # '1841'
publisher = list() # 'Privately printed',
title = list() # ["The Poetical Aviary, with a bird's-eye view of the English poets. [The preface signed: A. A.] Ms. notes"]
edition = list() # ''
place = list() # 'Calcutta'
issuance = list() # 'monographic'
authors = list() # {'creator': ['A. A.']}
first_pdf = list() # {'1': 'lsidyv35c55757'}
number_volumes = list()
identifier = list() # '000000196'
fulltext_filename = list() # 'sample/full_texts/000000196_01_text.json'
for book in metadata:
    if book["date"]:
        datefield.append(int(book["date"][:4]))
    else:
        datefield.append(None)
    publisher.append(book["publisher"])
    title.append(book["title"][0])
    edition.append(book["edition"])
    place.append(book["place"])
    issuance.append(book["issuance"])
    if "creator" in book["authors"].keys():
        authors.append(book["authors"]["creator"]) # this is a list!
    else:
        authors.append([''])
    first_pdf.append(book["pdf"]["1"])
    number_volumes.append(len(book["pdf"]))
    identifier.append(book["identifier"])
    fulltext_filename.append(book["fulltext_filename"].split("/")[-1])
df_meta = pd.DataFrame.from_dict({"datefield": datefield, "publisher": publisher,
                                 "title": title, "edition": edition, "place": place,
                                 "issuance": issuance, "authors": authors, "first_pdf": first_pdf,
                                 "number_volumes": number_volumes, "identifier": identifier,
                                 "fulltext_filename": fulltext_filename})

# texts
how_many_lines = 200 # we reduce the amount of text to the first n lines, to make it faster to play with it
fulltext_filename = list()
fulltext = list()
for f,t in texts.items():
    fulltext_filename.append(f)
    text = " ".join([line[1][:how_many_lines] for line in t])
    fulltext.append(text)

df_texts = pd.DataFrame.from_dict({"fulltext_filename": fulltext_filename, "fulltext": fulltext})

## UML modelling
From an Entity-Relationship model to a relational model (tidy data).

UML: Unified Modelling Language. A visual design language to go about modelling systems, including data. https://en.wikipedia.org/wiki/Unified_Modeling_Language

In [9]:
df_meta.head(1)

Unnamed: 0,datefield,publisher,title,edition,place,issuance,authors,first_pdf,number_volumes,identifier,fulltext_filename
0,1841.0,Privately printed,"The Poetical Aviary, with a bird's-eye view of...",,Calcutta,monographic,[A. A.],lsidyv35c55757,1,196,000000196_01_text.json


In [10]:
df_texts[df_texts["fulltext_filename"] == '000000196_01_text.json']

Unnamed: 0,fulltext_filename,fulltext
150,000000196_01_text.json,"THE POETICAL AVIARY, WITH A B I R D'S-E ..."


In [11]:
df_extra[df_extra["first_pdf"] == "lsidyv35c55757"]

Unnamed: 0,aleph system no.,country code,language code (008),language code (041),ddc,personal author,corporate author,title,edition,imprint,series,subjects,other personal authors,other corporate authors,dom id,type,genre,first_pdf
234,14846757,|||,eng,,,A. A.,,"The Poetical Aviary, with a bird's-eye view of...",,"Calcutta : Privately printed, 1841.",,,,,lsidyv35c55757,poet,Poetry,lsidyv35c55757


*Note: switch to the blackboard and model!*

## Tidy dataset: relational-model

* Full view (for your curiosity): https://dbdiagram.io/d/5d06a4adfff7633dfc8e3a42
* Reduced view (we here use this one): https://dbdiagram.io/d/5d06a5d0fff7633dfc8e3a47

In [12]:
# first, join the extra metadata genre column to the metadata data frame

df_extra_genre = df_extra[["genre","first_pdf"]]
df_book = df_meta.join(df_extra_genre.set_index('first_pdf'), on='first_pdf')

In [13]:
df_book.head(1)

Unnamed: 0,datefield,publisher,title,edition,place,issuance,authors,first_pdf,number_volumes,identifier,fulltext_filename,genre
0,1841.0,Privately printed,"The Poetical Aviary, with a bird's-eye view of...",,Calcutta,monographic,[A. A.],lsidyv35c55757,1,196,000000196_01_text.json,Poetry


In [14]:
# second, add the book_id to the book_text dataframe

df_book_text = df_texts.join(df_book[["identifier","fulltext_filename"]].set_index('fulltext_filename'), on='fulltext_filename')
df_book_text = df_book_text.rename(columns={"identifier":"book_id"})
df_book_text.head(3)

Unnamed: 0,fulltext_filename,fulltext,book_id
0,000551646_01_text.json,"' -■"" ' LiLitr-- )Wm&, HISTORY OF THE...",551646
1,002674278_01_text.json,The Great Revolution of 1840. REMINISC...,2674278
2,001975731_01_text.json,THE REAR-GUARD OF THE REVOLUTION. BY E...,1975731


In [15]:
# third, pull our author information and create the author table and the author-book table

author_id = 0 # this is a counter which provides for a distinct identifier to every author
author_dict = OrderedDict()
author_book_table = {"book_id":list(),"author_id":list()}
for book_id, authors in df_book[["identifier","authors"]].values:
    for author in authors:
        if author not in author_dict.keys():
            author_dict[author] = author_id
            author_id += 1
        author_book_table["book_id"].append(book_id)
        author_book_table["author_id"].append(author_dict[author])
        
df_author_book = pd.DataFrame.from_dict(author_book_table)
df_author = pd.DataFrame.from_dict({"name":[v for v in author_dict.keys()],
                                   "id":[k for k in author_dict.values()]})
df_author.set_index("id", inplace=True)

In [16]:
df_author.head(3)

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
0,A. A.
1,"Abbott, Evelyn"
2,"A'BECKETT, Gilbert Abbott."


In [17]:
df_author_book.head(3)

Unnamed: 0,book_id,author_id
0,196,0
1,4047,1
2,5382,2


*Note: you don't need to do this: these dataframes are already there!*

In [18]:
# let's now save our data frames for future use

root_folder = "../data/bl_books/sample_tidy/"
df_book.to_csv(os.path.join(root_folder,"df_book.csv"))
df_author.to_csv(os.path.join(root_folder,"df_author.csv"))
df_author_book.to_csv(os.path.join(root_folder,"df_author_book.csv"))
df_book_text.to_csv(os.path.join(root_folder,"df_book_text.csv"))