# Week 8 Lecture - Data Formats

Topics
* Review importing libraries
* Review CSV
* JSON Data Format
* XML Data Format
* JSON and Python
* Parsing MARC records


## Importing Libraries

Python has some functions already loaded and ready to use (like `print()`, `len()`, `type()` and `range()`. The full list of built-in functions can be found [on the Python Documentation website](https://docs.python.org/3/library/functions.html).

Beyond the built-in functions, Python has a very extensive [standard library](https://docs.python.org/3/library/index.html), which provides additional functionality in a variety of areas including [math](https://docs.python.org/3/library/numeric.html), [text processing](https://docs.python.org/3/library/text.html), and working with unique [file formats](https://docs.python.org/3/library/fileformats.html).

In [None]:
# import the time module from the standard library
import time

We can use the [time module](https://docs.python.org/3/library/time.html) to see how many nanoseconds have elapsed since the [EPOCH](https://en.wikipedia.org/wiki/Unix_time)

In [None]:
# how many nanoseconds
time.time()

When is the EPOCH (on this computer)?

In [None]:
# use the gmtime method to see whehn is time point zero
time.gmtime(0)

The world (of computers) is only 51 years old.

## Parsing CSV files

* CSV stands for *comma separated values* 
* Is a file format for tabular data, i.e. Excel Spreadsheets
* It is a pretty common data format
* So Python has a built-in parser in the [CSV module](https://docs.python.org/3/library/csv.html)


In [None]:
# load the csv module into memory
import csv

In [None]:
# open the dog names csv file and loop over the contents
with open("files/pgh-dog-names.csv", "r") as fileh:
    # create the special csv reader object that knows how to parse the file
    csv_reader = csv.reader(fileh)
    
    # initialize a variable with a very short name
    long_name = ""
    breed = ""
    color = ""

    # loop over each line look for the longest dog name
    for dog in csv_reader:
        # the name is the 4th column
        # check to see if that string is longer than long_name
        if len(dog[3]) > len(long_name):
            # if we have a new long name, save it in the variable long_name
            long_name = dog[3]
            breed = dog[1]
            color = dog[2]


# print the name of the dog with the longest name
print("The dog with the longest name in Pittsburgh is", long_name)
print("It is a", color, breed)

These data fit nicely into a two dimensional table, but sometimes you have data that 

## What is JSON

![json slide](json-slides/Slide2.png)
![json slide](json-slides/Slide3.png)
![json slide](json-slides/Slide4.png)
![json slide](json-slides/Slide5.png)
![json slide](json-slides/Slide6.png)
![json slide](json-slides/Slide7.png)
![json slide](json-slides/Slide8.png)

## What is XML?

![json slide](json-slides/Slide9.png)
![json slide](json-slides/Slide10.png)
![json slide](json-slides/Slide11.png)
![json slide](json-slides/Slide12.png)
![json slide](json-slides/Slide13.png)

## Parsing JSON and Python

Python is *batteries included* so there is already a [JSON module](https://docs.python.org/3/library/json.html) in the standard library.

The JSON module provides 4 main functions. Two for decoding (parsing) and two for encoding (*serializing*):
* `json.load()` - Parse JSON data from a file
* `json.loads()` - Parse JSON data from a string
* `json.dump()` - Serialize Python data into a JSON file
* `json.dumps()` - Serialize Python data into a JSON string

Notice a pattern on the naming convention?

In [None]:
import json

### Decoding / Parsing JSON data

I have included some fun JSON data in the `files` folder. Let's go and see what we have using JupyterLab and then open them with Python.

In [None]:
json_string = """
{"something":"I was not able to think of any fun data", 
"data":[1,2,3,4,5,6]}

"""


parsed_json = json.loads(json_string)
parsed_json

What *type* is this parsed data being represented in Python?

In [None]:
type(parsed_json)

JSON maps very nicely to Python dictionaries. You can see how the JSON data types map to Python data types in [the documentation](https://docs.python.org/3/library/json.html#encoders-and-decoders)

Usually we don't type our json literally as strings...we store it in files

In [None]:
# open a JSON file handler and parse it into a python d
with open("files/stranger.json") as fileh:
    stranger_data = json.load(fileh) #stranger danger

In [None]:
# display the data
stranger_data

fun!

In [None]:
# get the value at the key summary
stranger_data['summary']

Using the Python dictionary indexing, we can reach into this complex data structure and grab subsets of the tree.

In [None]:
# how many episodes
len(stranger_data["_embedded"]["episodes"])

Note, we have to have a pre-existing understanding of the data structure so we can know what keys

### Encoding / Serializing JSON Data

From the [Rick and Morty API](https://rickandmortyapi.com)

In [None]:
with open("files/rm-characters.json") as fileh:
    rm_characters = json.load(fileh)
rm_characters["info"]

In [None]:
len(rm_characters["results"])

We have twenty characters, but maybe we just want to have the aliens.

In [None]:
# create an empty dictionary for our data
aliens = {}

# loop over all the characters
for character in rm_characters["results"]:
    # check to see if the character is an alient
    if character["species"] == "Alien":
        #using the name as a key, save the data for alien characters as values
        aliens[character["name"]] = character

# display the results
aliens

In [None]:
# save our data to disk
with open("aliens.json", "w") as fileh:
    json.dump(aliens, fileh)

![ancient aliens](https://i.kym-cdn.com/photos/images/original/000/183/103/alens.jpg)

In [None]:
# create an empty dictionary for our data
aliens = []

# loop over all the characters
for character in rm_characters["results"]:
    # check to see if the character is an alient
    if character["species"] == "Alien":
        #using the name as a key, save the data for alien characters as values
        aliens.append(character)

# display the results
aliens

In [None]:
with open("aliens.json", "w") as fileh:
    json.dump(aliens, fileh)

If we want to get all this data as a JSON string, we can use `dumps`

## Parsing MARC records

Unfortunately, [MARC records are still a thing](https://www.libraryjournal.com/?detailStory=marc-must-die) and so we need to *deal with it*.

Double Unfortunately, Python does not include a MARC parser as part of the standard library.

![batteries not included](https://images-na.ssl-images-amazon.com/images/I/71uIJYNfJZL._SL1500_.jpg)

Fortunately, [a brave soul](https://inkdroid.org/about/) has built this capacity for us in the form of a third-party library

We'll be working with pymarc library for this lecture.  Details about it can be found [here](https://gitlab.com/pymarc/pymarc), which also has some documentation and resources for its use.  There is also a complete library documentation [here](https://readthedocs.org/projects/pymarc/downloads/pdf/latest/).

### Installing 3rd party libraries

Libraries like Pymarc are not part of the "standard library," which means you have to install them on your computer yourself. There are currently 291,647 on the [PyPI website](https://pypi.org), the Python Package Index, a repository of 3rd party packages. You don't need to install all of them, only the ones you need.

In our case, we needed to install the [pymarc](https://pypi.org/project/pymarc/) package. 



Installing packages is done using a command line tool called `pip` and you can run the code cell below to execute a unix command to install pymarc. NOTE: this may not work on Windows computers AND you don't need to run this if you are on JupyterHub (it just won't do anything).

In [None]:
# run the command pip to install pymarc
!pip install pymarc --user

Now that pymarc is installed, we can `import` it like any other library.

In [None]:
# import the 
import pymarc

## Reading MARC files

MARC files are text files (technically they are MARC8 encoded text files) and they are UGLY. But because they are just text you can open them using the regular Python file machinery

In [None]:
# open MARC file called marc.dat
with open("files/marc.dat", 'r') as raw:
    print(raw.read(3000)) #read first 1000 characters

There is a lot going on in that file and parsing it using Python string methods would be unpleasant. Because somoene did that for us, we can stand on the shoulder of giants!

In [None]:
# create an empty list to store the records
marc_records = []

# open the file 
with open("files/marc.dat", "rb") as fileh:
    
    # create an isntance of the marc reader, like with CSV 
    marc_reader = pymarc.MARCReader(fileh)
    
    # loop over each record
    for record in marc_reader:
        # add record to our list
        marc_records.append(record)
        
print("There are", len(marc_records), "MARC records.")

What does a record look like?

In [None]:
# look at the 2nd record at index 1 because...
marc_records[1]

Drat! What we are looking at is an complex, custom data structure (technically an "object") of the type `Record`. We won't get into the details of object oriented programming. 

If we want to manipulate are Record, we should [read the documentation](https://pymarc.readthedocs.io/en/latest/#module-pymarc.record)

In [None]:
# see what the dictionary representation of the MARC record
marc_records[1].as_dict()

Ugh, still a mess, but we can use handy methods like `title()` and `author()` to grab information without having to memorize MARC fields 

In [None]:
# loop over each record
for record in marc_records:
    # print name and author
    print(record.title(), "by", record.author())


Pymarc includes a bunch of helper functions for grabbing information from a MARC record

In [None]:
# display the ISBN 
marc_records[11].isbn()

In [None]:
# dispay the publisher 
marc_records[12].publisher()

In [None]:
# display the publication year
marc_records[1].pubyear()

In [None]:
# display the subjects
marc_records[-4].subjects()

AH! We have a list of [Fields](https://pymarc.readthedocs.io/en/latest/#module-pymarc.field), we need to do some looping and reading of documentation.

In [None]:
# loop over each field and display the field value
for subject in marc_records[-4].subjects() :
    print(subject.value())

In [None]:
# loop over each field and display the raw MARC
for subject in marc_records[-4].subjects() :
    print(subject)

If we are wicked smart and have memorized the MARC field numbers by heart, you can always reference them directly using Python indexing syntax. 

In [None]:
# Get the value for field 300 from
marc_records[1]['300'].value()

In [None]:
# display the 245 field in the original MARC format
marc_records[1]['245'].as_marc("utf-8")

## Writing

In [None]:
# create a new blank MARC record
record = pymarc.Record() 

# Create the 245 and 100 fields 
title = pymarc.Field(
        tag = '245',
        indicators = ['0','1'], 
        subfields = [
            'a', 'The Beaverkill : ', 'b', 'The History of a Rever and its People /'
        ])
author = pymarc.Field(
        tag = '100',
        indicators = ['0','1'], 
        subfields = [
            'a', 'Ed Van Put'
        ])
record.add_field(title, author)


# write the MARC record to disk
with open('test_write.dat', 'wb') as out:
    out.write(record.as_marc())

In [None]:
# what does it look like
print(record.as_marc())

Ugh.

In [None]:
records_test = []

# open the file 
with open("test_write.dat", "rb") as fileh:
    
    # create an isntance of the marc reader, like with CSV 
    marc_reader = pymarc.MARCReader(fileh)
    
    # loop over each record
    for record in marc_reader:
        # add record to our list
        records_test.append(record)

In [None]:
# display the title and author of the record we read from disk
print(records_test[0].title(), "by", records_test[0].author())

![Hi MARC](https://media.giphy.com/media/G2vaqcEICxOyA/giphy.gif)