<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 Fall 2019</h2>
</center>

----

# Lists and Metadata Processing
 <center><pre>Rev: 08/23/2019</pre></center>

In [None]:
# We can create a simple list with square brackets:
var = [1,2,3,4,5]

In [None]:
# Again, to identify the datatype of a variable, we can use the type() function:
type(var)

In [None]:
# We can access list items by requesting the item in sequence.
# This works much like a string. To get the first character of
# a string, we use square brackets and 0 for the zero-th element:
# var[0]
# We use the same method with lists. This should return '2':

var[1]

In [None]:
# We can iterate through a list with a 'for loop'. We name the local variable
# that will be assigned a value for each list item.

# If we've not noticed this before, a for loop has three important features:
#
# 1) we name the local variable and the object to iterate through
# 2) follow the object to iterate through with a colon ':'
# 3) the instructions that will be iterated are indented (all to the same point)
#

for number in var:
    print(number,type(number))

In [None]:
# Note how the above shows that Python has autotyped each item in the list
# as an integer.

# Now, we can create a list for a book. Note that we can break lines easily
# here to make this readable.

metadata = ["Frederick Douglass",
            "Narrative of the Life of Frederick Douglass, an American Slave. Written by Himself"]

# This provides us with a semi-computable list of strings.

In [None]:
# But what if we want to add the year of publication?
# We can append a list with the append method. It operates directly on the list
# and modifies the list.
#
# Strings are immutable while lists are mutable.

# If we add just the year, it will be autotyped as an integer:
metadata.append(1845)

In [None]:
print(metadata)

In [None]:
# But what if we have multiple books? We can create what is called a list-of-lists.
# We can start be creating an empty list to hold all the items:

book_metadata = list()
book_metadata.append(metadata)
print(book_metadata)

In [None]:
# Now we can easily add another entry:
book_metadata.append(['Frederick Douglass',
                      'Life and Times of Frederick Douglass, Written by Himself. His Early Life as a Slave, His Escape from Bondage, and His Complete History to the Present Time, Including His Connection with the Anti-slavery Movement; His Labors in Great Britain as Well as in His Own Country; His Experience in the Conduct of an Influential Newspaper; His Connection with the Underground Railroad; His Relations with John Brown and the Harpers Ferry Raid;  His Recruiting the 54th and 55th Mass. Colored Regiments; His Interviews with Presidents Lincoln and Johnson; His Appointment by Gen. Grant to Accompany the Santo Domingo Commission&#x2014;Also to a Seat in the Council of the District of Columbia; His Appointment as United States Marshal by President R. B. Hayes; Also His Appointment to Be Recorder of Deeds in Washington by President J. A. Garfield; with Many Other Interesting and Important Events of His Most Eventful Life; With an Introduction by Mr. George L. Ruffin, of Boston',
                      1892])

In [None]:
# We can now count the number of entries with the len() function:
len(book_metadata)

## Now we'll setup a more complicated approach to examining book metadata
---

In [None]:
# This lines tell the Python interpreter to load some additional functions.
# We need functions to parse Comma Separated Value (CSV) data.

import csv

# we'll set a variable to hold the number of records read from this CSV file.
ln = 0

# as above, we'll create a metadata variable (overwritting the previous variable)
metadata=list()

# Now we open the CSV file and read each line, appending to the metadata list:
with open('data/Underwood_ch1/allgenremeta.csv', encoding = 'utf-8') as f:
    reader = csv.reader(f, delimiter = ',')
    for row in reader:
        metadata.append(row)
        
        # increment our counter
        ln += 1

# tell us how many entries we've read
print("read %s lines" % ln)

In [None]:
# The first line will give us some data about our metadata:
print(metadata[0])

In [None]:
# Let's remove the header (our "metadata") for easier processing:
metadata = metadata[1:]

In [None]:
# This metadata file contains 5,751 entries about a sample of books from 
# the 18th to 20th century. Some of the entries have uncertain publication 
# dates and there are three different genres:
#
# poe = poetry
# fic = fiction
# bio = biography

cleaned_up=list()
for book in metadata:

    # if the date just contains digits (i.e., just a year-of-publication)
    if book[2].isdigit() == True:
       
        # only select for fiction
        if book[5] == "fic":
            cleaned_up.append(book)

In [None]:
# How many books now?
print(len(cleaned_up))

In [None]:
# sort and print first and last entry
cleaned_metadata = sorted(cleaned_up, key = lambda x: x[2])
print(cleaned_metadata[:1])
print(cleaned_metadata[-1:])

### Title Lengths

Using our knowledge of conditional tests (if statements), we can get the average title 
length for each century of data.

---


In [None]:
# Eighteenth Century
average = 0
i = 0
for book in cleaned_metadata:
    pub_year = int(book[2])
    if pub_year >= 1700 and pub_year <= 1799:
        i = i + 1
        average = average + len(book[10])
print(average/i)

In [None]:
# Nineteenth Century
average = 0
i = 0
for book in cleaned_metadata:
    pub_year = int(book[2])
    if pub_year >= 1800 and pub_year <= 1899:
        i = i + 1
        average = average + len(book[10])
print(average/i)

In [None]:
# Twentieth Century
average = 0
i = 0
for book in cleaned_metadata:
    pub_year = int(book[2])
    if pub_year >= 1900 and pub_year <= 1999:
        i = i + 1
        average = average + len(book[10])
print(average/i)

### <font color="red">Question:</font> Now run the same for the other three genres. Any differences?

In [None]:
# We can iterate through the title list and search for some of the examples
# invoked by Moretti.

i = 0
for book in cleaned_metadata:
    title = book[10]
    if title.startswith("The"):
        i += 1
print(i)

In [None]:
# But was that all of them?
i = 0
for book in cleaned_metadata:
    title = book[10]
    if title.lower().startswith("the"):
        i += 1
print(i)

In [None]:
# What other patterns might we search for?