<a href="https://colab.research.google.com/github/lmrhody/femethodsS23/blob/main/week7_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 7: Jupyter Notebook Assignment - Working with Data

Fill out the cell below with your information. 

* Student Name: 
* Date: 
* Instructor: Lisa Rhody
* Assignment due: 
* Methods of Text Analysis
* MA in DH at The Graduate Center, CUNY

## Objectives
The purpose of this notebook is to get some hands-on experience putting what you've seen in tutorials about importing and working with text in Python into practice. You'll also be asked to put the reading you've been doing all semester into conversation with the process of importing, cleaning, and preparing data. 

The object of the notebooks this week is: 
* To practice several ways of importing text into your Python environment to study; 
* To become more familiar with various pipelines for cleaning and preparing data for text analysis; 
* To consider the challenges that the availability and scarcity of data presents to the literary scholar (and to consider how other kinds of research might also need to address similar issues); 
* To connect examples of real-world text analysis projects with the practical process of cleaning and preparing data. 

# Getting Started
We're going to start by importing some important libraries for working with text data. 

In [2]:
import nltk
import numpy as np
import pandas as pd
import urllib
import pprint

## Importing Data
So far, we have worked with data during the Datacamp exercises, but that was a much more controlled environment. When you are actually doing your own text analysis project, you will have a much messier process. During this week's reading, you will have read several pieces about what cleaning takes place and some of the challenges that data presents when working with text. In particular, we're looking at text analysis from a humanities / litereary perspective; however, one might argue that these challenges are more similar to the text analysis one might perform in the social sciences or with non-fiction work than might appear to be the case on the surface. 

In this lesson, we'll practice importing data: 
* from a file already on your computer (using a directory path); 
* from a file on the web using a URL request 
* from a file on the web using Beautiful Soup. 


### Loading data from a flat file on your local computer
Before you get started, be sure to download this file onto your local computer and save it as herland.txt. 

Next, we're going to import `herland.txt` using an upload function that is part of the google.colab Python package. This function will open a button under the cell that you can use to "Choose Files" from your local computer. Choose the `herland.txt` file and then upload it. The for loop below will print out what the name of the file is that you are saving to the Google Colab content folder. 

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print("User uploaded file '{name} with length {length} bytes".format(
      name=fn, length=len(uploaded[fn])))

Saving herland.txt to herland.txt
User uploaded file 'herland.txt with length 328410 bytes


To find the file you just uploaded, look to the left side of this browser window. Click on the icon of a file folder. A directory structure should open. Click on the arrow next to `content` and you should see your uploaded file appear inside. 

Then we're going to use a Python function `open()`. We'll use a `for` loop, which simply means that we'll do an action that repeats until we tell it to stop. The following code says that we want to `open` the file `herland.txt` so we can read it (argument `mode='r'`). Then we're going to close the file. When we do this, we're going to assign a variable name to the resulting data, which is now a string called `file`.


In [8]:
filename = 'herland.txt'
herland = open(filename, mode='r')
hertext = herland.read()
herland.close()

Another way to read the text from a file into Python is to use a "context manager." The following tells python that with the `herland.txt` file open, read in the text and create a variable called `file` to store the data. Then, the next line tells Python to print the new variable `file`. When you run the next cell, it is going to print out the entire text of *Herland*. That's a lot of text, so once you've done it, you can clear the cell's output and move on to the next cell. 

In [9]:
# Here is how you print a string from a file without having to close the file using a context manager

with open('herland.txt','r') as file:
    print(file.read())

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK HERLAND ***










HERLAND

by Charlotte Perkins Stetson Gilman




CHAPTER 1. A Not Unnatural Enterprise


This is written from memory, unfortunately. If I could have brought with
me the material I so carefully prepared, this would be a very different
story. Whole books full of notes, carefully copied records, firsthand
descriptions, and the pictures--that’s the worst loss. We had some
bird’s-eyes of the cities and

In [None]:
# If you don't want to save the text of the file, but just want to peek into it to see what's there, you could use this method. 

with open('herland.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman



This eBook is for the use of anyone anywhere at no cost and with



### What happens when you import a flat file? 

The python function `type()` will return to you output that explains the data type you are working with. When you pass the new text object `herland` through the `type()` function below, what response do you get? The response will look different from other data types that you've used before. In this case, it is read in as a "file object." Remember that Python won't know how to handle data unless it fits a particular data type that the computer expects when passing a function to it. In the next input, we ask Python for the length of the file. This will throw an error. Why do you think that is? 

In [None]:
# herland is a file object, not a string. 
type(herland)

_io.TextIOWrapper

In [None]:
# since herland is a file object and not a string, you can't find the length of it.
len(herland)

TypeError: ignored

#### Response here: 

### We had to go through a process to convert the file object to a string. 
Looking at the cells below, which variable should return `type()` as a string? (The answer is in the cell below.) 

In [None]:
# but hertext is a different datatype. How would you check? 
type(hertext)

str

Once you have a string, there are a number of functions that you can make use of. One of those is the `len()` command, which you can run below. 

In [None]:
# How many characters are in the hertext string? 
len(hertext)

315999

Once an object is recognized as a string, you can begin manipulating it. For example, you could count the number of times the sequence of characters "her" appear within the entire text of _Herland_.

In [None]:
hertext.count('her', 0, -1)

1244

The ability to count characters, words, n-grams, etc. means that we can also more easily target specific sections of the text. For example, when you print to your screen the opening of the herland file, you notice that it is accompanied with metadata. For the purposes of text analysis, what would be the advantages or disadvantages of removing the metadata associated with _Herland_?

In [None]:
# What is happening at the beginning of the herland.txt file, though? We can check to see by using an index. 
print(hertext[:660])

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK HERLAND ***










HERLAND

by Charlotte Perkins Stetson Gilman




CHAPTER 1. 


Working with a string is *more* helpful than simply working with a text object, but there are other things that we can do to the text to make it more easily manipulated in Python and NLTK. For example, when you're working with a string, it's not easy to count whole words. The NLTK word tokenizer function, however, will take a string and turn it into "tokens"--discrete segments of characters. Tokenized strings become a new data type--a list. 

In [None]:
hertokens = nltk.word_tokenize(hertext)
type(hertokens)

list

A tokenized list can be called, acted upon, and manipulated differently than a string. If we call just the tokens that are in index positions 0-15, here is what you would get:

In [None]:
hertokens[:15]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Herland',
 ',',
 'by',
 'Charlotte',
 'Perkins',
 'Stetson',
 'Gilman',
 'This',
 'eBook',
 'is']

In [None]:
text1 = nltk.Text(hertokens)

In [None]:
type(text1)

nltk.text.Text

In [None]:
len(text1)

68494

In [None]:
text1[1000:1025]

['that',
 'they',
 'seemed',
 'sure',
 '.',
 'I',
 'told',
 'the',
 'boys',
 'about',
 'these',
 'stories',
 ',',
 'and',
 'they',
 'laughed',
 'at',
 'them',
 '.',
 'Naturally',
 'I',
 'did',
 'myself',
 '.',
 'I']

## Review
When you import text from a flat file that is saved on your local computer, what will you need to do in order to select parts of the text using an index? 

## Next, we're going to retrieve text directly from a URL with the `urlllib` package
To do this, we're going to call the package `urllib` and specifically from that we're going to use `urlretrieve.` Next, we need to assign the text in the file to a variable. In this case, that variable is `url`. We're going to run `urlretrieve` with two parameters, the name of the URL you want to import (which you assigned to the variable `url` above, and the file name and extension. Here that is `203-0.txt.` If you pay attention to the output, you'll realize that you've imported the file as an object. 

In [None]:
import urllib.request

In [20]:
from urllib.request import urlopen
from urllib.request import Request
url = 'https://www.gutenberg.org/files/203/203-0.txt'
uncletom = urlopen(url)

### Using what you've learned so far, how would you figure out what data type the file `uncletom` is? Add a cell below and show how you would find the answer. 

In [21]:
type(uncletom)

http.client.HTTPResponse

Next, we're going to turn the text of Uncle Tom's Cabin into a list. A list is a mutable, ordered sequence of items. It can be indexed, sliced, and changed. Items in the list can be accessed through it's indexical placement. 

In [None]:
dir(uncletom.read())

In [27]:
words = uncletom.read().decode().split()

In [28]:
print(type(words))

<class 'list'>


Let's practice those steps again, but with a new file this time. 

In [34]:
from urllib.request import urlopen

shakespeare = 'http://composingprograms.com/shakespeare.txt'

print( type(shakespeare) )

<class 'str'>


In [36]:
shakespeare = 'http://composingprograms.com/shakespeare.txt'
shakespeare = urlopen('http://composingprograms.com/shakespeare.txt')
print(type(shakespeare))

<class 'http.client.HTTPResponse'>


In [37]:
dir(shakespeare)

['__abstractmethods__',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_abc_impl',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_check_close',
 '_close_conn',
 '_get_chunk_left',
 '_method',
 '_peek_chunked',
 '_read1_chunked',
 '_read_and_discard_trailer',
 '_read_next_chunk_size',
 '_read_status',
 '_readall_chunked',
 '_readinto_chunked',
 '_safe_read',
 '_safe_readinto',
 'begin',
 'chunk_left',
 'chunked',
 'close',
 'closed',
 'code',
 'debuglevel',
 'detach',
 'fileno',
 'flush',
 'fp',
 'getcode',
 'getheader',
 'getheaders',
 'geturl',
 'headers',
 'info',
 'isatty',
 'isclosed',

In [38]:
words = shakespeare.read().decode().split()

In [39]:
print(type(words))

<class 'list'>


In [42]:
title = words[0:3]

In [43]:
body = words[3:]

In [44]:
print(body[:10])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']



__Indexing Operator__

Indexing operator ([ ]) selects one or more elements from a sequence. Each element of a sequence is assigned a number - its position or index. Index must be an integer value and is called inside a pair of square brackets. 

The operation that extracts a subsequence is called __slicing__. When selecting more than one element __": operator"__ is used with integer before and after it to indicate where to start and where to stop the index, respectively.

Python indexing starts at 0 and ends at (n-1), where n refers to the number of items in the sequence. The function "len" can be used to get the number of items in a list. 

Negative indexing is also supported by Python. It can be done by adding "-" operator before the integer value.

In [45]:
n_words = len(body)
print( n_words )

980634


In [47]:
print( body[980634])

IndexError: ignored

In [48]:
print( body[980633])

.


In [49]:
sub_body = body[:10]
print( sub_body)

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']


In [50]:
print( sub_body[:-2])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour']


In [51]:
print( sub_body[::2])    # gives every 2nd element

['Now', 'fair', ',', 'nuptial', 'Draws']


__Python Syntax__

Syntax refers to the structure of the language. 

The end of the statement does not require semicolon or other symbol. After a statement is complete, the code is considered completed. However, using semicolon can allow you to execute two separate codes from the same line. 

Indentation i.e. the whitespace matters in Python. A block of code is a set of statements that should be treated as a unit even when written in a new line. A code block in python are denoted by indentation. For example, in compound statements such as loops and conditionals, after the colon we must enter into a new line and add exactly four spaces to continue further. Whitespaces __within__ the same line does not matter however.  

Comments about codes can be made using hashtag #. anything written after # is ignored by the interpreter. Python does not have any syntax for multi-line comments. 

In [52]:
sub_body_lowercase = []
for word in sub_body:
  sub_body_lowercase.append(word.lower())
  #print(sub_body_lowercase)
#print(sub_body_lowercase)
sub_body_lowercase

['now', ',', 'fair', 'hippolyta', ',', 'our', 'nuptial', 'hour', 'draws', 'on']

## Importing an HTML file using an http: request
The previous two files that we imported were _plain text_ files. In other words, there is little to no descriptive encoding. However, we can also use another module from the URLLIB package that is designed to import an .html file directly from the web. We can actually do this with just a few lines of code. First, we import the URLLIB package, and specifically the `request` module. We assign the URL we want to manipulate by assigning the URL to a variable. Next, we pass the URL through the urlopen.request function from the URLLIB package, and also at the same time "read" the file. The output of that string becomes the variable `html`. When we print the variable html, we discover that all of the HTML from the page has been pulled into the variable name. Unfortuantely, it doesn't look very clean. 

In [53]:
# Now import the bibliography page from Colored Conventions in HTML
import urllib.request
anotherurl='http://coloredconventions.org/exhibits/show/bishophmturner'

In [54]:
html = urllib.request.urlopen(anotherurl).read()
print(html)

b'<!DOCTYPE html>\n<!--[if IE 6]>\n<html id="ie6" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 7]>\n<html id="ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 8]>\n<html id="ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->\n<html lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<!--<![endif]-->\n<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n\t\t\t\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<link rel="pingback" href="https://coloredconventions.org/before-garvey-mcneal-turner/xmlrpc.php" />\n\n\t\t<!--[if lt IE 9]>\n\t<script src="https://coloredconventions.o

If you are interested in doing text analysis of a webpage, and the only way to ingest the web page is with HTML included, what are things you might need to learn to do to separate the HTML tags from the text? Look at the code above and write a short description of what might need to stay and what might need to be extracted. Should the extracted data be preserved or discarded? 

# Importing Data by Webscraping with BeautifulSoup
If you are interested in scraping data from the open web, BeautifulSoup is a Python pacakge worth exploring in detail. For our purposes here, though, we're going to consider how to use Beautiful Soup to turn "unstructured" data into "structured" data. As you read through this section, consider Muñoz and Rawson's argument about data cleaning. Is there a need for the data to stay unstructured? What is the value of cleaning? 

In [55]:
import requests
from bs4 import BeautifulSoup

In [56]:
# Specify url: url
url4 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url4)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

<!DOCTYPE html>
<html lang="en" xmlns:addthis="https://www.addthis.com/help/api-spec" xmlns:fb="https://www.facebook.com/2008/fbml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://coloredconventions.org/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <style id="et-builder-googlefonts-cached-inline">
   /* Original: https://fonts.googleapis.com/css?family=Oswald:200,300,regular,500,600,700|Open+Sans:300,regular,500,600,700,800,300italic,italic,500italic,600italic,700italic,800italic&#038;subset=cyrillic,cyrillic-ext,latin,latin-ext,vietnamese,greek,greek-ext,hebrew&#038;display=swap *//* User Agent: Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1 */@font-face {font-family: 'Open Sans';font-style: italic;font-weight: 300;font-st

Compare the text imported using the "webscraping" method included with BeautifulSoup versus the option of importing the entire file using URLLIB. 

## Cleaning up Webscraped text

In [57]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url5 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url5)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Colored Conventions' webpage: ccc_title
ccc_title = soup.title

# Print the title of Colored Conventions' webpage to the shell
print(ccc_title)


<title>Press &amp; Notices - Colored Conventions Project</title>


In [None]:
# Get Colored Conventions' text: ccc_text
ccc_text = soup.get_text()

# Print CCC's text 
print(ccc_text)

In [None]:
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

Explain what the value is of importing HTML files using BeautifulSoup. How does this relate to the concerns that Rawson and Muñoz raise in their article? 

## Access data using an API
In the following exercise, you will import data from the Chronicling America API. You will set parameters for what content and keywords to pull in, then you will send the request to the server. After you import the data, you'll organize and clean up the JSON format--in other words, when you get your search results, it will come packaged in a file format, called JSON. We will ingest the JSON file, turn it into a dictionary, and then turn part of that dictionary into a Pandas Dataframe. All we're doing when we turn text data into a dataframe is organizing the metadata and the files into a format that can be used and acted upon in order to do other kinds of analysis. 

In [None]:
# Make the Requests module available
import requests
import pandas as pd

In [None]:
# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

In [None]:
# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
params = {
    'proxtext': 'poetry' # Search for this keyword -- feel free to change!
    
}

(Later on, you will be asked to return to the above cell and change the search parameters. You do this by replacing `poetry` with `yourterm`.)

In [None]:
# This adds a value for 'encoding' to our dictionary
params['format'] = 'json'

# Let's view the updated dictionary
params

{'proxtext': 'poetry', 'format': 'json'}

In [None]:
# This sends our request to the API and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# This shows us the url that's sent to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url)) 

# This checks the status code of the response to make sure there were no errors
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))
    print('Try running this cell again.')

Here's the formatted url that gets sent to the ChronAmerca API:
https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=poetry&format=json

All ok


In [None]:
# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()

In [None]:
# Let's prettify the raw JSON data and then display it.

# We're using the Pygments library to add some colour to the output, so we need to import it
import json
from pygments import highlight, lexers, formatters

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)


The output of the above cell will be quite long. Before turning in this assignment, please delete the cell above so the file you turn in is not difficult to read. Thank you!

What kind of data type is `outfile`?

In [None]:
print(outfile)

<_io.TextIOWrapper name='data.json' mode='w' encoding='UTF-8'>


In [None]:
# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()


In [None]:
type(data)

dict

In the cell below, we will take the nested dictionary, which is also a json format, and we will convert it into a DataFrame. 

In [None]:
pd.DataFrame.from_dict(data)

Unnamed: 0,totalItems,endIndex,startIndex,itemsPerPage,items
0,419070,20,1,20,"{'sequence': 25, 'county': ['New York'], 'edit..."
1,419070,20,1,20,"{'sequence': 131, 'county': [None], 'edition':..."
2,419070,20,1,20,"{'sequence': 15, 'county': ['Prince George's']..."
3,419070,20,1,20,"{'sequence': 17, 'county': ['Cook County'], 'e..."
4,419070,20,1,20,"{'sequence': 13, 'county': ['Cook County'], 'e..."
5,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
6,419070,20,1,20,"{'sequence': 3, 'county': ['Cook County'], 'ed..."
7,419070,20,1,20,"{'sequence': 1, 'county': ['Fayette', 'Hamilto..."
8,419070,20,1,20,"{'sequence': 93, 'county': [None], 'edition': ..."
9,419070,20,1,20,"{'sequence': 24, 'county': ['Cook County'], 'e..."


You may notice that a lot of the cells repeat the same data over and over again. What do you think is showing up in each row and column? 

In [None]:
pd.DataFrame.from_dict(data, orient='index')

Unnamed: 0,0
totalItems,419070
endIndex,20
startIndex,1
itemsPerPage,20
items,"[{'sequence': 25, 'county': ['New York'], 'edi..."


If we switch the layout of the dataframe, it becomes easier to see how the labels for the dataframe are different from the many items in the items observation. We can try to use the json method `normalize` to flatten out the file into columns. 


In [None]:
df = pd.io.json.json_normalize(data)
df.columns

Index(['endIndex', 'items', 'itemsPerPage', 'startIndex', 'totalItems'], dtype='object')

When we use the Multi Index function, we essentially collapse all the lists in the dataframe into one observation. 

In [None]:
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('.')) for c in df.columns])
df

Unnamed: 0,endIndex,items,itemsPerPage,startIndex,totalItems
0,20,"[{'sequence': 25, 'county': ['New York'], 'edi...",20,1,419070


In [None]:
json=pd.DataFrame.from_dict(data)

If we name the dataframe json, we can run a miniature program over that file that returns the keys (index labels) of each item in the dictionary `data`.

In [None]:
for key in json:
    print(key)

totalItems
endIndex
startIndex
itemsPerPage
items


The `.tail()` method will print out just the last (in this case) 6 items in the dictionary.

In [None]:
json.tail(6)

Unnamed: 0,totalItems,endIndex,startIndex,itemsPerPage,items
14,419070,20,1,20,"{'sequence': 18, 'county': ['Douglas'], 'editi..."
15,419070,20,1,20,"{'sequence': 30, 'county': ['Cook County'], 'e..."
16,419070,20,1,20,"{'sequence': 38, 'county': ['New York'], 'edit..."
17,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
18,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
19,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."


The `shape()` method will show how many rows and how many columns are in your dataframe.

In [None]:
json.shape

(20, 5)

Ok, we have lots of differently shaped data objects now. Let's see what the differences are. In the first case, if we take the variable `json` which is a json object and we print `items`, we get a json object.

In [None]:
print(json['items'])

0     {'sequence': 25, 'county': ['New York'], 'edit...
1     {'sequence': 131, 'county': [None], 'edition':...
2     {'sequence': 15, 'county': ['Prince George's']...
3     {'sequence': 17, 'county': ['Cook County'], 'e...
4     {'sequence': 13, 'county': ['Cook County'], 'e...
5     {'sequence': 9, 'county': ['Cook County'], 'ed...
6     {'sequence': 3, 'county': ['Cook County'], 'ed...
7     {'sequence': 1, 'county': ['Fayette', 'Hamilto...
8     {'sequence': 93, 'county': [None], 'edition': ...
9     {'sequence': 24, 'county': ['Cook County'], 'e...
10    {'sequence': 41, 'county': [None], 'edition': ...
11    {'sequence': 42, 'county': [None], 'edition': ...
12    {'sequence': 40, 'county': [None], 'edition': ...
13    {'sequence': 9, 'county': ['Hennepin', 'Ramsey...
14    {'sequence': 18, 'county': ['Douglas'], 'editi...
15    {'sequence': 30, 'county': ['Cook County'], 'e...
16    {'sequence': 38, 'county': ['New York'], 'edit...
17    {'sequence': 9, 'county': ['Cook County'],

When we request the data type of `data` we get a dictionary.

In [None]:
print(data)

{'totalItems': 419070, 'endIndex': 20, 'startIndex': 1, 'itemsPerPage': 20, 'items': [{'sequence': 25, 'county': ['New York'], 'edition': None, 'frequency': 'Daily', 'id': '/lccn/sn83030272/1913-05-04/ed-1/seq-25/', 'subject': ['New York (N.Y.)--Newspapers.', 'New York (State)--New York County.--fast--(OCoLC)fst01234953', 'New York (State)--New York.--fast--(OCoLC)fst01204333', 'New York County (N.Y.)--Newspapers.'], 'city': ['New York'], 'date': '19130504', 'title': 'The sun. [volume]', 'end_year': 1916, 'note': ['A facsimile of Vol. 1, no. 1 (Sept. 3, 1833) issued by The Sun (New York, N.Y. : 1920) on Sept. 2, 1933.', 'Also issued on microfilm by New York Public Library.', 'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Evening eds.: Evening sun (New York, N.Y. : 1852), <1852>, and: Evening sun (New York, N.Y. : 1887), 1887-1916.', 'Publisher varies: Benjamin H. Day & George W. Wisner, 1833-1835; Benjamin H

When we "normalize" the dataframe key `items`, we turn it into a dataframe, and when we call the dataframe, we get the contents of this item in the dictionary in a dataframe format. Keys are at the top of each column.

In [None]:
json_file = pd.DataFrame.from_dict(json_normalize(data['items']))

In [None]:
json_file

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,...,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,"[Extra sun, New York sun]",nn_ehrlich_ver02,[New York],New York,[New York],19130504,,,1916,Daily,...,Benj. H. Day,THIRD SECTION SUBURBAN REAL ESTATE SECTION,25,1833,[New York],"[New York (N.Y.)--Newspapers., New York (State...",The sun. [volume],sun.,page,https://chroniclingamerica.loc.gov/lccn/sn8303...
1,"[Star, Sunday star]",dlc_2goncharova_ver03,[Washington],District of Columbia,[None],19480926,,,1972,Daily,...,W.D. Wallach & Hope,,131,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
2,[Greenbelt],mdu_annapolis_ver01,[Greenbelt],Maryland,[Prince George's],19380824,,,1954,Weekly,...,[s.n.],,15,1937,[Maryland],"[Greenbelt (Md.)--Newspapers., Maryland--Green...",Greenbelt cooperator.,greenbelt cooperator.,page,https://chroniclingamerica.loc.gov/lccn/sn8906...
3,[],iune_echo_ver01,[Chicago],Illinois,[Cook County],19120206,,,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,17,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
4,[],iune_golf_ver01,[Chicago],Illinois,[Cook County],19150202,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,13,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
5,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19140204,,NOON EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,9,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
6,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19140304,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,3,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
7,"[Blade, Bluegrass blade]",kyu_dylan_ver01,"[Lexington, Cincinnati]",Kentucky,"[Fayette, Hamilton]",19080209,,,1999,Weekly,...,Blade Pub. Co.,,1,1880,"[Kentucky, Ohio]","[Fayette County (Ky.)--Newspapers., Kentucky--...",Blue-grass blade. [volume],blue-grass blade.,page,https://chroniclingamerica.loc.gov/lccn/sn8606...
8,"[Star, Sunday star]",dlc_1noguchi_ver01,[Washington],District of Columbia,[None],19390122,,,1972,Daily,...,W.D. Wallach & Hope,,93,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
9,[],iune_hotel_ver01,[Chicago],Illinois,[Cook County],19151214,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,24,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...


## Reflection
In this exercise, you queried an API from Chronicling America and pulled in files that included the search term "poetry." Those files, then, were cleaned and made slightly more tidy by highlighting the "keys" to the dictionary, and then taking one small section of the dictionary and turning it into a dataframe. In a markdown section, look over what you have done, and try changing the search *parameter* at the top of the exercise. What changes when you re run the activity? What is "messy" about the file that makes it hard to work with? What is "clean" about the file that makes it easier to work with? 

In [63]:
import urllib, urllib.request
url = 'http://export.arxiv.org/api/query?search_query=all:genetics&start=0&max_results=4'
data = urllib.request.urlopen(url)
print(data.read().decode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dall%3Agenetics%26id_list%3D%26start%3D0%26max_results%3D4" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=all:genetics&amp;id_list=&amp;start=0&amp;max_results=4</title>
  <id>http://arxiv.org/api/0dBGo2AFcn6SdKSuro8QOPE3+RU</id>
  <updated>2023-03-13T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">8478</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">4</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/abs/1209.4847v1</id>
    <updated>2012-09-21T15:44:19Z</updated>
    <published>2012-09-21T15:44:19Z</published>
    <title>The new classes of the genetic algorithms are defined