<a href="https://colab.research.google.com/github/lmrhody/femethodsS23/blob/main/week7_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 7: Jupyter Notebook Assignment - Working with Data

Fill out the cell below with your information. 

* Student Name: Michael Smith
* Date: 3/20/23
* Instructor: Lisa Rhody
* Assignment due: 
* Methods of Text Analysis
* MA in DH at The Graduate Center, CUNY

## Objectives
The purpose of this notebook is to get some hands-on experience putting what you've seen in tutorials about importing and working with text in Python into practice. You'll also be asked to put the reading you've been doing all semester into conversation with the process of importing, cleaning, and preparing data. 

The object of the notebooks this week is: 
* To practice several ways of importing text into your Python environment to study; 
* To become more familiar with various pipelines for cleaning and preparing data for text analysis; 
* To consider the challenges that the availability and scarcity of data presents to the literary scholar (and to consider how other kinds of research might also need to address similar issues); 
* To connect examples of real-world text analysis projects with the practical process of cleaning and preparing data. 

# Getting Started
We're going to start by importing some important libraries for working with text data. 

In [1]:
import nltk
import numpy as np
import pandas as pd
import urllib
import pprint

# Importing Data
So far, we have worked with data during the Datacamp exercises, but that was a much more controlled environment. When you are actually doing your own text analysis project, you will have a much messier process. During this week's reading, you will have read several pieces about what cleaning takes place and some of the challenges that data presents when working with text. In particular, we're looking at text analysis from a humanities / litereary perspective; however, one might argue that these challenges are more similar to the text analysis one might perform in the social sciences or with non-fiction work than might appear to be the case on the surface. 

In this lesson, we'll practice importing data: 
* from a file already on your computer (using a directory path); 
* from a file on the web using a URL request 
* from a file on the web using Beautiful Soup. 


### Loading data from a flat file on your local computer
Before you get started, be sure to download this file onto your local computer and save it as herland.txt. 

Next, we're going to import `herland.txt` using an upload function that is part of the google.colab Python package. This function will open a button under the cell that you can use to "Choose Files" from your local computer. Choose the `herland.txt` file and then upload it. The for loop below will print out what the name of the file is that you are saving to the Google Colab content folder. 

In [4]:
# Running from a locally hosted notebook so the upload of the text is unnecessary

# from google.colab import files

# uploaded = files.upload()

# for fn in uploaded.keys():
#   print("User uploaded file '{name} with length {length} bytes".format(
#       name=fn, length=len(uploaded[fn])))

To find the file you just uploaded, look to the left side of this browser window. Click on the icon of a file folder. A directory structure should open. Click on the arrow next to `content` and you should see your uploaded file appear inside. 

Then we're going to use a Python function `open()`. We'll use a `for` loop, which simply means that we'll do an action that repeats until we tell it to stop. The following code says that we want to `open` the file `herland.txt` so we can read it (argument `mode='r'`). Then we're going to close the file. When we do this, we're going to assign a variable name to the resulting data, which is now a string called `file`.


In [2]:
filename = 'herland.txt'
herland = open(filename, mode='r')
hertext = herland.read()
herland.close()

Another way to read the text from a file into Python is to use a "context manager." The following tells python that with the `herland.txt` file open, read in the text and create a variable called `file` to store the data. Then, the next line tells Python to print the new variable `file`. When you run the next cell, it is going to print out the entire text of *Herland*. That's a lot of text, so once you've done it, you can clear the cell's output and move on to the next cell. 

In [4]:
# Here is how you print a string from a file without having to close the file using a context manager

with open('herland.txt','r') as file:
    print(file.read()[:500])

The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: Engl


In [5]:
# If you don't want to save the text of the file, but just want to peek into it to see what's there, you could use this method. 

with open('herland.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman



This eBook is for the use of anyone anywhere at no cost and with



### What happens when you import a flat file? 

The python function `type()` will return to you output that explains the data type you are working with. When you pass the new text object `herland` through the `type()` function below, what response do you get? The response will look different from other data types that you've used before. In this case, it is read in as a "file object." Remember that Python won't know how to handle data unless it fits a particular data type that the computer expects when passing a function to it. In the next input, we ask Python for the length of the file. This will throw an error. Why do you think that is? 

In [6]:
# herland is a file object, not a string. 
type(herland)

_io.TextIOWrapper

In [7]:
# since herland is a file object and not a string, you can't find the length of it.
len(hertext)

315999

#### Response here: 'herland' is presently an I/O text stream object read from the 'herland.txt' file. In order to return the length of the text in the stream it must first be converted to a string using the .read() method. This was done with hertext = herland.read(). The length of hertext can be found using the len() function with 315,999 characters.

We had to go through a process to convert the file object to a string. 

Looking at the cells below, which variable should return `type()` as a string? (The answer is in the cell below.) 

In [8]:
# but hertext is a different datatype. How would you check? 
type(hertext)

str

Once you have a string, there are a number of functions that you can make use of. One of those is the `len()` command, which you can run below. 

In [9]:
# How many characters are in the hertext string? 
len(hertext)

315999

Once an object is recognized as a string, you can begin manipulating it. For example, you could count the number of times the sequence of characters "her" appear within the entire text of _Herland_.

In [10]:
hertext.count('her', 0, -1)

1244

The ability to count characters, words, n-grams, etc. means that we can also more easily target specific sections of the text. For example, when you print to your screen the opening of the herland file, you notice that it is accompanied with metadata. For the purposes of text analysis, what would be the advantages or disadvantages of removing the metadata associated with _Herland_?

Project Guttenberg text files include a metadata heading at the start of every text file. For text analysis we may wish to exclude this heading as part of word studies as it description of the text, not the text itself. 

In [11]:
# What is happening at the beginning of the herland.txt file, though? We can check to see by using an index. 
print(hertext[:500])

The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: Engl


There is a python library for removing the gutenberg header and footer metadata. Below is the installation and import process with a sample from 'hertext' with removed header and footer.

In [12]:
pip install gutenberg-cleaner

Note: you may need to restart the kernel to use updated packages.


In [13]:
from gutenberg_cleaner import simple_cleaner, super_cleaner

herclean = simple_cleaner(hertext)

In [14]:
herclean[:500]

'\n\n\n\n\n\n\n\n\n\nHERLAND\n\nby Charlotte Perkins Stetson Gilman\n\n\n\n\nCHAPTER 1. A Not Unnatural Enterprise\n\n\nThis is written from memory, unfortunately. If I could have brought with\nme the material I so carefully prepared, this would be a very different\nstory. Whole books full of notes, carefully copied records, firsthand\ndescriptions, and the pictures--that’s the worst loss. We had some\nbird’s-eyes of the cities and parks; a lot of lovely views of streets,\nof buildings, outside and in, and some of those '

Working with a string is *more* helpful than simply working with a text object, but there are other things that we can do to the text to make it more easily manipulated in Python and NLTK. For example, when you're working with a string, it's not easy to count whole words. The NLTK word tokenizer function, however, will take a string and turn it into "tokens"--discrete segments of characters. Tokenized strings become a new data type--a list. 

In [16]:
hertokens = nltk.word_tokenize(herclean)
type(hertokens)

list

A tokenized list can be called, acted upon, and manipulated differently than a string. If we call just the tokens that are in index positions 0-15, here is what you would get:

In [17]:
hertokens[:15]

['HERLAND',
 'by',
 'Charlotte',
 'Perkins',
 'Stetson',
 'Gilman',
 'CHAPTER',
 '1',
 '.',
 'A',
 'Not',
 'Unnatural',
 'Enterprise',
 'This',
 'is']

In [18]:
text1 = nltk.Text(hertokens)

In [19]:
type(text1)

nltk.text.Text

In [20]:
len(text1)

65090

In [21]:
text1[1000:1025]

['for',
 'weeks',
 'past',
 ',',
 'the',
 'same',
 'taste',
 '.',
 'I',
 'happened',
 'to',
 'speak',
 'of',
 'that',
 'river',
 'to',
 'our',
 'last',
 'guide',
 ',',
 'a',
 'rather',
 'superior',
 'fellow',
 'with']

### Review
When you import text from a flat file that is saved on your local computer, what will you need to do in order to select parts of the text using an index? 

An indexable list can only be used after a single text string as tokenized into a list of strings. The tokenizer uses the spaces and punctuation as a delimiter to separate each portion of the string. After we have a list index values can be used to select a single string in the list or ranges of the list.

## Ingesting data from a URL

Next, we're going to retrieve text directly from a URL with the `urlllib` package
To do this, we're going to call the package `urllib` and specifically from that we're going to use `urlretrieve.` Next, we need to assign the text in the file to a variable. In this case, that variable is `url`. We're going to run `urlretrieve` with two parameters, the name of the URL you want to import (which you assigned to the variable `url` above, and the file name and extension. Here that is `203-0.txt.` If you pay attention to the output, you'll realize that you've imported the file as an object. 

In [22]:
from urllib import request

In [23]:
# I was having issues with the suggested pattern returning a value so am using the suggested pattern
# from https://stackoverflow.com/questions/61897926/project-gutenberg-accessing-text-with-url
url = "https://www.gutenberg.org/files/203/203-0.txt"

response = request.urlopen(url)
raw = response.read()
uncletom = raw.decode("utf-8-sig")

### Question:
Using what you've learned so far, how would you figure out what data type the file `uncletom` is? Add a cell below and show how you would find the answer. 

With the pattern updated to include a read() and decode() of the url stream the file type is a string.

In [24]:
type(uncletom)

str

Next, we're going to turn the text of Uncle Tom's Cabin into a list. A list is a mutable, ordered sequence of items. It can be indexed, sliced, and changed. Items in the list can be accessed through it's indexical placement. 

In [25]:
# first remove headers and footers of Guttenberg text
uncleclean = simple_cleaner(uncletom)
words = uncleclean.split()
len(words)

180925

In [26]:
print(type(words))

<class 'list'>


Let's practice those steps again, but with a new file this time. 

In [27]:
from urllib.request import urlopen

shakespeare = 'http://composingprograms.com/shakespeare.txt'

print( type(shakespeare) )

<class 'str'>


In [28]:
shakespeare = 'http://composingprograms.com/shakespeare.txt'
# shakespeare = urlopen('http://composingprograms.com/shakespeare.txt')
response = request.urlopen(shakespeare)
raw = response.read()
shakespeare = raw.decode("utf-8-sig")

print(type(shakespeare))

<class 'str'>


In [15]:
# dir(shakespeare)

In [29]:
words = shakespeare.split()

In [30]:
print(type(words))

<class 'list'>


In [31]:
title = words[0:3]
title

['A', "MIDSUMMER-NIGHT'S", 'DREAM']

In [32]:
body = words[3:]

In [33]:
print(body[:10])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']



__Indexing Operator__

Indexing operator ([ ]) selects one or more elements from a sequence. Each element of a sequence is assigned a number - its position or index. Index must be an integer value and is called inside a pair of square brackets. 

The operation that extracts a subsequence is called __slicing__. When selecting more than one element __": operator"__ is used with integer before and after it to indicate where to start and where to stop the index, respectively.

Python indexing starts at 0 and ends at (n-1), where n refers to the number of items in the sequence. The function "len" can be used to get the number of items in a list. 

Negative indexing is also supported by Python. It can be done by adding "-" operator before the integer value.

In [34]:
n_words = len(body)
print( n_words )

980634


In [91]:
# The index value is out of range as the first index value is 0 and the final index value is the length - 1
# print( body[980634])

.


In [35]:
print( body[980633])

.


In [36]:
sub_body = body[:10]
print( sub_body)

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']


In [37]:
print( sub_body[:-2])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour']


In [38]:
print( sub_body[::2])    # gives every 2nd element

['Now', 'fair', ',', 'nuptial', 'Draws']


__Python Syntax__

Syntax refers to the structure of the language. 

The end of the statement does not require semicolon or other symbol. After a statement is complete, the code is considered completed. However, using semicolon can allow you to execute two separate codes from the same line. 

Indentation i.e. the whitespace matters in Python. A block of code is a set of statements that should be treated as a unit even when written in a new line. A code block in python are denoted by indentation. For example, in compound statements such as loops and conditionals, after the colon we must enter into a new line and add exactly four spaces to continue further. Whitespaces __within__ the same line does not matter however.  

Comments about codes can be made using hashtag #. anything written after # is ignored by the interpreter. Python does not have any syntax for multi-line comments. 

In [39]:
sub_body_lowercase = []
for word in sub_body:
  sub_body_lowercase.append(word.lower())
  #print(sub_body_lowercase)
#print(sub_body_lowercase)
sub_body_lowercase

['now', ',', 'fair', 'hippolyta', ',', 'our', 'nuptial', 'hour', 'draws', 'on']

## Importing an HTML file using an http: request
The previous two files that we imported were _plain text_ files. In other words, there is little to no descriptive encoding. However, we can also use another module from the URLLIB package that is designed to import an .html file directly from the web. We can actually do this with just a few lines of code. First, we import the URLLIB package, and specifically the `request` module. We assign the URL we want to manipulate by assigning the URL to a variable. Next, we pass the URL through the urlopen.request function from the URLLIB package, and also at the same time "read" the file. The output of that string becomes the variable `html`. When we print the variable html, we discover that all of the HTML from the page has been pulled into the variable name. Unfortuantely, it doesn't look very clean. 

In [40]:
# Now import the bibliography page from Colored Conventions in HTML
import urllib.request
anotherurl='http://coloredconventions.org/exhibits/show/bishophmturner'

In [41]:
html = urllib.request.urlopen(anotherurl).read()
print(html[0:500])

b'<!DOCTYPE html>\n<!--[if IE 6]>\n<html id="ie6" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 7]>\n<html id="ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 8]>\n<html id="ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if !('


If you are interested in doing text analysis of a webpage, and the only way to ingest the web page is with HTML included, what are things you might need to learn to do to separate the HTML tags from the text? Look at the code above and write a short description of what might need to stay and what might need to be extracted. Should the extracted data be preserved or discarded? 

Webscraping projects require looking at the structure of the HTML document and find the particular pieces that you're interested in parsing. On this page there is a div element with the class="et_pb_text_inner" which contains the entirety of the credits portion on the webpage. If you can select this particular element by its class (assuming it's unique) and then strip HTML tags that wrap the text, you would be succesful at getting this data.

# Importing Data by Webscraping with BeautifulSoup
If you are interested in scraping data from the open web, BeautifulSoup is a Python pacakge worth exploring in detail. For our purposes here, though, we're going to consider how to use Beautiful Soup to turn "unstructured" data into "structured" data. As you read through this section, consider Muñoz and Rawson's argument about data cleaning. Is there a need for the data to stay unstructured? What is the value of cleaning? 

In [42]:
import requests
from bs4 import BeautifulSoup

In [43]:
# Specify url: url
url4 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url4)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup[0:500])

<!DOCTYPE html>
<html lang="en" xmlns:addthis="https://www.addthis.com/help/api-spec" xmlns:fb="https://www.facebook.com/2008/fbml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://coloredconventions.org/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <style id="et-builder-googlefonts


Compare the text imported using the "webscraping" method included with BeautifulSoup versus the option of importing the entire file using URLLIB. 

## Cleaning up Webscraped text

In [44]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url5 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url5)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Colored Conventions' webpage: ccc_title
ccc_title = soup.title

# Print the title of Colored Conventions' webpage to the shell
print(ccc_title)


<title>Press &amp; Notices - Colored Conventions Project</title>


In [45]:
# Get Colored Conventions' text: ccc_text
ccc_text = soup.get_text()
# Print CCC's text 
# print(ccc_text[:500])

# to get only the citations select the divs with the class used for each set of citations
mydivs = soup.find_all("div", {"class": "et_pb_text"})
for div in mydivs:
    print(div.get_text())



Press & Notices
Academic Journals
Fagan, Benjamin. “Chronicling White America.” American Periodicals: A Journal of History & Criticism 26, no. 1 (2016): 10–13.
Spires, Derrick R. “The Captive Stage: Performance and the Proslavery Imagination of the Antebellum North by Douglas A. Jones (review).” Early American Literature 51, no. 1 (2016): 200–205.
Eric Gardner. and Joycelyn Moody. “Introduction: Black Periodical Studies.” American Periodicals: A Journal of History, Criticism, and Bibliography 25.2 (2015): 105-111. Project MUSE. Web.
Joycelyn Moody. and Howard Rambsy II. “Guest Editors’ Introduction: African American Print Cultures.”MELUS: Multi-Ethnic Literature of the U.S. 40.3 (2015): 1-11. Project MUSE. Web.
Roundtable: The Colored Conventions Project, Fall 2015. 
The Colored Conventions Project and the Changing Same, by P. Gabrielle Foreman
Toward Meaning-making in the Digital Age: Black Women, Black Data and Colored Conventions, by Sarah Patterson
The Colored Conventions Movement

In [47]:
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href')[0:200])

/about/book/
https://coloredconventions.org/
https://coloredconventions.org/
https://coloredconventions.org/about-conventions/
https://coloredconventions.org/about-conventions/
https://coloredconventions.org/about-records/
https://coloredconventions.org/about-conventions/submit-records/
https://coloredconventions.org/about-records/ccp-corpus/
https://coloredconventions.org/bibliography/
https://coloredconventions.org/exhibits/
https://coloredconventions.org/teaching/
https://coloredconventions.org/teaching/#teaching-partners
https://coloredconventions.org/curriculum/
https://coloredconventions.org/news/
https://douglassday.org/
https://douglassday.org/
https://coloredconventions.org/digblk/symposium-ccp-making-social-movement/
https://coloredconventions.org/news/mural-dedication-philadelphia/
https://coloredconventions.org/news/
https://coloredconventions.org/about/press-notices/
https://coloredconventions.org/about/videos/
https://coloredconventions.org/about/
https://coloredconventio

# Questions for reflection

Explain what the value is of importing HTML files using BeautifulSoup. How does this relate to the concerns that Rawson and Muñoz raise in their article? Are there times when you might want to keep the HTML? 

BeatifulSoup provides a number of methods to interact with the HTML document's structure and extract particular pieces of text based on an understanding of HTML markup and it's usage. HTML tags are themselves a method for structuring text. Tags that wrap text H1, H2, H3, P, I, NAV, etc. all have meaning are are chosen to separate headings, paragraphs, lists, tables, and more. Preserving an understanding of which texts were structured with different tags could help keep clean which text elements are emphasized over others as well as the hierarchical structures used to structure the text in HTML.

Consider Rob Kitchin's criteria of "good data." Would these datasets satisfy his definition of "good data"? Why or why not? What kinds of questions could one ask about the Colored Conventions Project using what you've learned here? 

I believe Kitchin would be able to label webscraped data as 'good' if certain conditions were met. For example, if the process for making decisions on which information is pulled out from a webpage, how this happens, and what is left behind, I belive this would show the good data practice of being accountable in documenting the process. As well if it were possible if the is clearly named relationship between the webscraper and the creator of the webpages then this would likely be considered ethically and socially responsible. And if both sets of stakeholders with potentially different perspectives on what makes the data good is incorported in the history of the data.