<a href="https://colab.research.google.com/github/lclarete/DHUM72500-FINAL-PORTFOLIO/blob/main/Clarete_Week7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 7: Jupyter Notebook Assignment - Working with Data
* Student Name: Livia Clarete
* Date: March 29 2023
* Assignment Due: March 2023
* Instructor: Lisa Rhody
* Methods of Text Analysis, Spring 2023

Fill out the cell below with your information. 

## Objectives
The purpose of this notebook is to get some hands-on experience putting what you've seen in tutorials about importing and working with text in Python into practice. You'll also be asked to put the reading you've been doing all semester into conversation with the process of importing, cleaning, and preparing data. 

The object of the notebooks this week is: 
* To practice several ways of importing text into your Python environment to study; 
* To become more familiar with various pipelines for cleaning and preparing data for text analysis; 
* To consider the challenges that the availability and scarcity of data presents to the literary scholar (and to consider how other kinds of research might also need to address similar issues); 
* To connect examples of real-world text analysis projects with the practical process of cleaning and preparing data. 

# Getting Started
We're going to start by importing some important libraries for working with text data. 

In [None]:
import nltk
import numpy as np
import pandas as pd
import urllib
import pprint

# Importing Data
So far, we have worked with data during the Datacamp exercises, but that was a much more controlled environment. When you are actually doing your own text analysis project, you will have a much messier process. During this week's reading, you will have read several pieces about what cleaning takes place and some of the challenges that data presents when working with text. In particular, we're looking at text analysis from a humanities / litereary perspective; however, one might argue that these challenges are more similar to the text analysis one might perform in the social sciences or with non-fiction work than might appear to be the case on the surface. 

In this lesson, we'll practice importing data: 
* from a file already on your computer (using a directory path); 
* from a file on the web using a URL request 
* from a file on the web using Beautiful Soup. 


### Loading data from a flat file on your local computer
Before you get started, be sure to download this file onto your local computer and save it as herland.txt. 

Next, we're going to import `herland.txt` using an upload function that is part of the google.colab Python package. This function will open a button under the cell that you can use to "Choose Files" from your local computer. Choose the `herland.txt` file and then upload it. The for loop below will print out what the name of the file is that you are saving to the Google Colab content folder. 

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print("User uploaded file '{name} with length {length} bytes".format(
      name=fn, length=len(uploaded[fn])))

Saving herland.txt to herland.txt
User uploaded file 'herland.txt with length 88 bytes


To find the file you just uploaded, look to the left side of this browser window. Click on the icon of a file folder. A directory structure should open. Click on the arrow next to `content` and you should see your uploaded file appear inside. 

Then we're going to use a Python function `open()`. We'll use a `for` loop, which simply means that we'll do an action that repeats until we tell it to stop. The following code says that we want to `open` the file `herland.txt` so we can read it (argument `mode='r'`). Then we're going to close the file. When we do this, we're going to assign a variable name to the resulting data, which is now a string called `file`.


In [None]:
filename = 'herland.txt'
herland = open(filename, mode='r')
hertext = herland.read()
herland.close()

Another way to read the text from a file into Python is to use a "context manager." The following tells python that with the `herland.txt` file open, read in the text and create a variable called `file` to store the data. Then, the next line tells Python to print the new variable `file`. When you run the next cell, it is going to print out the entire text of *Herland*. That's a lot of text, so once you've done it, you can clear the cell's output and move on to the next cell. 

In [None]:
# Here is how you print a string from a file without having to close the file using a context manager

with open('herland.txt','r') as file:
    print(file.read())

langchain
openai
flask
flask_cors
transformers
flask_session
flask_socketio
PyPDF2<3.0




In [None]:
# If you don't want to save the text of the file, but just want to peek into it to see what's there, you could use this method. 

with open('herland.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

langchain

openai

flask



### What happens when you import a flat file? 

The python function `type()` will return to you output that explains the data type you are working with. When you pass the new text object `herland` through the `type()` function below, what response do you get? The response will look different from other data types that you've used before. In this case, it is read in as a "file object." Remember that Python won't know how to handle data unless it fits a particular data type that the computer expects when passing a function to it. In the next input, we ask Python for the length of the file. This will throw an error. Why do you think that is? 

In [None]:
# herland is a file object, not a string. 
type(herland)

_io.TextIOWrapper

In [None]:
# since herland is a file object and not a string, you can't find the length of it.
len(herland)

TypeError: ignored

#### Response here: 

We had to go through a process to convert the file object to a string. 

Looking at the cells below, which variable should return `type()` as a string? (The answer is in the cell below.) 

In [None]:
# but hertext is a different datatype. How would you check? 
type(hertext)

str

Once you have a string, there are a number of functions that you can make use of. One of those is the `len()` command, which you can run below. 

In [None]:
# How many characters are in the hertext string? 
len(hertext)

88

Once an object is recognized as a string, you can begin manipulating it. For example, you could count the number of times the sequence of characters "her" appear within the entire text of _Herland_.

In [None]:
hertext.count('her', 0, -1)

0

The ability to count characters, words, n-grams, etc. means that we can also more easily target specific sections of the text. For example, when you print to your screen the opening of the herland file, you notice that it is accompanied with metadata. For the purposes of text analysis, what would be the advantages or disadvantages of removing the metadata associated with _Herland_?

In [None]:
# What is happening at the beginning of the herland.txt file, though? We can check to see by using an index. 
print(hertext[:660])

langchain
openai
flask
flask_cors
transformers
flask_session
flask_socketio
PyPDF2<3.0




Working with a string is *more* helpful than simply working with a text object, but there are other things that we can do to the text to make it more easily manipulated in Python and NLTK. For example, when you're working with a string, it's not easy to count whole words. The NLTK word tokenizer function, however, will take a string and turn it into "tokens"--discrete segments of characters. Tokenized strings become a new data type--a list. 

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
hertokens = nltk.word_tokenize(hertext)
type(hertokens)

list

A tokenized list can be called, acted upon, and manipulated differently than a string. If we call just the tokens that are in index positions 0-15, here is what you would get:

In [None]:
hertokens[:15]

['langchain',
 'openai',
 'flask',
 'flask_cors',
 'transformers',
 'flask_session',
 'flask_socketio',
 'PyPDF2',
 '<',
 '3.0']

In [None]:
text1 = nltk.Text(hertokens)

In [None]:
type(text1)

nltk.text.Text

In [None]:
len(text1)

10

In [None]:
text1[5:10]

['flask_session', 'flask_socketio', 'PyPDF2', '<', '3.0']

### Review
When you import text from a flat file that is saved on your local computer, what will you need to do in order to select parts of the text using an index? 

## Answer 
After importing text from a flat file (such as a text file), it's possible to use Python's built-in functions to open the file and read its contents into a string variable. Once the text is in a string variable, it's possible select parts of the text using indexing or slicing, just like you would with any other string.

Here's an example code snippet that demonstrates how to open a file, read its contents into a string variable, and select parts of the text using indexing:


```
# Open the file for reading
with open('filename.txt', 'r') as file:
    # Read the entire file into a string variable
    text = file.read()

# Select the first 10 characters of the text using indexing
first_10_chars = text[0:10]

# Select the next 10 characters of the text using indexing
next_10_chars = text[10:20]

# Print the selected parts of the text
print(first_10_chars)
print(next_10_chars)
```



## Ingesting data from a URL

Next, we're going to retrieve text directly from a URL with the `urlllib` package
To do this, we're going to call the package `urllib` and specifically from that we're going to use `urlretrieve.` Next, we need to assign the text in the file to a variable. In this case, that variable is `url`. We're going to run `urlretrieve` with two parameters, the name of the URL you want to import (which you assigned to the variable `url` above, and the file name and extension. Here that is `203-0.txt.` If you pay attention to the output, you'll realize that you've imported the file as an object. 

In [None]:
import urllib.request

In [None]:
from urllib.request import urlopen
from urllib.request import Request
url = 'https://www.gutenberg.org/files/203/203-0.txt'
uncletom = urlopen(url)

### Question:
Using what you've learned so far, how would you figure out what data type the file `uncletom` is? Add a cell below and show how you would find the answer. 

In [None]:
type(uncletom)

http.client.HTTPResponse

Next, we're going to turn the text of Uncle Tom's Cabin into a list. A list is a mutable, ordered sequence of items. It can be indexed, sliced, and changed. Items in the list can be accessed through it's indexical placement. 

In [None]:
dir(uncletom.read())

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'center',
 'count',
 'decode',
 'endswith',
 'expandtabs',
 'find',
 'fromhex',
 'hex',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

In [None]:
words = uncletom.read().decode().split()

In [None]:
print(type(words))

<class 'list'>


Let's practice those steps again, but with a new file this time. 

In [None]:
from urllib.request import urlopen

shakespeare = 'http://composingprograms.com/shakespeare.txt'

print( type(shakespeare) )

<class 'str'>


In [None]:
shakespeare = 'http://composingprograms.com/shakespeare.txt'
shakespeare = urlopen('http://composingprograms.com/shakespeare.txt')
print(type(shakespeare))

<class 'http.client.HTTPResponse'>


In [None]:
dir(shakespeare)

['__abstractmethods__',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_abc_impl',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_check_close',
 '_close_conn',
 '_get_chunk_left',
 '_method',
 '_peek_chunked',
 '_read1_chunked',
 '_read_and_discard_trailer',
 '_read_next_chunk_size',
 '_read_status',
 '_readall_chunked',
 '_readinto_chunked',
 '_safe_read',
 '_safe_readinto',
 'begin',
 'chunk_left',
 'chunked',
 'close',
 'closed',
 'code',
 'debuglevel',
 'detach',
 'fileno',
 'flush',
 'fp',
 'getcode',
 'getheader',
 'getheaders',
 'geturl',
 'headers',
 'info',
 'isatty',
 'isclosed',

In [None]:
words = shakespeare.read().decode().split()

In [None]:
print(type(words))

<class 'list'>


In [None]:
title = words[0:3]

In [None]:
body = words[3:]

In [None]:
print(body[:10])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']



__Indexing Operator__

Indexing operator ([ ]) selects one or more elements from a sequence. Each element of a sequence is assigned a number - its position or index. Index must be an integer value and is called inside a pair of square brackets. 

The operation that extracts a subsequence is called __slicing__. When selecting more than one element __": operator"__ is used with integer before and after it to indicate where to start and where to stop the index, respectively.

Python indexing starts at 0 and ends at (n-1), where n refers to the number of items in the sequence. The function "len" can be used to get the number of items in a list. 

Negative indexing is also supported by Python. It can be done by adding "-" operator before the integer value.

In [None]:
n_words = len(body)
print( n_words )

980634


In [None]:
print( body[980634])

IndexError: ignored

In [None]:
print( body[980633])

.


In [None]:
sub_body = body[:10]
print( sub_body)

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour', 'Draws', 'on']


In [None]:
print( sub_body[:-2])

['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour']


In [None]:
print( sub_body[::2])    # gives every 2nd element

['Now', 'fair', ',', 'nuptial', 'Draws']


__Python Syntax__

Syntax refers to the structure of the language. 

The end of the statement does not require semicolon or other symbol. After a statement is complete, the code is considered completed. However, using semicolon can allow you to execute two separate codes from the same line. 

Indentation i.e. the whitespace matters in Python. A block of code is a set of statements that should be treated as a unit even when written in a new line. A code block in python are denoted by indentation. For example, in compound statements such as loops and conditionals, after the colon we must enter into a new line and add exactly four spaces to continue further. Whitespaces __within__ the same line does not matter however.  

Comments about codes can be made using hashtag #. anything written after # is ignored by the interpreter. Python does not have any syntax for multi-line comments. 

In [None]:
sub_body_lowercase = []
for word in sub_body:
  sub_body_lowercase.append(word.lower())
  #print(sub_body_lowercase)
#print(sub_body_lowercase)
sub_body_lowercase

['now', ',', 'fair', 'hippolyta', ',', 'our', 'nuptial', 'hour', 'draws', 'on']

## Importing an HTML file using an http: request
The previous two files that we imported were _plain text_ files. In other words, there is little to no descriptive encoding. However, we can also use another module from the URLLIB package that is designed to import an .html file directly from the web. We can actually do this with just a few lines of code. First, we import the URLLIB package, and specifically the `request` module. We assign the URL we want to manipulate by assigning the URL to a variable. Next, we pass the URL through the urlopen.request function from the URLLIB package, and also at the same time "read" the file. The output of that string becomes the variable `html`. When we print the variable html, we discover that all of the HTML from the page has been pulled into the variable name. Unfortuantely, it doesn't look very clean. 

In [None]:
# Now import the bibliography page from Colored Conventions in HTML
import urllib.request
anotherurl='http://coloredconventions.org/exhibits/show/bishophmturner'

In [None]:
html = urllib.request.urlopen(anotherurl).read()
print(html)

b'<!DOCTYPE html>\n<!--[if IE 6]>\n<html id="ie6" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 7]>\n<html id="ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if IE 8]>\n<html id="ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<![endif]-->\n<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->\n<html lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n<!--<![endif]-->\n<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n\t\t\t\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<link rel="pingback" href="https://coloredconventions.org/before-garvey-mcneal-turner/xmlrpc.php" />\n\n\t\t<!--[if lt IE 9]>\n\t<script src="https://coloredconventions.o

If you are interested in doing text analysis of a webpage, and the only way to ingest the web page is with HTML included, what are things you might need to learn to do to separate the HTML tags from the text? Look at the code above and write a short description of what might need to stay and what might need to be extracted. Should the extracted data be preserved or discarded? 

## Answer
In order to perform a text analysis based on a webpage data (HTML tags), we have to web scraping the page. Some of the Python libraries that can be used are including Beautiful Soup, Scrapy, and Requests-HTML. For example, using the Beautiful Soup library, you can extract the text content of a webpage as follows:


```
import requests
from bs4 import BeautifulSoup

# Send a GET request to the webpage
response = requests.get('https://www.example.com')

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the text content of the webpage and discard the HTML tags
text = soup.get_text()
```

In this example, the `requests.get()` function sends a GET request to the specified webpage and returns the HTML content of the page as a response object. The response object's content attribute contains the HTML content of the page.

The HTML content is then parsed using the Beautiful Soup library's `BeautifulSoup()` function, which returns a BeautifulSoup object that can be used to extract the text content of the page. The `soup.get_text()` function extracts the text content of the page and discards the HTML tags.

When extracting text from a webpage, it's important to consider what data should be preserved or discarded. Typically, you would want to preserve the text content of the webpage while discarding any irrelevant or extraneous information, such as HTML tags, scripts, and stylesheets. However, the specific data that should be preserved or discarded will depend on the goals of the text analysis and the structure of the webpage.




# Importing Data by Webscraping with BeautifulSoup
If you are interested in scraping data from the open web, BeautifulSoup is a Python pacakge worth exploring in detail. For our purposes here, though, we're going to consider how to use Beautiful Soup to turn "unstructured" data into "structured" data. As you read through this section, consider Muñoz and Rawson's argument about data cleaning. Is there a need for the data to stay unstructured? What is the value of cleaning? 

## Answer 
As mentione, Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. It can be used to turn unstructured data, such as the HTML content of a webpage, into structured data, such as a dataset that can be analyzed using data analysis tools. The purpose of data cleaning is to improve the quality and reliability of the data, making it more suitable for analysis.
In Muñoz and Rawson's argument about data cleaning, they emphasize the importance of transforming unstructured data into structured data, as this can help to reduce errors, increase consistency, and improve the reliability of the data. They argue that data cleaning is a necessary step in the data analysis process, as it can help to uncover hidden patterns and insights that would not be visible in uncleaned data.





In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Specify url: url
url4 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url4)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

<!DOCTYPE html>
<html lang="en" xmlns:addthis="https://www.addthis.com/help/api-spec" xmlns:fb="https://www.facebook.com/2008/fbml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://coloredconventions.org/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <style id="et-builder-googlefonts-cached-inline">
   /* Original: https://fonts.googleapis.com/css?family=Oswald:200,300,regular,500,600,700|Open+Sans:300,regular,500,600,700,800,300italic,italic,500italic,600italic,700italic,800italic&#038;subset=cyrillic,cyrillic-ext,latin,latin-ext,vietnamese,greek,greek-ext,hebrew&#038;display=swap *//* User Agent: Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1 */@font-face {font-family: 'Open Sans';font-style: italic;font-weight: 300;font-st

Compare the text imported using the "webscraping" method included with BeautifulSoup versus the option of importing the entire file using URLLIB. 

## Answer
The two methods of importing data from a webpage, BeautifulSoup's web scraping method and the URLLIB method, differ in several ways. It method involves parsing the HTML code of a webpage to extract the desired content. It is a more targeted approach  to extract only the specific elements of the webpage that you are interested in. This method requires the use of a third-party library, such as BeautifulSoup, and some knowledge of HTML structure and syntax.

BeautifulSoup method requires some knowledge of HTML syntax and structure, as well as the use of a third-party library, which means we can customize the data extraction process to suit your specific needs. The URLLIB method, on the other hand, is simpler and requires less technical expertise, but it's limited in terms of its flexibility.

## Cleaning up Webscraped text

In [None]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url5 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url5)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Colored Conventions' webpage: ccc_title
ccc_title = soup.title

# Print the title of Colored Conventions' webpage to the shell
print(ccc_title)


<title>Press &amp; Notices - Colored Conventions Project</title>


In [None]:
# Get Colored Conventions' text: ccc_text
ccc_text = soup.get_text()

# Print CCC's text 
print(ccc_text)










Press & Notices - Colored Conventions Project





























































 



THE COLORED CONVENTIONS BOOK   ▶Learn more  ▶Order book
 












HOME
CONVENTIONS

About the Conventions
Digital Records
Submit Records
CCP Corpus
Bibliography


EXHIBITS
TEACHING

North American Teaching Partners
Curriculum


NEWS & EVENTS

Douglass Day
Transcribe Mary Ann Shadd Cary Papers
Symposium 2022: The Making of a Social Movement
Mural in Philadelphia
News
Press & Notices
Videos


ABOUT CCP

CCP Principles
Team
Committees
Project Curriculum Vitae
Speaker’s Agreement
#DigBlk, Center for Black Digital Research
How to Use This Site
Contact Us


DONATE
 





Select Page


  
 



 



 











Press & Notices
Academic Journals
Fagan, Benjamin. “Chronicling White America.” American Periodicals: A Journal of History & Criticism 26, no. 1 (2016): 10–13.
Spires, Derrick R. “The Captive Stage: Performance and the Proslavery Imagination of the Antebellum Nor

In [None]:
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

/about/book/
https://coloredconventions.org/
https://coloredconventions.org/
https://coloredconventions.org/about-conventions/
https://coloredconventions.org/about-conventions/
https://coloredconventions.org/about-records/
https://coloredconventions.org/about-conventions/submit-records/
https://coloredconventions.org/about-records/ccp-corpus/
https://coloredconventions.org/bibliography/
https://coloredconventions.org/exhibits/
https://coloredconventions.org/teaching/
https://coloredconventions.org/teaching/#teaching-partners
https://coloredconventions.org/curriculum/
https://coloredconventions.org/news/
https://douglassday.org/
https://douglassday.org/
https://coloredconventions.org/digblk/symposium-ccp-making-social-movement/
https://coloredconventions.org/news/mural-dedication-philadelphia/
https://coloredconventions.org/news/
https://coloredconventions.org/about/press-notices/
https://coloredconventions.org/about/videos/
https://coloredconventions.org/about/
https://coloredconventio

# Questions for reflection

Explain what the value is of importing HTML files using BeautifulSoup. How does this relate to the concerns that Rawson and Muñoz raise in their article? Are there times when you might want to keep the HTML? 

## Answer

BeautifulSoup is a Python library that enables parsing and extracting data from HTML and XML files. It extracts specific elements like tables, paragraphs, or headings from HTML files for further analysis or visualization. The library's value lies in its ability to convert unstructured data into structured formats, such as CSV or JSON files, allowing for easier analysis and visualization. This addresses the concerns highlighted by Rawson and Muñoz, who emphasize the importance of data cleaning and structuring. By leveraging BeautifulSoup to extract data elements from HTML files, unstructured data can be transformed into structured data, resulting in reduced noise, improved accuracy, and increased data reliability.

Consider Rob Kitchin's criteria of "good data." Would these datasets satisfy his definition of "good data"? Why or why not? What kinds of questions could one ask about the Colored Conventions Project using what you've learned here? 

## Answer

Rob Kitchin's criteria for "good data" is related to accuracy, reliability, validity, timeliness, and accessibility. The datasets from the Colored Conventions Project are likely to meet these criteria as the project has been developed and maintained with careful curation and structuring. However, the satisfaction of these criteria depends on the specific dataset and its intended use.

The Colored Conventions Project offers researchers a valuable resource for studying Black activism in the 19th century. By utilizing the structured data, researchers can explore various aspects, such as how the Black community organized and mobilized for political and social change during that time. Analyzing patterns in the conventions' frequency, location, content, and examining the relationships between participants and organizations, researchers can gain insights into the conventions' impact, effectiveness, and influence on public opinion and policy. Overall, the project enables a wide range of research questions to be addressed, making it an important tool for studying the history and politics of Black activism in the 19th century.


