<a href="https://colab.research.google.com/github/lclarete/DHUM72500-FINAL-PORTFOLIO/blob/main/Clarete_Week8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 8: Working with Text Data from APIs
* Student Name: Livia Clarete
* Date: March 29 2023
* Assignment Due: March 2023
* Instructor: Lisa Rhody
* Methods of Text Analysis, Spring 2023

Fill out the cell below with your information. 

## Objectives
This week you'll be working with the availablility of APIs and the data they provide, as well as learing about some of the steps that you need to take while cleaning and preparing text data for analysis. 

In this notebook, you will:
* ingest data into a notebook from an API; 
* use Pandas dataframes to find and organize data; 
* uncover some of the challenges of working with data, including variation, multiple words with similar word stems, words with similar meanings, stopword control, and more; 
* Consider the relationship between the data you are working with, the forms of "data cleaning" or "data scrubbing" that most text analysis piplines use, and the challenges that those methods present a feminist critical approach. 

# Getting Started
As always, we need to begin by importing Python libraries that we know we will need to use. By this point, perhaps you are already familiar with some of them and know why we use them. Others may be less familiar to you. See if you can read through the list of imports and identify what each package does and why we will need it. 

In [None]:
import nltk
import numpy as np
import pandas as pd
import urllib
import pprint

## Access data using an API
In the following exercise, you will import data from the Chronicling America API. You will set parameters for what content and keywords to pull in, then you will send the request to the server. After you import the data, you'll organize and clean up the JSON format--in other words, when you get your search results, it will come packaged in a file format, called JSON. We will ingest the JSON file, turn it into a dictionary, and then turn part of that dictionary into a Pandas Dataframe. All we're doing when we turn text data into a dataframe is organizing the metadata and the files into a format that can be used and acted upon in order to do other kinds of analysis. 

To work with APIs, we will need to import a new library called "[requests](https://pypi.org/project/requests/)."

In [None]:
# Make the Requests module available
import requests

# What are APIs? 
APIs are a set of routines that allow you to build from and interact with a software application. APIs make it possible for 2 software programs to work with each other. We will use APIs to pull data from applications. Many applications, like Twitter, Instagram, LinkedIn, and other applications have APIs. 

In the following example, you are going to use the requests library to make an http request to the [Open Movie Database](https://www.omdbapi.com/) (OMDb). We are going to use a security protocol called an API Key and request that the API return results to a query about the movie *The Princess Bride*. 

First, go to the API Key generator on the [OMDb website](https://www.omdbapi.com/apikey.aspx?__EVENTTARGET=freeAcct&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=%2FwEPDwUKLTIwNDY4MTIzNQ9kFgYCAQ9kFgICBw8WAh4HVmlzaWJsZWhkAgIPFgIfAGhkAgMPFgIfAGhkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYDBQtwYXRyZW9uQWNjdAUIZnJlZUFjY3QFCGZyZWVBY2N0oCxKYG7xaZwy2ktIrVmWGdWzxj%2FDhHQaAqqFYTiRTDE%3D&__VIEWSTATEGENERATOR=5E550F58&__EVENTVALIDATION=%2FwEdAAU%2BO86JjTqdg0yhuGR2tBukmSzhXfnlWWVdWIamVouVTzfZJuQDpLVS6HZFWq5fYpioiDjxFjSdCQfbG0SWduXFd8BcWGH1ot0k0SO7CfuulHLL4j%2B3qCcW3ReXhfb4KKsSs3zlQ%2B48KY6Qzm7wzZbR&at=freeAcct&Email=). Click the radio button next to FREE. Enter your email address, first, and last name, and then in the "use" section, you can write: "Completing an assignment for class." Then click Submit. It usually takes just a few moments for a confirmation email to arrive in your email box. Be sure to click on the second link in the email first to validate your key. The email will include a sequence of characters and numbers that you will use in this exercise. Once you have set up a key, you can begin the rest of the activity. 

In order to retrieve data from the OMDb API using your API Key, we need to make a request using a communication protocol that is common on the internet: http. Essentially, what we will do is create a variable called `url` in which we will store an http request that is sent to the internet address www.omdbapi.com/. What follows the address, beginning with a question mark, is a query string. Query strings are not part of the URL syntax, but it tells the API (in this case) what your key is,and then what information you would like to retrieve. In this case, the information is all the information included in the record with the title *The Princess Bride*. 

Notice that the URL cannot contain spaces. Therefore, we insert the percent sign `%` where a space in the title might go. Alternately, we could put a + sign between each word. 

In [None]:
url = 'http://www.omdbapi.com/?apikey=4603ce48&t=the%princess%bride'

In [None]:
# Create a variable movies and use the get method in requests to read in the 
# response from the URL.
movies = requests.get(url)
# What datatype is the variable movies?
type(movies)

requests.models.Response

The result of the `requests.get()` method is specific to the requests library. In order to use the file, though, we need to convert the data from its current format into a JSON file. We do that by taking movies and applying the `.json' function.

In [None]:
# Take movies and turn it into json. 
json_data = movies.json()
type(json_data)

dict

When you check the data type now, you will discover that the json file is saved as a dictionary, which is to say a series of "keys" and "values" saved in pairs. 

Finally, we need to create a for loop so that we can go through the API results and print out the keys and their associated values. So, for every key and it's associated value in the json_data object, we look at each item in the dictionary and print the key, then a : and then the associated values. 

In [None]:
for key, value in json_data.items():
  print(key + ':', value)

Title: The Princess Bride
Year: 1987
Rated: PG
Released: 09 Oct 1987
Runtime: 98 min
Genre: Adventure, Comedy, Family
Director: Rob Reiner
Writer: William Goldman
Actors: Cary Elwes, Mandy Patinkin, Robin Wright
Plot: A bedridden boy's grandfather reads him the story of a farmboy-turned-pirate who encounters numerous obstacles, enemies and allies in his quest to be reunited with his true love.
Language: English
Country: United States
Awards: Nominated for 1 Oscar. 7 wins & 10 nominations total
Poster: https://m.media-amazon.com/images/M/MV5BYzdiOTVjZmQtNjAyNy00YjA2LTk5ZTAtNmJkMGQ5N2RmNjUxXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg
Ratings: [{'Source': 'Internet Movie Database', 'Value': '8.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '97%'}, {'Source': 'Metacritic', 'Value': '77/100'}]
Metascore: 77
imdbRating: 8.0
imdbVotes: 433,206
imdbID: tt0093779
Type: movie
DVD: 18 Jul 2000
BoxOffice: $30,857,814
Production: N/A
Website: N/A
Response: True


We know that we can search by title in the API because of the documentation on the [OMDb website](https://www.omdbapi.com/). Look under Usage and Parameters. In fact, the OMDb site includes examples, so that you can do a search and find the query string you need to get the result you are looking for. All the search parameters are listed here. 

Another way to search for a particular movie is with its item ID in the IMDB database. You can find the item identifier by looking at the end of the URL when you search for a movie. For example, in this URL https://www.imdb.com/title/tt2906216/ we would use the item ID `tt2906216`. There are also additional arguments that can be used in the query string for OMDb to find the full plot description. 

In [None]:
url = 'http://www.omdbapi.com/?apikey=4603ce48&i=tt2906216&plot=full'
dandd = requests.get(url)
json_data = dandd.json()
for key, value in json_data.items():
  print(key + ':', value)

Title: Dungeons & Dragons: Honor Among Thieves
Year: 2023
Rated: N/A
Released: 31 Mar 2023
Runtime: 134 min
Genre: Action, Adventure, Fantasy
Director: John Francis Daley, Jonathan Goldstein
Writer: Michael Gilio, John Francis Daley, Chris McKay
Actors: Chris Pine, Michelle Rodriguez, Regé-Jean Page
Plot: A charming thief and a band of unlikely adventurers embark on an epic quest to retrieve a lost relic, but things go dangerously awry when they run afoul of the wrong people.
Language: English
Country: United States
Awards: N/A
Poster: https://m.media-amazon.com/images/M/MV5BZjAyMGMwYTEtNDk4ZS00YmY0LThhZjUtOWI4ZjFmZmU4N2I3XkEyXkFqcGdeQXVyMTEyNzQ1MTk0._V1_SX300.jpg
Ratings: []
Metascore: N/A
imdbRating: N/A
imdbVotes: N/A
imdbID: tt2906216
Type: movie
DVD: N/A
BoxOffice: N/A
Production: Paramount Pictures
Website: N/A
Response: True


# Chronicling America
For this activity, we are going to use the [Chronicling America API](https://chroniclingamerica.loc.gov/about/api/), which is created and maintained by the Library of Congress. Chronicling America is an archive of digitized newspapers from across the United States that are not under copyright protection by another digitization vendor. It is part of the National Digitial Newspaper project funded by the National Endowment for the Humanities. There are more than [140,000 newspaper titles](https://chroniclingamerica.loc.gov/search/titles/) included in the collection. 

If you're interested in some of the critiques of the Chronicling America project, you might want to read Benjamin Fagen's article "Chronicling White America." (Fagan, Benjamin. "Chronicling White America." American Periodicals: A Journal of History & Criticism, vol. 26 no. 1, 2016, p. 10-13. Project MUSE https://muse.jhu.edu/article/613375.) 

In [None]:
# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

In [None]:
# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
# The parameters are drawn from the API documentation which describes the fields in the API. 
params = {
    'proxtext': 'poetry' # Search for this keyword -- feel free to change!
    
}

(Later on, you will be asked to return to the above cell and change the search parameters. You do this by replacing `poetry` with `yourterm`.)

In [None]:
# The following line adds a value for 'encoding' to our dictionary
params['format'] = 'json'

# Let's view the updated dictionary
params

{'proxtext': 'poetry', 'format': 'json'}

In [None]:
# The next line uses the requests package that we imported above to pull data from the Chronicling America API 
# and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# We use a print statement to show us the url that we are sending to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url)) 

# It's nice to have some feedback about the status of the API response 
# to make sure there were no errors. The following checks to see that there are no errors and shows a response.
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))
    print('Try running this cell again.')

Here's the formatted url that gets sent to the ChronAmerca API:
https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=poetry&format=json

All ok


In [None]:
# We are going to take the Chronicling America API's JSON results and turn them into a Python variable called 'data'
data = response.json()

The request that we made was for data formatted in JSON, which means Javascript Object Notation. JSON is a structured way of organizing information and it can be converted into a Python Dataframe; however, it's not always easy for a human to read. We're going to use another package called Prettify to use indentation and color to help make the JSON a little more understandable to the human reader. We're also using a json library and a library called Pygments to add some colour to the output. 

In [None]:
# Let's prettify the raw JSON data and then display it.

# We're using the Pygments library to add some colour to the output, so we need to import it
import json
from pygments import highlight, lexers, formatters
import pprint

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)

{[37m[39;49;00m
[37m  [39;49;00m[94m"totalItems"[39;49;00m:[37m [39;49;00m[34m519976[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"endIndex"[39;49;00m:[37m [39;49;00m[34m20[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"startIndex"[39;49;00m:[37m [39;49;00m[34m1[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"itemsPerPage"[39;49;00m:[37m [39;49;00m[34m20[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"items"[39;49;00m:[37m [39;49;00m[[37m[39;49;00m
[37m    [39;49;00m{[37m[39;49;00m
[37m      [39;49;00m[94m"sequence"[39;49;00m:[37m [39;49;00m[34m25[39;49;00m,[37m[39;49;00m
[37m      [39;49;00m[94m"county"[39;49;00m:[37m [39;49;00m[[37m[39;49;00m
[37m        [39;49;00m[33m"New York"[39;49;00m[37m[39;49;00m
[37m      [39;49;00m],[37m[39;49;00m
[37m      [39;49;00m[94m"edition"[39;49;00m:[37m [39;49;00m[34mnull[39;49;00m,[37m[39;49;00m
[37m      [39;49;00m[94m"frequency"[39;49;00m:[37m [39;49;00m[

In [None]:
type(json_data)

dict

## Reading text in a dataframe
Next, you will use what we learned about the data in the API using the keys. We're going to look into the "items" entry in the JSON file and create a dataframe using Pandas that pulls out the title, content, and year of publication for each of the items. 


In [None]:
# Drill down into the API data to find the filds we want to pull out and work with
json_data = json.loads(formatted_data)
# print(json_data['items'])
cleaned_papers = []
for item in json_data['items']:
  # print(item['ocr_eng'])
  cleaned_papers.append({'title': item['title'], 'content': item['ocr_eng'], 'date': item['date'] })
# print(cleaned_papers)
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(cleaned_papers)


The output of the above cell will be quite long. Before turning in this assignment, please delete the cell above so the file you turn in is not difficult to read. Thank you!

In [None]:
type(cleaned_papers)

list

In the cell below, we will take the nested dictionary, which is also a json format, and we will convert it into a DataFrame. 

In [None]:
df = pd.DataFrame.from_dict(json_data)
print(df.head(4))

   totalItems  endIndex  startIndex  itemsPerPage  \
0      519976        20           1            20   
1      519976        20           1            20   
2      519976        20           1            20   
3      519976        20           1            20   

                                               items  
0  {'sequence': 25, 'county': ['New York'], 'edit...  
1  {'sequence': 131, 'county': [None], 'edition':...  
2  {'sequence': 7, 'county': ['Fulton'], 'edition...  
3  {'sequence': 15, 'county': ['Prince George's']...  


If we switch the layout of the dataframe, it becomes easier to see how the labels for the dataframe are different from the many items in the items observation. We can try to use the json method `normalize` to flatten out the file into columns. 


In [None]:
df = pd.io.json.json_normalize(json_data)
df.columns

  df = pd.io.json.json_normalize(json_data)


Index(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'], dtype='object')

In [None]:
dfpapers = pd.DataFrame.from_dict(json_data['items'])
dfpapers.head(5)

Unnamed: 0,sequence,county,edition,frequency,id,subject,city,date,title,end_year,...,language,alt_title,lccn,country,ocr_eng,batch,title_normal,url,place,page
0,25,[New York],,Daily,/lccn/sn83030272/1913-05-04/ed-1/seq-25/,"[New York (N.Y.)--Newspapers., New York (State...",[New York],19130504,The sun. [volume],1916,...,[English],"[Extra sun, New York sun]",sn83030272,New York,V\ngg POETRY SECTl GARDENS\nTHIRD SECTION.\nNE...,nn_ehrlich_ver02,sun.,https://chroniclingamerica.loc.gov/lccn/sn8303...,[New York--New York--New York],
1,131,[None],,Daily,/lccn/sn83045462/1948-09-26/ed-1/seq-131/,"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",[Washington],19480926,Evening star. [volume],1972,...,[English],"[Star, Sunday star]",sn83045462,District of Columbia,"...long limbed,\nathletic look in\nUMenea^le*\...",dlc_2goncharova_ver03,evening star.,https://chroniclingamerica.loc.gov/lccn/sn8304...,[District of Columbia--Washington],6
2,7,[Fulton],,Weekly,/lccn/2020233210/1911-03-16/ed-1/seq-7/,"[Atlanta (Ga.)--Newspapers., Christianity--Sou...",[Atlanta],19110316,The Golden age. [volume],1920,...,[English],[],2020233210,Georgia,HE MOST enthusiastic and partial\nstudent of c...,gu_eridanus_ver02,golden age.,https://chroniclingamerica.loc.gov/lccn/202023...,[Georgia--Fulton--Atlanta],7
3,15,[Prince George's],,Weekly,/lccn/sn89061521/1938-08-24/ed-1/seq-15/,"[Greenbelt (Md.)--Newspapers., Maryland--Green...",[Greenbelt],19380824,Greenbelt cooperator.,1954,...,[English],[Greenbelt],sn89061521,Maryland,"Auyust 24, 1938\nFAVORITE POEMS\nDear NEIGHBOR...",mdu_annapolis_ver01,greenbelt cooperator.,https://chroniclingamerica.loc.gov/lccn/sn8906...,[Maryland--Prince George's--Greenbelt],PAGE FIFTEEN
4,17,[Cook County],,Daily (except Sunday and holidays),/lccn/sn83045487/1912-02-06/ed-1/seq-17/,"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",[Chicago],19120206,The day book. [volume],1917,...,[English],[],sn83045487,Illinois,"IKiSKJSSSWPWf\n0 0\nThe Mercantile Muse.\n""Has...",iune_echo_ver01,day book.,https://chroniclingamerica.loc.gov/lccn/sn8304...,[Illinois--Cook County--Chicago],


In [None]:
for key in dfpapers:
    print(key)

sequence
county
edition
frequency
id
subject
city
date
title
end_year
note
state
section_label
type
place_of_publication
start_year
edition_label
publisher
language
alt_title
lccn
country
ocr_eng
batch
title_normal
url
place
page


The `.tail()` method will print out just the last (in this case) 6 items in the dictionary.

In [None]:
dates = dfpapers["date"]
dates.head()

0    19130504
1    19480926
2    19110316
3    19380824
4    19120206
Name: date, dtype: object

In [None]:
dfpapers["date"] = pd.to_datetime(dfpapers["date"])
dfpapers["date"]

0    1913-05-04
1    1948-09-26
2    1911-03-16
3    1938-08-24
4    1912-02-06
5    1915-02-02
6    1914-02-04
7    1914-03-04
8    1908-02-09
9    1939-01-22
10   1915-12-14
11   1948-11-18
12   1905-11-19
13   1905-11-19
14   1905-11-19
15   1917-05-31
16   1960-04-03
17   1954-10-01
18   1912-07-28
19   1912-08-24
Name: date, dtype: datetime64[ns]

The `shape()` method will show how many rows and how many columns are in your dataframe.

In [None]:
dfpapers.shape

(20, 28)

In [None]:
dfpapers.describe()

Unnamed: 0,sequence,end_year,start_year
count,20.0,20.0,20.0
mean,35.55,1945.9,1891.2
std,42.257948,30.026129,35.256354
min,1.0,1916.0,1833.0
25%,9.0,1917.0,1854.0
50%,21.0,1938.0,1908.5
75%,40.25,1972.0,1911.0
max,154.0,1999.0,1947.0


## Reflection and Writing
In this exercise, you queried an API from Chronicling America and pulled in files that included the search term "poetry." Those files, then, reshaped and made slightly more tidy by highlighting the "keys" to the dictionary, and then taking one small section of the dictionary and turning it into a dataframe. 

Look back over the notebook and do the following: 


1.   Return to the section "Reading text in a dataframe." Read through some of the entries. Create a new text cell and explain what kind of "data cleaning" you would recommend to prepare the text for analysis. How can you take into consideration Munoz and Rawson's article? What makes the data "messy"? What messiness should remain? What messiness should be repaired? What messiness should be removed? 
2.    Go to the top of the Chronicling America section. Make a copy of the search query and try replacing the term "poetry" (the parameter of the search argument) with another one. What were the results? 
3.   What changes when you rerun the activity besides the results? Do you need to make any changes to the next cells for them to run? 


## Answer 1

After reading through the entries in the "Reading text in a dataframe" section, here are some recommendations for data cleaning to prepare the text for analysis:

1. Remove irrelevant metadata for focused text analysis.
2. Standardize capitalization and punctuation for better pattern analysis.
3. Exclude non-textual elements (images, ads) for text-focused examination.
4. Address OCR errors to minimize noise and preserve accuracy.
5. Preserve relevant messiness (misspellings, non-standard capitalization) for analysis.


## Answer 2

In [None]:
# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

In [None]:
# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
# The parameters are drawn from the API documentation which describes the fields in the API. 
params = {
    'proxtext': 'song' # Search for this keyword -- feel free to change!
    
}

In [None]:
# The following line adds a value for 'encoding' to our dictionary
params['format'] = 'json'

# Let's view the updated dictionary
params

{'proxtext': 'song', 'format': 'json'}

In [None]:
# The next line uses the requests package that we imported above to pull data from the Chronicling America API 
# and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# We use a print statement to show us the url that we are sending to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url)) 

# It's nice to have some feedback about the status of the API response 
# to make sure there were no errors. The following checks to see that there are no errors and shows a response.
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))
    print('Try running this cell again.')

Here's the formatted url that gets sent to the ChronAmerca API:
https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=song&format=json

All ok


In [None]:
# We are going to take the Chronicling America API's JSON results and turn them into a Python variable called 'data'
data = response.json()

The request that we made was for data formatted in JSON, which means Javascript Object Notation. JSON is a structured way of organizing information and it can be converted into a Python Dataframe; however, it's not always easy for a human to read. We're going to use another package called Prettify to use indentation and color to help make the JSON a little more understandable to the human reader. We're also using a json library and a library called Pygments to add some colour to the output. 

In [None]:
# Let's prettify the raw JSON data and then display it.

# We're using the Pygments library to add some colour to the output, so we need to import it
import json
from pygments import highlight, lexers, formatters
import pprint

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)

##Answer 3

Not any changes require to rerun the activity besides the results, we just update the 'proxtext': 'song' in params dictionary.


Before
```
params = {
    'proxtext': 'poetry' # Search for this keyword -- feel free to change!
    
}
```
After

```
params = {
    'proxtext': 'song' # Search for this keyword -- feel free to change!
    
}
```

 We do not need to make any changes to the next cells for them to run