# Week 4 Day 2: APIs

## Ethical web scraping

The phrase "data scraping" is colloquial and popular but has pejorative connotations. Data is valuable: other people invested time in collecting, organizing, and sharing it. When you show up with a scraper you built after maybe a dozen hours demanding data, you rarely pay the costs of labor, hosting, *etc*. that went into making the data available. There are *very* good rationales for making many kinds of data more availabile: reproducibility of scientific results, sharing publicly-funded and/or close-to-zero marginal cost resources, transparency and accountability in democratic institutions, remixing for innovative new analyses, *etc*. 

But data breaches have become eponymous (Target in 2013, Equifax in 2017, Facebook in 2018, *etc*.) because they violate other values like privacy. These manifest most clearly in principles outlined in the 1978 [Belmont Report](https://en.wikipedia.org/wiki/Belmont_Report):
* **Respect for persons**: protecting the autonomy of all people and treating them with courtesy and respect and allowing for informed consent. Researchers must be truthful and conduct no deception;
* **Beneficence**: The philosophy of "Do no harm" while maximizing benefits for the research project and minimizing risks to the research subjects; and
* **Justice**: ensuring reasonable, non-exploitative, and well-considered procedures are administered fairly — the fair distribution of costs and benefits to potential research participants — and equally.

(A fourth principle "Respect for Public" emphasizes compliance, accountability, and transparency in the conduct of research.)

In the context of data scraping, there are four "areas of difficulty":

* **Informed consent**: does the data scraper obtain consent from every person whose data is being retrieved?
* **Informational risk**: can the data scraper inflict economic, social, *etc*. harm on individuals by disclosing data?
* **Privacy**: does the data scraper know which information a person intended to be private or public? 
* **Decision-making under uncertainty**: does the data scraper know all the ways the data could be (mis)used? 

Ethical and legal risks involved with scraping:

* **[Copyright infringement](https://en.wikipedia.org/wiki/Copyright_infringement)**: compiling data that someone else can claim ownership over
* **[Trespass](https://en.wikipedia.org/wiki/Trespass_to_chattels#In_the_electronic_age)**: over-aggressive scraping shuts down someone else's property
* **[Computer Fraud & Abuse Act](https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act)**: misrepresenting yourself to access a system is "hacking"

While I cannot provide legal advice, we will revisit these concerns throughout the course through best practices for avoiding infringement, staggering data collection, simulating human requests, securing data, and protecting privacy.

James Densmore has a nice summary of [practices for ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

> * If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
> * I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
> * I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
> * I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
> * I will respect any content I do keep. I’ll never pass it off as my own.
> * I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
> * I will respond in a timely fashion to your outreach and work with you towards a resolution.
> * I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other important components of ethical web scraping practices [include](http://robertorocha.info/on-the-ethics-of-web-scraping/):

* Reading the Terms of Service and Privacy Policies for the site's rules on scraping.
* Inspecting the robots.txt file for rules about what pages can be scraped, indexed, *etc*.
* Be gentle on smaller websites by running during off-peak hours and spacing out requests.
* Identify yourself by name and email in your User-Agent strings

What does a robots.txt file look like? Here is CNN's. It helpfull provides a sitemap to the robot to get other pages, it allows all kinds of User-agents, and disallows crawling of pages in specific directories (ads, polls, tests).

![Should you build a scraper flowchart](http://www.storybench.org/wp-content/uploads/2016/04/flowchart_final.jpeg)

<!-- 

### What is an API?
An API is a communication tool. You, the ***User***, communicate with the ***Client***, the computer that sends the request to the ***Server***, the computer that responds to your request. 

The server is where the information you’re looking for is stored, and it’s what responds to your request. Information about the server appears in the documentation. The documentation will include the endpoints where specific data can be found as well as the structure of the data on the server. 

How you make a request depends on the API you are using adn that is where documentation such as the Wikipedia documenation will come into play. 

A core thing that is prevelant through APIs is an ***Endpoint***, or a specific route or URL where an API can be accessed. Each endpoint corresponds to a particular function or data point that the API exposes for use. 

When interacting with an API, a **client** (the person or software using the API) will send **requests** to the **endpoint**. These requests tell the API what action to perform, and they often contain additional data or parameters to guide that action. Following a request, the API provides a **response**. This contains the data requested, or an error message detailing why the request couldn’t be completed.



<!-- JSON files are similar to dictionaries but they don't import as so-->

### import requests
One of the things that we need in order to access this API is `requests`. This is library that makes a HTTP request and gets backend information from a website, such as RSS feed. 

documentation - https://requests.readthedocs.io/en/latest/

- requests.get()

requests.get(url, params={key: value}, args) -- sends a GET request to the specified url


- requests.get().json

Making a request with Requests is very simple. Begin by importing the Requests ... There's also a builtin JSON decoder. 

In [11]:
import requests
import urllib.request
import urllib3

print(urllib3.__version__)
s = requests.Session()
s.verify = False
s.trust_env = True
s.proxies = urllib.request.getproxies()
s.get("https://httpbin.org/get?a=1&b=2").json()

1.26.16




{'args': {'a': '1', 'b': '2'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, br',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.3',
  'X-Amzn-Trace-Id': 'Root=1-685ae37d-0ac554914205e901646b6688'},
 'origin': '67.172.153.46',
 'url': 'https://httpbin.org/get?a=1&b=2'}

<!--  -->

## PokeApi
### We're going to try with the PokeApi - https://pokeapi.co

This is an API where we cna get a bunch of information on the pokemon in question. 

--- try it on the UI

--- here are the pokemon options: https://pokeapi.co/api/v2/pokemon

the website you need to ping is: 

    "https://pokeapi.co/api/v2/pokemon/"+pokemonName


In [13]:
pokemonName = 'charmander'

pokemonSearch = requests.get("https://pokeapi.co/api/v2/pokemon/"+pokemonName).json()

SSLError: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/charmander (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1006)')))

In [21]:
#import pprint so we can see it
import pprint

In [23]:
#write your own

pokemonSearch_squirtle = requests.get("https://pokeapi.co/api/v2/pokemon/squirtle").json()

SSLError: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/squirtle (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1006)')))

In [1]:
# pretty print 



In [1]:
#get the keys of the dictionary

pokemonSearch.keys()

<!--  -->

## Excercise 1: 

Write code to find the Pokemon's abilities, store them as a list, and then print them out. (try a different pokemon)

Hint: A Pokeman can have multiple abilities, so you'll need to iterate over them.

In [None]:
for ability in pokemonSearch_squirtle['abilities']:
    abilityLyst.append(ability['ability']['name'])

<!--  -->

## TVMaze

https://www.tvmaze.com/api#show-search

noteL futurama is show 538 

In [27]:
#make a querey for the show 'girls'

requests.get('https://api.tvmaze.com/search/shows?q=girls')

<Response [200]>

In [33]:
#make a querey for the show  'girls'

girlsQ = requests.get('https://api.tvmaze.com/search/shows?q=girls').json()

girlsQ

[{'score': 0.90573967,
  'show': {'id': 139,
   'url': 'https://www.tvmaze.com/shows/139/girls',
   'name': 'Girls',
   'type': 'Scripted',
   'language': 'English',
   'genres': ['Drama', 'Romance'],
   'status': 'Ended',
   'runtime': 30,
   'averageRuntime': 30,
   'premiered': '2012-04-15',
   'ended': '2017-04-16',
   'officialSite': 'http://www.hbo.com/girls',
   'schedule': {'time': '22:00', 'days': ['Sunday']},
   'rating': {'average': 6.4},
   'weight': 97,
   'network': {'id': 8,
    'name': 'HBO',
    'country': {'name': 'United States',
     'code': 'US',
     'timezone': 'America/New_York'},
    'officialSite': 'https://www.hbo.com/'},
   'webChannel': None,
   'dvdCountry': None,
   'externals': {'tvrage': 30124, 'thetvdb': 220411, 'imdb': 'tt1723816'},
   'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/31/78286.jpg',
    'original': 'https://static.tvmaze.com/uploads/images/original_untouched/31/78286.jpg'},
   'summary': '<p>This Emmy winni

In [37]:
#get futurama through show number
futuramaQ = requests.get('https://api.tvmaze.com/search/shows?q=futurama').json()

futuramaQ[0]['show']['id']

538

<!--  -->

### APIs with keys

An application programming interface (API) key is a code used to identify and authenticate an application or user. API keys are available through platforms, such as a white-labeled internal marketplace. They also act as a unique identifier and provide a secret token for authentication purposes.

#### What is an API querey?
Parameters are the variables passed to an API endpoint to provide explicit instructions for the API server to process. The parameters can be included as part of the API request in the URL query string or in the request body field

![how-to-use-an-api-just-the-basics-4.png](attachment:how-to-use-an-api-just-the-basics-4.png)


<!--  -->

## Last.fm Music Discovery API

The Last.fm API allows anyone to build their own programs using Last.fm data

- https://www.last.fm/api


In [43]:
#this is a key
aKey = "815f527f75d594aa272fc6c9205136b2"

#the api root information is found here - https://www.last.fm/api/intro
rootURL = "http://ws.audioscrobbler.com/2.0/"

#write a querey
artistSearchQuery = requests.get(rootURL+"?method=artist.search&artist=eminem&api_key="+
                          aKey+"&format=json").json()

In [47]:
pprint.pprint(artistSearchQuery)

{'results': {'@attr': {'for': 'eminem'},
             'artistmatches': {'artist': [{'image': [{'#text': 'https://lastfm.freetls.fastly.net/i/u/34s/2a96cbd8b46e442fc41c2b86b821562f.png',
                                                      'size': 'small'},
                                                     {'#text': 'https://lastfm.freetls.fastly.net/i/u/64s/2a96cbd8b46e442fc41c2b86b821562f.png',
                                                      'size': 'medium'},
                                                     {'#text': 'https://lastfm.freetls.fastly.net/i/u/174s/2a96cbd8b46e442fc41c2b86b821562f.png',
                                                      'size': 'large'},
                                                     {'#text': 'https://lastfm.freetls.fastly.net/i/u/300x300/2a96cbd8b46e442fc41c2b86b821562f.png',
                                                      'size': 'extralarge'},
                                                     {'#text': 'https://lastfm.f

In [49]:
#find the top albumn

artistSearchQuery = requests.get(rootURL+"?method=artist.getTopAlbums&artist=eminem&api_key="+
                          aKey+"&format=json").json()


In [51]:
artistSearchQuery

{'topalbums': {'album': [{'name': 'The Eminem Show',
    'playcount': 66987229,
    'mbid': 'af71f60c-a8e8-4774-a2b3-30dbfaa13bd6',
    'url': 'https://www.last.fm/music/Eminem/The+Eminem+Show',
    'artist': {'name': 'Eminem',
     'mbid': 'b95ce3ff-3d05-4e87-9e01-c97b66af13d4',
     'url': 'https://www.last.fm/music/Eminem'},
    'image': [{'#text': 'https://lastfm.freetls.fastly.net/i/u/34s/74768435b4f70689863aa76f888d62a3.png',
      'size': 'small'},
     {'#text': 'https://lastfm.freetls.fastly.net/i/u/64s/74768435b4f70689863aa76f888d62a3.png',
      'size': 'medium'},
     {'#text': 'https://lastfm.freetls.fastly.net/i/u/174s/74768435b4f70689863aa76f888d62a3.png',
      'size': 'large'},
     {'#text': 'https://lastfm.freetls.fastly.net/i/u/300x300/74768435b4f70689863aa76f888d62a3.png',
      'size': 'extralarge'}]},
   {'name': 'Recovery',
    'playcount': 44846260,
    'mbid': 'dddf01df-f9f1-4ba6-b414-5ddf1984fc7f',
    'url': 'https://www.last.fm/music/Eminem/Recovery',
    '

### Exercise 2

 make a search query for your own favorite artist

In [59]:
mySearchQuery = requests.get(rootURL+"?method=artist.getTopTracks&artist=Coldplay&api_key="+
                          aKey+"&format=json").json()

mySearchQuery['toptracks']['track'][1]['name']

'Viva la Vida'

In [63]:
for track in mySearchQuery['toptracks']['track']:
    print(track['name'])

Yellow
Viva la Vida
The Scientist
Clocks
Fix You
Sparks
Paradise
Don't Panic
Trouble
In My Place
Speed of Sound
A Sky Full of Stars
Violet Hill
Shiver
Adventure Of A Lifetime
Talk
Hymn for the Weekend
Green Eyes
God Put a Smile Upon Your Face
Strawberry Swing
Life in Technicolor
Parachutes
Politik
We Never Change
Cemeteries of London
Every Teardrop Is a Waterfall
The Hardest Part
Magic
Spies
High Speed
42
A Message
Lost!
A Rush of Blood to the Head
princess of china
Amsterdam
Daylight
My Universe
What If
Square One
White Shadows
Charlie Brown
Death and All His Friends
Swallowed in the Sea
A Whisper
Low
Yes
Christmas Lights
Twisted Logic


## National Park Service API

https://www.nps.gov/subjects/developer/index.htm

In [67]:
# authentication
#https://www.nps.gov/subjects/developer/guides.html

# set base url and authenticatoin
apiKey = "lLqHaJEKm2wfhbVIZVVSPrUxkBxWKbDj0GcgplEk"

# request call
baseURL = "https://developer.nps.gov/api/v1"
HEADERS = {"X-Api-Key":apiKey}
req = requests.get(baseURL+"/campgrounds",headers=HEADERS).json()

In [69]:
pprint.pprint(req)

{'data': [{'accessibility': {'accessRoads': ['Paved Roads - All vehicles OK'],
                             'adaInfo': 'The main road leading to the '
                                        'campground is paved but the road that '
                                        'goes to each campsite is not.',
                             'additionalInfo': '',
                             'cellPhoneInfo': '',
                             'classifications': ['Limited Development '
                                                 'Campground'],
                             'fireStovePolicy': 'Ground fires are not '
                                                'permitted. Each campsite has '
                                                'a grill.',
                             'internetInfo': '',
                             'rvAllowed': '1',
                             'rvInfo': 'RV and Trailers are permitted',
                             'rvMaxLength': '0',
                             

In [5]:
# park code amis

In [71]:
# make a list of all activiies you can do in the park

for activity in req['data'][15]['activities']:
    print(activity['name'])


KeyError: 'activities'

## Making Functions

## Wikipedia API

In [73]:
## Run this cell block
baseURL = "https://en.wikipedia.org/w/api.php" #base URL for the wikipedia API

In [77]:
search = 'microsoft' #the parameter we'll search for

testRequest = requests.get(baseURL+"?+action=query&list=search&srsearch="+search+"&format=json").json() #our json request

In [79]:
testRequest

{'batchcomplete': '',
 'continue': {'sroffset': 10, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 43601},
  'search': [{'ns': 0,
    'title': 'Microsoft',
    'pageid': 19001,
    'size': 229016,
    'wordcount': 19729,
    'snippet': '<span class="searchmatch">Microsoft</span> Corporation is an American multinational corporation and technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the',
    'timestamp': '2025-06-24T00:25:20Z'},
   {'ns': 0,
    'title': 'Microsoft Office',
    'pageid': 20288,
    'size': 198060,
    'wordcount': 16053,
    'snippet': 'Bill Gates on August 1, 1988, at COMDEX, contained <span class="searchmatch">Microsoft</span> Word, <span class="searchmatch">Microsoft</span> Excel, and <span class="searchmatch">Microsoft</span> PowerPoint — all three of which remain core products',
    'timestamp': '2025-05-05T23:14:26Z'},
   {'ns': 0,
    'title': 'Microsoft Excel',
    'pageid': 20268,
    'size': 103566,
    'wordcount': 

In [81]:
# define a function
def WikiCall(search):
  #define the baseURL
    baseURL = "https://en.wikipedia.org/w/api.php" #base URL for the wikipedia API
  #make a query
    Q = requests.get(baseURL+"?+action=query&list=search&srsearch="+search+"&format=json").json() #our json request

    return Q

In [83]:
WikiCall('Microsoft')

{'batchcomplete': '',
 'continue': {'sroffset': 10, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 43602},
  'search': [{'ns': 0,
    'title': 'Microsoft',
    'pageid': 19001,
    'size': 229016,
    'wordcount': 19729,
    'snippet': '<span class="searchmatch">Microsoft</span> Corporation is an American multinational corporation and technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the',
    'timestamp': '2025-06-24T00:25:20Z'},
   {'ns': 0,
    'title': 'Microsoft Office',
    'pageid': 20288,
    'size': 198060,
    'wordcount': 16053,
    'snippet': 'Bill Gates on August 1, 1988, at COMDEX, contained <span class="searchmatch">Microsoft</span> Word, <span class="searchmatch">Microsoft</span> Excel, and <span class="searchmatch">Microsoft</span> PowerPoint — all three of which remain core products',
    'timestamp': '2025-05-05T23:14:26Z'},
   {'ns': 0,
    'title': 'Microsoft Excel',
    'pageid': 20268,
    'size': 103566,
    'wordcount': 

### Exercise 3: 

call the same API and pass another variable as a search term

In [85]:
WikiCall("Dog")

{'batchcomplete': '',
 'continue': {'sroffset': 10, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 124204,
   'suggestion': 'do',
   'suggestionsnippet': 'do'},
  'search': [{'ns': 0,
    'title': 'Dog',
    'pageid': 4269567,
    'size': 192447,
    'wordcount': 17696,
    'snippet': 'The <span class="searchmatch">dog</span> (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the gray wolf. Also called the domestic <span class="searchmatch">dog</span>, it was selectively bred',
    'timestamp': '2025-06-21T22:30:22Z'},
   {'ns': 0,
    'title': 'Dog (disambiguation)',
    'pageid': 2854454,
    'size': 7831,
    'wordcount': 1059,
    'snippet': 'Look up <span class="searchmatch">dog</span>, doggy, or doggie in Wiktionary, the free dictionary. The <span class="searchmatch">dog</span> is a domesticated canid species, Canis familiaris. <span class="searchmatch">Dog</span>(s), doggy, or doggie may',
    'timestamp': '2025-03-16T12:09:44Z'},
   {'ns': 

### Exercise 4: 
Create a function that searches for articles like in the previous two questions, but this time return the article with the highest word count. This should return the title, pageid, snippet and wordcount from the article with the highest word count from your search.

In [99]:
def WikiCall_highest(search):
    words = 0
  #define the baseURL
    baseURL = "https://en.wikipedia.org/w/api.php" #base URL for the wikipedia API
  #make a query
    Q = requests.get(baseURL+"?+action=query&list=search&srsearch="+search+"&format=json").json() #our json request

    for article in Q['query']['search']:
        
        wordCount = article['wordcount']

        if words < wordCount:
            words = wordCount
            title = article['title']
            pageid = article['pageid']
            snippet = article['snippet']

        print(title)
        print(wordCount)

In [101]:
WikiCall_highest('Dog')

Dog
17696
Dog
1059
Dog
388
Dog
282
Dog
1322
Dog
469
Dog
5174
Dog
5108
Dog
17247
Dog
558
