In [1]:
from lec_utils import *


<div class="alert alert-info" markdown="1">

#### Lecture 10


# APIs II & Web Scraping
    
</div>


## More API Demos

---

In [7]:
import yaml

# Load the YAML config
with open("api_keys_private.yaml") as f:
    config = yaml.safe_load(f)

# Access values
dartmouth_chat_api = config["dartmouth_chat_api"]["api_key"]
openai_api_key = config["openai"]["api_key"]
rapidapi_api_key = config["rapidapi"]["api_key"]

## Dartmouth Chat API

- Get your API: https://rcweb.dartmouth.edu/~d20964h/2024-12-11-dartmouth-chat-api/api_key/
- Basic usage: https://rcweb.dartmouth.edu/~d20964h/2024-12-11-dartmouth-chat-api/basic_usage/
- Using with OpenAI or Langchain packages: https://rcweb.dartmouth.edu/~d20964h/2024-12-11-dartmouth-chat-api/using_python_langchain/

### List available models

In [6]:
resp

<Response [401]>

In [9]:
import requests
from pprint import pprint


resp = requests.get(
    "https://chat.dartmouth.edu/api/models",
    headers={"Authorization": f"bearer {dartmouth_chat_api}"},
)
models = [ model['id'] for model in resp.json()["data"] ]
pprint(models)

['openai.gpt-oss-120b',
 'google.gemma-3-27b-it',
 'meta.llama-3.2-11b-vision-instruct',
 'qwen.qwen3-vl-32b-instruct-fp8',
 'openai_responses.gpt-5.1-chat-latest',
 'openai_responses.gpt-5.1-2025-11-13',
 'openai_responses.gpt-5.2-chat-latest',
 'openai_responses.gpt-5.2-2025-12-11',
 'openai.gpt-4.1-mini-2025-04-14',
 'openai.gpt-4.1-2025-04-14',
 'openai_responses.gpt-5-mini-2025-08-07',
 'openai_responses.gpt-5-2025-08-07',
 'anthropic.claude-3-5-haiku-20241022',
 'anthropic.claude-haiku-4-5-20251001',
 'anthropic.claude-3-7-sonnet-20250219',
 'anthropic.claude-opus-4-5-20251101',
 'anthropic.claude-sonnet-4-20250514',
 'anthropic.claude-sonnet-4-5-20250929',
 'vertex_ai.gemini-2.0-flash-001',
 'vertex_ai.gemini-2.5-flash',
 'vertex_ai.gemini-2.5-pro',
 'mistral.mistral-large-2512',
 'mistral.mistral-medium-2508',
 'mistral.pixtral-large-2411',
 'meta.llama-3-2-3b-instruct',
 'baai.bge-large-en-v1-5',
 'baai.bge-m3',
 'google_genai.gemini-embedding-001',
 'meta.codellama-13b-instru

### Sending prompt via `requests`

In [10]:
import requests

resp = requests.post(
    "https://chat.dartmouth.edu/api/chat/completions",
    headers={"Authorization": f"bearer {dartmouth_chat_api}"},
    json={
        "model": "openai.gpt-oss-120b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is an API?"},
        ],
        "stream": False,
    },
)
print(resp.json()["choices"][0]["message"]["content"])

**API** stands for **A**pplication **P**rogramming **I**nterface. It‚Äôs a set of rules, protocols, and tools that lets one piece of software talk to another. Think of it as a contract between two programs: one side (the *provider*) offers certain functionality, and the other side (the *consumer*) can request that functionality by following the contract‚Äôs specifications.

---

## Core Ideas

| Concept | What It Means |
|---------|----------------|
| **Interface** | A defined way to interact‚Äîmethods, endpoints, data formats, error codes, etc. |
| **Abstraction** | The consumer doesn‚Äôt need to know how the provider does its work; it only needs to know *what* it can ask for and *how* to ask. |
| **Loose Coupling** | Changes inside the provider can often be made without breaking the consumer, as long as the public contract stays the same. |
| **Automation** | APIs enable programs to call each other automatically, without human intervention. |

---

## Types of APIs

| Type | Typical 

### Sending prompt via `OpenAI` package with streaming

In [11]:
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="https://chat.dartmouth.edu/api",
    api_key=dartmouth_chat_api
)

chat_completion = client.chat.completions.create(
    model="google.gemma-3-27b-it",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "How do you learn web scraping?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    try:
        print(message.choices[0].delta.content, end="")
    except:
        pass
else:
  print()

Okay, you want to learn web scraping! That's fantastic. It's a really useful skill. Here's a breakdown of how to learn it, broken down into steps, resources, and things to consider.  I'll try to be comprehensive.

**1. Understand the Basics & Legal/Ethical Considerations**

* **What *is* Web Scraping?** At its core, it's automatically extracting data from websites.  Instead of manually copying and pasting, you write code to do it for you.
* **Why Learn It?**  Data analysis, price monitoring, research, lead generation, content aggregation ‚Äì the uses are diverse.
* **Legality & Ethics - *VERY IMPORTANT!***
    * **Terms of Service:** *Always* check a website's `robots.txt` file (e.g., `https://www.example.com/robots.txt`) and Terms of Service.  These dictate what you're allowed to scrape.  Respect these rules!
    * **Respect Server Load:** Don't bombard a website with requests.  Introduce delays (see "Best Practices" below).  Excessive scraping can overload a server and is considered 

## OpenAI API

- Sign up for an account: https://platform.openai.com
- Get your API key from https://platform.openai.com/account/api-keys
- Add your API key to your `api_keys.yaml` file
- [Response API Reference](https://platform.openai.com/docs/api-reference/responses)

In [12]:
from openai import OpenAI

client = OpenAI(api_key=openai_api_key)

In [13]:
response = client.responses.create(
        model="gpt-4o-mini",
        input="What is the capital of New Hampshire? Provide a short answer."
    )
response.output_text

'The capital of New Hampshire is Concord.'

In [14]:
response

Response(id='resp_0c9caa729beb4483006984fb0f596c8195ba8fd66c0513f9c5', created_at=1770322703.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-mini-2024-07-18', object='response', output=[ResponseOutputMessage(id='msg_0c9caa729beb4483006984fb0fc8e08195b7a898b3ed18741c', content=[ResponseOutputText(annotations=[], text='The capital of New Hampshire is Concord.', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=ResponseUsa

In [15]:
import json

json.loads(response.json())['output']

[{'id': 'msg_0c9caa729beb4483006984fb0fc8e08195b7a898b3ed18741c',
  'content': [{'annotations': [],
    'text': 'The capital of New Hampshire is Concord.',
    'type': 'output_text',
    'logprobs': []}],
  'role': 'assistant',
  'status': 'completed',
  'type': 'message'}]

In [16]:
AOC_tweets = pd.read_json("../../public_data/AOC_recent_tweets.txt")
AOC_tweets
# list(AOC_tweets['full_text'][0:10])

Unnamed: 0,created_at,id,id_str,full_text,...,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,2021-02-06 20:22:38+00:00,1358149122264563712,1358149122264563712,RT @RepEscobar: Our country has the moral obli...,...,,,,
1,2021-02-06 20:16:39+00:00,1358147616400408576,1358147616400408576,RT @RoKhanna: What happens when we guarantee $...,...,,,,
2,2021-02-06 20:07:35+00:00,1358145332316667909,1358145332316667904,(Source: https://t.co/3o5JEr6zpd),...,,,,
...,...,...,...,...,...,...,...,...,...
3244,2019-10-09 14:00:32+00:00,1181932460516478976,1181932460516478976,Trump decision isn‚Äôt about drawing down US mil...,...,1.18e+18,1.18e+18,"{'url': 'https://t.co/ODpoyZI83r', 'expanded':...",{'created_at': 'Wed Oct 09 13:22:07 +0000 2019...
3245,2019-10-09 13:41:17+00:00,1181927615340453899,1181927615340453888,Federal govs are failing to act on the climate...,...,,,,
3246,2019-10-09 05:32:34+00:00,1181804625588051968,1181804625588051968,RT @LeanInOrg: Thank you @AOC for highlighting...,...,1.18e+18,1.18e+18,"{'url': 'https://t.co/S9q3D4wGcb', 'expanded':...",


In [17]:
list(AOC_tweets['full_text'][0:10])

['RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT‚Ä¶',
 'RT @RoKhanna: What happens when we guarantee $15/hour?\n\nüí∞ 31% of Black workers and 26% of Latinx workers get raises.\nüò∑ A majority of essent‚Ä¶',
 '(Source: https://t.co/3o5JEr6zpd)',
 'Joe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she‚Äôll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU',
 'What‚Äôs even more gross is that Mace takes corporate PAC money.\n\nShe‚Äôs already funded by corporations. Now she‚Äôs choosing to swindle working people on top of it.\n\nPeak scam artistry. Caps for cash üí∞ https://t.co/CcVxgDF6id',
 'Joe Cunningham already proving to be leagues more decent + honest than Mace seems capable of.\n\nThe House was far better off w/ Cunningham. It‚Äôs sad to see Mace diminish the representation of her community by launching a reput

In [14]:
print(AOC_tweets['full_text'][1])

RT @RoKhanna: What happens when we guarantee $15/hour?

üí∞ 31% of Black workers and 26% of Latinx workers get raises.
üò∑ A majority of essent‚Ä¶


In [18]:
client = OpenAI(api_key=openai_api_key)

prompt_template = """
Is the following text making a statement about minimum wage? 
You should answer either Yes or No.

{text}

Answer:
"""

# Build prompts
questions = [
    prompt_template.format_map(dict(text=t))
    for t in AOC_tweets["full_text"].head(10)
]

In [19]:
questions

['\nIs the following text making a statement about minimum wage? \nYou should answer either Yes or No.\n\nRT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT‚Ä¶\n\nAnswer:\n',
 '\nIs the following text making a statement about minimum wage? \nYou should answer either Yes or No.\n\nRT @RoKhanna: What happens when we guarantee $15/hour?\n\nüí∞ 31% of Black workers and 26% of Latinx workers get raises.\nüò∑ A majority of essent‚Ä¶\n\nAnswer:\n',
 '\nIs the following text making a statement about minimum wage? \nYou should answer either Yes or No.\n\n(Source: https://t.co/3o5JEr6zpd)\n\nAnswer:\n',
 '\nIs the following text making a statement about minimum wage? \nYou should answer either Yes or No.\n\nJoe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she‚Äôll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU\n\nAnswer:\n',
 '

In [20]:
# Call GPT-4o-mini in a loop (or batch later)
responses = []
for q in questions:
    r = client.responses.create(
        model="gpt-4o-mini",
        input=q,
        temperature=0,
        max_output_tokens=16,
    )
    answer = r.output_text.strip()
    responses.append(answer)

In [16]:
responses

['No', 'Yes', 'Yes', 'No', 'No.', 'No.', 'No.', 'No', 'No', 'No']

In [21]:
AOC_tweets.loc[:9, "min_wage_label"] = responses
AOC_tweets.loc[:9, ["full_text", "min_wage_label"]]

Unnamed: 0,full_text,min_wage_label
0,RT @RepEscobar: Our country has the moral obli...,No
1,RT @RoKhanna: What happens when we guarantee $...,Yes
2,(Source: https://t.co/3o5JEr6zpd),Yes
...,...,...
7,RT @jaketapper: .@RepNancyMace fundraising off...,No
8,RT @RepMcGovern: One reason Washington can‚Äôt ‚Äú...,No
9,"RT @JoeNeguse: Just to be clear, ‚Äútargeting‚Äù s...",No.


## Rapid API

A marketplace for third party APIs
- Some users maintain the API
- Data collectors can pay for subscriptions
- Most have "baic" plans that are free
- https://rapidapi.com/hub

- [Sports](https://rapidapi.com/search/Sports?sortBy=ByRelevance)
- [Social](https://rapidapi.com/search/Social?sortBy=ByRelevance)
- [News, Media](https://rapidapi.com/search/News%2C%20Media?sortBy=ByRelevance)

### Baseball from RapidAPI

- https://rapidapi.com/sparior/api/baseball4

In [22]:
url = "https://baseball4.p.rapidapi.com/v1/mlb/schedule"

querystring = {"date":"2025-10-25"}

headers = {
    "x-rapidapi-key": rapidapi_api_key,
    "x-rapidapi-host": "baseball4.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

response.json()

{'meta': {'version': 'v1.0',
  'status': 200,
  'copywrite': 'https://steadyapi.com',
  'date': '2025-10-25',
  'totalEvents': 0,
  'totalGames': 1,
  'totalGamesInProgress': 0},
 'body': {'0': {'gamePk': 813026,
   'gameGuid': 'd6817775-be45-4d6f-b02c-28bde5658918',
   'gameType': 'W',
   'season': '2025',
   'gameDate': '2025-10-26T00:00:00Z',
   'officialDate': '2025-10-25',
   'status': {'abstractGameState': 'Final',
    'codedGameState': 'F',
    'detailedState': 'Final',
    'statusCode': 'F',
    'startTimeTBD': False,
    'abstractGameCode': 'F'},
   'teams': {'away': {'team': {'id': 119, 'name': 'Los Angeles Dodgers'},
     'leagueRecord': {'wins': 1, 'losses': 1, 'pct': '.500'},
     'score': 5,
     'isWinner': True,
     'splitSquad': False,
     'seriesNumber': 1},
    'home': {'team': {'id': 141, 'name': 'Toronto Blue Jays'},
     'leagueRecord': {'wins': 1, 'losses': 1, 'pct': '.500'},
     'score': 1,
     'isWinner': False,
     'splitSquad': False,
     'seriesNumber'

In [23]:
response.json().keys()

dict_keys(['meta', 'body'])

In [24]:
response.json()['body']['0']['gamePk']

813026

In [25]:
url = "https://baseball4.p.rapidapi.com/v1/mlb/games-playbyplay"

querystring = {"gamePk":"813026"}

headers = {
    "x-rapidapi-key": rapidapi_api_key,
    "x-rapidapi-host": "baseball4.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

data = response.json()
data

{'meta': {'version': 'v1.0',
  'status': 200,
  'copywrite': 'https://steadyapi.com'},
 'body': [{'result': {'type': 'atBat',
    'event': 'Flyout',
    'eventType': 'field_out',
    'description': 'Shohei Ohtani flies out to left fielder Nathan Lukes.',
    'rbi': 0,
    'awayScore': 0,
    'homeScore': 0,
    'isOut': True},
   'about': {'atBatIndex': 0,
    'halfInning': 'top',
    'isTopInning': True,
    'inning': 1,
    'startTime': '2025-10-26T00:09:17.290Z',
    'endTime': '2025-10-26T00:10:16.216Z',
    'isComplete': True,
    'isScoringPlay': False,
    'hasReview': False,
    'hasOut': True,
    'captivatingIndex': 0},
   'count': {'balls': 0, 'strikes': 2, 'outs': 1},
   'matchup': {'batter': {'id': 660271, 'fullName': 'Shohei Ohtani'},
    'batSide': {'code': 'L', 'description': 'Left'},
    'pitcher': {'id': 592332, 'fullName': 'Kevin Gausman'},
    'pitchHand': {'code': 'R', 'description': 'Right'},
    'batterHotColdZones': [],
    'pitcherHotColdZones': [],
    'splits

### YouTube from RapidAPI

- https://rapidapi.com/ytjar/api/yt-api

In [26]:
url = "https://yt-api.p.rapidapi.com/search"

querystring = {
    "query":"dartmouth",
    "geo":"US"
}

headers = {
    "x-rapidapi-key": rapidapi_api_key,
    "x-rapidapi-host": "yt-api.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

response.json()

{'continuation': 'ErADEglkYXJ0bW91dGgaogNTQlNDQVF0TGJVUlpXR0ZoVkRselFZSUJDMHhuYmsweE5taFhlQzAwZ2dFWVZVTnRla0YyVEdwUVdWVm1XVEJGVEZaNkxWUlBVR0ozZ2dFTGVrcE1SRXMyTTNGaVgydUNBUXRqVVRCbk1IbzRWa3RNWTRJQkMzcFFNamhHVTFZeGExOUZnZ0VMWnpCVGRHaExUbVpmYW5lQ0FRdHdaRUozT1d3MGJWZDBTWUlCQzNWR1puTlNkVk4zVW05VmdnRUxRMmxMTTI4NGVETlFRMmVDQVF0VlJWRjFPV3B4VEVOWFo0SUJDMGh3ZUhoNFpuaEtXSEJOZ2dFTFFXTXhRMDF2T1d0UFVWV0NBUXRKVVd0M1RsVmxZbXQwYzRJQkMxQTNWRmxsUm5rd1NuUkJnZ0VMTmpCb1EyZGZibWh6VnpDQ0FRdGhUVTFEWVd0bVIzSlBZNElCQ3poWGVuRTNSRlpzZGs0MGdnRUxXV1k1VlZsclJrVjFlV3VDQVF0a1ZWSlVVMlIxVkV0UlRiSUJCZ29FQ0J3UUF1b0JCQWdDRUJRJTNEGIHg6BgiC3NlYXJjaC1mZWVk.CgtlNk1BZ19kX2hnYyiqzZPMBjIKCgJTQRIEGgAgXw%3D%3D',
 'estimatedResults': '999290',
 'data': [{'type': 'shorts_listing',
   'title': 'Shorts',
   'data': [{'type': 'shorts',
     'videoId': 'GwHGlkc_ch0',
     'title': 'A Day at Dartmouth: Cake, Classes, & Cartwheels',
     'viewCountText': '3.3K views',
     'thumbnail': [{'url': 'https://i.ytimg.com/vi/GwHGlkc_ch0/frame0.jpg

In [27]:
response_json = response.json()
len(response_json['data'])

28

In [28]:
pd.DataFrame(response_json['data'])

Unnamed: 0,type,title,data,videoId,...,channelBadges,isVerifiedChannel,subtitle,isLive
0,shorts_listing,Shorts,"[{'type': 'shorts', 'videoId': 'GwHGlkc_ch0', ...",,...,,,,
1,video,Excellence in a Vibrant College Town,,HV9PPYV8Prk,...,,,,
2,video,Why Dartmouth?,,jIGj-4aum3I,...,,,,
...,...,...,...,...,...,...,...,...,...
25,video,Dartmouth president: Need to ensure American s...,,8Wzq7DVlvN4,...,[Verified],True,,
26,video,Dartmouth Engineering Video Tour,,Yf9UYkFEuyk,...,,,,
27,video,The Dartmouth College Murders : Robert Tulloch...,,dURTSduTKQM,...,[Verified],True,,


### Twitter from RapidAPI

- https://rapidapi.com/davethebeast/api/twitter241

In [29]:
url = "https://twitter241.p.rapidapi.com/search-v2"

querystring = {
    "type":"Top",
    "count":"80",
    "query":"dartmouth"
}

headers = {
    "x-rapidapi-key": rapidapi_api_key,
    "x-rapidapi-host": "twitter241.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

response.json()

{'cursor': {'bottom': 'DAACCgACHAa-cEWAJxAKAAMcBr5wRX_Y8AgABAAAAAILAAUAAADoRW1QQzZ3QUFBZlEvZ0dKTjB2R3AvQUFBQUJNYi91QzNPTlpBU0J3RlNzSXZHNEF1SEFSV3dtU1dFTjBiL2RHY2dGb1FNUnYvb1J0R1ZoRHJHL29FaldHV2dZRWIvWm9hZUp0Um5Sd0FyQ1NWMTlFOEhBWmdDVTZhRVZvYitkdGpSUmJnNWh3QzZtWVZWOUhURy8ranZVaVc4SXdjQlhQNFh0cXdwaHdBV3lpR0d0R2hIQUt2OHV5V0FKUWNBNGxGdmhwUXNCd0Q5WVppMjJFWUcvM2c0b1dYc0t3Y0JVZ2pKOXNCc2c9PQgABgAAAAAIAAcAAAAADAAICgABG_nbY0UW4OYAAAA',
  'top': 'DAACCgACHAa-cEWAJxAKAAMcBr5wRX_Y8AgABAAAAAELAAUAAADoRW1QQzZ3QUFBZlEvZ0dKTjB2R3AvQUFBQUJNYi91QzNPTlpBU0J3RlNzSXZHNEF1SEFSV3dtU1dFTjBiL2RHY2dGb1FNUnYvb1J0R1ZoRHJHL29FaldHV2dZRWIvWm9hZUp0Um5Sd0FyQ1NWMTlFOEhBWmdDVTZhRVZvYitkdGpSUmJnNWh3QzZtWVZWOUhURy8ranZVaVc4SXdjQlhQNFh0cXdwaHdBV3lpR0d0R2hIQUt2OHV5V0FKUWNBNGxGdmhwUXNCd0Q5WVppMjJFWUcvM2c0b1dYc0t3Y0JVZ2pKOXNCc2c9PQgABgAAAAAIAAcAAAAADAAICgABG_nbY0UW4OYAAAA'},
 'result': {'timeline': {'instructions': [{'type': 'TimelineAddEntries',
     'entries': [{'entryId': 'toptabsrpusermodule-2019510872333877248',
       's

In [23]:
response_json = response.json()
response_json['result']['timeline']['instructions'][0]['entries'][0]['content']['items'][1]['item']['itemContent']

{'itemType': 'TimelineUser',
 '__typename': 'TimelineUser',
 'user_results': {'result': {'__typename': 'User',
   'id': 'VXNlcjo5MDkzOTM1Mjc5NTU3NDY4MTY=',
   'rest_id': '909393527955746816',
   'affiliates_highlighted_label': {},
   'avatar': {'image_url': 'https://pbs.twimg.com/profile_images/1891475534787264512/C0e8ry_E_normal.jpg'},
   'core': {'created_at': 'Sun Sep 17 12:28:07 +0000 2017',
    'name': 'Britannia Royal Naval College',
    'screen_name': 'DartmouthBRNC'},
   'dm_permissions': {'can_dm': True},
   'follow_request_sent': False,
   'has_graduated_access': True,
   'is_blue_verified': False,
   'legacy': {'default_profile': True,
    'default_profile_image': False,
    'description': 'Official feed for Britannia Royal Naval College delivering world class Naval Officer training in Dartmouth since 1863, and RN leadership training since 2008.',
    'entities': {'description': {'urls': []},
     'url': {'urls': [{'display_url': 'royalnavy.mod.uk/our-organisati‚Ä¶',
       

### Reddit from RapidAPI

- https://rapidapi.com/sparior/api/reddit3

In [30]:
url = "https://reddit3.p.rapidapi.com/v1/reddit/search"

querystring = {
    "search":"investing",
    "subreddit":"wallstreetbets",
    "filter":"posts",
    "timeFilter":"year",
    "sortType":"relevance"
}

headers = {
    "x-rapidapi-key": rapidapi_api_key,
    "x-rapidapi-host": "reddit3.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

response.json()

{'meta': {'version': 'v1.0',
  'status': 200,
  'copywrite': 'https://steadyapi.com',
  'search': 'investing',
  'subreddit': 'wallstreetbets',
  'filter': 'posts',
  'timeFilter': 'year',
  'sortType': 'relevance',
  'cursor': 't3_1oecsuz'},
 'body': [{'approved_at_utc': None,
   'subreddit': 'wallstreetbets',
   'selftext': 'The year was 2015. I was working a shit job as a housing consultant at a for profit automotive school. The job sucked but I had plenty of desk time and I used it to research tickers. I had been trading off an on since around 2010 but with only about 1k. I remember buying Macys, Hecla Mining, and bullshit penny stocks until I knew better. I was over trading like a mad man. This was even dumber then because they charged commission of about 7$ per trade during those days. Just young and dumb not really making any money. Wasting company time so no harm no foul. \n\nIn the beginning of 2015 I received an inheritance from an Aunt for 10k. Which was basically my life sa

In [31]:
pd.DataFrame(response.json()['body'])

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,...,media,is_video,post_hint,preview
0,,wallstreetbets,The year was 2015. I was working a shit job as...,t2_1wpeqazaq7,...,,False,,
1,,investing,I‚Äôve been trying to figure out how to position...,t2_o8n99at,...,,False,,
2,,fican,"I started investing 4 years ago, only make 70k...",t2_crrprcif,...,,False,,
...,...,...,...,...,...,...,...,...,...
22,,worldnews,,t2_b2qoadvw,...,,False,,
23,,wallstreetbets,No story just 0dte SPY. Maybe this year I‚Äôll g...,t2_a0yvtjcr,...,,False,image,{'images': [{'source': {'url': 'https://previe...
24,,pcmasterrace,,t2_21wdg9e8,...,,False,link,{'images': [{'source': {'url': 'https://extern...


## Scraping vs. APIs

- There are two different ways of programmatically accessing data from the internet: either **by scraping**, or **through an API**.

- **Scraping** is the act of emulating a web browser to access its HTML source code.<small>When scraping, you get back data as HTML and have to **parse** that HTML to extract the information you want. Parse means to "extract meaning from a sequence of symbols".

<center>
    
| ‚úÖ Pros | ‚ùå Cons |
| --- | --- |
| If the website exists, you can usually scrape it.<br><small>This is what Google does!</small> | Scraping and parsing code gets **messy**, since <br>HTML documents contain lots of content unrelated to the<br>information you're trying to find (advertisements, formatting).<br><br>When the website's structure changes, your code will need to, too.<br><br>The site owner may not _want_ you to scrape it!</small>
    
    
</center>

- An application programming interface, or **API**, is a service that makes data directly available to the user in a **convenient** fashion. Usually, APIs give us code back as JSON objects.<br><small>APIs are made by organizations that host data. For example, X (formally known as Twitter) has an [API](https://developer.twitter.com/en/docs/twitter-api), as does [OpenAI](https://platform.openai.com/docs/overview?lang=python), the creators of ChatGPT.</small>


| ‚úÖ Pros | ‚ùå Cons |
| --- | --- |
| If an API exists, the data are usually clean, up-to-date, and ready to use.<br><br>The presence of an API signals that the data provider<br> is okay with you using their data.<br><br>The data provider can plan and regulate data usage.<br><small>Sometimes, you may need to create an API "key",<br>which is like an account for using the API.<br>APIs can often give you access to data that isn't publicly available.</small> | APIs don't always exist for the data you want! |

- We'll start by learning how to scrape; we'll discuss APIs at the end of the lecture.

### What is HTML?

- HTML (Hypertext Markup Language) is **the** basic building block of the internet. 

- It is a **markup language**, not a programming language.<br><small>Markup languages specify what something should _look like_, while programming languages specify what something should _calculate_ or _do_.</small>

- Specifically, it defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.

- We're only going to learn enough HTML to help us scrape information.<br><small>See [this tutorial](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for more details on HTML.</small>

### An example webpage

- For instance, here's the source code of a very basic webpage.

In [36]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>'''

- Here's what that webpage actually looks like:

In [5]:
HTML(html_string)

### The anatomy of HTML documents

- **HTML document**: The totality of markup that makes up a webpage.

<center>

<img src="imgs/webpage_anatomy.png" width=500>

</center>

- **Document Object Model (DOM)**: The internal representation of an HTML document as a hierarchical **tree** structure.

<center><img src="imgs/dom_tree.png" width=500></center>

- **Why do we care about the DOM?** Extracting information out of an HTML document will involve **traversing** this tree.

- **HTML element**: An object in the DOM, such as a paragraph, header, or title.

- **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.<br><small>See the attached reference slides for examples of common tags.</small>

- **Attributes**: Some tags have **attributes**, which further specify how to display information.

```html
        <p style="color: red">Look at my red text!</p>
```

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Useful tags to know

- Often, the information we're looking for is nestled in one of these tags:

|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

- There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Example tags and attributes

- Tags can have **attributes**, which further specify how to display information on a webpage.

- For instance, `<img>` tags have `src` and `alt` attributes, among others:<br>

```html
        <img src="cool-visualization.png" alt="My box plot that I'm super proud of." width=500>
```

- Hyperlinks have `href` attributes: 

```html
        Click <a href="https://kengchichang.github.io/QSS20/">this link</a> to access past exams.
```

- The `<div>` tag is one of the more common tags. It defines a "section" of an HTML document, and is often used as a container for other HTML elements.<br><small>Think of `<div>`s like cells in Jupyter Notebooks.</small>
    
```html
        <div class="background">
          <h3>This is a heading</h3>
          <p>This is a paragraph.</p>
        </div>
```

- Often, the information we're looking for is stored in an attribute!<br><small>You can imagine a situation where we want to get the URL behind a button, for example.</small>

## Parsing HTML

---

### Beautiful Soup üçú

- [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.<br><small>Remember, **parse** means to "extract meaning from a sequence of symbols".

- **Warning**: Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.

### Example HTML document

- To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width=600></center>

- This is the DOM tree of the HTML document `html_string` we created earlier.

In [37]:
print(html_string)


<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>


In [7]:
HTML(html_string)

### Instantiating `BeautifulSoup` objects

- `bs4`'s `BeautifulSoup` function takes in a string or file-like object representing HTML and returns a **parsed** document.

In [33]:
# We also could have used:
# import bs4
# But, then we'd need to use bs4.BeautifulSoup every time.
from bs4 import BeautifulSoup

- Normally, we pass the result of a `GET` request to `BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [38]:
soup = BeautifulSoup(html_string) 
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [39]:
type(soup)

bs4.BeautifulSoup

- `BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [40]:
print(soup.text) 




Heading here
My First paragraph
My second paragraph




item 1
item 2
item 3






### Finding elements in a BeautifulSoup object

- The two main methods you will use to extract information from a BeautifulSoup object are `find` and `find_all`.

- `soup.find(tag)` finds the **first** instance of a tag (the first one on the page), and returns just that tag.<br><small>It has several optional arguments, including some that involve defining `lambda` functions: **look at the documentation!**</small>

- `soup.find_all(tag)` will find **all** instances of a tag, and returns a **list** of tags.

- Remember: **`find` finds tags!**

### Using `find`

- Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width=600> ‚¨áÔ∏è <center><img src="imgs/dom_subtree_1.png" width=325></center>  </center> 

In [41]:
soup.find('div') 

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

- Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [42]:
soup.find('div', attrs={'id': 'nav'}) 

<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>

- `find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [43]:
# The ul child is not at the top of the tree, but we can still find it.
soup.find('ul') 

<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>

### Using `find_all`

- `find_all` returns a list of all matching tags.

In [16]:
soup


<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [44]:
soup.find_all('div') 

[<div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>,
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>]

In [45]:
soup.find_all('li') 

[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]

- We often use the `find_all` method in conjunction with a `for`-loop or list comprehension, to perform some operation on every matching tag.

In [46]:
[x.text for x in soup.find_all('li')] 

['item 1', 'item 2', 'item 3']

### Node attributes

- The `text` attribute of a tag element gets the text between the opening and closing tags.

In [47]:
soup.find('p')

<p>My First paragraph</p>

In [48]:
soup.find('p').text

'My First paragraph'

- The `attrs` attribute of a tag element lists all of its attributes.

In [49]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

In [50]:
soup.find('div').text

'\nHeading here\nMy First paragraph\nMy second paragraph\n\n'

In [52]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

In [53]:
soup.find('div').attrs

{'id': 'content'}

- The `get` method of a tag element **gets the value of an attribute**.<br><small>`find` and `get` are easy to get confused, but you'll use them both a lot.</small>

In [54]:
soup.find('div').get('id')

'content'

- The `get` method must be called directly on the node that contains the attribute you're looking for.

In [25]:
soup


<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [26]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [27]:
soup.find('div').get('id')

'content'

## Example: Scraping quotes üí¨

---

### Example: Scraping quotes

- Navigate to [quotes.toscrape.com](https://quotes.toscrape.com).

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- **Goal**: Extract quotes, and relevant metadata, into a DataFrame.

- Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### Organizing our work

- Eventually, we will implement a single function, `make_quote_df`, which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of [quotes.toscrape.com](https://quotes.toscrape.com).

- Along the way, we'll implement several helper functions, with the goal of separating our logic: **each function should either request information, OR parse, but not both!**

- This makes it easier to debug and catch errors.

- It also avoids **unnecessary requests**.

### Downloading a single page

- First, let's figure out how to download a single page from [quotes.toscrape.com](https://quotes.toscrape.com).

- The URLs seem to be formatted a very particular way:

```html
        https://quotes.toscrape.com/page/2
```

In [55]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    res = requests.get(url)
    return BeautifulSoup(res.text)

- Let's test our function on a single page, like Page 2.<br><small>There's nothing special about Page 2; we chose it arbitrarily.</small>

In [56]:
soup = download_page(2) 
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">‚ÄúThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters

- Now that this works, later on, we can call `download_page(1)`, `download_page(2)`, `download_page(3)`, ..., `download_page(n)`.

### Parsing a single page

- Now, let's try and extract the relevant information out of the `soup` object for Page 2.

- **Open [quotes.toscrape.com/page/2](https://quotes.toscrape.com/page/2/) in Chrome, right click the page, and click "Inspect"!**<br><small>This will help us find where each quote is located in the HTML.</small>

In [57]:
divs = soup.find_all('div', class_='quote') 
# The above is a shortcut for the following, just for when the attribute key is class:
# divs = soup.find_all('div', attrs={'class': 'quote'})

In [59]:
divs[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">‚ÄúThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself

- From this `<div>`, we can extract the quote, author name, author's URL, and tags.<br><small>Strategy: Figure out how to process one `<div>`, then put that logic in a function to use on other `<div>`s.

In [61]:
# The quote.
divs[0].find('span', class_='text').text 

"‚ÄúThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most important

In [62]:
# The author.
divs[0].find('small', class_='author').text 

'Marilyn Monroe'

In [64]:
# The URL for the author.
divs[0].find('a').get('href') 

'/author/Marilyn-Monroe'

In [65]:
# The quote's tags.
divs[0].find('meta', class_='keywords').get('content') 

'friends,heartbreak,inspirational,life,love,sisters'

### Parsing a single quote, and then a single page

- Let's implement a function that takes in a `<div>` corresponding to a single quote and returns a dictionary containing the quote's information.<br><small>Why use a dictionary? Passing `pd.DataFrame` a list of dictionaries is an easy way to create a DataFrame.</small>

In [67]:
def process_quote(div):
    quote = div.find('span', class_='text').text
    author = div.find('small', class_='author').text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', class_='keywords').get('content')
    return {'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags}

In [69]:
divs[4]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">‚ÄúI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.‚Äù</span>
<span>by <small class="author" itemprop="author">Dr. Seuss</small>
<a href="/author/Dr-Seuss">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="fantasy" itemprop="keywords"/>
<a class="tag" href="/tag/fantasy/page/1/">fantasy</a>
</div>
</div>

In [68]:
# Make sure everything here looks correct based on what's on the webpage!
process_quote(divs[4]) 

{'quote': '‚ÄúI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.‚Äù',
 'author': 'Dr. Seuss',
 'author_url': 'https://quotes.toscrape.com/author/Dr-Seuss',
 'tags': 'fantasy'}

- Now, we can implement a function that takes in a **list** of `<div>`s, calls `process_quote` on each `<div>` in the list, and returns a **DataFrame**.

In [70]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [39]:
process_page(divs)

Unnamed: 0,quote,author,author_url,tags
0,‚ÄúThis life is what you make it. No matter what...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"friends,heartbreak,inspirational,life,love,sis..."
1,‚ÄúIt takes a great deal of bravery to stand up ...,J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,"courage,friends"
2,"‚ÄúIf you can't explain it to a six year old, yo...",Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"simplicity,understand"
...,...,...,...,...
7,"‚ÄúIt is not a lack of love, but a lack of frien...",Friedrich Nietzsche,https://quotes.toscrape.com/author/Friedrich-N...,"friendship,lack-of-friendship,lack-of-love,lov..."
8,"‚ÄúGood friends, good books, and a sleepy consci...",Mark Twain,https://quotes.toscrape.com/author/Mark-Twain,"books,contentment,friends,friendship,life"
9,‚ÄúLife is what happens to us while we are makin...,Allen Saunders,https://quotes.toscrape.com/author/Allen-Saunders,"fate,life,misattributed-john-lennon,planning,p..."


### Putting it all together

- Now, we can implement `make_quote_df`.

In [71]:
def make_quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.''' # This is called a docstring!
    dfs = []
    for i in range(1, n+1):
        # Download page n and create a BeautifulSoup object.
        soup = download_page(i)
        # Create DataFrame using the information in that page.
        divs = soup.find_all('div', class_='quote')
        df = process_page(divs)
        # Append DataFrame to dfs.
        dfs.append(df)
    # Stitch all DataFrames together.
    return pd.concat(dfs).reset_index(drop=True)

In [72]:
quotes = make_quote_df(3)
quotes.head()

Unnamed: 0,quote,author,author_url,tags
0,‚ÄúThe world as we have created it is a process ...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"change,deep-thoughts,thinking,world"
1,"‚ÄúIt is our choices, Harry, that show what we t...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,‚ÄúThere are only two ways to live your life. On...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"inspirational,life,live,miracle,miracles"
3,"‚ÄúThe person, be it gentleman or lady, who has ...",Jane Austen,https://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"‚ÄúImperfection is beauty, madness is genius and...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"


- Now, `quotes` is s DataFrame, like any other!

In [42]:
quotes['author'].value_counts().iloc[:10].sort_values().plot(kind='barh')

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Summary of our steps

1. Implement `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.

2. Implement `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a dictionary containing all of the relevant information for that quote.

3. Implement `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

4. Implement `make_quote_df(n)`.