# Data Science From Scratch Notes

## Chapter 9: Getting Data

## stdin and stdout

We pipe data through Python scripts at the command line using `sys.stdin` and `sys.stdout`. Here's a script that counts the lines it receives and then writes out the count:

In [1]:
# line_count.py
import sys

count = 0
for line in sys.stdin:
    count += 1

# print goes to sys.stdout
print(count)

0


On a Unix-like system, use:

In [2]:
cat SomeFile.txt | python line_count.py

cat: SomeFile.txt: No such file or directory
/Users/nathansquan/Documents/DataScience/DSFS/bin/python: can't open file '/Users/nathansquan/Documents/DataScience/DSFS/line_count.py': [Errno 2] No such file or directory


Similarly,  here's a script that counts the words in its input and writes out the most common ones:

## Reading Files

### The Basics of Text Files

We obtain a *file object* using `open`:

In [4]:
# 'r' means read-only, it's assumed if you leave it out
file_for_reading = open('reading_file.txt', 'r')

# 'w' is write -- will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' is append -- for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

# close files when done
file_for_writing.close()

FileNotFoundError: [Errno 2] No such file or directory: 'reading_file.txt'

Since its easy to forget to close our files, we should always use them in a `with` block, at the end of which they will be closed automatically:

In [5]:
with open(filename) as f:
    data = function_that_gets_data_from(f)

# at this point f has already been closed, so don't try to use it
process(data)

NameError: name 'filename' is not defined

If we need to read a whole text file, we can iterate over the lines of the file using `for`:

In [6]:
starts_with_hash = 0

with open('input.txt') as f:
    for line in f: # look at each line in the file
        if re.match("^#", line): # use a regex to see if it starts with '#'
            starts_with_hash += 1 # if it does, add 1 to the count

FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

Every line we get this way ends in a newline character, so we'll often want to `strip` it before doing anything with it:

In [7]:
def get_domain(email_address: str) -> str:
    """Split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]

# a couple of tests
assert get_domain('joelgrus@gmail.com') == 'gmail.com'
assert get_domain('joel@m.datasciencester.com') == 'm.datasciencester.com'

from collections import Counter

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip())
                            for line in f
                            if "@" in line)

FileNotFoundError: [Errno 2] No such file or directory: 'email_addresses.txt'

## Delimited Files

Delimited files can be complicated: fields with commas, tabs, and newlines in them. We should never try to parse them ourselves. Instead, we will use Python's `csv` module (or the pandas library).

If the file has no headers (which means we probably want each row as a `list`), we can use `csv.reader` to iterate over the rows, each of which will be an appropriately split list:

In [8]:
import csv

with open('tab_delimited_stock_prices.txt') as f:
    tab_reader = csv.reader(f, delimiter='\t')
    for row in tab_reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date, symbol, closing_price) # process is not a function. It represents some function merely for demos

FileNotFoundError: [Errno 2] No such file or directory: 'tab_delimited_stock_prices.txt'

If the file has headers, we can either skip the header row with an initial call to `reader.next`, or get each row as a `dict` (with headers as keys) by using `csv.DictReader`:

In [10]:
with open('colon_delimited_stock_prices.txt') as f:
    colon_reader = csv.DictReader(f, delimiter=':')
    for dict_row in colon_reader:
        date = dict_row["date"]
        symbol = dict_row["symbol"]
        closing_price = float(dict_row["closing_price"])
        process(date, symbol, closing_price)

FileNotFoundError: [Errno 2] No such file or directory: 'colon_delimited_stock_prices.txt'

Even if the file doesn't have headers, we can still use `DictReader` by passing it the keys as a `fieldnames` parameter.

We can similarly write out delimited data using `csv.writer`:

In [11]:
todays_prices = {'AAPL': 90.91, 'MSFT': 41.68, 'FB': 64.5}

with open('comma_delimited_stock_prices.txt', 'w') as f:
          csv_writer = csv.writer(f, delimiter=',')
          for stock, price in todays_prices.items():
                csv_writer.writerow([stock, price])

## Scraping the Web

### HTML and the Parsing Thereof

To get data out of HTML, we use the Beautiful Soup library, which builds a tree out of the various elements on a web page and provides a simple interface for accessing them. We also use the Requests library to easily make HTTP requests.

We'll also use the `html5lib`parser.

To use Beautiful Soup, we pass a string containing HTML into the `BeautifulSoup` function. This will be the result of a call to `requests.get`:

In [12]:
from bs4 import BeautifulSoup
import requests

url = ("https://raw.githubusercontent.com/"
       "joelgrus/data/master/getting-data.html")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

We typically work with `Tag` objects, which correspond to the tags representing the structure of an HTML page.

For example, to find the first`<p>` tag and its contents, we can use:

In [14]:
first_paragraph = soup.find('p') # or just soup.p
first_paragraph

<p id="p1">This is the first paragraph.</p>

We can get the text contents of a `Tag` using its `text` property:

In [15]:
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

print(first_paragraph_text)
print(first_paragraph_words)

This is the first paragraph.
['This', 'is', 'the', 'first', 'paragraph.']


Extract a tag's attributes by treating it like a `dict`:

In [16]:
first_paragraph_id = soup.p['id'] # raises KeyError fi no 'id'
first_paragraph_id2 = soup.p.get('id') # returns None if no 'id'

print(first_paragraph_id)
print(first_paragraph_id2)

p1
p1


Get multiple tags at once:

In [17]:
all_paragraphs = soup.find_all('p') # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

print(all_paragraphs)
print(paragraphs_with_ids)

[<p id="p1">This is the first paragraph.</p>, <p class="important">This is the second paragraph.</p>]
[<p id="p1">This is the first paragraph.</p>]


Find tags with a specific `class`:

In [21]:
important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                        if 'important' in p.get('class', [])]

print(important_paragraphs)
print(important_paragraphs2)
print(important_paragraphs3)

[<p class="important">This is the second paragraph.</p>]
[<p class="important">This is the second paragraph.</p>]
[<p class="important">This is the second paragraph.</p>]


We can combine these methods to implement more elaborate logic. For example, if we want to find every `<span>` element that is contained inside a `<div>` element:

In [23]:
# Warning: will return the same <span> multiple times
# if it sits inside multiple <div>s.
# Be more clever if that's the case
spans_inside_divs = [span
                     for div in soup('div') # for each <div> on the page
                     for span in div('span')] # find each <span> inside it

print(spans_inside_divs)

[<span id="name">Joel</span>, <span id="twitter">@joelgrus</span>, <span id="email">joelgrus-at-gmail</span>]


### Example: Keeping Tabs on Congress

Goal: Quantify what Congress is saying on the topic of data science. Find all the representatives who have press releases about "data".

We'll collect all of the URLs linked to from *https://www.house.gov/representatives*.

All the links to the websites look like:

In [24]:
<td>
    <a> href="https://jayapal.house.gov">Jayapal, Pramila</a>
<td>

SyntaxError: invalid syntax (871086746.py, line 1)

Let's first collect all the URLs linked to from this page:

In [26]:
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
            for a in soup('a')
            if a.has_attr('href')]

print(len(all_urls)) # 967, way too many

967


This returns too many URLS. We want ones that start with either *http://* or *https://*, have some kind of name, and end with either *.house.gov* or *.house.gov/*. This is a good situation to use regex:

In [28]:
import re

# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# write tests
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov/")
assert re.match(regex, "https://joel.house.gov/")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "http://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

# and now apply
good_urls = [url for url in all_urls if re.match(regex, url)]

print(len(good_urls)) # still 878

878


There are still way too many since there are only 435 representatives. Looking at the list, there are a lot of duplicates. We'll use `set` to get rid of them:

In [29]:
good_urls = list(set(good_urls))
print(len(good_urls)) # 439

439


When we look at the sites, most of them have a link to press releases:

In [35]:
html = requests.get('https://jayapal.house.gov/').text
soup = BeautifulSoup(html, 'html5lib')

# use a set beacuse the links might appear multiple times
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links)

{'https://jayapal.house.gov/category/news/', 'https://jayapal.house.gov/category/press-releases/'}


Some sites might have relative links. This means we need to remember the originating site:

In [40]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases' 
                in a.text.lower()}
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

https://chrissmith.house.gov/: set()
https://blakemoore.house.gov: {'/media/press-releases'}
https://slotkin.house.gov/: {'/media/press-releases'}
https://cardenas.house.gov: {'https://cardenas.house.gov/media-center/press-releases'}
https://gohmert.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=1954'}
https://krishnamoorthi.house.gov: {'/media/press-releases'}
https://schweikert.house.gov/: {'/media-center/press-releases'}
https://gallagher.house.gov: {'/media/press-releases'}
https://timryan.house.gov/: {'/media/press-releases'}
https://axne.house.gov/: {'/media/press-releases'}
https://chu.house.gov/: {'/media-center/press-releases'}
https://bergman.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://vandrew.house.gov: {'/media/press-releases'}
https://danbishop.house.gov: {'/media/press-releases'}
https://horsford.house.gov: {'/media/press-releases'}
https://mullin.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://laturner.house.gov: {'/media/pr

https://debbiedingell.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27', '/news/'}
https://murphy.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://larson.house.gov/: {'/media-center/press-releases'}
https://frankel.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://bustos.house.gov: {'https://bustos.house.gov/category/press-release/'}
https://hartzler.house.gov/: {'/media-center/press-releases'}
https://lahood.house.gov/: {'/press-releases'}
https://jayapal.house.gov: {'https://jayapal.house.gov/category/news/', 'https://jayapal.house.gov/category/press-releases/'}
https://palmer.house.gov/: {'/media-center/press-releases'}
https://kim.house.gov/: {'/media/press-releases'}
https://upton.house.gov: {'/News/DocumentQuery.aspx?DocumentTypeID=1828'}
https://susielee.house.gov: {'/media/press-releases'}
https://cawthorn.house.gov: {'/media/press-releases'}
https://jacksonlee.house.gov/: {'/media-center/press-releases'}
https://norcross.house.gov: set(

https://gonzalez.house.gov: {'/media/press-releases'}
https://latta.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=1456'}
https://bilirakis.house.gov/: {'/media/press-releases'}
https://williams.house.gov: {'/media-center/press-releases'}
https://barrymoore.house.gov: {'/media/press-releases'}
https://dustyjohnson.house.gov/: {'/media/press-releases'}
https://costa.house.gov/: {'/media-center/press-releases'}
https://pocan.house.gov: {'/media-center/press-releases'}
https://mrvan.house.gov: {'/media/press-releases'}
https://barragan.house.gov: set()
https://youngkim.house.gov: {'/media/press-releases'}
https://mcclain.house.gov: {'/media/press-releases'}
https://davids.house.gov/: {'/media/press-releases'}
https://kaptur.house.gov/: {'/media-center/press-releases'}
https://kilmer.house.gov: {'https://kilmer.house.gov/news/press-releases'}
https://harder.house.gov/: {'/media/press-releases'}
https://phillips.house.gov/: {'/media/press-releases'}
https://pallone.house.gov: {'/medi

https://delauro.house.gov/: {'/media-center/press-releases'}
https://crow.house.gov/: {'/media/press-releases'}
https://trentkelly.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://carey.house.gov: {'/media/press-releases'}
https://vela.house.gov: {'/press-release'}
https://gooden.house.gov: {'/press-releases'}
https://grothman.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://pingree.house.gov/: set()
https://evans.house.gov/: {'/media-center/press-releases'}
https://sarbanes.house.gov/: {'/media-center/press-releases'}
https://bonamici.house.gov: {'/media/press-releases'}
https://cartwright.house.gov: {'/news/documentquery.aspx?DocumentTypeID=2442'}
https://roy.house.gov: {'/media/press-releases'}
https://sewell.house.gov/: {'/frontpage?qt-home_page_tabs=0#qt-home_page_tabs', '/media-center/press-releases'}
https://kevinbrady.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=2657'}
https://mcmorris.house.gov/: set()
https://lofgren.house.gov/: {'/

Now let's find out which congresspeople have press releases mentioning "data". Let's write a function that checks whether a page of press releases mentions any given term. We'll use the `<p>` tag to find keywords:

In [41]:
def paragraph_mentions(text: str, keyword: str) -> bool:
    """
    Returns True if a <p> inside the text mentions {keyword}
    """
    soup = BeautifulSoup(text, 'html5lib')
    paragraphs = [p.get_text() for p in soup('p')]
    
    return any(keyword.lower() in paragraph.lower()
               for paragraph in paragraphs)

Write a quick test:

In [43]:
text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, "twitter") # is inside a <p>
assert not paragraph_mentions(text, "facebook") # is not inside a <p>

Find the relevant congresspeople and give their names:

In [44]:
for house_url, pr_links in press_releases.items():
    for pr_link in pr_links:
        url = f"{house_url}/{pr_link}"
        text = requests.get(url).text
        
        if paragraph_mentions(text, 'data'):
            print(f"{house_url}")
            break # done with this house_url

https://mullin.house.gov
https://langevin.house.gov
https://cicilline.house.gov/
https://mikejohnson.house.gov
https://khanna.house.gov
https://guest.house.gov
https://spanberger.house.gov
https://anthonybrown.house.gov
https://reed.house.gov/
https://cawthorn.house.gov
https://jhb.house.gov/
https://simpson.house.gov
https://marymiller.house.gov
https://davids.house.gov/
https://lesko.house.gov
https://meijer.house.gov
https://delbene.house.gov
https://huizenga.house.gov/
https://crist.house.gov
https://gosar.house.gov/
https://pressley.house.gov
https://carbajal.house.gov


Notice that the press release pages are paginated for the most part. This means that we only retrieved the few most recent press releases for each congressperson. A more thorough solution would have iterated over the pages an retrieved the full text of each press release.




## Using APIs

Some websites provide *application programming interfaces* (APIs) to explicitly request data in a structured format. No need to scrape!

### JSON and XML

HTTP is a protocol for transferring *text*. Thus, the data requested through a web API needs to be *serialized* into a string format. The serialization uses JavaScript Object Notation (JSON).

We can parse JSON using Python's `json` module. We will use its `loads` function to deserialize a string representing a JSON object into a Python object:

In [46]:
import json

serialized = """{"title" : "Data Science Book",
                 "author" : "Joel Grus",
                 "publicationYear" : 2019,
                 "topics" : ["data", "science", "data science"]}"""

# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
assert deserialized["publicationYear"] == 2019
assert "data science" in deserialized["topics"]

Sometimes an API provider provides only responses in XML. We use Beautiful Soup to get data from XML similarly to how we used it to get data from HTML.


### Using an Unauthenticated API

Let's look at GitHub's API, with which we can do some simple things unauthenticated:

In [48]:
import requests, json

github_user = "joelgrus"
endpoint = f"https://api.github.com/users/{github_user}/repos"

repos = json.loads(requests.get(endpoint).text)

Currently, `repos` is a `list` of Python `dicts`, each representing a public repository in a GitHub account. We can use this to fidn the months and days an account creates a repo.

In [50]:
from collections import Counter
from dateutil.parser import parse

dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

print(weekday_counts)

Counter({2: 8, 4: 6, 5: 5, 1: 5, 6: 4, 3: 2})


Similarly, we can get the languages of the last five repos:

In [51]:
last_5_repositories = sorted(repos,
                             key=lambda r: r["pushed_at"],
                             reverse=True)[:5]
last_5_languages = [repo["language"]
                    for repo in last_5_repositories]
print(last_5_languages)

['Python', 'JavaScript', 'Python', 'Python', 'Python']


## Example: Using the Twitter APIs

