# Scrape Stories from Story Center

Scrape English Corona stories from the Story Center website: https://www.storycenter.org/covid-stories.

The stories are stored in dictionaries with the following fields:

- `link`: The link of the story page (string)
- `title`: The title of the story page (string)
- `summary`: A short summary of the story displayed on the main page (string)
- `date_published`: The date when the story was published (string)
- `date_modified`: The date when the story was last modified (string)
- `author`: The author of the story (string)
- `location`: The location of the author (string)
- `story_text`: The text of the story (string)

The web pages containing the stories are stored as `.html`. The notebook requires a folder at `DATA_DIR`.

In [1]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os

In [2]:
# Set URL of the main page containing the links to the story pages

URL = "https://www.storycenter.org/covid-stories"

# Set directory for storing web pages

DATA_DIR = "../storycenter.org/scraped/"

In [3]:
# Function to check if directory for saving web page exists

def check_dir(DATA_DIR):
    if not os.path.isdir(DATA_DIR):
        os.makedirs(DATA_DIR)
        print(f"Created saving directory at {DATA_DIR}")

In [4]:
# Function to check if web page can be loaded from disk; otherwise fetch website and save as .html to disk

def load_web_page(URL, file_name, DATA_DIR):
    check_dir(DATA_DIR)
    if os.path.exists(file_name):
        page = open(file_name, "r", encoding="utf-8").read()
        print(f"Loaded web page from {file_name}")
    else:
        req = Request(URL, headers = {"User-Agent": "Mozilla/5.0"})
        page = urlopen(req).read()
        with open(file_name, "w", encoding="utf-8") as file:
            file.write(page.decode())
        print(f"Saved web page to {file_name}")
    return page

In [5]:
# Function to extract links to the story pages from main page

def extract_hrefs_from_url(URL):
    main_page = BeautifulSoup(urlopen(URL).read(), "html.parser")
    
    hrefs = []

    for instance in main_page.find_all("a", attrs = {"class": "summary-thumbnail-container sqs-gallery-image-container"}):
        new_link = instance.get("href")
        hrefs.append("https://www.storycenter.org" + new_link)
        print("Extracted link: {}".format(hrefs[-1]))
    
    return hrefs

In [6]:
# Function to extract text from story pages

def extract_text_from_url(URL):
    new_file_name = DATA_DIR + URL.split("//")[-1] + ".html"
    new_page = BeautifulSoup(load_web_page(URL, new_file_name, DATA_DIR), "html.parser")
    new_title = new_page.title.string[:new_page.title.string.find(" —")]
    new_summary = new_page.find("meta", attrs = {"itemprop": "description"}).get("content")
    new_date_pub = new_page.find("meta", attrs = {"itemprop": "datePublished"}).get("content")
    new_date_mod = new_page.find("meta", attrs = {"itemprop": "dateModified"}).get("content")
    new_inline_text = new_page.find("div", attrs = {"class": "sqs-layout sqs-grid-12 columns-12"})
    new_stripped_text = [text.replace("\xa0", " ") for text in new_inline_text.stripped_strings]
    new_first_line = new_stripped_text[0].split(" ")
    new_author = "".join([s + " " for s in new_first_line[1:3]])
    new_location = "".join([s + " " for s in new_first_line[3:]])
    new_story_text = "\n".join(new_stripped_text[1:])
    new_doc = {
        "link": URL,
        "title": new_title,
        "summary": new_summary,
        "date_published": new_date_pub,
        "date_modified": new_date_mod,
        "author": new_author,
        "location": new_location,
        "story_text": new_story_text
    }
    print("Extracted text from: {}".format(URL))
    
    return new_doc

In [7]:
# Wrapper to for list of story pages

def extract_texts_from_url_list(URL_list):
    docs = []
    for URL in URL_list:
        docs.append(extract_text_from_url(URL))
    
    print("Done")
    
    return docs

In [8]:
# Function to print doc

def print_doc(doc):
    for field in doc.keys():
        if isinstance(doc[field], list):
            print(field + ": " + "".join([s + "\n" + "\t" for s in doc[field]]) + "\n")
        else:
            print(field + ": " + doc[field] + "\n")
        
    return

In [9]:
# Extract links from main page

hrefs = extract_hrefs_from_url(URL)

Extracted link: https://www.storycenter.org/covid-stories-1//thirty-three-days
Extracted link: https://www.storycenter.org/covid-stories-1//spains-rainbows-of-hope
Extracted link: https://www.storycenter.org/covid-stories-1//funeral-for-our-dead-beliefs
Extracted link: https://www.storycenter.org/covid-stories-1//diary-of-a-queer-woman-during-covid-19-pandemic
Extracted link: https://www.storycenter.org/covid-stories-1//the-year-of-the-rat
Extracted link: https://www.storycenter.org/covid-stories-1//finding-hope-in-the-little-things
Extracted link: https://www.storycenter.org/covid-stories-1//a-ma
Extracted link: https://www.storycenter.org/covid-stories-1//broad-beans-on-the-wall
Extracted link: https://www.storycenter.org/covid-stories-1//what-is-it-like-to-be-you
Extracted link: https://www.storycenter.org/covid-stories-1//dear-coronavirus
Extracted link: https://www.storycenter.org/covid-stories-1//ive-found-my-marbles
Extracted link: https://www.storycenter.org/covid-stories-1//go

In [10]:
# Extract texts from links

docs = extract_texts_from_url_list(hrefs)

Loaded web page from ../storycenter.org/scraped/thirty-three-days.html
Extracted text from: https://www.storycenter.org/covid-stories-1//thirty-three-days
Loaded web page from ../storycenter.org/scraped/spains-rainbows-of-hope.html
Extracted text from: https://www.storycenter.org/covid-stories-1//spains-rainbows-of-hope
Loaded web page from ../storycenter.org/scraped/funeral-for-our-dead-beliefs.html
Extracted text from: https://www.storycenter.org/covid-stories-1//funeral-for-our-dead-beliefs
Loaded web page from ../storycenter.org/scraped/diary-of-a-queer-woman-during-covid-19-pandemic.html
Extracted text from: https://www.storycenter.org/covid-stories-1//diary-of-a-queer-woman-during-covid-19-pandemic
Loaded web page from ../storycenter.org/scraped/the-year-of-the-rat.html
Extracted text from: https://www.storycenter.org/covid-stories-1//the-year-of-the-rat
Loaded web page from ../storycenter.org/scraped/finding-hope-in-the-little-things.html
Extracted text from: https://www.storyce

In [11]:
# Show example doc

print_doc(docs[0])

link: https://www.storycenter.org/covid-stories-1//thirty-three-days

title: Thirty-Three Days

summary: Our baby sister, Sherill, has been rushed to the hospital. As her legal guardian, I immediately think to go there. Debbie reminds me, “Freda, we can’t go there.” Right, no one is allowed! The hospital is restricting visitors.

date_published: 2020-06-03T18:25:13-0700

date_modified: 2021-01-21T14:45:07-0800

author: Alfreda Harris, 

location: Flynt, Michigan, U.S. 

story_text: I am sheltering in place alone, but not lonely. I miss my Mama. She passed on Friday, January 24, 2020, after five weeks in the hospital. Besides being my mother, she was my housemate and best friend. The eldest of her six children (two boys and four girls), I am her next of kin. I am the one everyone looked to when it came to making decisions. I am the facilitator and advocate on behalf of our family. Now, walking through the house we shared and where I grew up, her presence is all around me. I hear her voi