# A - Data Collection & Scraping - Scraping the DayZero Project Platform

In this notebook we scrape all necessary data from the website DayZero Project and start cleaning the data.

In [None]:
# import relevant packages
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import re
import requests
import json
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pathlib import Path
from time import sleep
import pandas as pd
from tqdm import tqdm
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed

To create a network of goals, we used data from the DayZero Project platform, a website and community of goal setters, in existing since 2009. Our aim was to create a network of goals, connected by co-occurence on users' lists. Due to the structure of the DayZero Project platform, the scraping process was relatively complicated. The website does not offer any overview of all users (or goals). The only way to access users' list is by going to their page using their username. The only way to retrieve users' usernames in a semi-systematic manner is to use the search function, which returns a limited number (at most 50) of usernames per search query. 

The process of extracting the necessary data went as follows:
1. We started with a list of the 400 most common names from the 2000s in the US. We retrieved this list from the social security website of the US government. Due to scraping restrictions, we had to manually save the html file of the page. We decided to use this as basis for extracting usernames from the website.
2. We used each of the 400 names in a search query, scraping the html of the page of search results and extracting all usernames from the html. This left us with 12998 unique usernames (15205 originally, as some search queries lead to duplicates).
3. We then looped over the 12998 user names to retrieve the list of goals for each, scraping the html of their lists and then extracting goals. Each goal has a unique ID which we extracted. This gave us a list of 231269 unique goals (503828 goals before clearing out duplicates).
4. Using the goals' IDs, which form the unique part of a goal URL, we scraped the html from all goal pages and extracted titles and descriptions for each goal.
5. Next, we extracted further attributes of each goal, such as tags, comments, and counts of completion and inclusion on lists (these attributes do not refer to the user cohort we extracted from the website, but are counts provided by the website relating to all users).
6. Finally, we compiled dictionaries for further processing.

## 1. Importing File With Most Common Names

The first step in the the process of gathering our data is to read in the html file containing the 400 names used a base as described above.

In [None]:
# Load the local HTML file
tables = pd.read_html("../Data/Additional Data/2000_names_USA.html")

# Usually the first table is the one we want
names_df = tables[0]

# Rename columns for clarity
names_df.columns = ["Rank", "Male Name", "Male Count", "Female Name", "Female Count"]

# Extract male and female names
male_names = names_df["Male Name"].tolist()
female_names = names_df["Female Name"].tolist()

print("Male Names:", male_names[:10])
print("Female Names:", female_names[:10])

Male Names: ['Jacob', 'Michael', 'Joshua', 'Matthew', 'Daniel', 'Christopher', 'Andrew', 'Ethan', 'Joseph', 'William']
Female Names: ['Emily', 'Madison', 'Emma', 'Olivia', 'Hannah', 'Abigail', 'Isabella', 'Samantha', 'Elizabeth', 'Ashley']


In [4]:
male_names = male_names[:-1]
female_names = female_names[:-1]

In [5]:
names_list = female_names + male_names

The names_list contains the 400 usernames to be used in the search in the next step.

## 2. Scraping Usernames

### Creating a Cookie File to Simulate Logging In

In order to use the websites search function, one needs to be logged in as a user. Therefore, we are saving the cookies from a manual log in to use in the scraping process further on.

In [None]:
driver = webdriver.Chrome()
driver.get("https://dayzeroproject.com")

# pause here so you can log in manually
input("Log in in the browser, then press Enter here...")

# after login, save cookies
cookies = driver.get_cookies()
cookie_file = Path("cookies.json")
cookie_file.write_text(json.dumps(cookies, indent=2))

driver.quit()
print("Saved cookies to cookies.json")

### Retrieving Usernames Based on Search Pages

The first step in retrieving usernames based on search pages is to perform a search for each of the 400 names from the names_list above and scrape the html of each search results page. We save this into a list called search_pages_html_list. For safekeeping, we export the list as a json file.

#### Scraping Search Pages

In [None]:
# -----------------------------
# 1. Setup Selenium
# -----------------------------
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=chrome_options)

# -----------------------------
# 2. Go to the site first
# -----------------------------
driver.get("https://dayzeroproject.com")

# -----------------------------
# 3. Add cookies for login
# -----------------------------

# path to your saved cookie file
cookie_file = Path("cookies.json")  # adjust path if needed

# load cookies from JSON file
with cookie_file.open("r", encoding="utf-8") as f:
    cookies = json.load(f)

# add cookies to the Selenium driver
for cookie in cookies:
    driver.add_cookie(cookie)

# Refresh to apply cookies
driver.refresh()
time.sleep(2)

# -----------------------------
# 3. Go to the user's following page
# -----------------------------
USERNAME = "Rebecca2025"
driver.get(f"https://dayzeroproject.com/user/{USERNAME}/following")
time.sleep(2)

# -----------------------------
# 4. Click the FIND PEOPLE button (JS click)
# -----------------------------
find_people_button = WebDriverWait(driver, 20).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "span.showfindpeople.hand"))
)
driver.execute_script("arguments[0].click();", find_people_button)

# -----------------------------
# 5. Wait for the search box div to appear
# -----------------------------
search_container = WebDriverWait(driver, 20).until(
    EC.presence_of_element_located((By.ID, "following-searchbox"))
)
search_input = search_container.find_element(By.TAG_NAME, "input")
search_button = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "following-search"))
)

# -----------------------------
# 6. Loop over names_list
# -----------------------------
search_pages_html_list = []  # store HTML for each search

for name in names_list:
    # clear previous input and type new name
    search_input.clear()
    search_input.send_keys(name)

    # click the SEARCH button
    driver.execute_script("arguments[0].click();", search_button)

    # wait a few seconds for JS to populate results
    time.sleep(5)  # adjust if results load slowly

    # optionally click again if needed (keep your original logic)
    driver.find_element(By.ID, "following-search").click()
    time.sleep(3)

    # get full page HTML and store
    page_html = driver.page_source
    search_pages_html_list.append({"name": name, "html": page_html})
    print(f"Saved HTML for search: {name}")

print("All search pages saved. You can now parse results with regex or BeautifulSoup.")

In [None]:
# -----------------------------
# save the list of HTML pages
# -----------------------------
output_file = Path("dayzero_search_pages.json")

with output_file.open("w", encoding="utf-8") as f:
    json.dump(search_pages_html_list, f, ensure_ascii=False, indent=2)

print(f"Saved {len(search_pages_html_list)} pages to {output_file}")

Using the html pages of the search queries, we extract the usernames by using beautiful soup and regex.

#### Extracting Usernames

In [None]:
all_usernames = []  # this will hold usernames from all pages

for page in search_pages_html_list:
    html = page["html"]
    soup = BeautifulSoup(html, "html.parser")

    # find all divs with class 'following-username'
    user_divs = soup.find_all("div", class_="following-username")

    for div in user_divs:
        a_tag = div.find("a", class_="bold")
        if a_tag and "href" in a_tag.attrs:
            href = a_tag["href"]
            username = href.split("/user/")[-1]
            all_usernames.append(username)

print(f"Extracted {len(all_usernames)} usernames in total.")

In [None]:
username_counts = Counter(all_usernames)
duplicates = {u: c for u, c in username_counts.items() if c > 1}

print(f"Total usernames: {len(all_usernames)}")
print(f"Unique usernames: {len(username_counts)}")
print(f"Number of duplicates: {len(duplicates)}")

# Optional: see top repeated usernames
for u, c in sorted(duplicates.items(), key=lambda x: -x[1])[:20]:
    print(u, c)

This leaves us with 12998 unique usernames.

## 3. Extracting Goals of Users

Using the list of usernames, we now scrape users' lists of goals.

### Scraping Users' Goal Lists

In [None]:
goal_lists_html_list = []

# --- 1. Start a new driver session ---
options = Options()
options.add_argument("--start-maximized")
# or headless mode if you don’t need the UI:
# options.add_argument("--headless")

driver = webdriver.Chrome(options=options)

# --- 2. Load cookies if the site requires login ---
import json
with open("cookies.json", "r") as f:
    cookies = json.load(f)

driver.get("https://dayzeroproject.com")  # open the domain first
for cookie in cookies:
    driver.add_cookie(cookie)

# loop

for user in tqdm(all_usernames, desc="Fetching goal pages"):
    try:
        url = f"https://dayzeroproject.com/user/{user}/list/-1"
        driver.get(url)

        # wait until the goals are visible or the page stabilizes
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div#content"))
        )
        time.sleep(2)  # short delay for JS completion

        html = driver.page_source
        goal_lists_html_list.append({"username": user, "html": html})
    except Exception as e:
        print(f"Error for {user}: {e}")
        continue

In [None]:
# specify the path and filename
output_file = "goal_lists_html_list.json"

# save the list of dictionaries to JSON
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(goal_lists_html_list, f, ensure_ascii=False, indent=2)

print(f"Saved {len(goal_lists_html_list)} pages to {output_file}")

The goal_lists_html_list now holds the html of all goals we scraped.

Now we can extract a list of goals (a dict here to include necessary information) by using beautiful soup and regex again.

### Extracting Goals From Pages

In [None]:
user_goals = {}

for item in tqdm(goal_lists_html_list, desc="Extracting user goals"):
    username = item.get("username")
    html = item.get("html")

    # Skip invalid entries
    if not username or not isinstance(html, str):
        continue

    soup = BeautifulSoup(html, "html.parser")

    # -----------------------------
    # Extract all goals
    # -----------------------------
    goals = []
    seen_ids = set()

    for a_tag in soup.find_all("a", href=True):
        href = a_tag["href"]
        if href.startswith("/goal/"):
            goal_id = href.split("/goal/")[-1]
            if goal_id not in seen_ids:
                text = a_tag.get_text(strip=True)
                goals.append({"id": goal_id, "href": href, "text": text})
                seen_ids.add(goal_id)

    user_goals[username] = goals

print(f"Extracted goals for {len(user_goals)} users.")

The result of this is a dictionary called user_goals with user name as key and goals of the user as values (nested dict with ID, href and text of goal).

Not all users have any goals on their list, so we split the dictionary into two parts: one containing users with goals, and one users without.

### Filtering Users With and Without Goals

In [None]:
users_with_goals = {}
users_without_goals = {}

for username, goals in tqdm(user_goals.items(), desc="Filtering users", unit="user"):
    if goals:  # non-empty list is truthy
        users_with_goals[username] = goals
    else:
        users_without_goals[username] = goals

print(f"Users with goals: {len(users_with_goals)}")
print(f"Users without goals: {len(users_without_goals)}")

In [None]:
# Save user_goals to JSON
with open("users_with_goals.json", "w", encoding="utf-8") as f:
    json.dump(users_with_goals, f, ensure_ascii=False, indent=2)

print("users_with_goals has been saved to 'users_with_goals.json'.")

This results in 10087 with goals and 2910 without.

### Checking Up on Total Goals Extracted

Since the dictionary of the users with goals has a list with nested dictionaries (one per goal) as value, we can now already see how many goals we have scraped, namely 503828, which may include duplicate goals.

In [None]:
total_goals = sum(len(goals) for goals in users_with_goals.values())
print(f"Total goals (including duplicates): {total_goals}")

## Creating a Dictionairy of Unique Goals

Since the dictionary above contains duplicates, we filter it to create a new dictionary called goals_dict, containing only unique goals. This gives us 231269 goals.

In [None]:
# Initialize the goal container
goals_dict = {}

for username, goals in tqdm(users_with_goals.items(), desc="Processing users"):
    for goal in goals:
        goal_id = goal["id"]
        title = goal["text"]
        
        # Only add if not already in the dict (to avoid duplicates)
        if goal_id not in goals_dict:
            goals_dict[goal_id] = {
                "title": title,
                "html": None  # placeholder for HTML to be fetched later
            }

print(f"Total unique goals collected: {len(goals_dict)}")

## 4. Retrieving Goals from the Goal Pages

Now that we have the IDs of all the unique goals we have scraped, we can loop over them to extract the html of each goal page.

### Scraping Goal Pages

We start by defining a function to do this.

In [None]:
# Function to fetch a single goal page
def fetch_goal_html(goal_id, max_retries=3, backoff_factor=2):
    """
    Fetches the HTML for a goal with a retry mechanism.
    Returns (goal_id, html, error_message or None)
    """
    url = f"https://dayzeroproject.com/goal/{goal_id}"
    
    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            return goal_id, response.text, None  # success
        
        except Exception as e:
            error_msg = str(e)
            if attempt < max_retries:
                sleep_time = backoff_factor ** (attempt - 1)
                print(f"⚠️ Attempt {attempt} failed for {goal_id}: {error_msg}. Retrying in {sleep_time}s...")
                time.sleep(sleep_time)
            else:
                print(f"❌ Failed for {goal_id} after {max_retries} attempts.")
                return goal_id, None, error_msg

# Function to save progress periodically
def save_goals_dict(filename="goals_dict_progress.json"):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(goals_dict, f, ensure_ascii=False)
    print(f"Progress saved to {filename}.")

Then we apply this function in our loop to retrieve all html pages.

In [None]:
# Parameters
all_goal_ids = list(goals_dict.keys())
batch_size = 1000  # number of goals per batch
max_workers = 10   # adjust based on network & server tolerance

for i in range(0, len(all_goal_ids), batch_size):
    batch_goal_ids = all_goal_ids[i:i+batch_size]
    print(f"Processing batch {i//batch_size + 1} ({len(batch_goal_ids)} goals)...")
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(fetch_goal_html, gid): gid for gid in batch_goal_ids}
        
        for future in tqdm(as_completed(futures), total=len(futures), desc=f"Batch {i//batch_size + 1}"):
            goal_id, html, error = future.result()
            goals_dict[goal_id]["html"] = html
            if error:
                print(f"❌ Error fetching {goal_id}: {error}")

    # Polite pause between batches
    time.sleep(5)
    
    # Save progress after each batch
    save_goals_dict()

print("Finished fetching HTML for all goals.")

One important aspect of goals is that some have descriptions, while others do not. For our analysis, we will only use goals which have a description in order to be able to fully analyze them. To separate those goals with text from those without, we first need to extract the text from the html.

### Extracting Text from Goal Pages

First, we use beautiful soup and regex to extract the description from the html. For this, we first retrieve certain divs, and then the text within those divs.

In [None]:
goal_description_dict = {}
no_description_id_list = []

for goal_id, goal_data in tqdm(goals_dict.items(), desc="Extracting goal texts"):
    html = goal_data.get("html", "")
    
    if not html:
        # No HTML, empty list and track ID
        goal_description_dict[goal_id] = []
        no_description_id_list.append(goal_id)
        continue

    soup = BeautifulSoup(html, "html.parser")  # or "lxml" if installed
    goal_texts = []

    # Find all outer divs
    outer_divs = soup.find_all("div", class_="ml10 mr10")

    for outer in outer_divs:
        # Find all nested darkdarkdarkgrey divs
        inner_divs = outer.find_all("div", class_="darkdarkdarkgrey")
        for div in inner_divs:
            # Remove nested attribution divs
            for attrib in div.find_all("div", class_="goal-attribution"):
                attrib.decompose()
            
            text = div.get_text(strip=True)
            if text:
                goal_texts.append(text)

    # Save in final dict; track IDs with no descriptions
    if goal_texts:
        goal_description_dict[goal_id] = goal_texts
    else:
        goal_description_dict[goal_id] = []
        no_description_id_list.append(goal_id)

print(f"Total goals processed: {len(goal_description_dict)}")
print(f"Goals with no description: {len(no_description_id_list)}")

Our approach of retrieving divs leads to some goals having more than one description, if the description is parted into several paragraphs in the html.

In [None]:
# Filter goals with more than 1 description
goals_multiple_texts = {gid: texts 
                        for gid, texts in goal_description_dict.items() 
                        if len(texts) > 1}

print(f"Found {len(goals_multiple_texts)} goals with more than 1 description.")

Therefore, we merge the text if there are several items in the description list.

In [None]:
for goal_id, texts in goal_description_dict.items():
    if len(texts) > 1:
        # Merge all items into one string, separated by space (or newline if preferred)
        merged_text = " ".join(texts)
        goal_description_dict[goal_id] = [merged_text]  # keep as a single-item list

As every goal should now have a description list of length 1, we convert it to a string.

In [None]:
for goal_id, texts in goal_description_dict.items():
    if texts:  # should always be True since each list has 1 item
        goal_description_dict[goal_id] = texts[0]  # convert list to string
    else:
        goal_description_dict[goal_id] = ""  # just in case some lists are empty

Finally, we created an updated dictionary with descriptions and export this.

In [None]:
goal_dict_with_description_updated = {}

for goal_id, description in goal_description_dict.items():
    title = goal_dict[goal_id].get("title", "")
    goal_dict_with_description_updated[goal_id] = {
        "title": title,
        "description": description
    }

print(f"Created updated goal dict with {len(goal_dict_with_description_updated)} items.")

In [None]:
output_file = "goal_dict_with_description_updated.json"

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(goal_dict_with_description_updated, f, ensure_ascii=False, indent=2)

print(f"Exported {len(goal_dict_with_description_updated)} goals to {output_file}")

## 5. Extracting Further Attributes

There are three additional attributes we are interested in extracting.
1. Counts of how many people have completed a certain goal and how many people want to complete it. We extract these in two step process, first getting a messy div class with the correct content, and later cleaning the content in a second step.
2. The comments on a goal. Not all goals have comments, and many comments are pictures, which we are not extracting. However, if there are textual comments, we are extracting them for each goal where this is applicable.
3. The tags assigned by the website itself to each goal. A goal can have one or many tags, which we therefore extract as a list.

In [None]:
for goal_id, content in tqdm(goals_dict.items(), desc="Parsing goals"):
    html = content.get("html", "")
    try:
        doc = lxml.html.fromstring(html)

        # 1. Counts
        node = doc.xpath('//div[@class="places-peoplecount size90"]')
        counts_messy_dict[goal_id] = node[0].text_content().strip() if node else None

        # 2. Comments
        comments = doc.xpath('//div[@class="goal-notecontent rounded truncate"]')
        goals_comments_dict[goal_id] = [c.text_content().strip() for c in comments]

        # 3. Tags
        tags = doc.xpath('//a[starts-with(@href, "/tag/")]')
        goals_tags_dict[goal_id] = [t.text_content().strip() for t in tags]

    except Exception as e:
        print("Error at ID:", goal_id)
        raise

To correctly retrieve the counts and the different variations of text used to describe them, we need to clean them further.

In [None]:
counts_clean_dict = {}

for goal_id, text in tqdm(counts_messy_dict.items(), desc="Parsing counts"):
    if text is None:
        counts_clean_dict[goal_id] = {"wants_to_do": 0, "have_done": 0}
        continue

    wants_to_do = 0
    have_done = 0

    # Remove HTML tags if present
    text_clean = BeautifulSoup(text, "html.parser").get_text()

    # Extract "On X lists"
    match_lists = re.search(r"On ([\d,]+) lists", text_clean)
    if match_lists:
        wants_to_do = int(match_lists.group(1).replace(",", ""))

    # Extract "X people want to do this"
    match_wants = re.search(r"(\d+|One) person(?:s)? want(?:s)? to do this", text_clean)
    if match_wants:
        if match_wants.group(1) == "One":
            wants_to_do = 1
        else:
            wants_to_do = int(match_wants.group(1).replace(",", ""))

    # Extract "X people have done it"
    match_done = re.search(r"([\d,]+) people have done it", text_clean)
    if match_done:
        have_done = int(match_done.group(1).replace(",", ""))

    counts_clean_dict[goal_id] = {"wants_to_do": wants_to_do, "have_done": have_done}

Now we can export all three dictionaries for safekeeping.

In [None]:
# Export counts_clean_dict
with open("counts_clean_dict.json", "w", encoding="utf-8") as f:
    json.dump(counts_clean_dict, f, ensure_ascii=False, indent=2)

# Export goals_comments_dict
with open("goals_comments_dict.json", "w", encoding="utf-8") as f:
    json.dump(goals_comments_dict, f, ensure_ascii=False, indent=2)

# Export goals_tags_dict
with open("goals_tags_dict.json", "w", encoding="utf-8") as f:
    json.dump(goals_tags_dict, f, ensure_ascii=False, indent=2)

## 6. Compiling a Final Dictionary

Finally, now that we have retrieved all goals based on the usernames, and extracted the relevant attributes, we can combine all of this into one dictionary.

In [None]:
goals_with_attributes_dict = {}

for goal_id, desc_info in goal_dict_with_description_updated.items():
    # Get counts, or default to 0 for both keys if missing
    counts = counts_clean_dict.get(goal_id, {"wants_to_do": 0, "have_done": 0})

    combined_info = {
        "title": desc_info.get("title", ""),
        "description": desc_info.get("description", ""),
        "wants_to_do": counts.get("wants_to_do", 0),
        "have_done": counts.get("have_done", 0),
        "comments": goals_comments_dict.get(goal_id, []),
        "tags": goals_tags_dict.get(goal_id, [])
    }
    goals_with_attributes_dict[goal_id] = combined_info

print(f"Combined dict created with {len(goals_with_attributes_updated_dict)} items.")

In [None]:
output_file = "goals_with_attributes.json"

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(goals_with_attributes_dict, f, ensure_ascii=False, indent=2)

print(f"Exported {len(goals_with_attributes_dict)} goals to {output_file}")

In the further processing, this dictionary, as well as the users_with_goals dictionary will be used to conduct further pre-processing and construct our network.