This is part one of an attempt to expand on a working paper, "[Extracting protest events from newspaper articles with ChatGPT](https://osf.io/dvht7)" that I wrote with Andy Andrews and Rashawn Ray. In that paper, we tested whether ChatGPT could replace my undergraduate RAs in extracting details about Black Lives Matter protests from media accounts. This time, I want to expand it to include more articles, movements, and variables.

In this part, I largely copy [old code on downloading](https://nealcaren.github.io/notes/posts/scraping/bulk-download.html) to help gather a couple of thousand articles from the [Crowd Counting Consortium](https://github.com/nonviolent-action-lab/crowd-counting-consortium)'s dataset. Their dataset includes event characteristics for over a hundred thousand protest events and the source web addresses. I aim to test if GPT models can replicate their hand-coding results, but this script just gets the data.

In [None]:
pip  install -U pyppeteer python-slugify  chromedriver-py

In [1]:
import os
import asyncio
import nest_asyncio
from random import shuffle
from collections import Counter

from slugify import slugify
from pyppeteer import launch
from pyppeteer.errors import NetworkError

import pandas as pd


nest_asyncio.apply()

In [3]:
df = pd.read_csv(
    "https://github.com/nonviolent-action-lab/crowd-counting-consortium/raw/master/ccc_compiled_2021-present.csv",
    encoding="latin",
    low_memory=False,
)
print(len(df))

103986


In [4]:
# Limit to just 2023 or 2024
df = df[pd.to_datetime(df["date"]).dt.year.isin([2023,2024])]
print(len(df))

# Keep only with one source 
df.dropna(subset=['source_2'], inplace=True)
len(df)
df['Keep'] = True
print(df['Keep'].sum())

# Step 2: Eliminate social media URLs and ensure they contain 'http'
social_media_domains = ["twitter.com", "youtube.com", "facebook.com", "instagram.com", "tiktok.com", "bsky.com"]
for sm_domain in social_media_domains:
    df['Keep'] = df['Keep'] & (~df['source_1'].str.contains(sm_domain) & df['source_1'].str.contains("http"))

print(df['Keep'].sum())

# Step 3: Filter URLs to keep only those that appear once
# Count occurrences of each URL
url_counts = df.loc[df['Keep'], 'source_1'].value_counts()

# Here, we use 'map' to align counts with the original DataFrame, checking if each count equals 1
unique_url_mask = df['source_1'].map(url_counts) == 1

# Update the 'Keep' column: True only if previously True AND the URL is unique (appears once)
df['Keep'] = df['Keep'] & unique_url_mask

print(df['Keep'].sum())


37378
16284
8400
3398


In [6]:
# Save the subset

df = df[df['Keep']]
df.to_json('ccc_sample.json', orient='records')

In [2]:
# Load the subset and make a list of the URLS

df = pd.read_json('ccc_sample.json')
urls = df['source_1'].values
shuffle(urls)

Below are slightly revised functions from [earlier](https://nealcaren.github.io/notes/posts/scraping/bulk-download.html). The include small changes to try and trick more websites into thinking the scraper is a human, such as `await page.setViewport({'width': 2560, 'height': 1600})` to match a laptop screen size on the screen-less browser.

In [9]:
# Ensure the HTML directory exists
html_dir = "HTML"
os.makedirs(html_dir, exist_ok=True)

# User agent to be used for all requests
ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15"
bad_urls = []


async def fetch(page, url, timeout=30):
    # Slugify the URL to create a valid filename
    filename = slugify(url) + ".html"
    file_path = os.path.join(html_dir, filename)

    if os.path.isfile(file_path):
        # print(f"File {file_path} already exists, skipping download.")
        return

    if url in bad_urls:
        print(f"Skipping bad URL: {url}")
        return

    try:
        # Set the user agent for the page
        await page.setUserAgent(ua)
        await page.setViewport({'width': 2560, 'height': 1600})
        await page.evaluateOnNewDocument("""
    () => {
        Object.defineProperty(navigator, 'webdriver', {
            get: () => false,
        });
    }
    """)
        page.on('dialog', lambda dialog: asyncio.create_task(dialog.dismiss()))

        # Navigate to the page with a timeout
        response = await asyncio.wait_for(
            
            page.goto(url, {"waitUntil": "networkidle0"}), timeout
        )

        # Check if the page was successfully retrieved
        if response and response.ok:
            content = await page.content()
            # Save the content to a file in the 'HTML' directory
            with open(file_path, "w", encoding="utf-8") as file:
                file.write(content)
            print(f"Content from {url} has been saved to {file_path}")
        else:
            print(f"Failed to retrieve {url}")
            bad_urls.append(url)
    except asyncio.TimeoutError:
        print(f"Fetching {url} took too long and was cancelled.")
        bad_urls.append(url)
    except Exception as e:
        print(f"An error occurred while fetching {url}: {e}")
        bad_urls.append(url)

In [10]:
async def process_url(page, url_queue):
    while not url_queue.empty():
        url = await url_queue.get()
        await fetch(page, url)  # Your existing fetch function
        url_queue.task_done()

async def main():
    browser = await launch()
    pages = [await browser.newPage() for _ in range(5)]  # Initialize pages once

    # Create a queue of URLs
    url_queue = asyncio.Queue()
    for url in urls:
        if url not in bad_urls:
            await url_queue.put(url)

    # Create a task for each page to process URLs from the queue
    tasks = [asyncio.create_task(process_url(page, url_queue)) for page in pages]

    # Wait for all tasks to complete
    await asyncio.gather(*tasks)

    # Close pages and browser after all operations are complete
    for page in pages:
        await page.close()
    await browser.close()

    if bad_urls:
        print("The following URLs had issues and were not downloaded:")
        print("\n".join(bad_urls))

asyncio.run(main())

Fetching https://www.citizen-times.com/story/news/local/2023/08/14/two-asheville-police-cars-completely-destroyed-by-suspected-arson/70589191007/ took too long and was cancelled.
Fetching https://www.wbir.com/article/news/local/queer-in-sevier-variety-show-2023/51-eceff553-0dfe-4a02-8f48-d78a445fefc5#:~:text=The%20third%20annual%20%22Queer%20in,at%20the%20Sevier%20Civic%20Center. took too long and was cancelled.
Fetching https://www.michigandaily.com/news/news-briefs/2-geo-protestors-detained-and-released-in-police-encounter-outside-ann-arbor-restaurant/ took too long and was cancelled.
Fetching https://act.everytown.org/event/moms-demand-action-event/53943 took too long and was cancelled.
Fetching https://www.wtnh.com/news/connecticut/healthcare-workers-at-3-local-hospitals-to-rally-amid-financial-care-condition-troubles/ took too long and was cancelled.
Fetching https://www.wcvb.com/article/students-walk-out-wayland-massachusetts-protest-antisemitic-graffiti/45875442 took too long an

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.sendMessageToTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.sendMessageToTarget): No session with given id
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.detachFromTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.detachFromTarget): No session with given id


Fetching https://www.eventbrite.com/e/teens-against-gender-mutilation-rally-murfreesboro-tickets-485521647317 took too long and was cancelled.
Fetching https://www.phillyburbs.com/story/news/local/2023/12/09/gun-safety-advocates-ceasefireepa-stalled-gun-bills-bucks-county-red-flag-laws-background-checks/71850932007/ took too long and was cancelled.
Fetching https://www.ballstatedaily.com/article/2023/11/news-students-hold-protest-for-ceasefire-in-the-israel-hamas-war-at-scramble-light took too long and was cancelled.
Failed to retrieve https://www.nytimes.com/2023/11/01/world/middleeast/columbia-protest-hillary-clinton-class.html
Fetching https://www.washingtonpost.com/dc-md-va/2024/01/15/virginia-assembly-gun-rights-rally/ took too long and was cancelled.
Fetching https://tylerpaper.com/news/tyler-area-gears-up-for-pride-month-with-third-annual-march-other-events/article_a035edc8-00c9-11ee-ad97-531d226adae7.html took too long and was cancelled.
Failed to retrieve https://act.community

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.detachFromTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.detachFromTarget): No session with given id


Fetching https://www.wbaltv.com/article/federal-hill-residents-protest-bge-gas-regulators-exterior/44304834 took too long and was cancelled.
Fetching https://weartv.com/news/local/hundreds-gather-in-pensacola-to-protest-us-funding-for-israel-hamas-war took too long and was cancelled.
Failed to retrieve https://www.semissourian.com/story/3021988.html
Fetching https://act.everytown.org/event/moms-demand-action-event/53855/signup/?_gl=1*1imcj2z*_ga*NDYxNDc1MDczLjE2ODM1NDAxODg.*_ga_LT0FWV3EK3*MTY4MzU0MDE4OC4xLjAuMTY4MzU0MDE5MS4wLjAuMA.. took too long and was cancelled.
Fetching https://www.wisn.com/article/protesters-marching-against-republican-policies-gather-outside-fiserv-forum/44895022# took too long and was cancelled.
Fetching https://wjla.com/news/local/restaurant-maryland-rally-tax-credit-increase-minimum-wage-montgomery-county-prince-georges-county-md-association-ram-march-wayne-curry-opposed-support-tips-tipped-credit-higher-labor-costs-service-charges-claims took too long and was

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.detachFromTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.detachFromTarget): No session with given id


Fetching https://wlos.com/news/local/union-workers-rally-for-health-care-justice-and-workers-rights-in-downtown-asheville-pack-square-park-national-nurses-union-mission-hospital-international-brotherhood-of-electrical-workers-american-federation-of-labor-and-congress-of-industrial-organizat took too long and was cancelled.
Fetching https://kfoxtv.com/news/local/pro-palestine-rally-to-be-held-at-san-jacinto-plaza-saturday-israel-biden-netanyahu-middle-east-east-hamas-el-paso-texas-protest took too long and was cancelled.
Fetching https://www.kcra.com/article/republicans-small-business-owners-rally-to-pass-pro-public-safety-bills/44969031 took too long and was cancelled.
Fetching https://katu.com/news/local/rally-held-for-transgender-rights-in-la-center took too long and was cancelled.
Fetching https://www.kcra.com/article/sacramento-rally-domestic-workers-sb-686/44943835 took too long and was cancelled.
Fetching https://www.browndailyherald.com/article/2024/02/protestors-deliver-pro-div

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.sendMessageToTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.sendMessageToTarget): No session with given id


Fetching https://theblackwallsttimes.com/2024/02/20/vigil-for-non-binary-teen-who-died-after-school-bathroom-beating/ took too long and was cancelled.
Failed to retrieve https://pridemyrtlebeach.org/event/pride-in-the-park-festival-2023/
Fetching https://whyy.org/articles/delaware-county-lgbtq-pride-parade/ took too long and was cancelled.
Fetching https://www.actionnewsjax.com/news/local/protestors-march-jacksonville-city-hall-three-weeks-after-racist-triple-shooting/TTGNG5IK2ZDILCCPYP2WIEG2IU/ took too long and was cancelled.
Content from https://t.me/patriotfrontvideos/424 has been saved to HTML/https-t-me-patriotfrontvideos-424.html
Fetching https://dailycollegian.com/2023/04/cops-off-campus-protest/ took too long and was cancelled.
Failed to retrieve https://sdpride.org/event/drag-march-for-trans-rights/
Fetching https://www.wjcl.com/article/rally-to-take-place-at-governors-mansion-following-death-of-georgia-corrections-officer/45420102 took too long and was cancelled.
Fetching ht

PageError: Protocol Error: Connection Closed. Most likely the page has been closed.

I ended up running the last few cells almost a dozen times, sometimes tweaking the script in attempt to download more articles, other times because I accidentally stopped the process by closing my laptop. I ended up about 3,059 articles. I suspect that most of the URLS that I couldn't get were sites where CloudFlare or whatever popup was blocking me for not being a human while some smaller share are invalid URLs. 