This is part one of an attempt to expand on a working paper, "[Extracting protest events from newspaper articles with ChatGPT](https://osf.io/dvht7)" that I wrote with Andy Andrews and Rashawn Ray. In that paper, we tested whether ChatGPT could replace my undergraduate RAs in extracting details about Black Lives Matter protests from media accounts. This time, I want to expand it to include more articles, movements, and variables.

In this part, I largely copy [old code on downloading](https://nealcaren.github.io/notes/posts/scraping/bulk-download.html) to help gather a couple of thousand articles from the [Crowd Counting Consortium](https://github.com/nonviolent-action-lab/crowd-counting-consortium)'s dataset. Their dataset includes event characteristics for over a hundred thousand protest events and the source web addresses. I aim to test if GPT models can replicate their hand-coding results, but this script just gets the data.

In [41]:
pip install undetected-chromedriver

^C
[31mERROR: Operation cancelled by user[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [28]:
import os
from random import shuffle
from collections import Counter
from urllib.parse import urlparse
import re
from concurrent.futures import ThreadPoolExecutor
import subprocess
import json

from slugify import slugify
from selenium import webdriver
import undetected_chromedriver as uc
from newspaper import Article
import pandas as pd


In [29]:
# Load the subset and make a list of the URLS

df = pd.read_json('ccc_sample.json')
urls = df['source_1'].values
shuffle(urls)

In [30]:
user_agent = '''Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'''

def slugurl(url):
    filename = slugify(url) + ".html"
    file_path = os.path.join('HTML', filename)
    return file_path



In [31]:
def get_article_info(url, html):
    article = Article(url=url)
    article.set_html(html)
    article.parse()

    article_details = {
        "title": article.title,
        "text": article.text,
        "url": article.meta_data["og"].get("url", article.url),
        "authors": article.authors,
        "date": article.publish_date,
        "meta": article.meta_data,
        "description": article.meta_description,
        
        "site": article.meta_data["og"].get("site_name", ""),
        "publisher": article.meta_data["publisher"],
    }

    return article_details

In [91]:
# Options 1 Standard Selenium

options = webdriver.ChromeOptions()
options.add_argument("--headless=True")
options.add_argument(f"--user-agent={user_agent}")
selenium_driver = webdriver.Chrome(options=options)

In [92]:
# Option 2 Add a extension

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument('--load-extension=bypass-paywalls-chrome-clean')
options.add_argument(f"--user-agent={user_agent}")
selenium_bypass_driver = webdriver.Chrome(options=options)

In [93]:
# Option 3 Optimized Browser

options = uc.ChromeOptions()
#options.add_argument('--load-extension=bypass-paywalls-chrome-clean')
uc_driver = uc.Chrome(headless=True,
                       options=options,
                       use_subprocess=False)


In [94]:
# Option 4 Optimized Browser

options = uc.ChromeOptions()
#options.add_argument('--load-extension=bypass-paywalls-chrome-clean')
uc_driver2 = uc.Chrome(headless=False,
                       options=options,
                       use_subprocess=False)


In [95]:
drivers = {
    "selenium": selenium_driver,
    "selenium_bypass": selenium_bypass_driver,
    "undetected": uc_driver,
    "uc_driver2" : uc_driver2
}

In [107]:
def fetch_html(url, drivers):
    results = {}
    
    for driver_name in ['selenium', 'selenium_bypass', 'undetected', 'uc_driver2']:
        driver = drivers[driver_name]
        driver.get(url)
        html = driver.page_source
        info = get_article_info(url, html)
        info['html'] = html
        info['url_retrieved'] = url
        results[driver_name] = info
    return results
    

In [97]:
url = "https://www.bozemandailychronicle.com/news/international/protestors-picket-testers-weekend-fundraiser/article_241b4486-bec9-11ee-8e46-db5dbc78639a.html"
url = 'https://pantagraph.com/news/local/video-activists-gather-for-a-rally-in-support-of-palestinians-on-monday-in-normal/video_21ebf46a-1b3b-54f7-a761-457a60930271.html'
url = 'https://www.ack.net/stories/nhs-students-walk-out-to-support-lgbtq-rights,31618'
url = 'https://www.fox10phoenix.com/news/palestinian-americans-supporters-show-out-in-droves-at-tempe-rally-for-peace'
url = 'https://www.lmtonline.com/local/article/parents-plan-walk-peaceful-protest-canceled-uisd-18360224.php'
url = 'https://www.expressnews.com/politics/article/texas-capitol-survivors-child-sexual-abuse-17851830.php'
url = 'https://www.semissourian.com/story/3021988.html'
r = fetch_html(url, drivers)

In [108]:
shuffle(urls)

In [109]:
done = {}
 
for url in urls[:50]:
    r = fetch_html(url, drivers)
    done[url] = pd.DataFrame(r).T

WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: chrome=123.0.6312.59)
Stacktrace:
0   undetected_chromedriver             0x000000010335c0f8 undetected_chromedriver + 4595960
1   undetected_chromedriver             0x0000000103353e63 undetected_chromedriver + 4562531
2   undetected_chromedriver             0x0000000102f57225 undetected_chromedriver + 381477
3   undetected_chromedriver             0x0000000102f41ce7 undetected_chromedriver + 294119
4   undetected_chromedriver             0x0000000102f414ab undetected_chromedriver + 292011
5   undetected_chromedriver             0x0000000102f40ec0 undetected_chromedriver + 290496
6   undetected_chromedriver             0x0000000102f40d47 undetected_chromedriver + 290119
7   undetected_chromedriver             0x0000000102f3f089 undetected_chromedriver + 282761
8   undetected_chromedriver             0x0000000102f3f5df undetected_chromedriver + 284127
9   undetected_chromedriver             0x0000000102f4df28 undetected_chromedriver + 343848
10  undetected_chromedriver             0x0000000102f632cb undetected_chromedriver + 430795
11  undetected_chromedriver             0x0000000102f6858b undetected_chromedriver + 451979
12  undetected_chromedriver             0x0000000102f3fbdf undetected_chromedriver + 285663
13  undetected_chromedriver             0x0000000102f62f0d undetected_chromedriver + 429837
14  undetected_chromedriver             0x0000000102fe325f undetected_chromedriver + 954975
15  undetected_chromedriver             0x0000000102fc2ee3 undetected_chromedriver + 823011
16  undetected_chromedriver             0x0000000102f93be4 undetected_chromedriver + 629732
17  undetected_chromedriver             0x0000000102f9479e undetected_chromedriver + 632734
18  undetected_chromedriver             0x0000000103322012 undetected_chromedriver + 4358162
19  undetected_chromedriver             0x0000000103326c5d undetected_chromedriver + 4377693
20  undetected_chromedriver             0x00000001033265d3 undetected_chromedriver + 4376019
21  undetected_chromedriver             0x0000000103326f05 undetected_chromedriver + 4378373
22  undetected_chromedriver             0x000000010330ba35 undetected_chromedriver + 4266549
23  undetected_chromedriver             0x000000010332728d undetected_chromedriver + 4379277
24  undetected_chromedriver             0x00000001032fe080 undetected_chromedriver + 4210816
25  undetected_chromedriver             0x0000000103344ac8 undetected_chromedriver + 4500168
26  undetected_chromedriver             0x0000000103344c41 undetected_chromedriver + 4500545
27  undetected_chromedriver             0x0000000103353aa3 undetected_chromedriver + 4561571
28  libsystem_pthread.dylib             0x00007ff81143418b _pthread_start + 99
29  libsystem_pthread.dylib             0x00007ff81142fae3 thread_start + 15


In [112]:
dfs = []
for df_name in done.keys():
    df = done[df_name]
    dfs.append(df)
    
retrieved = pd.concat(dfs)

In [114]:
len(retrieved)

4

In [115]:
retrieved

Unnamed: 0,title,text,url,authors,date,meta,description,site,publisher,html,url_retrieved
selenium,Lawmakers adjourn after protests erupt at Capitol,The second week of Tennessee’s legislative spe...,https://www.wsmv.com/video/2023/08/29/lawmaker...,[],2023-08-29 00:00:00,{'apple-mobile-web-app-status-bar-style': 'bla...,The second week of Tennessee’s legislative spe...,https://www.wsmv.com,{},"<html lang=""en""><head><script async="""" src=""//...",https://www.wsmv.com/video/2023/08/29/lawmaker...
selenium_bypass,Lawmakers adjourn after protests erupt at Capitol,2 kids wander off as parents are found passed ...,https://www.wsmv.com/video/2023/08/29/lawmaker...,[],2023-08-29 00:00:00,{'apple-mobile-web-app-status-bar-style': 'bla...,The second week of Tennessee’s legislative spe...,https://www.wsmv.com,{},"<html lang=""en""><head><script charset=""UTF-8"" ...",https://www.wsmv.com/video/2023/08/29/lawmaker...
undetected,Lawmakers adjourn after protests erupt at Capitol,‘I feel so bad’: Man says he saw Mizzou studen...,https://www.wsmv.com/video/2023/08/29/lawmaker...,[],2023-08-29 00:00:00,{'apple-mobile-web-app-status-bar-style': 'bla...,The second week of Tennessee’s legislative spe...,https://www.wsmv.com,{},"<html lang=""en""><head><script charset=""UTF-8"" ...",https://www.wsmv.com/video/2023/08/29/lawmaker...
uc_driver2,Lawmakers adjourn after protests erupt at Capitol,The second week of Tennessee’s legislative spe...,https://www.wsmv.com/video/2023/08/29/lawmaker...,[],2023-08-29 00:00:00,{'apple-mobile-web-app-status-bar-style': 'bla...,The second week of Tennessee’s legislative spe...,https://www.wsmv.com,{},"<html lang=""en""><head><script async="""" src=""//...",https://www.wsmv.com/video/2023/08/29/lawmaker...
